[
https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302340#comment-15302340
]
ASF GitHub Bot commented on MAHOUT-1863:
----------------------------------------
Github user andrewpalumbo commented on the pull request:
https://github.com/apache/mahout/pull/235#issuecomment-221923050
Oh yes- I'd forgotten that we're allowing for user-defined directories now
in these scripts so we don't know how long the path will be. So that simple
alternative won't work without as you mentioned a loop- these scripts are
already complicated enough.. We've discussed tearing them down completely and
re-doing them but haven't had a chance (Would you be interested? :))
So I'll have to test this out but I'm for committing this as is. It needs
to at least be working on hadoop 2.
We can then look at getting all the scripts back to hadoop 1 compatible
later.
> cluster-syntheticcontrol.sh errors out with "Input path does not exist"
> -----------------------------------------------------------------------
>
> Key: MAHOUT-1863
> URL: https://issues.apache.org/jira/browse/MAHOUT-1863
> Project: Mahout
> Issue Type: Bug
> Affects Versions: 0.12.0
> Reporter: Albert Chu
> Priority: Minor
>
> Running cluster-syntheticcontrol.sh on 0.12.0 resulted in this error:
> {noformat}
> Exception in thread "main"
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
> not exist: hdfs://apex156:54310/user/achu/testdata
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
> at
> org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
> at
> org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
> at
> org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
> at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> {noformat}
> It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch
> {noformat}
> commit 23267a0bef064f3351fd879274724bcb02333c4a
> {noformat}
> one change in question
> {noformat}
> - $DFS -mkdir testdata
> + $DFS -mkdir ${WORK_DIR}/testdata
> {noformat}
> now requires that the -p option be specified to -mkdir. This fix is simple.
> Another change:
> {noformat}
> - $DFS -put ${WORK_DIR}/synthetic_control.data testdata
> + $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
> {noformat}
> appears to break the example b/c in:
> examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
> examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
> the file 'testdata' is hard coded into the example as just 'testdata'.
> ${WORK_DIR}/testdata needs to be passed in as an option.
> Reverting the lines listed above fixes the problem. However, the reverting
> presumably breaks the original problem listed in MAHOUT-1773.
> I originally attempted to fix this by simply passing in the option "--input
> ${WORK_DIR}/testdata" into the command in the script. However, a number of
> other options are required if one option is specified.
> I considered modifying the above Job.java files to take a minimal number of
> arguments and set the rest to some default, but that would have also required
> changes to DefaultOptionCreator.java to make required options non-optional,
> which I didn't want to go down the path of determining what other examples
> had requires/non-requires requirements.
> So I just passed in every required option into cluster-syntheticcontrol.sh to
> fix this, using whatever defaults were hard coded into the Job.java files
> above.
> I'm sure there's a better way to do this, and I'm happy to supply a patch,
> but thought I'd start with this.
> Github pull request to be sent shortly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)