[
https://issues.apache.org/jira/browse/MAHOUT-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301235#comment-15301235
]
Andrew Palumbo commented on MAHOUT-1863:
----------------------------------------
I agree that there is probably an easier fix than supplying all of the
parameters, this script has been around for a while. IIRC there was a recent
change made so that users could provide their own working directory by
exporting an env var eg {{export WORK_DIR=/hoime/myworkdir}}, the reason being
that some os /tmp directories (the original hard coded default bas dir) were
not directly accessible to some users directly or large enough to accommodate
large files (i cant remember which it was) but it should have defaulted back
to {{/tmp}}. The PR may be an easier place to discuss that. Thanks again.
> cluster-syntheticcontrol.sh errors out with "Input path does not exist"
> -----------------------------------------------------------------------
>
> Key: MAHOUT-1863
> URL: https://issues.apache.org/jira/browse/MAHOUT-1863
> Project: Mahout
> Issue Type: Bug
> Affects Versions: 0.12.0
> Reporter: Albert Chu
> Priority: Minor
>
> Running cluster-syntheticcontrol.sh on 0.12.0 resulted in this error:
> {noformat}
> Exception in thread "main"
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
> not exist: hdfs://apex156:54310/user/achu/testdata
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
> at
> org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
> at
> org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.run(Job.java:133)
> at
> org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job.main(Job.java:62)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
> at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> {noformat}
> It appears cluster-syntheticcontrol.sh breaks under 0.12.0 due to patch
> {noformat}
> commit 23267a0bef064f3351fd879274724bcb02333c4a
> {noformat}
> one change in question
> {noformat}
> - $DFS -mkdir testdata
> + $DFS -mkdir ${WORK_DIR}/testdata
> {noformat}
> now requires that the -p option be specified to -mkdir. This fix is simple.
> Another change:
> {noformat}
> - $DFS -put ${WORK_DIR}/synthetic_control.data testdata
> + $DFS -put ${WORK_DIR}/synthetic_control.data ${WORK_DIR}/testdata
> {noformat}
> appears to break the example b/c in:
> examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java
> examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java
> the file 'testdata' is hard coded into the example as just 'testdata'.
> ${WORK_DIR}/testdata needs to be passed in as an option.
> Reverting the lines listed above fixes the problem. However, the reverting
> presumably breaks the original problem listed in MAHOUT-1773.
> I originally attempted to fix this by simply passing in the option "--input
> ${WORK_DIR}/testdata" into the command in the script. However, a number of
> other options are required if one option is specified.
> I considered modifying the above Job.java files to take a minimal number of
> arguments and set the rest to some default, but that would have also required
> changes to DefaultOptionCreator.java to make required options non-optional,
> which I didn't want to go down the path of determining what other examples
> had requires/non-requires requirements.
> So I just passed in every required option into cluster-syntheticcontrol.sh to
> fix this, using whatever defaults were hard coded into the Job.java files
> above.
> I'm sure there's a better way to do this, and I'm happy to supply a patch,
> but thought I'd start with this.
> Github pull request to be sent shortly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)