Re: Using several Mahout JarSteps in a JobFlow

Thomas Söhngen Tue, 08 Feb 2011 09:36:54 -0800

Hi Sebastian,

thank you very much, using the tempDir parameter fixed the problem.

As you mentioned, it would be really nice if there were a a single step,which puts out item recommendations for users as well as user-user anditem-item similiarity. An alternative would be, to split theRecommenderJob class in different jobs, which rely on each othersoutput. This would be even better for my case, because I am using AWSEMR and would have to do a manual copy out of hdfs if these informationare not in the main output of the step, which would be much harder toscript.


Best regards,
Thomas

Am 08.02.2011 17:46, schrieb Sebastian Schelter:

Hi Thomas,

you can also use the parameter --tempDir to explicitly point a job to a
temp directory.

By the way I recoginize that our users shouldn't need to execute both
jobs like you do because the similar items computation is already
contained in RecommenderJob, we should add an option that makes it write
out the similar items in a nice form, so we can avoid having to run both
jobs.

I'm gonna create a ticket for this.

--sebastian


Am 08.02.2011 17:37, schrieb Sean Owen:

I would not run them in the same root directory / key prefix. Put them
both under different namespaces.

On Tue, Feb 8, 2011 at 4:34 PM, Thomas Söhngen<[email protected]>  wrote:

Hi fellow data crunchers,

I am running a JobFlow with a step using
"org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" and a
following step using
"org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step
works without problems, but the second one is throwing an Exception:

|Exception in thread"main"
  org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
temp/itemIDIndex already exists and is not empty
        at
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
        at
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

|

It looks like the second job is using the same temporal output directories
as the first job. How can I avoid this? Or even better: If some of the tasks
are already done and cached in the first step, how could I use them so that
they don't have to be recomputed in the second step?

Best regards,
Thomas

PS: This is the actual JobFlow definition in JSON:

[
   [......],
  {
    "Name": "MR Step 2: Find similiar items",
    "HadoopJarStep": {
      "MainClass":
"org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
      "Args": [
         "--input",
"s3n://recommendertest/data/<jobid>/aggregateWatched/",
         "--output",    "s3n://recommendertest/data/<jobid>/similiarItems/",
         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
         "--maxSimilaritiesPerItem",    "100"
      ]
    }
  },
  {
    "Name": "MR Step 3: Find items for user",
    "HadoopJarStep": {
      "MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
      "Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
      "Args": [
         "--input",
"s3n://recommendertest/data/<jobid>/aggregateWatched/",
         "--output",
  "s3n://recommendertest/data/<jobid>/userRecommendations/",
         "--similarityClassname",    "SIMILARITY_PEARSON_CORRELATION",
         "--numRecommendations",    "100"
      ]
    }
  }
]

||||

Re: Using several Mahout JarSteps in a JobFlow

Reply via email to