Hi Sebastian,
thank you very much, using the tempDir parameter fixed the problem.
As you mentioned, it would be really nice if there were a a single step,
which puts out item recommendations for users as well as user-user and
item-item similiarity. An alternative would be, to split the
RecommenderJob class in different jobs, which rely on each others
output. This would be even better for my case, because I am using AWS
EMR and would have to do a manual copy out of hdfs if these information
are not in the main output of the step, which would be much harder to
script.
Best regards,
Thomas
Am 08.02.2011 17:46, schrieb Sebastian Schelter:
Hi Thomas,
you can also use the parameter --tempDir to explicitly point a job to a
temp directory.
By the way I recoginize that our users shouldn't need to execute both
jobs like you do because the similar items computation is already
contained in RecommenderJob, we should add an option that makes it write
out the similar items in a nice form, so we can avoid having to run both
jobs.
I'm gonna create a ticket for this.
--sebastian
Am 08.02.2011 17:37, schrieb Sean Owen:
I would not run them in the same root directory / key prefix. Put them
both under different namespaces.
On Tue, Feb 8, 2011 at 4:34 PM, Thomas Söhngen<[email protected]> wrote:
Hi fellow data crunchers,
I am running a JobFlow with a step using
"org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" and a
following step using
"org.apache.mahout.cf.taste.hadoop.item.RecommenderJob". The first step
works without problems, but the second one is throwing an Exception:
|Exception in thread"main"
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
temp/itemIDIndex already exists and is not empty
at
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:124)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:818)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:165)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:328)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
|
It looks like the second job is using the same temporal output directories
as the first job. How can I avoid this? Or even better: If some of the tasks
are already done and cached in the first step, how could I use them so that
they don't have to be recomputed in the second step?
Best regards,
Thomas
PS: This is the actual JobFlow definition in JSON:
[
[......],
{
"Name": "MR Step 2: Find similiar items",
"HadoopJarStep": {
"MainClass":
"org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
"Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
"Args": [
"--input",
"s3n://recommendertest/data/<jobid>/aggregateWatched/",
"--output", "s3n://recommendertest/data/<jobid>/similiarItems/",
"--similarityClassname", "SIMILARITY_PEARSON_CORRELATION",
"--maxSimilaritiesPerItem", "100"
]
}
},
{
"Name": "MR Step 3: Find items for user",
"HadoopJarStep": {
"MainClass": "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob",
"Jar": "s3n://recommendertest/mahout-core/mahout-core-0.4-job.jar",
"Args": [
"--input",
"s3n://recommendertest/data/<jobid>/aggregateWatched/",
"--output",
"s3n://recommendertest/data/<jobid>/userRecommendations/",
"--similarityClassname", "SIMILARITY_PEARSON_CORRELATION",
"--numRecommendations", "100"
]
}
}
]
||||