[
https://issues.apache.org/jira/browse/MAHOUT-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679843#comment-13679843
]
Jake Mannix commented on MAHOUT-1147:
-------------------------------------
Hmmm:
13/06/10 12:58:44 INFO cvb.CVB0Driver: About to run: Writing final topic/term
distributions from /tmp/mahout-work-jake/reuters-lda-model/model-20 to
/tmp/mahout-work-jake/reuters-lda
13/06/10 12:58:45 INFO input.FileInputFormat: Total input paths to process : 10
13/06/10 12:58:46 INFO cvb.CVB0Driver: About to run: Writing final
document/topic inference from /tmp/mahout-work-jake/reuters-out-matrix/matrix
to /tmp/mahout-work-jake/reuters-lda-topics
13/06/10 12:58:47 INFO input.FileInputFormat: Total input paths to process : 1
13/06/10 12:58:52 INFO mapred.JobClient: Running job: job_201306101136_0057
13/06/10 12:58:53 INFO mapred.JobClient: map 0% reduce 0%
13/06/10 12:59:50 INFO mapred.JobClient: map 20% reduce 0%
13/06/10 12:59:56 INFO mapred.JobClient: map 40% reduce 0%
13/06/10 12:59:59 INFO mapred.JobClient: map 60% reduce 0%
13/06/10 13:00:02 INFO mapred.JobClient: map 80% reduce 0%
13/06/10 13:00:05 INFO mapred.JobClient: map 100% reduce 0%
13/06/10 13:00:08 INFO mapred.JobClient: Job complete: job_201306101136_0057
13/06/10 13:00:08 INFO mapred.JobClient: Counters: 6
13/06/10 13:00:08 INFO mapred.JobClient: Job Counters
13/06/10 13:00:08 INFO mapred.JobClient: Launched map tasks=10
13/06/10 13:00:08 INFO mapred.JobClient: Data-local map tasks=10
13/06/10 13:00:08 INFO mapred.JobClient: FileSystemCounters
13/06/10 13:00:08 INFO mapred.JobClient: HDFS_BYTES_READ=6690610
13/06/10 13:00:08 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=6690610
13/06/10 13:00:08 INFO mapred.JobClient: Map-Reduce Framework
13/06/10 13:00:08 INFO mapred.JobClient: Map input records=20
13/06/10 13:00:08 INFO mapred.JobClient: Spilled Records=0
13/06/10 13:00:08 INFO mapred.JobClient: Running job: job_201306101136_0058
13/06/10 13:00:09 INFO mapred.JobClient: map 0% reduce 0%
13/06/10 13:00:12 INFO mapred.JobClient: map 100% reduce 0%
13/06/10 13:10:17 INFO mapred.JobClient: Task Id :
attempt_201306101136_0058_m_000000_0, Status : FAILED
java.lang.NullPointerException
at
org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.cleanup(CVB0DocInferenceMapper.java:99)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Task attempt_201306101136_0058_m_000000_0 failed to report status for 602
seconds. Killing!
13/06/10 13:10:18 INFO mapred.JobClient: map 0% reduce 0%
13/06/10 13:10:27 INFO mapred.JobClient: map 100% reduce 0%
> CVB Bug in CVB0Driver causes doc/topic distributions to be trained on random
> matrix
> -----------------------------------------------------------------------------------
>
> Key: MAHOUT-1147
> URL: https://issues.apache.org/jira/browse/MAHOUT-1147
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.7
> Environment: Eclipse IDE
> Java code base
> CVB0Driver Class
> setModelPaths(Job job, Path modelPath) - method
> Reporter: Jack Pay
> Assignee: Jake Mannix
> Labels: bug, cvb, fix, suggestion
> Fix For: 0.8
>
> Attachments: MAHOUT-1147.patch, MAHOUT-1147.patch
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Problem:
> When training doc/topic model no paths for the term/topic model found
> (outputs null).
> These paths are set using setModelPaths in CVB0Driver.
> Reason for Problem:
> Variety of Job instances call this method.
> The Job is passed to the method instead of the Configuration object given to
> the Job.
> The configuration is retrieved from the Job instance itself.
> I believe that this Configuration instance is a clone of the original.
> This is a problem as the variable MODEL_PATHS is set on the clone which is
> then discarded when the given Job is complete.
> The original Configuration has no MODEL_PATHS String set and therefore
> returns null.
> The code stipulates that if it cannot find a model to use a new random
> matrix. This happens every time as MODEL_PATHS is not set for the
> Configuration instance used.
> Solution:
> Do not pass the Job to the setModels method, but pass the Configuration
> instance passed into the method which created the Job.
> i.e.
> change from:
> setModelPaths(Job job, Path modelPath)
> to:
> setModelPaths(Configuration conf, Path modelPath)
> And change all calling methods accordingly (obviously).
> So far what little testing I have done appears to solve this problem.
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira