the issue is that the numTerms in dictionary is 0. learning for LDA on reuters-lda/reuters-matrix/matrix (numTerms: 0), finding 5-topics, with document/topic prior 1.0E-4, topic/term prior 1.0E-4. Maximum iterations to run will be 2, unless the change in perplexity is less than 0.0. Topic model output (p(term|topic) for each topic) will be stored
Could you check ur seq2sparse output and the terms in the dictionary ? On Friday, March 14, 2014 12:16 PM, Steven Cullens <srcull...@gmail.com> wrote: Hi, I'm running Mahout 0.9 and Hadoop 1.1.1 and I'm following the examples/bin/cluster-reuters.sh script, but manually entering commands because the script crashes. Data preparation runs smoothly, but when I call cvb, it times out prior to writing topics. Any ideas? Thanks in advance and here are the commands: # comparison of output of seqdumper with raw text files was fine bin/mahout seqdirectory -i file:///home/hduser/software/mahout-distribution-0.9/examples/reuters-out -o reuters-lda/seqfile -c UTF-8 -chunk 64 -xm sequential # output had dictionary and a frequency file. tf-vector part file had text file name for key and vector of large number (word id?) : integer (word count?) bin/mahout seq2sparse -i reuters-lda/seqfile -o reuters-lda/vectors -ow --maxDFPercent 85 --namedVector # key was replaced by integer: Key: 0: Value: /reut2-000.sgm-0.txt:{26587:3.0,19426:6.0,41154:1.0 bin/mahout rowid -i reuters-lda/vectors/tf-vectors -o reuters-lda/reuters-matrix # times out prior to writing topics. bin/mahout cvb -i reuters-lda/reuters-matrix/matrix -o reuters-lda/lda -k 5 -ow -x 2 -dict reuters-lda/vectors/dictonary.file-* -dt reuters-lda/topics -mt reuters-lda/model Here's the output of the last step: MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Warning: $HADOOP_HOME is deprecated. Running on hadoop, using /home/hduser/software/hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=/home/hduser/software/hadoop-1.1.1/conf MAHOUT-JOB: /home/hduser/software/mahout-distribution-0.9/mahout-examples-0.9-job.jar Warning: $HADOOP_HOME is deprecated. 14/03/14 11:41:58 WARN driver.MahoutDriver: No cvb.props found on classpath, will use command-line arguments only 14/03/14 11:41:58 INFO common.AbstractJob: Command line arguments: {--convergenceDelta=[0.0], --dictionary=[reuters-lda/vectors/dictonary.file-*], --doc_topic_output=[reuters-lda/topics], --doc_topic_smoothing=[1.0E-4], --endPhase=[2147483647], --input=[reuters-lda/reuters-matrix/matrix], --iteration_block_size=[10], --maxIter=[2], --max_doc_topic_iters=[10], --num_reduce_tasks=[10], --num_topics=[5], --num_train_threads=[4], --num_update_threads=[1], --output=[reuters-lda/lda], --overwrite=null, --startPhase=[0], --tempDir=[temp], --term_topic_smoothing=[1.0E-4], --test_set_fraction=[0.0], --topic_model_temp_dir=[reuters-lda/model]} 14/03/14 11:41:58 INFO cvb.CVB0Driver: Will run Collapsed Variational Bayes (0th-derivative approximation) learning for LDA on reuters-lda/reuters-matrix/matrix (numTerms: 0), finding 5-topics, with document/topic prior 1.0E-4, topic/term prior 1.0E-4. Maximum iterations to run will be 2, unless the change in perplexity is less than 0.0. Topic model output (p(term|topic) for each topic) will be stored reuters-lda/lda. Random initialization seed is 7962, holding out 0.0 of the data for perplexity check 14/03/14 11:41:58 INFO cvb.CVB0Driver: Dictionary to be used located reuters-lda/vectors/dictonary.file-* p(topic|docId) will be stored reuters-lda/topics 14/03/14 11:41:58 INFO cvb.CVB0Driver: Current iteration number: 0 14/03/14 11:41:58 INFO cvb.CVB0Driver: About to run iteration 1 of 2 14/03/14 11:41:58 INFO cvb.CVB0Driver: About to run: Iteration 1 of 2, input path: reuters-lda/model/model-0 14/03/14 11:41:59 INFO input.FileInputFormat: Total input paths to process : 1 14/03/14 11:41:59 INFO mapred.JobClient: Running job: job_201403131444_0034 14/03/14 11:42:00 INFO mapred.JobClient: map 0% reduce 0% 14/03/14 11:42:11 INFO mapred.JobClient: map 86% reduce 0% 14/03/14 11:42:14 INFO mapred.JobClient: map 100% reduce 0% 14/03/14 11:42:22 INFO mapred.JobClient: map 100% reduce 3% 14/03/14 11:42:23 INFO mapred.JobClient: map 100% reduce 6% 14/03/14 11:42:24 INFO mapred.JobClient: map 100% reduce 20% 14/03/14 11:42:32 INFO mapred.JobClient: map 100% reduce 30% 14/03/14 11:42:33 INFO mapred.JobClient: map 100% reduce 40% 14/03/14 11:42:40 INFO mapred.JobClient: map 100% reduce 43% 14/03/14 11:42:41 INFO mapred.JobClient: map 100% reduce 60% 14/03/14 11:42:48 INFO mapred.JobClient: map 100% reduce 66% 14/03/14 11:42:49 INFO mapred.JobClient: map 100% reduce 73% 14/03/14 11:42:50 INFO mapred.JobClient: map 100% reduce 80% 14/03/14 11:42:57 INFO mapred.JobClient: map 100% reduce 83% 14/03/14 11:42:58 INFO mapred.JobClient: map 100% reduce 100% 14/03/14 11:42:59 INFO mapred.JobClient: Job complete: job_201403131444_0034 14/03/14 11:42:59 INFO mapred.JobClient: Counters: 29 14/03/14 11:42:59 INFO mapred.JobClient: Job Counters 14/03/14 11:42:59 INFO mapred.JobClient: Launched reduce tasks=10 14/03/14 11:42:59 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12784 14/03/14 11:42:59 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/03/14 11:42:59 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/03/14 11:42:59 INFO mapred.JobClient: Launched map tasks=1 14/03/14 11:42:59 INFO mapred.JobClient: Data-local map tasks=1 14/03/14 11:42:59 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=85618 14/03/14 11:42:59 INFO mapred.JobClient: File Output Format Counters 14/03/14 11:42:59 INFO mapred.JobClient: Bytes Written=1040 14/03/14 11:42:59 INFO mapred.JobClient: FileSystemCounters 14/03/14 11:42:59 INFO mapred.JobClient: FILE_BYTES_READ=258 14/03/14 11:42:59 INFO mapred.JobClient: HDFS_BYTES_READ=6924921 14/03/14 11:42:59 INFO mapred.JobClient: FILE_BYTES_WRITTEN=279071 14/03/14 11:42:59 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1040 14/03/14 11:42:59 INFO mapred.JobClient: File Input Format Counters 14/03/14 11:42:59 INFO mapred.JobClient: Bytes Read=6924788 14/03/14 11:42:59 INFO mapred.JobClient: Map-Reduce Framework 14/03/14 11:42:59 INFO mapred.JobClient: Map output materialized bytes=178 14/03/14 11:42:59 INFO mapred.JobClient: Map input records=21578 14/03/14 11:42:59 INFO mapred.JobClient: Reduce shuffle bytes=178 14/03/14 11:42:59 INFO mapred.JobClient: Spilled Records=10 14/03/14 11:42:59 INFO mapred.JobClient: Map output bytes=30 14/03/14 11:42:59 INFO mapred.JobClient: CPU time spent (ms)=7510 14/03/14 11:42:59 INFO mapred.JobClient: Total committed heap usage (bytes)=323293184 14/03/14 11:42:59 INFO mapred.JobClient: Combine input records=5 14/03/14 11:42:59 INFO mapred.JobClient: SPLIT_RAW_BYTES=133 14/03/14 11:42:59 INFO mapred.JobClient: Reduce input records=5 14/03/14 11:42:59 INFO mapred.JobClient: Reduce input groups=5 14/03/14 11:42:59 INFO mapred.JobClient: Combine output records=5 14/03/14 11:42:59 INFO mapred.JobClient: Physical memory (bytes) snapshot=579133440 14/03/14 11:42:59 INFO mapred.JobClient: Reduce output records=5 14/03/14 11:42:59 INFO mapred.JobClient: Virtual memory (bytes) snapshot=4194500608 14/03/14 11:42:59 INFO mapred.JobClient: Map output records=5 14/03/14 11:42:59 INFO cvb.CVB0Driver: About to run iteration 2 of 2 14/03/14 11:42:59 INFO cvb.CVB0Driver: About to run: Iteration 2 of 2, input path: reuters-lda/model/model-1 14/03/14 11:43:00 INFO input.FileInputFormat: Total input paths to process : 1 14/03/14 11:43:00 INFO mapred.JobClient: Running job: job_201403131444_0035 14/03/14 11:43:01 INFO mapred.JobClient: map 0% reduce 0% 14/03/14 11:43:12 INFO mapred.JobClient: map 82% reduce 0% 14/03/14 11:43:15 INFO mapred.JobClient: map 100% reduce 0% 14/03/14 11:43:22 INFO mapred.JobClient: map 100% reduce 6% 14/03/14 11:43:23 INFO mapred.JobClient: map 100% reduce 13% 14/03/14 11:43:24 INFO mapred.JobClient: map 100% reduce 20% 14/03/14 11:43:30 INFO mapred.JobClient: map 100% reduce 23% 14/03/14 11:43:32 INFO mapred.JobClient: map 100% reduce 40% 14/03/14 11:43:39 INFO mapred.JobClient: map 100% reduce 43% 14/03/14 11:43:40 INFO mapred.JobClient: map 100% reduce 50% 14/03/14 11:43:41 INFO mapred.JobClient: map 100% reduce 60% 14/03/14 11:43:49 INFO mapred.JobClient: map 100% reduce 66% 14/03/14 11:43:50 INFO mapred.JobClient: map 100% reduce 80% 14/03/14 11:43:57 INFO mapred.JobClient: map 100% reduce 83% 14/03/14 11:43:59 INFO mapred.JobClient: map 100% reduce 100% 14/03/14 11:43:59 INFO mapred.JobClient: Job complete: job_201403131444_0035 14/03/14 11:43:59 INFO mapred.JobClient: Counters: 29 14/03/14 11:43:59 INFO mapred.JobClient: Job Counters 14/03/14 11:43:59 INFO mapred.JobClient: Launched reduce tasks=10 14/03/14 11:43:59 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=11051 14/03/14 11:43:59 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/03/14 11:43:59 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/03/14 11:43:59 INFO mapred.JobClient: Launched map tasks=1 14/03/14 11:43:59 INFO mapred.JobClient: Data-local map tasks=1 14/03/14 11:43:59 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=87458 14/03/14 11:43:59 INFO mapred.JobClient: File Output Format Counters 14/03/14 11:43:59 INFO mapred.JobClient: Bytes Written=1040 14/03/14 11:43:59 INFO mapred.JobClient: FileSystemCounters 14/03/14 11:43:59 INFO mapred.JobClient: FILE_BYTES_READ=258 14/03/14 11:43:59 INFO mapred.JobClient: HDFS_BYTES_READ=6925961 14/03/14 11:43:59 INFO mapred.JobClient: FILE_BYTES_WRITTEN=288058 14/03/14 11:43:59 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1040 14/03/14 11:43:59 INFO mapred.JobClient: File Input Format Counters 14/03/14 11:43:59 INFO mapred.JobClient: Bytes Read=6924788 14/03/14 11:43:59 INFO mapred.JobClient: Map-Reduce Framework 14/03/14 11:43:59 INFO mapred.JobClient: Map output materialized bytes=178 14/03/14 11:43:59 INFO mapred.JobClient: Map input records=21578 14/03/14 11:43:59 INFO mapred.JobClient: Reduce shuffle bytes=178 14/03/14 11:43:59 INFO mapred.JobClient: Spilled Records=10 14/03/14 11:43:59 INFO mapred.JobClient: Map output bytes=30 14/03/14 11:43:59 INFO mapred.JobClient: CPU time spent (ms)=7080 14/03/14 11:43:59 INFO mapred.JobClient: Total committed heap usage (bytes)=323293184 14/03/14 11:43:59 INFO mapred.JobClient: Combine input records=5 14/03/14 11:43:59 INFO mapred.JobClient: SPLIT_RAW_BYTES=133 14/03/14 11:43:59 INFO mapred.JobClient: Reduce input records=5 14/03/14 11:43:59 INFO mapred.JobClient: Reduce input groups=5 14/03/14 11:43:59 INFO mapred.JobClient: Combine output records=5 14/03/14 11:43:59 INFO mapred.JobClient: Physical memory (bytes) snapshot=579375104 14/03/14 11:43:59 INFO mapred.JobClient: Reduce output records=5 14/03/14 11:43:59 INFO mapred.JobClient: Virtual memory (bytes) snapshot=4192706560 14/03/14 11:43:59 INFO mapred.JobClient: Map output records=5 14/03/14 11:43:59 INFO cvb.CVB0Driver: Completed 2 iterations in 120 seconds 14/03/14 11:43:59 INFO cvb.CVB0Driver: Perplexities: () 14/03/14 11:43:59 INFO cvb.CVB0Driver: About to run: Writing final topic/term distributions from reuters-lda/model/model-2 to reuters-lda/lda 14/03/14 11:43:59 INFO input.FileInputFormat: Total input paths to process : 10 14/03/14 11:43:59 INFO cvb.CVB0Driver: About to run: Writing final document/topic inference from reuters-lda/reuters-matrix/matrix to reuters-lda/topics 14/03/14 11:44:00 INFO input.FileInputFormat: Total input paths to process : 1 14/03/14 11:44:00 INFO mapred.JobClient: Running job: job_201403131444_0036 14/03/14 11:44:01 INFO mapred.JobClient: map 0% reduce 0% 14/03/14 11:44:11 INFO mapred.JobClient: map 10% reduce 0% 14/03/14 11:44:12 INFO mapred.JobClient: map 20% reduce 0% 14/03/14 11:44:14 INFO mapred.JobClient: map 40% reduce 0% 14/03/14 11:44:16 INFO mapred.JobClient: map 60% reduce 0% 14/03/14 11:44:19 INFO mapred.JobClient: map 80% reduce 0% 14/03/14 11:44:21 INFO mapred.JobClient: map 90% reduce 0% 14/03/14 11:44:22 INFO mapred.JobClient: map 100% reduce 0% 14/03/14 11:44:22 INFO mapred.JobClient: Job complete: job_201403131444_0036 14/03/14 11:44:22 INFO mapred.JobClient: Counters: 19 14/03/14 11:44:22 INFO mapred.JobClient: Job Counters 14/03/14 11:44:22 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=30038 14/03/14 11:44:22 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/03/14 11:44:22 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/03/14 11:44:22 INFO mapred.JobClient: Launched map tasks=10 14/03/14 11:44:22 INFO mapred.JobClient: Data-local map tasks=10 14/03/14 11:44:22 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 14/03/14 11:44:22 INFO mapred.JobClient: File Output Format Counters 14/03/14 11:44:22 INFO mapred.JobClient: Bytes Written=1040 14/03/14 11:44:22 INFO mapred.JobClient: FileSystemCounters 14/03/14 11:44:22 INFO mapred.JobClient: HDFS_BYTES_READ=2420 14/03/14 11:44:22 INFO mapred.JobClient: FILE_BYTES_WRITTEN=250140 14/03/14 11:44:22 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1040 14/03/14 11:44:22 INFO mapred.JobClient: File Input Format Counters 14/03/14 11:44:22 INFO mapred.JobClient: Bytes Read=1040 14/03/14 11:44:22 INFO mapred.JobClient: Map-Reduce Framework 14/03/14 11:44:22 INFO mapred.JobClient: Map input records=5 14/03/14 11:44:22 INFO mapred.JobClient: Physical memory (bytes) snapshot=364630016 14/03/14 11:44:22 INFO mapred.JobClient: Spilled Records=0 14/03/14 11:44:22 INFO mapred.JobClient: CPU time spent (ms)=800 14/03/14 11:44:22 INFO mapred.JobClient: Total committed heap usage (bytes)=162529280 14/03/14 11:44:22 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3778887680 14/03/14 11:44:22 INFO mapred.JobClient: Map output records=5 14/03/14 11:44:22 INFO mapred.JobClient: SPLIT_RAW_BYTES=1380 14/03/14 11:44:22 INFO mapred.JobClient: Running job: job_201403131444_0037 14/03/14 11:44:23 INFO mapred.JobClient: map 0% reduce 0% 14/03/14 11:54:28 INFO mapred.JobClient: Task Id : attempt_201403131444_0037_m_000000_0, Status : FAILED org.apache.mahout.math.IndexException: Index 26587 is outside allowable range of [0,0) at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:195) at org.apache.mahout.clustering.lda.cvb.TopicModel.pTopicGivenTerm(TopicModel.java:374) at org.apache.mahout.clustering.lda.cvb.TopicModel.trainDocTopicModel(TopicModel.java:287) at org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:41) at org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:28) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) at org.apache.hadoop.mapred.Child.main(Child.java:249) Task attempt_201403131444_0037_m_000000_0 failed to report status for 600 seconds. Killing! attempt_201403131444_0037_m_000000_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient). attempt_201403131444_0037_m_000000_0: log4j:WARN Please initialize the log4j system properly. 14/03/14 12:04:35 INFO mapred.JobClient: Task Id : attempt_201403131444_0037_m_000000_1, Status : FAILED org.apache.mahout.math.IndexException: Index 26587 is outside allowable range of [0,0) at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:195) at org.apache.mahout.clustering.lda.cvb.TopicModel.pTopicGivenTerm(TopicModel.java:374) at org.apache.mahout.clustering.lda.cvb.TopicModel.trainDocTopicModel(TopicModel.java:287) at org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:41) at org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:28) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) at org.apache.hadoop.mapred.Child.main(Child.java:249) Task attempt_201403131444_0037_m_000000_1 failed to report status for 600 seconds. Killing! attempt_201403131444_0037_m_000000_1: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient). attempt_201403131444_0037_m_000000_1: log4j:WARN Please initialize the log4j system properly.