[ 
https://issues.apache.org/jira/browse/MAHOUT-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977991#action_12977991
 ] 

Sebastian Schelter commented on MAHOUT-577:
-------------------------------------------

I'd suggest too you check the job's status as Joris suggests.

Also be aware that in CooccurrencesMapper all cooccurring column pairs
are mapped out for each vector. The number of pairs per vector is
n*(n-1)/2 where n is the number of non-zero entries in the vector. This
number grows quadratic so it is very important to only use the job on
sparse matrices and prune rows with lots of entries if possible.

It looks like you apply this to the output of LDA. I'm not familiar with the 
anatomics of the resulting output matrix from that algorithm, but could it be 
that it's very dense?

> RowSimilarityJob hangs during CooccurrencesMapper
> -------------------------------------------------
>
>                 Key: MAHOUT-577
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-577
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>         Environment: Linux Debian 5.0.5, 12GB Ram, Hadoop 20.3 installation 
>            Reporter: Maya Hristakeva
>            Priority: Blocker
>
> Hello,
> When trying to run a RowSimilarityJob on a matrix ( 146682 x 138351 ), the 
> job gets through the RowWeightMapper and WeightedOccurrencesPerColumnReducer, 
> and hangs during the CooccurrencesMapper although it shows that the map tasks 
> are 100% complete. 
> The command I use to run the job is: 
> hadoop jar mahout-core-0.4-job.jar 
> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob 
> -Dmapred.input.dir=/user/maya.hristakeva/mahout/core4/tf/1/0.001/title/12_07_10/lda/5/lda-sim/ldaCompressedDocumentsMatrix
>  
> -Dmapred.output.dir=/user/maya.hristakeva/mahout/core4/tf/1/0.001/title/12_07_10/lda/5/lda-sim/ldaDocumentSimilarityMatrix
>  -Dmapred.reduce.tasks=8 -Dmapred.map.tasks=200 
> -Dmapred.job.name=LDA_ROW_SIMILARITY_TEST --tempDir 
> /user/maya.hristakeva/temp/lda/5 --numberOfColumns 138351 
> --similarityClassname 
> org.apache.mahout.math.hadoop.similarity.vector.DistributedEuclideanDistanceVectorSimilarity
>  --maxSimilaritiesPerRow 10
> And the output of the mappers which are 100% complete, but hanging is: 
> syslog logs
> 01-05 18:30:00,835 INFO org.apache.hadoop.mapred.MapTask: bufstart = 
> 29085149; bufend = 39038598; bufvoid = 99614720
> 2011-01-05 18:30:00,835 INFO org.apache.hadoop.mapred.MapTask: kvstart = 
> 65461; kvend = 327605; length = 327680
> 2011-01-05 18:30:06,241 INFO org.apache.hadoop.mapred.MapTask: Finished spill 
> 94
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: Spilling map 
> output: record full = true
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: bufstart = 
> 39038598; bufend = 48983989; bufvoid = 99614720
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: kvstart = 
> 327605; kvend = 262068; length = 327680
> 2011-01-05 18:30:14,528 INFO org.apache.hadoop.mapred.MapTask: Finished spill 
> 95
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: Spilling map 
> output: record full = true
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: bufstart = 
> 48983989; bufend = 58929384; bufvoid = 99614720
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: kvstart = 
> 262068; kvend = 196531; length = 327680
> 2011-01-05 18:30:22,615 INFO org.apache.hadoop.mapred.MapTask: Finished spill 
> 96
> .
> .
> .
> This problem does not occur when I use a toy matrix of 100 x 100, but once I 
> give it the original matrix of ..... the problem is always reproducible. 
> Any ideas on what could be causing this? 
> Thanks, 
> Maya Hristakeva

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to