[
https://issues.apache.org/jira/browse/MAHOUT-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved MAHOUT-577.
------------------------------
Resolution: Not A Problem
So the conclusion here is that it's more or less working as intended? Yes I
agree, it's going to take a while to compute billions of row-row pairs, and
that's not a sparse/dense issue. I can't think of any obvious options that
could be added to trade off accuracy for speed.
I can imagine small tweaks to reduce the amount of data written, like writing
floats instead of doubles. If it's I/O bound then saving 8 bytes per datum
might be non-trivial. I bet some looking will turn up other optimizations. But
it won't change the O() of the computation.
I suppose another "workaround" is to not feed in so many rows. Only compute on
a subset that is of interest? If there is some useful way to filter out the
uninteresting rows (small norm? not sure) then maybe that could become a new
option to the job.
> RowSimilarityJob hangs during CooccurrencesMapper
> -------------------------------------------------
>
> Key: MAHOUT-577
> URL: https://issues.apache.org/jira/browse/MAHOUT-577
> Project: Mahout
> Issue Type: Bug
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Environment: Linux Debian 5.0.5, 12GB Ram, Hadoop 20.3 installation
> Reporter: Maya Hristakeva
> Fix For: 0.5
>
>
> Hello,
> When trying to run a RowSimilarityJob on a matrix ( 146682 x 138351 ), the
> job gets through the RowWeightMapper and WeightedOccurrencesPerColumnReducer,
> and hangs during the CooccurrencesMapper although it shows that the map tasks
> are 100% complete.
> The command I use to run the job is:
> hadoop jar mahout-core-0.4-job.jar
> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob
> -Dmapred.input.dir=/user/maya.hristakeva/mahout/core4/tf/1/0.001/title/12_07_10/lda/5/lda-sim/ldaCompressedDocumentsMatrix
>
> -Dmapred.output.dir=/user/maya.hristakeva/mahout/core4/tf/1/0.001/title/12_07_10/lda/5/lda-sim/ldaDocumentSimilarityMatrix
> -Dmapred.reduce.tasks=8 -Dmapred.map.tasks=200
> -Dmapred.job.name=LDA_ROW_SIMILARITY_TEST --tempDir
> /user/maya.hristakeva/temp/lda/5 --numberOfColumns 138351
> --similarityClassname
> org.apache.mahout.math.hadoop.similarity.vector.DistributedEuclideanDistanceVectorSimilarity
> --maxSimilaritiesPerRow 10
> And the output of the mappers which are 100% complete, but hanging is:
> syslog logs
> 01-05 18:30:00,835 INFO org.apache.hadoop.mapred.MapTask: bufstart =
> 29085149; bufend = 39038598; bufvoid = 99614720
> 2011-01-05 18:30:00,835 INFO org.apache.hadoop.mapred.MapTask: kvstart =
> 65461; kvend = 327605; length = 327680
> 2011-01-05 18:30:06,241 INFO org.apache.hadoop.mapred.MapTask: Finished spill
> 94
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: Spilling map
> output: record full = true
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: bufstart =
> 39038598; bufend = 48983989; bufvoid = 99614720
> 2011-01-05 18:30:09,208 INFO org.apache.hadoop.mapred.MapTask: kvstart =
> 327605; kvend = 262068; length = 327680
> 2011-01-05 18:30:14,528 INFO org.apache.hadoop.mapred.MapTask: Finished spill
> 95
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: Spilling map
> output: record full = true
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: bufstart =
> 48983989; bufend = 58929384; bufvoid = 99614720
> 2011-01-05 18:30:17,328 INFO org.apache.hadoop.mapred.MapTask: kvstart =
> 262068; kvend = 196531; length = 327680
> 2011-01-05 18:30:22,615 INFO org.apache.hadoop.mapred.MapTask: Finished spill
> 96
> .
> .
> .
> This problem does not occur when I use a toy matrix of 100 x 100, but once I
> give it the original matrix of ..... the problem is always reproducible.
> Any ideas on what could be causing this?
> Thanks,
> Maya Hristakeva
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira