[
https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167338#comment-13167338
]
[email protected] commented on MAHOUT-923:
------------------------------------------------------
bq. On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:
bq. > Hm. I hope i did not read the code or miss something.
bq. >
bq. > 1 -- i am not sure this will actually work as intended unless # of
reducers is corced to 1, of which i see no mention in the code.
bq. > 2 -- mappers do nothing, passing on all the row pressure to sort which
is absolutely not necessary. Even if you use combiners. This is going to be
especially the case if you coerce 1 reducer an no combiners. IMO mean
computation should be pushed up to mappers to avoid sort pressures of map
reduce. Then reduction becomes largely symbolical(but you do need pass on the #
of rows mapper has seen, to the reducer, in order for that operation to apply
correctly).
bq. > 3 -- i am not sure -- is NullWritable as a key legit? In my experience
sequence file reader cannot instantiate it because NullWritable is a singleton
and its creation is prohibited by making constructor private.
Thanks Dmitry.
Regarding 1, if I understand correctly the number of reducers depends on the
number of unique keys. Since all keys are set to the same value (null), then
all of the mapper outputs should arrive at the same reducer. This seems to work
in the unit test, but I may be missing something?
Regarding 2, that makes alot of sense. I'm wondering how many rows should be
processed per mapper? I guess there is a trade-off between scalability
(processing more rows within a single map job means that each row must have
less columns) and speed? Is there someplace in the SSVD code where the matrix
is split into slices of rows that I could use as a reference?
Regarding 3, I believe NullWritable is OK. It's used pretty extensively in
TimesSquaredJob in DistributedRowMatrx. However if you feel there is some
disadvantage to this I could replace "NullWritable.get()" with "new
IntWritable(1)" (that is, set all of the keys to 1). Would that be more
suitable?
- Raphael
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/#review3838
-----------------------------------------------------------
On 2011-12-12 00:30:24, Raphael Cendrillon wrote:
bq.
bq. -----------------------------------------------------------
bq. This is an automatically generated e-mail. To reply, visit:
bq. https://reviews.apache.org/r/3147/
bq. -----------------------------------------------------------
bq.
bq. (Updated 2011-12-12 00:30:24)
bq.
bq.
bq. Review request for mahout.
bq.
bq.
bq. Summary
bq. -------
bq.
bq. Here's a patch with a simple job to calculate the row mean (column-wise
mean). One outstanding issue is the combiner, this requires a wrtiable class
IntVectorTupleWritable, where the Int stores the number of rows, and the Vector
stores the column-wise sum.
bq.
bq.
bq. This addresses bug MAHOUT-923.
bq. https://issues.apache.org/jira/browse/MAHOUT-923
bq.
bq.
bq. Diffs
bq. -----
bq.
bq.
/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java
1213095
bq.
/trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java
PRE-CREATION
bq.
/trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java
1213095
bq.
bq. Diff: https://reviews.apache.org/r/3147/diff
bq.
bq.
bq. Testing
bq. -------
bq.
bq. Junit test
bq.
bq.
bq. Thanks,
bq.
bq. Raphael
bq.
bq.
> Row mean job for PCA
> --------------------
>
> Key: MAHOUT-923
> URL: https://issues.apache.org/jira/browse/MAHOUT-923
> Project: Mahout
> Issue Type: Improvement
> Components: Math
> Affects Versions: 0.6
> Reporter: Raphael Cendrillon
> Assignee: Raphael Cendrillon
> Fix For: Backlog
>
> Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a
> Distributed Row Matrix for use in PCA.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira