Re: Review Request: Row mean job for PCA

Lance Norskog Sun, 11 Dec 2011 23:59:37 -0800

There is NullWritable as the key between mapper and reducer, and as
the first value in the pairs saved in a SequenceFile. As the
mapper->reducer key, it works.


In mahout, SequenceFile vectors and matrices are stored as
<IntWritable,VectorWritable> pairs. Even though this job is in the
middle of another job, it should follow the convention.

You do need to use only one reducer, and so combiners may be worthwhile.

The person using this job knows the right vector to use. It may be
that it gets a lot of sparse vectors but will become a dense vector.
Or a vector that writes to a database. Or something else. In fact, I
may just want to turn a vector from Dense to Sparse, and I could
achieve that with this job.

On Sun, Dec 11, 2011 at 7:58 PM, Raphael Cendrillon
<[email protected]> wrote:
>
>
>> On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:
>> > Hm. I hope i did not read the code or miss something.
>> >
>> > 1 -- i am not sure this will actually work as intended unless # of 
>> > reducers is corced to 1, of which i see no mention in the code.
>> > 2 -- mappers do nothing, passing on all the row pressure to sort which is 
>> > absolutely not necessary. Even if you use combiners. This is going to be 
>> > especially the case if you coerce 1 reducer an no combiners. IMO mean 
>> > computation should be pushed up to mappers to avoid sort pressures of map 
>> > reduce. Then reduction becomes largely symbolical(but you do need pass on 
>> > the # of rows mapper has seen, to the reducer, in order for that operation 
>> > to apply correctly).
>> > 3 -- i am not sure -- is NullWritable as a key legit? In my experience 
>> > sequence file reader cannot instantiate it because NullWritable is a 
>> > singleton and its creation is prohibited by making constructor private.
>
> Thanks Dmitry.
>
> Regarding 1, if I understand correctly the number of reducers depends on the 
> number of unique keys. Since all keys are set to the same value (null), then 
> all of the mapper outputs should arrive at the same reducer. This seems to 
> work in the unit test, but I may be missing something?
>
> Regarding 2, that makes alot of sense. I'm wondering how many rows should be 
> processed per mapper?  I guess there is a trade-off between scalability 
> (processing more rows within a single map job means that each row must have 
> less columns) and speed?  Is there someplace in the SSVD code where the 
> matrix is split into slices of rows that I could use as a reference?
>
> Regarding 3, I believe NullWritable is OK. It's used pretty extensively in 
> TimesSquaredJob in DistributedRowMatrx. However if you feel there is some 
> disadvantage to this I could replace "NullWritable.get()" with "new 
> IntWritable(1)" (that is, set all of the keys to 1). Would that be more 
> suitable?
>
>
> - Raphael
>
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/3147/#review3838
> -----------------------------------------------------------
>
>
> On 2011-12-12 00:30:24, Raphael Cendrillon wrote:
>>
>> -----------------------------------------------------------
>> This is an automatically generated e-mail. To reply, visit:
>> https://reviews.apache.org/r/3147/
>> -----------------------------------------------------------
>>
>> (Updated 2011-12-12 00:30:24)
>>
>>
>> Review request for mahout.
>>
>>
>> Summary
>> -------
>>
>> Here's a patch with a simple job to calculate the row mean (column-wise 
>> mean). One outstanding issue is the combiner, this requires a wrtiable class 
>> IntVectorTupleWritable, where the Int stores the number of rows, and the 
>> Vector stores the column-wise sum.
>>
>>
>> This addresses bug MAHOUT-923.
>>     https://issues.apache.org/jira/browse/MAHOUT-923
>>
>>
>> Diffs
>> -----
>>
>>   
>> /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java
>>  1213095
>>   
>> /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java
>>  PRE-CREATION
>>   
>> /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java
>>  1213095
>>
>> Diff: https://reviews.apache.org/r/3147/diff
>>
>>
>> Testing
>> -------
>>
>> Junit test
>>
>>
>> Thanks,
>>
>> Raphael
>>
>>
>



-- 
Lance Norskog
[email protected]

Re: Review Request: Row mean job for PCA

Reply via email to