Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Raphael Cendrillon Mon, 12 Dec 2011 00:55:22 -0800

You've convinced me that this is probably a bad idea.  You never know when this 
might come back to bite later.


On 12 Dec, 2011, at 12:50 AM, Dmitriy Lyubimov wrote:

> Oh now i remember what the deal with NullWritable was.
> 
> yes sequence file would read it as in
> 
>    Configuration conf = new Configuration();
>    FileSystem fs = FileSystem.getLocal(new Configuration());
>    Path testPath = new Path("name.seq");
> 
>    IntWritable iw = new IntWritable();
>    SequenceFile.Writer w =
>      SequenceFile.createWriter(fs,
>                                conf,
>                                testPath,
>                                NullWritable.class,
>                                IntWritable.class);
>    w.append(NullWritable.get(),iw);
>    w.close();
> 
> 
>    SequenceFile.Reader r = new SequenceFile.Reader(fs, testPath, conf);
>    while ( r.next(NullWritable.get(),iw));
>    r.close();
> 
> 
> but SequenceFileInputFileFormat would not. I.e. it is ok if you read
> it explicitly but I don't think one can use such files as an input for
> other MR jobs.
> 
> But since in this case there's no MR job to consume that output (and
> unlikely ever will be) i guess it is ok to save NullWritable in this
> case...
> 
> -d
> 
> On Mon, Dec 12, 2011 at 12:30 AM, [email protected]
> (Commented) (JIRA) <[email protected]> wrote:
>> 
>>    [ 
>> https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167406#comment-13167406
>>  ]
>> 
>> [email protected] commented on MAHOUT-923:
>> ------------------------------------------------------
>> 
>> 
>> 
>> bq.  On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:
>> bq.  > Hm. I hope i did not read the code or miss something.
>> bq.  >
>> bq.  > 1 -- i am not sure this will actually work as intended unless # of 
>> reducers is corced to 1, of which i see no mention in the code.
>> bq.  > 2 -- mappers do nothing, passing on all the row pressure to sort 
>> which is absolutely not necessary. Even if you use combiners. This is going 
>> to be especially the case if you coerce 1 reducer an no combiners. IMO mean 
>> computation should be pushed up to mappers to avoid sort pressures of map 
>> reduce. Then reduction becomes largely symbolical(but you do need pass on 
>> the # of rows mapper has seen, to the reducer, in order for that operation 
>> to apply correctly).
>> bq.  > 3 -- i am not sure -- is NullWritable as a key legit? In my 
>> experience sequence file reader cannot instantiate it because NullWritable 
>> is a singleton and its creation is prohibited by making constructor private.
>> bq.
>> bq.  Raphael Cendrillon wrote:
>> bq.      Thanks Dmitry.
>> bq.
>> bq.      Regarding 1, if I understand correctly the number of reducers 
>> depends on the number of unique keys. Since all keys are set to the same 
>> value (null), then all of the mapper outputs should arrive at the same 
>> reducer. This seems to work in the unit test, but I may be missing something?
>> bq.
>> bq.      Regarding 2, that makes alot of sense. I'm wondering how many rows 
>> should be processed per mapper?  I guess there is a trade-off between 
>> scalability (processing more rows within a single map job means that each 
>> row must have less columns) and speed?  Is there someplace in the SSVD code 
>> where the matrix is split into slices of rows that I could use as a 
>> reference?
>> bq.
>> bq.      Regarding 3, I believe NullWritable is OK. It's used pretty 
>> extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel 
>> there is some disadvantage to this I could replace "NullWritable.get()" with 
>> "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more 
>> suitable?
>> bq.
>> bq.
>> 
>> NullWritable objection is withdrawn. Apparently i haven't looked into hadoop 
>> for too long, amazingly it seems to work now.
>> 
>> 
>> 1 -- I don't think your statement about # of reduce tasks is true.
>> 
>> The job (or, rather, user) sets the number of reduce tasks via config 
>> propery. All users will follow hadoop recommendation to set that to 95% of 
>> capacity they want to take. (usually the whole cluster). So in production 
>> environment you are virtually _guaranteed_ to have number of reducers of 
>> something like 75 on a 40-noder and consequently 75 output files (unless 
>> users really want to read the details of your job and figure you meant it to 
>> be just 1).
>> Now, it is true that only one file will actually end up having something and 
>> the rest of task slots will just be occupied doing nothing .
>> 
>> So there are two problems with that scheme: a) is that job that allocates so 
>> many task slots that do nothing is not a good citizen, since in real 
>> production cluster is always shared with multiple jobs. b) your code assumes 
>> result will end up in partition 0, whereas contractually it may end up in 
>> any of 75 files. (in reality with default hash partitioner for key 1 it will 
>> wind up in partion 0001 unless there's one reducer as i guess in your test 
>> was).
>> 
>> 2-- it is simple. when you send n rows to reducers, they are shuffled - and 
>> - sorted. Sending massive sets to reducers has 2 effects: first, even if 
>> they all group under the same key, they are still sorted with ~ n log (n/p) 
>> where p is number of partitions assuming uniform distribution (which it is 
>> not because you are sending everything to the same place). Just because we 
>> can run distributed sort, doesn't mean we should. Secondly, all these rows 
>> are physically moved to reduce tasks, which is still ~n rows. Finally what 
>> has made your case especially problematic is that you are sending everything 
>> to the same reducer, i.e. you are not actually doing sort in distributed way 
>> but rather simple single threaded sort at the reducer that happens to get 
>> all the input.
>> 
>> So that would allocate a lot of tasks slots that are not used; but do a sort 
>> that is not needed; and do it in a single reducer thread for the entire 
>> input which is not parallel at all.
>> 
>> Instead, consider this: map has a state consisting of (sum(X), k). it keeps 
>> updating it sum+=x, k++ for every new x. At the end of the cycle (in 
>> cleanup) it writes only 1 tuple (sum(x), k) as output. so we just reduced 
>> complexity of the sort and io from millions of elements to just # of maps 
>> (which is perhaps just handful and in reality rarely overshoots 500 
>> mappers). That is, about at least 4 orders of magnitude.
>> 
>> Now, we send that handful tuples to single reducer and just do combining 
>> (sum(X)+= sum_i(X); n+= n_i) where i is the tuple in reducer. And because it 
>> is only a handful, reducer also runs very quickly, so the fact that we 
>> coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 
>> 500 vectors it sums up doesn't warrant distributed computation.
>> 
>> But, you have to make sure there's only 1 reducer no matter what user put 
>> into the config, and you have to make sure you do all heavy lifting in the 
>> mappers.
>> 
>> Finally, you don't even to coerce to 1 reducer. You still can have several 
>> (but uniformly distributed) and do final combine in front end of the method. 
>> However, given small size and triviality of the reduction processing, it is 
>> probably not warranted. Coercing to 1 reducer is ok in this case IMO.
>> 
>> 3 i guess any writable is ok but NullWritable. Maybe something has changed. 
>> i remember falling into that pitfall several generations of hadoop ago. You 
>> can verify by staging a simple experiment of writing a sequence file with 
>> nullwritable as either key or value and try to read it back. in my test long 
>> ago it would write ok but not read back. I beleive similar approach is used 
>> with keys in shuffle and sort. There is a reflection writable factory inside 
>> which is trying to use default constructor of the class to bring it up which 
>> is(was) not available for NullWritable.
>> 
>> 
>> - Dmitriy
>> 
>> 
>> -----------------------------------------------------------
>> This is an automatically generated e-mail. To reply, visit:
>> https://reviews.apache.org/r/3147/#review3838
>> -----------------------------------------------------------
>> 
>> 
>> On 2011-12-12 00:30:24, Raphael Cendrillon wrote:
>> bq.
>> bq.  -----------------------------------------------------------
>> bq.  This is an automatically generated e-mail. To reply, visit:
>> bq.  https://reviews.apache.org/r/3147/
>> bq.  -----------------------------------------------------------
>> bq.
>> bq.  (Updated 2011-12-12 00:30:24)
>> bq.
>> bq.
>> bq.  Review request for mahout.
>> bq.
>> bq.
>> bq.  Summary
>> bq.  -------
>> bq.
>> bq.  Here's a patch with a simple job to calculate the row mean (column-wise 
>> mean). One outstanding issue is the combiner, this requires a wrtiable class 
>> IntVectorTupleWritable, where the Int stores the number of rows, and the 
>> Vector stores the column-wise sum.
>> bq.
>> bq.
>> bq.  This addresses bug MAHOUT-923.
>> bq.      https://issues.apache.org/jira/browse/MAHOUT-923
>> bq.
>> bq.
>> bq.  Diffs
>> bq.  -----
>> bq.
>> bq.    
>> /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java
>>  1213095
>> bq.    
>> /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java
>>  PRE-CREATION
>> bq.    
>> /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java
>>  1213095
>> bq.
>> bq.  Diff: https://reviews.apache.org/r/3147/diff
>> bq.
>> bq.
>> bq.  Testing
>> bq.  -------
>> bq.
>> bq.  Junit test
>> bq.
>> bq.
>> bq.  Thanks,
>> bq.
>> bq.  Raphael
>> bq.
>> bq.
>> 
>> 
>> 
>>> Row mean job for PCA
>>> --------------------
>>> 
>>>                 Key: MAHOUT-923
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>          Components: Math
>>>    Affects Versions: 0.6
>>>            Reporter: Raphael Cendrillon
>>>            Assignee: Raphael Cendrillon
>>>             Fix For: Backlog
>>> 
>>>         Attachments: MAHOUT-923.patch
>>> 
>>> 
>>> Add map reduce job for calculating mean row (column-wise mean) of a 
>>> Distributed Row Matrix for use in PCA.
>> 
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA 
>> administrators: 
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 
>>

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Reply via email to