You've convinced me that this is probably a bad idea. You never know when this
might come back to bite later.
On 12 Dec, 2011, at 12:50 AM, Dmitriy Lyubimov wrote:
> Oh now i remember what the deal with NullWritable was.
>
> yes sequence file would read it as in
>
> Configuration conf = new Configuration();
> FileSystem fs = FileSystem.getLocal(new Configuration());
> Path testPath = new Path("name.seq");
>
> IntWritable iw = new IntWritable();
> SequenceFile.Writer w =
> SequenceFile.createWriter(fs,
> conf,
> testPath,
> NullWritable.class,
> IntWritable.class);
> w.append(NullWritable.get(),iw);
> w.close();
>
>
> SequenceFile.Reader r = new SequenceFile.Reader(fs, testPath, conf);
> while ( r.next(NullWritable.get(),iw));
> r.close();
>
>
> but SequenceFileInputFileFormat would not. I.e. it is ok if you read
> it explicitly but I don't think one can use such files as an input for
> other MR jobs.
>
> But since in this case there's no MR job to consume that output (and
> unlikely ever will be) i guess it is ok to save NullWritable in this
> case...
>
> -d
>
> On Mon, Dec 12, 2011 at 12:30 AM, [email protected]
> (Commented) (JIRA) <[email protected]> wrote:
>>
>> [
>> https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167406#comment-13167406
>> ]
>>
>> [email protected] commented on MAHOUT-923:
>> ------------------------------------------------------
>>
>>
>>
>> bq. On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:
>> bq. > Hm. I hope i did not read the code or miss something.
>> bq. >
>> bq. > 1 -- i am not sure this will actually work as intended unless # of
>> reducers is corced to 1, of which i see no mention in the code.
>> bq. > 2 -- mappers do nothing, passing on all the row pressure to sort
>> which is absolutely not necessary. Even if you use combiners. This is going
>> to be especially the case if you coerce 1 reducer an no combiners. IMO mean
>> computation should be pushed up to mappers to avoid sort pressures of map
>> reduce. Then reduction becomes largely symbolical(but you do need pass on
>> the # of rows mapper has seen, to the reducer, in order for that operation
>> to apply correctly).
>> bq. > 3 -- i am not sure -- is NullWritable as a key legit? In my
>> experience sequence file reader cannot instantiate it because NullWritable
>> is a singleton and its creation is prohibited by making constructor private.
>> bq.
>> bq. Raphael Cendrillon wrote:
>> bq. Thanks Dmitry.
>> bq.
>> bq. Regarding 1, if I understand correctly the number of reducers
>> depends on the number of unique keys. Since all keys are set to the same
>> value (null), then all of the mapper outputs should arrive at the same
>> reducer. This seems to work in the unit test, but I may be missing something?
>> bq.
>> bq. Regarding 2, that makes alot of sense. I'm wondering how many rows
>> should be processed per mapper? I guess there is a trade-off between
>> scalability (processing more rows within a single map job means that each
>> row must have less columns) and speed? Is there someplace in the SSVD code
>> where the matrix is split into slices of rows that I could use as a
>> reference?
>> bq.
>> bq. Regarding 3, I believe NullWritable is OK. It's used pretty
>> extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel
>> there is some disadvantage to this I could replace "NullWritable.get()" with
>> "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more
>> suitable?
>> bq.
>> bq.
>>
>> NullWritable objection is withdrawn. Apparently i haven't looked into hadoop
>> for too long, amazingly it seems to work now.
>>
>>
>> 1 -- I don't think your statement about # of reduce tasks is true.
>>
>> The job (or, rather, user) sets the number of reduce tasks via config
>> propery. All users will follow hadoop recommendation to set that to 95% of
>> capacity they want to take. (usually the whole cluster). So in production
>> environment you are virtually _guaranteed_ to have number of reducers of
>> something like 75 on a 40-noder and consequently 75 output files (unless
>> users really want to read the details of your job and figure you meant it to
>> be just 1).
>> Now, it is true that only one file will actually end up having something and
>> the rest of task slots will just be occupied doing nothing .
>>
>> So there are two problems with that scheme: a) is that job that allocates so
>> many task slots that do nothing is not a good citizen, since in real
>> production cluster is always shared with multiple jobs. b) your code assumes
>> result will end up in partition 0, whereas contractually it may end up in
>> any of 75 files. (in reality with default hash partitioner for key 1 it will
>> wind up in partion 0001 unless there's one reducer as i guess in your test
>> was).
>>
>> 2-- it is simple. when you send n rows to reducers, they are shuffled - and
>> - sorted. Sending massive sets to reducers has 2 effects: first, even if
>> they all group under the same key, they are still sorted with ~ n log (n/p)
>> where p is number of partitions assuming uniform distribution (which it is
>> not because you are sending everything to the same place). Just because we
>> can run distributed sort, doesn't mean we should. Secondly, all these rows
>> are physically moved to reduce tasks, which is still ~n rows. Finally what
>> has made your case especially problematic is that you are sending everything
>> to the same reducer, i.e. you are not actually doing sort in distributed way
>> but rather simple single threaded sort at the reducer that happens to get
>> all the input.
>>
>> So that would allocate a lot of tasks slots that are not used; but do a sort
>> that is not needed; and do it in a single reducer thread for the entire
>> input which is not parallel at all.
>>
>> Instead, consider this: map has a state consisting of (sum(X), k). it keeps
>> updating it sum+=x, k++ for every new x. At the end of the cycle (in
>> cleanup) it writes only 1 tuple (sum(x), k) as output. so we just reduced
>> complexity of the sort and io from millions of elements to just # of maps
>> (which is perhaps just handful and in reality rarely overshoots 500
>> mappers). That is, about at least 4 orders of magnitude.
>>
>> Now, we send that handful tuples to single reducer and just do combining
>> (sum(X)+= sum_i(X); n+= n_i) where i is the tuple in reducer. And because it
>> is only a handful, reducer also runs very quickly, so the fact that we
>> coerced it to be 1, is pretty benign. That volume of anywhere between 1 to
>> 500 vectors it sums up doesn't warrant distributed computation.
>>
>> But, you have to make sure there's only 1 reducer no matter what user put
>> into the config, and you have to make sure you do all heavy lifting in the
>> mappers.
>>
>> Finally, you don't even to coerce to 1 reducer. You still can have several
>> (but uniformly distributed) and do final combine in front end of the method.
>> However, given small size and triviality of the reduction processing, it is
>> probably not warranted. Coercing to 1 reducer is ok in this case IMO.
>>
>> 3 i guess any writable is ok but NullWritable. Maybe something has changed.
>> i remember falling into that pitfall several generations of hadoop ago. You
>> can verify by staging a simple experiment of writing a sequence file with
>> nullwritable as either key or value and try to read it back. in my test long
>> ago it would write ok but not read back. I beleive similar approach is used
>> with keys in shuffle and sort. There is a reflection writable factory inside
>> which is trying to use default constructor of the class to bring it up which
>> is(was) not available for NullWritable.
>>
>>
>> - Dmitriy
>>
>>
>> -----------------------------------------------------------
>> This is an automatically generated e-mail. To reply, visit:
>> https://reviews.apache.org/r/3147/#review3838
>> -----------------------------------------------------------
>>
>>
>> On 2011-12-12 00:30:24, Raphael Cendrillon wrote:
>> bq.
>> bq. -----------------------------------------------------------
>> bq. This is an automatically generated e-mail. To reply, visit:
>> bq. https://reviews.apache.org/r/3147/
>> bq. -----------------------------------------------------------
>> bq.
>> bq. (Updated 2011-12-12 00:30:24)
>> bq.
>> bq.
>> bq. Review request for mahout.
>> bq.
>> bq.
>> bq. Summary
>> bq. -------
>> bq.
>> bq. Here's a patch with a simple job to calculate the row mean (column-wise
>> mean). One outstanding issue is the combiner, this requires a wrtiable class
>> IntVectorTupleWritable, where the Int stores the number of rows, and the
>> Vector stores the column-wise sum.
>> bq.
>> bq.
>> bq. This addresses bug MAHOUT-923.
>> bq. https://issues.apache.org/jira/browse/MAHOUT-923
>> bq.
>> bq.
>> bq. Diffs
>> bq. -----
>> bq.
>> bq.
>> /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java
>> 1213095
>> bq.
>> /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java
>> PRE-CREATION
>> bq.
>> /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java
>> 1213095
>> bq.
>> bq. Diff: https://reviews.apache.org/r/3147/diff
>> bq.
>> bq.
>> bq. Testing
>> bq. -------
>> bq.
>> bq. Junit test
>> bq.
>> bq.
>> bq. Thanks,
>> bq.
>> bq. Raphael
>> bq.
>> bq.
>>
>>
>>
>>> Row mean job for PCA
>>> --------------------
>>>
>>> Key: MAHOUT-923
>>> URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>> Project: Mahout
>>> Issue Type: Improvement
>>> Components: Math
>>> Affects Versions: 0.6
>>> Reporter: Raphael Cendrillon
>>> Assignee: Raphael Cendrillon
>>> Fix For: Backlog
>>>
>>> Attachments: MAHOUT-923.patch
>>>
>>>
>>> Add map reduce job for calculating mean row (column-wise mean) of a
>>> Distributed Row Matrix for use in PCA.
>>
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA
>> administrators:
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>