if it's coherent with the rest of the code there, i guess it is benign to use it for this particular purpose. I can't think of a case where we'd want to pull exactly one vector into a MR job.
On Mon, Dec 12, 2011 at 12:54 AM, Raphael Cendrillon <cendrillon1...@gmail.com> wrote: > You've convinced me that this is probably a bad idea. You never know when > this might come back to bite later. > > On 12 Dec, 2011, at 12:50 AM, Dmitriy Lyubimov wrote: > >> Oh now i remember what the deal with NullWritable was. >> >> yes sequence file would read it as in >> >> Configuration conf = new Configuration(); >> FileSystem fs = FileSystem.getLocal(new Configuration()); >> Path testPath = new Path("name.seq"); >> >> IntWritable iw = new IntWritable(); >> SequenceFile.Writer w = >> SequenceFile.createWriter(fs, >> conf, >> testPath, >> NullWritable.class, >> IntWritable.class); >> w.append(NullWritable.get(),iw); >> w.close(); >> >> >> SequenceFile.Reader r = new SequenceFile.Reader(fs, testPath, conf); >> while ( r.next(NullWritable.get(),iw)); >> r.close(); >> >> >> but SequenceFileInputFileFormat would not. I.e. it is ok if you read >> it explicitly but I don't think one can use such files as an input for >> other MR jobs. >> >> But since in this case there's no MR job to consume that output (and >> unlikely ever will be) i guess it is ok to save NullWritable in this >> case... >> >> -d >> >> On Mon, Dec 12, 2011 at 12:30 AM, jirapos...@reviews.apache.org >> (Commented) (JIRA) <j...@apache.org> wrote: >>> >>> [ >>> https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167406#comment-13167406 >>> ] >>> >>> jirapos...@reviews.apache.org commented on MAHOUT-923: >>> ------------------------------------------------------ >>> >>> >>> >>> bq. On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote: >>> bq. > Hm. I hope i did not read the code or miss something. >>> bq. > >>> bq. > 1 -- i am not sure this will actually work as intended unless # of >>> reducers is corced to 1, of which i see no mention in the code. >>> bq. > 2 -- mappers do nothing, passing on all the row pressure to sort >>> which is absolutely not necessary. Even if you use combiners. This is going >>> to be especially the case if you coerce 1 reducer an no combiners. IMO mean >>> computation should be pushed up to mappers to avoid sort pressures of map >>> reduce. Then reduction becomes largely symbolical(but you do need pass on >>> the # of rows mapper has seen, to the reducer, in order for that operation >>> to apply correctly). >>> bq. > 3 -- i am not sure -- is NullWritable as a key legit? In my >>> experience sequence file reader cannot instantiate it because NullWritable >>> is a singleton and its creation is prohibited by making constructor private. >>> bq. >>> bq. Raphael Cendrillon wrote: >>> bq. Thanks Dmitry. >>> bq. >>> bq. Regarding 1, if I understand correctly the number of reducers >>> depends on the number of unique keys. Since all keys are set to the same >>> value (null), then all of the mapper outputs should arrive at the same >>> reducer. This seems to work in the unit test, but I may be missing >>> something? >>> bq. >>> bq. Regarding 2, that makes alot of sense. I'm wondering how many rows >>> should be processed per mapper? I guess there is a trade-off between >>> scalability (processing more rows within a single map job means that each >>> row must have less columns) and speed? Is there someplace in the SSVD code >>> where the matrix is split into slices of rows that I could use as a >>> reference? >>> bq. >>> bq. Regarding 3, I believe NullWritable is OK. It's used pretty >>> extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel >>> there is some disadvantage to this I could replace "NullWritable.get()" >>> with "new IntWritable(1)" (that is, set all of the keys to 1). Would that >>> be more suitable? >>> bq. >>> bq. >>> >>> NullWritable objection is withdrawn. Apparently i haven't looked into >>> hadoop for too long, amazingly it seems to work now. >>> >>> >>> 1 -- I don't think your statement about # of reduce tasks is true. >>> >>> The job (or, rather, user) sets the number of reduce tasks via config >>> propery. All users will follow hadoop recommendation to set that to 95% of >>> capacity they want to take. (usually the whole cluster). So in production >>> environment you are virtually _guaranteed_ to have number of reducers of >>> something like 75 on a 40-noder and consequently 75 output files (unless >>> users really want to read the details of your job and figure you meant it >>> to be just 1). >>> Now, it is true that only one file will actually end up having something >>> and the rest of task slots will just be occupied doing nothing . >>> >>> So there are two problems with that scheme: a) is that job that allocates >>> so many task slots that do nothing is not a good citizen, since in real >>> production cluster is always shared with multiple jobs. b) your code >>> assumes result will end up in partition 0, whereas contractually it may end >>> up in any of 75 files. (in reality with default hash partitioner for key 1 >>> it will wind up in partion 0001 unless there's one reducer as i guess in >>> your test was). >>> >>> 2-- it is simple. when you send n rows to reducers, they are shuffled - and >>> - sorted. Sending massive sets to reducers has 2 effects: first, even if >>> they all group under the same key, they are still sorted with ~ n log (n/p) >>> where p is number of partitions assuming uniform distribution (which it is >>> not because you are sending everything to the same place). Just because we >>> can run distributed sort, doesn't mean we should. Secondly, all these rows >>> are physically moved to reduce tasks, which is still ~n rows. Finally what >>> has made your case especially problematic is that you are sending >>> everything to the same reducer, i.e. you are not actually doing sort in >>> distributed way but rather simple single threaded sort at the reducer that >>> happens to get all the input. >>> >>> So that would allocate a lot of tasks slots that are not used; but do a >>> sort that is not needed; and do it in a single reducer thread for the >>> entire input which is not parallel at all. >>> >>> Instead, consider this: map has a state consisting of (sum(X), k). it keeps >>> updating it sum+=x, k++ for every new x. At the end of the cycle (in >>> cleanup) it writes only 1 tuple (sum(x), k) as output. so we just reduced >>> complexity of the sort and io from millions of elements to just # of maps >>> (which is perhaps just handful and in reality rarely overshoots 500 >>> mappers). That is, about at least 4 orders of magnitude. >>> >>> Now, we send that handful tuples to single reducer and just do combining >>> (sum(X)+= sum_i(X); n+= n_i) where i is the tuple in reducer. And because >>> it is only a handful, reducer also runs very quickly, so the fact that we >>> coerced it to be 1, is pretty benign. That volume of anywhere between 1 to >>> 500 vectors it sums up doesn't warrant distributed computation. >>> >>> But, you have to make sure there's only 1 reducer no matter what user put >>> into the config, and you have to make sure you do all heavy lifting in the >>> mappers. >>> >>> Finally, you don't even to coerce to 1 reducer. You still can have several >>> (but uniformly distributed) and do final combine in front end of the >>> method. However, given small size and triviality of the reduction >>> processing, it is probably not warranted. Coercing to 1 reducer is ok in >>> this case IMO. >>> >>> 3 i guess any writable is ok but NullWritable. Maybe something has changed. >>> i remember falling into that pitfall several generations of hadoop ago. You >>> can verify by staging a simple experiment of writing a sequence file with >>> nullwritable as either key or value and try to read it back. in my test >>> long ago it would write ok but not read back. I beleive similar approach is >>> used with keys in shuffle and sort. There is a reflection writable factory >>> inside which is trying to use default constructor of the class to bring it >>> up which is(was) not available for NullWritable. >>> >>> >>> - Dmitriy >>> >>> >>> ----------------------------------------------------------- >>> This is an automatically generated e-mail. To reply, visit: >>> https://reviews.apache.org/r/3147/#review3838 >>> ----------------------------------------------------------- >>> >>> >>> On 2011-12-12 00:30:24, Raphael Cendrillon wrote: >>> bq. >>> bq. ----------------------------------------------------------- >>> bq. This is an automatically generated e-mail. To reply, visit: >>> bq. https://reviews.apache.org/r/3147/ >>> bq. ----------------------------------------------------------- >>> bq. >>> bq. (Updated 2011-12-12 00:30:24) >>> bq. >>> bq. >>> bq. Review request for mahout. >>> bq. >>> bq. >>> bq. Summary >>> bq. ------- >>> bq. >>> bq. Here's a patch with a simple job to calculate the row mean >>> (column-wise mean). One outstanding issue is the combiner, this requires a >>> wrtiable class IntVectorTupleWritable, where the Int stores the number of >>> rows, and the Vector stores the column-wise sum. >>> bq. >>> bq. >>> bq. This addresses bug MAHOUT-923. >>> bq. https://issues.apache.org/jira/browse/MAHOUT-923 >>> bq. >>> bq. >>> bq. Diffs >>> bq. ----- >>> bq. >>> bq. >>> /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java >>> 1213095 >>> bq. >>> /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java >>> PRE-CREATION >>> bq. >>> /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java >>> 1213095 >>> bq. >>> bq. Diff: https://reviews.apache.org/r/3147/diff >>> bq. >>> bq. >>> bq. Testing >>> bq. ------- >>> bq. >>> bq. Junit test >>> bq. >>> bq. >>> bq. Thanks, >>> bq. >>> bq. Raphael >>> bq. >>> bq. >>> >>> >>> >>>> Row mean job for PCA >>>> -------------------- >>>> >>>> Key: MAHOUT-923 >>>> URL: https://issues.apache.org/jira/browse/MAHOUT-923 >>>> Project: Mahout >>>> Issue Type: Improvement >>>> Components: Math >>>> Affects Versions: 0.6 >>>> Reporter: Raphael Cendrillon >>>> Assignee: Raphael Cendrillon >>>> Fix For: Backlog >>>> >>>> Attachments: MAHOUT-923.patch >>>> >>>> >>>> Add map reduce job for calculating mean row (column-wise mean) of a >>>> Distributed Row Matrix for use in PCA. >>> >>> -- >>> This message is automatically generated by JIRA. >>> If you think it was sent incorrectly, please contact your JIRA >>> administrators: >>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >>> For more information on JIRA, see: http://www.atlassian.com/software/jira >>> >>> >