Amareshwari, Thanks for the suggestion, can you show a streaming jobconf that uses "mapred.job.classpath.archives" to add a custom combiner to the classpath?
I've tried several variations, but the jar doesn't seem to get added to the classpath properly... -Pete On Mon, Apr 6, 2009 at 12:17 AM, Amareshwari Sriramadasu < amar...@yahoo-inc.com> wrote: > You can add your jar to distributed cache and add it to classpath by > passing it in configuration propery - "mapred.job.classpath.archives". > > -Amareshwari > > Peter Skomoroch wrote: > >> If I need to use a custom streaming combiner jar in Hadoop 18.3, is there >> a >> way to add it to the classpath without the following patch? >> >> https://issues.apache.org/jira/browse/HADOOP-3570 >> >> >> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%3c48cf78e3.10...@yahoo-inc.com%3e >> >> On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch >> <peter.skomor...@gmail.com>wrote: >> >> >> >>> Paco, >>> >>> Thanks, good ideas on the combiner. I'm going to tweak things a bit as >>> you >>> suggest and report back later... >>> >>> -Pete >>> >>> >>> On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN <cet...@gmail.com> wrote: >>> >>> >>> >>>> hi peter, >>>> thinking aloud on this - >>>> >>>> trade-offs may depend on: >>>> >>>> * how much grouping would be possible (tracking a PDF would be >>>> interesting for metrics) >>>> * locality of key/value pairs (distributed among mapper and reducer >>>> tasks) >>>> >>>> to that point, will there be much time spent in the shuffle? if so, >>>> it's probably cheaper to shuffle/sort the grouped row vectors than the >>>> many small key,value pair >>>> >>>> in any case, when i had a similar situation on a large data set (2-3 >>>> Tb shuffle) a good pattern to follow was: >>>> >>>> * mapper emitted small key,value pairs >>>> * combiner grouped into row vectors >>>> >>>> that combiner may get invoked both at the end of the map phase and at >>>> the beginning of the reduce phase (more benefit) >>>> >>>> also, using byte arrays if possible to represent values may be able to >>>> save much shuffle time >>>> >>>> best, >>>> paco >>>> >>>> >>>> On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch >>>> <peter.skomor...@gmail.com> wrote: >>>> >>>> >>>>> Hadoop streaming question: If I am forming a matrix M by summing a >>>>> >>>>> >>>> number of >>>> >>>> >>>>> elements generated on different mappers, is it better to emit tons of >>>>> >>>>> >>>> lines >>>> >>>> >>>>> from the mappers with small key,value pairs for each element, or should >>>>> >>>>> >>>> I >>>> >>>> >>>>> group them into row vectors before sending to the reducers? >>>>> >>>>> For example, say I'm summing frequency count matrices M for each user >>>>> on >>>>> >>>>> >>>> a >>>> >>>> >>>>> different map task, and the reducer combines the resulting sparse user >>>>> >>>>> >>>> count >>>> >>>> >>>>> matrices for use in another calculation. >>>>> >>>>> Should I emit the individual elements: >>>>> >>>>> i (j, Mij) \n >>>>> 3 (1, 3.4) \n >>>>> 3 (2, 3.4) \n >>>>> 3 (3, 3.4) \n >>>>> 4 (1, 2.3) \n >>>>> 4 (2, 5.2) \n >>>>> >>>>> Or posting list style vectors? >>>>> >>>>> 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n >>>>> 4 ((1, 2.3), (2, 5.2)) \n >>>>> >>>>> Using vectors will at least save some message space, but are there any >>>>> >>>>> >>>> other >>>> >>>> >>>>> benefits to this approach in terms of Hadoop streaming overhead (sorts >>>>> etc.)? I think buffering issues will not be a huge concern since the >>>>> >>>>> >>>> length >>>> >>>> >>>>> of the vectors have a reasonable upper bound and will be in a sparse >>>>> format... >>>>> >>>>> >>>>> -- >>>>> Peter N. Skomoroch >>>>> 617.285.8348 >>>>> http://www.datawrangling.com >>>>> http://delicious.com/pskomoroch >>>>> http://twitter.com/peteskomoroch >>>>> >>>>> >>>>> >>>> >>> -- >>> Peter N. Skomoroch >>> 617.285.8348 >>> http://www.datawrangling.com >>> http://delicious.com/pskomoroch >>> http://twitter.com/peteskomoroch >>> >>> >>> >> >> >> >> >> > > -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch