Re: Hadoop streaming performance: elements vs. vectors

Peter Skomoroch Tue, 07 Apr 2009 18:00:38 -0700

Amareshwari,

Thanks for the suggestion, can you show a streaming jobconf that uses
"mapred.job.classpath.archives" to add a custom combiner to the classpath?


I've tried several variations, but the jar doesn't seem to get added to the
classpath properly...

-Pete

On Mon, Apr 6, 2009 at 12:17 AM, Amareshwari Sriramadasu <
amar...@yahoo-inc.com> wrote:

> You can add your jar to distributed cache and add it to classpath by
> passing it in configuration propery - "mapred.job.classpath.archives".
>
> -Amareshwari
>
> Peter Skomoroch wrote:
>
>> If I need to use a custom streaming combiner jar in Hadoop 18.3, is there
>> a
>> way to add it to the classpath without the following patch?
>>
>> https://issues.apache.org/jira/browse/HADOOP-3570
>>
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%3c48cf78e3.10...@yahoo-inc.com%3e
>>
>> On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch
>> <peter.skomor...@gmail.com>wrote:
>>
>>
>>
>>> Paco,
>>>
>>> Thanks, good ideas on the combiner.  I'm going to tweak things a bit as
>>> you
>>> suggest and report back later...
>>>
>>> -Pete
>>>
>>>
>>> On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN <cet...@gmail.com> wrote:
>>>
>>>
>>>
>>>> hi peter,
>>>> thinking aloud on this -
>>>>
>>>> trade-offs may depend on:
>>>>
>>>>  * how much grouping would be possible (tracking a PDF would be
>>>> interesting for metrics)
>>>>  * locality of key/value pairs (distributed among mapper and reducer
>>>> tasks)
>>>>
>>>> to that point, will there be much time spent in the shuffle?  if so,
>>>> it's probably cheaper to shuffle/sort the grouped row vectors than the
>>>> many small key,value pair
>>>>
>>>> in any case, when i had a similar situation on a large data set (2-3
>>>> Tb shuffle) a good pattern to follow was:
>>>>
>>>>  * mapper emitted small key,value pairs
>>>>  * combiner grouped into row vectors
>>>>
>>>> that combiner may get invoked both at the end of the map phase and at
>>>> the beginning of the reduce phase (more benefit)
>>>>
>>>> also, using byte arrays if possible to represent values may be able to
>>>> save much shuffle time
>>>>
>>>> best,
>>>> paco
>>>>
>>>>
>>>> On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
>>>> <peter.skomor...@gmail.com> wrote:
>>>>
>>>>
>>>>> Hadoop streaming question: If I am forming a matrix M by summing a
>>>>>
>>>>>
>>>> number of
>>>>
>>>>
>>>>> elements generated on different mappers, is it better to emit tons of
>>>>>
>>>>>
>>>> lines
>>>>
>>>>
>>>>> from the mappers with small key,value pairs for each element, or should
>>>>>
>>>>>
>>>> I
>>>>
>>>>
>>>>> group them into row vectors before sending to the reducers?
>>>>>
>>>>> For example, say I'm summing frequency count matrices M for each user
>>>>> on
>>>>>
>>>>>
>>>> a
>>>>
>>>>
>>>>> different map task, and the reducer combines the resulting sparse user
>>>>>
>>>>>
>>>> count
>>>>
>>>>
>>>>> matrices for use in another calculation.
>>>>>
>>>>> Should I emit the individual elements:
>>>>>
>>>>> i (j, Mij) \n
>>>>> 3 (1, 3.4) \n
>>>>> 3 (2, 3.4) \n
>>>>> 3 (3, 3.4) \n
>>>>> 4 (1, 2.3) \n
>>>>> 4 (2, 5.2) \n
>>>>>
>>>>> Or posting list style vectors?
>>>>>
>>>>> 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
>>>>> 4 ((1, 2.3), (2, 5.2)) \n
>>>>>
>>>>> Using vectors will at least save some message space, but are there any
>>>>>
>>>>>
>>>> other
>>>>
>>>>
>>>>> benefits to this approach in terms of Hadoop streaming overhead (sorts
>>>>> etc.)?  I think buffering issues will not be a huge concern since the
>>>>>
>>>>>
>>>> length
>>>>
>>>>
>>>>> of the vectors have a reasonable upper bound and will be in a sparse
>>>>> format...
>>>>>
>>>>>
>>>>> --
>>>>> Peter N. Skomoroch
>>>>> 617.285.8348
>>>>> http://www.datawrangling.com
>>>>> http://delicious.com/pskomoroch
>>>>> http://twitter.com/peteskomoroch
>>>>>
>>>>>
>>>>>
>>>>
>>> --
>>> Peter N. Skomoroch
>>> 617.285.8348
>>> http://www.datawrangling.com
>>> http://delicious.com/pskomoroch
>>> http://twitter.com/peteskomoroch
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>


-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Hadoop streaming performance: elements vs. vectors

Reply via email to