Re: Hadoop & Python

Todd Lipcon Thu, 21 May 2009 10:23:33 -0700

On Thu, May 21, 2009 at 5:19 AM, Dan Milstein <dmilst...@hubspot.com> wrote:


> One thing about the | sort | sh combiner.sh approach: you do have to be
> careful about memory if you're doing that -- if a mapper instance sees a
> large number of rows, you'll be asking sort to sort *all* of those before
> passing them to the combiner.  Hadoop itself only hands off some bounded
> number of output keys at a time to the combiner, which is much safer for
> large data sets.
>

The unix "sort" utility already does some smartness here. It has a
configurable memory buffer it uses for sorting, and spills to /tmp by
default. The manpage doesn't say what algorithm it's actually using, but I
presume it's a mergesort. I think the default memory usage is something
pretty small - you may get better performance using "sort -S 512M" or so.

-Todd

>
>
> On May 19, 2009, at 5:04 PM, Peter Skomoroch wrote:
>
>  Whoops, should have googled it first.  Looks like this is now fixed in
>> trunk, HADOOP-4842.  For people stuck using 18.3, a workaround appears to
>> be
>> adding something like "| sort | sh combiner.sh" to the call of the mapper
>> script (via Klaas Bosteels)
>>
>> Would be great to get this patched into distributions like EMR and
>> Cloudera
>>
>> On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch
>> <peter.skomor...@gmail.com>wrote:
>>
>>  One area I'm curious about is the requirement that any combiners in
>>> Streaming jobs be java classes.  Are there any plans to change this in
>>> the
>>> future?  Prototyping streaming jobs in Python is great, and the ability
>>> to
>>> use a Python combiner would help performance a lot without needing to
>>> move
>>> to Java.
>>>
>>>
>>>
>>>
>>> On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah <a...@cloudera.com> wrote:
>>>
>>>  S d,
>>>>
>>>> It is totally fine to use Python streaming if it does the job you are
>>>> after, there will be a slight performance hit, but that is noise
>>>> assuming
>>>> your cluster is a small one. If you are operating a large cluster
>>>> continuously, then once your logic is stabilized using Python it might
>>>> make
>>>> sense to convert/operationalize some jobs to Java (or C pipes) to
>>>> improve
>>>> performance for purpose of finishing quicker or reducing number of
>>>> servers
>>>> needed.
>>>>
>>>> You should also take a look at PIG and Hive, they are both higher level
>>>> languages and very easy to learn:
>>>>
>>>> http://www.cloudera.com/hadoop-training-pig-introduction
>>>>
>>>> http://www.cloudera.com/hadoop-training-hive-introduction
>>>>
>>>> -- amr
>>>>
>>>>
>>>> s d wrote:
>>>>
>>>>  Thanks.
>>>>> So in the overall scheme of things, what is the general feeling about
>>>>> using
>>>>> python for this? I like the ease of deploying and reading python
>>>>> compared
>>>>> with Java but want to make sure using python over hadoop is scalable &
>>>>> is
>>>>> standard practice and not something done only for prototyping and small
>>>>> scale tests.
>>>>>
>>>>>
>>>>> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <a...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>  Streaming is slightly slower than native Java jobs.  Otherwise Python
>>>>>> works
>>>>>> great in streaming.
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> On Tue, May 19, 2009 at 8:36 AM, s d <s.d.sau...@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Hi,
>>>>>>> How robust is using hadoop with python over the streaming protocol?
>>>>>>> Any
>>>>>>> disadvantages (performance? flexibility?) ?  It just strikes me that
>>>>>>>
>>>>>>>
>>>>>>>  python
>>>>>>
>>>>>>
>>>>>>  is so much more convenient when it comes to deploying and crunching
>>>>>>> text
>>>>>>> files.
>>>>>>> Thanks,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Peter N. Skomoroch
>>> 617.285.8348
>>> http://www.datawrangling.com
>>> http://delicious.com/pskomoroch
>>> http://twitter.com/peteskomoroch
>>>
>>>
>>
>>
>> --
>> Peter N. Skomoroch
>> 617.285.8348
>> http://www.datawrangling.com
>> http://delicious.com/pskomoroch
>> http://twitter.com/peteskomoroch
>>
>
>

Re: Hadoop & Python

Reply via email to