Whoops, should have googled it first.  Looks like this is now fixed in
trunk, HADOOP-4842.  For people stuck using 18.3, a workaround appears to be
adding something like "| sort | sh combiner.sh" to the call of the mapper
script (via Klaas Bosteels)

Would be great to get this patched into distributions like EMR and Cloudera

On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch
<peter.skomor...@gmail.com>wrote:

> One area I'm curious about is the requirement that any combiners in
> Streaming jobs be java classes.  Are there any plans to change this in the
> future?  Prototyping streaming jobs in Python is great, and the ability to
> use a Python combiner would help performance a lot without needing to move
> to Java.
>
>
>
>
> On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah <a...@cloudera.com> wrote:
>
>> S d,
>>
>>  It is totally fine to use Python streaming if it does the job you are
>> after, there will be a slight performance hit, but that is noise assuming
>> your cluster is a small one. If you are operating a large cluster
>> continuously, then once your logic is stabilized using Python it might make
>> sense to convert/operationalize some jobs to Java (or C pipes) to improve
>> performance for purpose of finishing quicker or reducing number of servers
>> needed.
>>
>>  You should also take a look at PIG and Hive, they are both higher level
>> languages and very easy to learn:
>>
>> http://www.cloudera.com/hadoop-training-pig-introduction
>>
>> http://www.cloudera.com/hadoop-training-hive-introduction
>>
>> -- amr
>>
>>
>> s d wrote:
>>
>>> Thanks.
>>> So in the overall scheme of things, what is the general feeling about
>>> using
>>> python for this? I like the ease of deploying and reading python compared
>>> with Java but want to make sure using python over hadoop is scalable & is
>>> standard practice and not something done only for prototyping and small
>>> scale tests.
>>>
>>>
>>> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <a...@cloudera.com>
>>> wrote:
>>>
>>>
>>>
>>>> Streaming is slightly slower than native Java jobs.  Otherwise Python
>>>> works
>>>> great in streaming.
>>>>
>>>> Alex
>>>>
>>>> On Tue, May 19, 2009 at 8:36 AM, s d <s.d.sau...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>> Hi,
>>>>> How robust is using hadoop with python over the streaming protocol? Any
>>>>> disadvantages (performance? flexibility?) ?  It just strikes me that
>>>>>
>>>>>
>>>> python
>>>>
>>>>
>>>>> is so much more convenient when it comes to deploying and crunching
>>>>> text
>>>>> files.
>>>>> Thanks,
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
> --
> Peter N. Skomoroch
> 617.285.8348
> http://www.datawrangling.com
> http://delicious.com/pskomoroch
> http://twitter.com/peteskomoroch
>



-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Reply via email to