Whoops, should have googled it first. Looks like this is now fixed in trunk, HADOOP-4842. For people stuck using 18.3, a workaround appears to be adding something like "| sort | sh combiner.sh" to the call of the mapper script (via Klaas Bosteels)
Would be great to get this patched into distributions like EMR and Cloudera On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch <peter.skomor...@gmail.com>wrote: > One area I'm curious about is the requirement that any combiners in > Streaming jobs be java classes. Are there any plans to change this in the > future? Prototyping streaming jobs in Python is great, and the ability to > use a Python combiner would help performance a lot without needing to move > to Java. > > > > > On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah <a...@cloudera.com> wrote: > >> S d, >> >> It is totally fine to use Python streaming if it does the job you are >> after, there will be a slight performance hit, but that is noise assuming >> your cluster is a small one. If you are operating a large cluster >> continuously, then once your logic is stabilized using Python it might make >> sense to convert/operationalize some jobs to Java (or C pipes) to improve >> performance for purpose of finishing quicker or reducing number of servers >> needed. >> >> You should also take a look at PIG and Hive, they are both higher level >> languages and very easy to learn: >> >> http://www.cloudera.com/hadoop-training-pig-introduction >> >> http://www.cloudera.com/hadoop-training-hive-introduction >> >> -- amr >> >> >> s d wrote: >> >>> Thanks. >>> So in the overall scheme of things, what is the general feeling about >>> using >>> python for this? I like the ease of deploying and reading python compared >>> with Java but want to make sure using python over hadoop is scalable & is >>> standard practice and not something done only for prototyping and small >>> scale tests. >>> >>> >>> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <a...@cloudera.com> >>> wrote: >>> >>> >>> >>>> Streaming is slightly slower than native Java jobs. Otherwise Python >>>> works >>>> great in streaming. >>>> >>>> Alex >>>> >>>> On Tue, May 19, 2009 at 8:36 AM, s d <s.d.sau...@gmail.com> wrote: >>>> >>>> >>>> >>>>> Hi, >>>>> How robust is using hadoop with python over the streaming protocol? Any >>>>> disadvantages (performance? flexibility?) ? It just strikes me that >>>>> >>>>> >>>> python >>>> >>>> >>>>> is so much more convenient when it comes to deploying and crunching >>>>> text >>>>> files. >>>>> Thanks, >>>>> >>>>> >>>>> >>>> >>> >>> >> > > > -- > Peter N. Skomoroch > 617.285.8348 > http://www.datawrangling.com > http://delicious.com/pskomoroch > http://twitter.com/peteskomoroch > -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch