Direct link to HADOOP-4842: https://issues.apache.org/jira/browse/HADOOP-4842
On Tue, May 19, 2009 at 5:04 PM, Peter Skomoroch <peter.skomor...@gmail.com>wrote: > Whoops, should have googled it first. Looks like this is now fixed in > trunk, HADOOP-4842. For people stuck using 18.3, a workaround appears to be > adding something like "| sort | sh combiner.sh" to the call of the mapper > script (via Klaas Bosteels) > > Would be great to get this patched into distributions like EMR and Cloudera > > > On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch < > peter.skomor...@gmail.com> wrote: > >> One area I'm curious about is the requirement that any combiners in >> Streaming jobs be java classes. Are there any plans to change this in the >> future? Prototyping streaming jobs in Python is great, and the ability to >> use a Python combiner would help performance a lot without needing to move >> to Java. >> >> >> >> >> On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah <a...@cloudera.com> wrote: >> >>> S d, >>> >>> It is totally fine to use Python streaming if it does the job you are >>> after, there will be a slight performance hit, but that is noise assuming >>> your cluster is a small one. If you are operating a large cluster >>> continuously, then once your logic is stabilized using Python it might make >>> sense to convert/operationalize some jobs to Java (or C pipes) to improve >>> performance for purpose of finishing quicker or reducing number of servers >>> needed. >>> >>> You should also take a look at PIG and Hive, they are both higher level >>> languages and very easy to learn: >>> >>> http://www.cloudera.com/hadoop-training-pig-introduction >>> >>> http://www.cloudera.com/hadoop-training-hive-introduction >>> >>> -- amr >>> >>> >>> s d wrote: >>> >>>> Thanks. >>>> So in the overall scheme of things, what is the general feeling about >>>> using >>>> python for this? I like the ease of deploying and reading python >>>> compared >>>> with Java but want to make sure using python over hadoop is scalable & >>>> is >>>> standard practice and not something done only for prototyping and small >>>> scale tests. >>>> >>>> >>>> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <a...@cloudera.com> >>>> wrote: >>>> >>>> >>>> >>>>> Streaming is slightly slower than native Java jobs. Otherwise Python >>>>> works >>>>> great in streaming. >>>>> >>>>> Alex >>>>> >>>>> On Tue, May 19, 2009 at 8:36 AM, s d <s.d.sau...@gmail.com> wrote: >>>>> >>>>> >>>>> >>>>>> Hi, >>>>>> How robust is using hadoop with python over the streaming protocol? >>>>>> Any >>>>>> disadvantages (performance? flexibility?) ? It just strikes me that >>>>>> >>>>>> >>>>> python >>>>> >>>>> >>>>>> is so much more convenient when it comes to deploying and crunching >>>>>> text >>>>>> files. >>>>>> Thanks, >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >> >> -- >> Peter N. Skomoroch >> 617.285.8348 >> http://www.datawrangling.com >> http://delicious.com/pskomoroch >> http://twitter.com/peteskomoroch >> > > > > -- > Peter N. Skomoroch > 617.285.8348 > http://www.datawrangling.com > http://delicious.com/pskomoroch > http://twitter.com/peteskomoroch > -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch