On Thu, May 21, 2009 at 5:19 AM, Dan Milstein <dmilst...@hubspot.com> wrote:
> One thing about the | sort | sh combiner.sh approach: you do have to be > careful about memory if you're doing that -- if a mapper instance sees a > large number of rows, you'll be asking sort to sort *all* of those before > passing them to the combiner. Hadoop itself only hands off some bounded > number of output keys at a time to the combiner, which is much safer for > large data sets. > The unix "sort" utility already does some smartness here. It has a configurable memory buffer it uses for sorting, and spills to /tmp by default. The manpage doesn't say what algorithm it's actually using, but I presume it's a mergesort. I think the default memory usage is something pretty small - you may get better performance using "sort -S 512M" or so. -Todd > > > On May 19, 2009, at 5:04 PM, Peter Skomoroch wrote: > > Whoops, should have googled it first. Looks like this is now fixed in >> trunk, HADOOP-4842. For people stuck using 18.3, a workaround appears to >> be >> adding something like "| sort | sh combiner.sh" to the call of the mapper >> script (via Klaas Bosteels) >> >> Would be great to get this patched into distributions like EMR and >> Cloudera >> >> On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch >> <peter.skomor...@gmail.com>wrote: >> >> One area I'm curious about is the requirement that any combiners in >>> Streaming jobs be java classes. Are there any plans to change this in >>> the >>> future? Prototyping streaming jobs in Python is great, and the ability >>> to >>> use a Python combiner would help performance a lot without needing to >>> move >>> to Java. >>> >>> >>> >>> >>> On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah <a...@cloudera.com> wrote: >>> >>> S d, >>>> >>>> It is totally fine to use Python streaming if it does the job you are >>>> after, there will be a slight performance hit, but that is noise >>>> assuming >>>> your cluster is a small one. If you are operating a large cluster >>>> continuously, then once your logic is stabilized using Python it might >>>> make >>>> sense to convert/operationalize some jobs to Java (or C pipes) to >>>> improve >>>> performance for purpose of finishing quicker or reducing number of >>>> servers >>>> needed. >>>> >>>> You should also take a look at PIG and Hive, they are both higher level >>>> languages and very easy to learn: >>>> >>>> http://www.cloudera.com/hadoop-training-pig-introduction >>>> >>>> http://www.cloudera.com/hadoop-training-hive-introduction >>>> >>>> -- amr >>>> >>>> >>>> s d wrote: >>>> >>>> Thanks. >>>>> So in the overall scheme of things, what is the general feeling about >>>>> using >>>>> python for this? I like the ease of deploying and reading python >>>>> compared >>>>> with Java but want to make sure using python over hadoop is scalable & >>>>> is >>>>> standard practice and not something done only for prototyping and small >>>>> scale tests. >>>>> >>>>> >>>>> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <a...@cloudera.com> >>>>> wrote: >>>>> >>>>> >>>>> >>>>> Streaming is slightly slower than native Java jobs. Otherwise Python >>>>>> works >>>>>> great in streaming. >>>>>> >>>>>> Alex >>>>>> >>>>>> On Tue, May 19, 2009 at 8:36 AM, s d <s.d.sau...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hi, >>>>>>> How robust is using hadoop with python over the streaming protocol? >>>>>>> Any >>>>>>> disadvantages (performance? flexibility?) ? It just strikes me that >>>>>>> >>>>>>> >>>>>>> python >>>>>> >>>>>> >>>>>> is so much more convenient when it comes to deploying and crunching >>>>>>> text >>>>>>> files. >>>>>>> Thanks, >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> -- >>> Peter N. Skomoroch >>> 617.285.8348 >>> http://www.datawrangling.com >>> http://delicious.com/pskomoroch >>> http://twitter.com/peteskomoroch >>> >>> >> >> >> -- >> Peter N. Skomoroch >> 617.285.8348 >> http://www.datawrangling.com >> http://delicious.com/pskomoroch >> http://twitter.com/peteskomoroch >> > >