One thing about the | sort | sh combiner.sh approach: you do have to be careful about memory if you're doing that -- if a mapper instance sees a large number of rows, you'll be asking sort to sort *all* of those before passing them to the combiner. Hadoop itself only hands off some bounded number of output keys at a time to the combiner, which is much safer for large data sets.

In dumbo itself, Klaas added "combine a chunk at a time", to address this problem.

(and, yes, overall, getting combines fully supported in streaming is awesome)

-D

On May 19, 2009, at 5:04 PM, Peter Skomoroch wrote:

Whoops, should have googled it first.  Looks like this is now fixed in
trunk, HADOOP-4842. For people stuck using 18.3, a workaround appears to be adding something like "| sort | sh combiner.sh" to the call of the mapper
script (via Klaas Bosteels)

Would be great to get this patched into distributions like EMR and Cloudera

On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch
<peter.skomor...@gmail.com>wrote:

One area I'm curious about is the requirement that any combiners in
Streaming jobs be java classes. Are there any plans to change this in the future? Prototyping streaming jobs in Python is great, and the ability to use a Python combiner would help performance a lot without needing to move
to Java.




On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah <a...@cloudera.com> wrote:

S d,

It is totally fine to use Python streaming if it does the job you are after, there will be a slight performance hit, but that is noise assuming
your cluster is a small one. If you are operating a large cluster
continuously, then once your logic is stabilized using Python it might make sense to convert/operationalize some jobs to Java (or C pipes) to improve performance for purpose of finishing quicker or reducing number of servers
needed.

You should also take a look at PIG and Hive, they are both higher level
languages and very easy to learn:

http://www.cloudera.com/hadoop-training-pig-introduction

http://www.cloudera.com/hadoop-training-hive-introduction

-- amr


s d wrote:

Thanks.
So in the overall scheme of things, what is the general feeling about
using
python for this? I like the ease of deploying and reading python compared with Java but want to make sure using python over hadoop is scalable & is standard practice and not something done only for prototyping and small
scale tests.


On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <a...@cloudera.com >
wrote:



Streaming is slightly slower than native Java jobs. Otherwise Python
works
great in streaming.

Alex

On Tue, May 19, 2009 at 8:36 AM, s d <s.d.sau...@gmail.com> wrote:



Hi,
How robust is using hadoop with python over the streaming protocol? Any disadvantages (performance? flexibility?) ? It just strikes me that


python


is so much more convenient when it comes to deploying and crunching
text
files.
Thanks,









--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch




--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Reply via email to