Re: Hadoop & Python

Dan Milstein Thu, 21 May 2009 05:20:01 -0700

One thing about the | sort | sh combiner.sh approach: you do have tobe careful about memory if you're doing that -- if a mapper instancesees a large number of rows, you'll be asking sort to sort *all* ofthose before passing them to the combiner. Hadoop itself only handsoff some bounded number of output keys at a time to the combiner,which is much safer for large data sets.

In dumbo itself, Klaas added "combine a chunk at a time", to addressthis problem.

(and, yes, overall, getting combines fully supported in streaming isawesome)


-D

On May 19, 2009, at 5:04 PM, Peter Skomoroch wrote:

Whoops, should have googled it first.  Looks like this is now fixed in
trunk, HADOOP-4842. For people stuck using 18.3, a workaroundappears to beadding something like "| sort | sh combiner.sh" to the call of themapper
script (via Klaas Bosteels)
Would be great to get this patched into distributions like EMR andCloudera
On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch
<peter.skomor...@gmail.com>wrote:
One area I'm curious about is the requirement that any combiners in
Streaming jobs be java classes. Are there any plans to change thisin thefuture? Prototyping streaming jobs in Python is great, and theability touse a Python combiner would help performance a lot without needingto move
to Java.
On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah <a...@cloudera.com>wrote:
S d,
It is totally fine to use Python streaming if it does the job youareafter, there will be a slight performance hit, but that is noiseassuming
your cluster is a small one. If you are operating a large cluster
continuously, then once your logic is stabilized using Python itmight makesense to convert/operationalize some jobs to Java (or C pipes) toimproveperformance for purpose of finishing quicker or reducing number ofservers
needed.
You should also take a look at PIG and Hive, they are both higherlevel
languages and very easy to learn:

http://www.cloudera.com/hadoop-training-pig-introduction

http://www.cloudera.com/hadoop-training-hive-introduction

-- amr


s d wrote:
Thanks.
So in the overall scheme of things, what is the general feelingabout
using
python for this? I like the ease of deploying and reading pythoncomparedwith Java but want to make sure using python over hadoop isscalable & isstandard practice and not something done only for prototyping andsmall
scale tests.
On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <a...@cloudera.com>
wrote:
Streaming is slightly slower than native Java jobs. OtherwisePython
works
great in streaming.

Alex

On Tue, May 19, 2009 at 8:36 AM, s d <s.d.sau...@gmail.com> wrote:
Hi,
How robust is using hadoop with python over the streamingprotocol? Anydisadvantages (performance? flexibility?) ? It just strikes methat
python
is so much more convenient when it comes to deploying andcrunching
text
files.
Thanks,
--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch
--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Hadoop & Python

Reply via email to