One thing about the | sort | sh combiner.sh approach: you do have to
be careful about memory if you're doing that -- if a mapper instance
sees a large number of rows, you'll be asking sort to sort *all* of
those before passing them to the combiner. Hadoop itself only hands
off some bounded number of output keys at a time to the combiner,
which is much safer for large data sets.
In dumbo itself, Klaas added "combine a chunk at a time", to address
this problem.
(and, yes, overall, getting combines fully supported in streaming is
awesome)
-D
On May 19, 2009, at 5:04 PM, Peter Skomoroch wrote:
Whoops, should have googled it first. Looks like this is now fixed in
trunk, HADOOP-4842. For people stuck using 18.3, a workaround
appears to be
adding something like "| sort | sh combiner.sh" to the call of the
mapper
script (via Klaas Bosteels)
Would be great to get this patched into distributions like EMR and
Cloudera
On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch
<peter.skomor...@gmail.com>wrote:
One area I'm curious about is the requirement that any combiners in
Streaming jobs be java classes. Are there any plans to change this
in the
future? Prototyping streaming jobs in Python is great, and the
ability to
use a Python combiner would help performance a lot without needing
to move
to Java.
On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah <a...@cloudera.com>
wrote:
S d,
It is totally fine to use Python streaming if it does the job you
are
after, there will be a slight performance hit, but that is noise
assuming
your cluster is a small one. If you are operating a large cluster
continuously, then once your logic is stabilized using Python it
might make
sense to convert/operationalize some jobs to Java (or C pipes) to
improve
performance for purpose of finishing quicker or reducing number of
servers
needed.
You should also take a look at PIG and Hive, they are both higher
level
languages and very easy to learn:
http://www.cloudera.com/hadoop-training-pig-introduction
http://www.cloudera.com/hadoop-training-hive-introduction
-- amr
s d wrote:
Thanks.
So in the overall scheme of things, what is the general feeling
about
using
python for this? I like the ease of deploying and reading python
compared
with Java but want to make sure using python over hadoop is
scalable & is
standard practice and not something done only for prototyping and
small
scale tests.
On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <a...@cloudera.com
>
wrote:
Streaming is slightly slower than native Java jobs. Otherwise
Python
works
great in streaming.
Alex
On Tue, May 19, 2009 at 8:36 AM, s d <s.d.sau...@gmail.com> wrote:
Hi,
How robust is using hadoop with python over the streaming
protocol? Any
disadvantages (performance? flexibility?) ? It just strikes me
that
python
is so much more convenient when it comes to deploying and
crunching
text
files.
Thanks,
--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch
--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch