S d,
It is totally fine to use Python streaming if it does the job you are
after, there will be a slight performance hit, but that is noise
assuming your cluster is a small one. If you are operating a large
cluster continuously, then once your logic is stabilized using Python it
might make sense to convert/operationalize some jobs to Java (or C
pipes) to improve performance for purpose of finishing quicker or
reducing number of servers needed.
You should also take a look at PIG and Hive, they are both higher
level languages and very easy to learn:
http://www.cloudera.com/hadoop-training-pig-introduction
http://www.cloudera.com/hadoop-training-hive-introduction
-- amr
s d wrote:
Thanks.
So in the overall scheme of things, what is the general feeling about using
python for this? I like the ease of deploying and reading python compared
with Java but want to make sure using python over hadoop is scalable & is
standard practice and not something done only for prototyping and small
scale tests.
On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <a...@cloudera.com> wrote:
Streaming is slightly slower than native Java jobs. Otherwise Python works
great in streaming.
Alex
On Tue, May 19, 2009 at 8:36 AM, s d <s.d.sau...@gmail.com> wrote:
Hi,
How robust is using hadoop with python over the streaming protocol? Any
disadvantages (performance? flexibility?) ? It just strikes me that
python
is so much more convenient when it comes to deploying and crunching text
files.
Thanks,