Thanks, What would be the # of severs , file sizes that in their range the performance hit will be minor? I am concerned about implementing it all only to rewrite it later to scale economically. Thanks for all the information.
On Tue, May 19, 2009 at 1:30 PM, Amr Awadallah <a...@cloudera.com> wrote: > S d, > > It is totally fine to use Python streaming if it does the job you are > after, there will be a slight performance hit, but that is noise assuming > your cluster is a small one. If you are operating a large cluster > continuously, then once your logic is stabilized using Python it might make > sense to convert/operationalize some jobs to Java (or C pipes) to improve > performance for purpose of finishing quicker or reducing number of servers > needed. > > You should also take a look at PIG and Hive, they are both higher level > languages and very easy to learn: > > http://www.cloudera.com/hadoop-training-pig-introduction > > http://www.cloudera.com/hadoop-training-hive-introduction > > -- amr > > > s d wrote: > >> Thanks. >> So in the overall scheme of things, what is the general feeling about >> using >> python for this? I like the ease of deploying and reading python compared >> with Java but want to make sure using python over hadoop is scalable & is >> standard practice and not something done only for prototyping and small >> scale tests. >> >> >> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <a...@cloudera.com> >> wrote: >> >> >> >>> Streaming is slightly slower than native Java jobs. Otherwise Python >>> works >>> great in streaming. >>> >>> Alex >>> >>> On Tue, May 19, 2009 at 8:36 AM, s d <s.d.sau...@gmail.com> wrote: >>> >>> >>> >>>> Hi, >>>> How robust is using hadoop with python over the streaming protocol? Any >>>> disadvantages (performance? flexibility?) ? It just strikes me that >>>> >>>> >>> python >>> >>> >>>> is so much more convenient when it comes to deploying and crunching text >>>> files. >>>> Thanks, >>>> >>>> >>>> >>> >> >> >