Re: Hadoop streaming or pipes ..

2012-04-07 Thread Mark question
Thanks all, and Charles you guided me to Baidu slides titled:
Introduction to *Hadoop C++
Extension*http://hic2010.hadooper.cn/dct/attach/Y2xiOmNsYjpwZGY6ODI5
which is their experience and the sixth-slide shows exactly what I was
looking for. It is still hard to manage memory with pipes besides the no
performance gains, hence the advancement of HCE.

Thanks,
Mark
On Thu, Apr 5, 2012 at 2:23 PM, Charles Earl charles.ce...@gmail.comwrote:

 Also bear in mind that there is a kind of detour involved, in the sense
 that a pipes map must send key,value data back to the Java process and then
 to reduce (more or less).
 I think that the Hadoop C Extension (HCE, there is a patch) is supposed to
 be faster.
 Would be interested to know if the community has any experience with HCE
 performance.
 C

 On Apr 5, 2012, at 3:49 PM, Robert Evans ev...@yahoo-inc.com wrote:

  Both streaming and pipes do very similar things.  They will fork/exec a
 separate process that is running whatever you want it to run.  The JVM that
 is running hadoop then communicates with this process to send the data over
 and get the processing results back.  The difference between streaming and
 pipes is that streaming uses stdin/stdout for this communication so
 preexisting processing like grep, sed and awk can be used here.  Pipes uses
 a custom protocol with a C++ library to communicate.  The C++ library is
 tagged with SWIG compatible data so that it can be wrapped to have APIs in
 other languages like python or perl.
 
  I am not sure what the performance difference is between the two, but in
 my own work I have seen a significant performance penalty from using either
 of them, because there is a somewhat large overhead of sending all of the
 data out to a separate process just to read it back in again.
 
  --Bobby Evans
 
 
  On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote:
 
  Hi guys,
   quick question:
Are there any performance gains from hadoop streaming or pipes over
  Java? From what I've read, it's only to ease testing by using your
 favorite
  language. So I guess it is eventually translated to bytecode then
 executed.
  Is that true?
 
  Thank you,
  Mark
 



Hadoop streaming or pipes ..

2012-04-05 Thread Mark question
Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark


Re: Hadoop streaming or pipes ..

2012-04-05 Thread Robert Evans
Both streaming and pipes do very similar things.  They will fork/exec a 
separate process that is running whatever you want it to run.  The JVM that is 
running hadoop then communicates with this process to send the data over and 
get the processing results back.  The difference between streaming and pipes is 
that streaming uses stdin/stdout for this communication so preexisting 
processing like grep, sed and awk can be used here.  Pipes uses a custom 
protocol with a C++ library to communicate.  The C++ library is tagged with 
SWIG compatible data so that it can be wrapped to have APIs in other languages 
like python or perl.

I am not sure what the performance difference is between the two, but in my own 
work I have seen a significant performance penalty from using either of them, 
because there is a somewhat large overhead of sending all of the data out to a 
separate process just to read it back in again.

--Bobby Evans


On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote:

Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark



Re: Hadoop streaming or pipes ..

2012-04-05 Thread Mark question
Thanks for the response Robert ..  so the overhead will be in read/write
and communication. But is the new process spawned a JVM or a regular
process?

Thanks,
Mark

On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans ev...@yahoo-inc.com wrote:

 Both streaming and pipes do very similar things.  They will fork/exec a
 separate process that is running whatever you want it to run.  The JVM that
 is running hadoop then communicates with this process to send the data over
 and get the processing results back.  The difference between streaming and
 pipes is that streaming uses stdin/stdout for this communication so
 preexisting processing like grep, sed and awk can be used here.  Pipes uses
 a custom protocol with a C++ library to communicate.  The C++ library is
 tagged with SWIG compatible data so that it can be wrapped to have APIs in
 other languages like python or perl.

 I am not sure what the performance difference is between the two, but in
 my own work I have seen a significant performance penalty from using either
 of them, because there is a somewhat large overhead of sending all of the
 data out to a separate process just to read it back in again.

 --Bobby Evans


 On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote:

 Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
 Java? From what I've read, it's only to ease testing by using your favorite
 language. So I guess it is eventually translated to bytecode then executed.
 Is that true?

 Thank you,
 Mark




Re: Hadoop streaming or pipes ..

2012-04-05 Thread Charles Earl
Also bear in mind that there is a kind of detour involved, in the sense that a 
pipes map must send key,value data back to the Java process and then to reduce 
(more or less). 
I think that the Hadoop C Extension (HCE, there is a patch) is supposed to be 
faster. 
Would be interested to know if the community has any experience with HCE 
performance.
C

On Apr 5, 2012, at 3:49 PM, Robert Evans ev...@yahoo-inc.com wrote:

 Both streaming and pipes do very similar things.  They will fork/exec a 
 separate process that is running whatever you want it to run.  The JVM that 
 is running hadoop then communicates with this process to send the data over 
 and get the processing results back.  The difference between streaming and 
 pipes is that streaming uses stdin/stdout for this communication so 
 preexisting processing like grep, sed and awk can be used here.  Pipes uses a 
 custom protocol with a C++ library to communicate.  The C++ library is tagged 
 with SWIG compatible data so that it can be wrapped to have APIs in other 
 languages like python or perl.
 
 I am not sure what the performance difference is between the two, but in my 
 own work I have seen a significant performance penalty from using either of 
 them, because there is a somewhat large overhead of sending all of the data 
 out to a separate process just to read it back in again.
 
 --Bobby Evans
 
 
 On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote:
 
 Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
 Java? From what I've read, it's only to ease testing by using your favorite
 language. So I guess it is eventually translated to bytecode then executed.
 Is that true?
 
 Thank you,
 Mark