RE: streaming + binary input/output data?

John Menzer Tue, 15 Apr 2008 00:33:31 -0700

ahhhh...i understand...that might be a problem!

well in that case i would need to parse each base64 encoded line for the
'\n' sequence before making any use of it and before adding my own '\n'. i
am quite sure that this could become quite performance consuming which in
turn would reduce the benefit of using binary data.


that's bad news!

i would need a encoder that automatically avoids the '\n' sequence...



Pra wrote:
> 
> John,
>    
>   My meaning didn't come through. 
>    
>   If you encode binary data and treat it like any peice of text going
> through hadoop's default input format, at some point your binary data
> might have a piece that looks like 00001010, in hex it might be 0A, and in
> ascii, might it not be interpreted at \N?
>    
>   Wouldn't you need to insure that throughout all of your binary data,
> that you don't have a piece of data that might be interpreted as a \N?  
>    
>   You may need to define your own input format for this to work.
>    
>   
> 
> John Menzer <[EMAIL PROTECTED]> wrote:
>   
> Sure! Some equivalent should be possible. 
> And like Runping already postet there have been some ideas about
> implementing binary data processing in hadoop streaming:
> https://issues.apache.org/jira/browse/HADOOP-1722
> However this hasn't happened yet.
> 
> That's why I am looking for a minimum-effort-work-around.
> 
> Reading binary data (in my case the data are floats being processed by a
> C-coded mapper) just seems to be much faster than parsing them from txt
> (to
> float). 
> 
> I am going to implement a base64 version to find out whether it's still
> faster than text-parsing.
> 
> John
> 
> 
> 
> Pra wrote:
>> 
>> John,
>> 
>> That's an interesting approach, but isn't it possible that an equivalent
>> \n might get encoded in the binary data?
>> 
>> John Menzer wrote:
>> 
>> so you mean you changed the hadoop streaming source code?
>> actually i am not really willing to change the source code if it's not
>> necessary.
>> 
>> so i thought about simply encoding the input binary data to txt (e.g.
>> with
>> base64) and then adding a '\n' after each line to make it splittable for
>> streaming.
>> after reading from stdin my C programm would just have to decode it
>> map/reduce it and then encode it back to base64 so write to stdout.
>> 
>> what do you think about that? worth a try?
>> 
>> 
>> 
>> Joydeep Sen Sarma wrote:
>>> 
>>> actually - this is possible - but changes to streaming are required.
>>> 
>>> at one point - we had gotten rid of the '\n' and '\t' separators between
>>> the keys and the values in the streaming code and streamed byte arrays
>>> directly to scripts (and then decoded them in the script). it worked
>>> perfectly fine. (in fact we were streaming thrift generated byte streams
>>> -
>>> encoded in java land and decoded in python land :-))
>>> 
>>> the binary data on hdfs is best stored as sequencefiles (if u store
>>> binary
>>> data in (what looks to hadoop as) a text file - then bad things will
>>> happen). if stored this way - hadoop doesn't care about newlines and
>>> tabs
>>> - those are purely artifacts of streaming.
>>> 
>>> also - the streaming code (for unknown reasons) doesn't allow a
>>> SequencefileInputFormat. there were minor tweaks we had to make to the
>>> streaming driver to allow this stuff ..
>>> 
>>> 
>>> -----Original Message-----
>>> From: Ted Dunning [mailto:[EMAIL PROTECTED]
>>> Sent: Mon 4/7/2008 7:43 AM
>>> To: core-user@hadoop.apache.org
>>> Subject: Re: streaming + binary input/output data?
>>> 
>>> 
>>> I don't think that binary input works with streaming because of the
>>> assumption of one record per line.
>>> 
>>> If you want to script map-reduce programs, would you be open to a Groovy
>>> implementation that avoids these problems?
>>> 
>>> 
>>> On 4/7/08 6:42 AM, "John Menzer" wrote:
>>> 
>>>> 
>>>> hi,
>>>> 
>>>> i would like to use binary input and output data in combination with
>>>> hadoop
>>>> streaming.
>>>> 
>>>> the reason why i want to use binary data is, that parsing text to float
>>>> seems to consume a big lot of time compared to directly reading the
>>>> binary
>>>> floats.
>>>> 
>>>> i am using a C-coded mapper (getting streaming data from stdin and
>>>> writing
>>>> to stdout) and no reducer.
>>>> 
>>>> so my question is: how do i implement binary input output in this
>>>> context?
>>>> as far as i understand i need to put an '\n' char at the end of each
>>>> binary-'line'. so hadoop knows how to split/distribute the input data
>>>> among
>>>> the nodes and how to collect it for output(??)
>>>> 
>>>> is this approach reasonable?
>>>> 
>>>> thanks,
>>>> john
>>> 
>>> 
>>> 
>>> 
>> 
>> -- 
>> View this message in context:
>> http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16656661.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> 
>> 
>> 
>> 
>> 
> 
> -- 
> View this message in context:
> http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16691343.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 
> 
>         
> 

-- 
View this message in context: 
http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16696744.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

RE: streaming + binary input/output data?

Reply via email to