Sure! Some equivalent should be possible. 
And like Runping already postet there have been some ideas about
implementing binary data processing in hadoop streaming:
https://issues.apache.org/jira/browse/HADOOP-1722
However this hasn't happened yet.

That's why I am looking for a minimum-effort-work-around.

Reading binary data (in my case the data are floats being processed by a
C-coded mapper) just seems to be much faster than parsing them from txt (to
float). 

I am going to implement a base64 version to find out whether it's still
faster than text-parsing.

John



Pra wrote:
> 
> John,
>    
>   That's an interesting approach, but isn't it possible that an equivalent
> \n might get encoded in the binary data?
> 
> John Menzer <[EMAIL PROTECTED]> wrote:
>   
> so you mean you changed the hadoop streaming source code?
> actually i am not really willing to change the source code if it's not
> necessary.
> 
> so i thought about simply encoding the input binary data to txt (e.g. with
> base64) and then adding a '\n' after each line to make it splittable for
> streaming.
> after reading from stdin my C programm would just have to decode it
> map/reduce it and then encode it back to base64 so write to stdout.
> 
> what do you think about that? worth a try?
> 
> 
> 
> Joydeep Sen Sarma wrote:
>> 
>> actually - this is possible - but changes to streaming are required.
>> 
>> at one point - we had gotten rid of the '\n' and '\t' separators between
>> the keys and the values in the streaming code and streamed byte arrays
>> directly to scripts (and then decoded them in the script). it worked
>> perfectly fine. (in fact we were streaming thrift generated byte streams
>> -
>> encoded in java land and decoded in python land :-))
>> 
>> the binary data on hdfs is best stored as sequencefiles (if u store
>> binary
>> data in (what looks to hadoop as) a text file - then bad things will
>> happen). if stored this way - hadoop doesn't care about newlines and tabs
>> - those are purely artifacts of streaming.
>> 
>> also - the streaming code (for unknown reasons) doesn't allow a
>> SequencefileInputFormat. there were minor tweaks we had to make to the
>> streaming driver to allow this stuff ..
>> 
>> 
>> -----Original Message-----
>> From: Ted Dunning [mailto:[EMAIL PROTECTED]
>> Sent: Mon 4/7/2008 7:43 AM
>> To: core-user@hadoop.apache.org
>> Subject: Re: streaming + binary input/output data?
>> 
>> 
>> I don't think that binary input works with streaming because of the
>> assumption of one record per line.
>> 
>> If you want to script map-reduce programs, would you be open to a Groovy
>> implementation that avoids these problems?
>> 
>> 
>> On 4/7/08 6:42 AM, "John Menzer" wrote:
>> 
>>> 
>>> hi,
>>> 
>>> i would like to use binary input and output data in combination with
>>> hadoop
>>> streaming.
>>> 
>>> the reason why i want to use binary data is, that parsing text to float
>>> seems to consume a big lot of time compared to directly reading the
>>> binary
>>> floats.
>>> 
>>> i am using a C-coded mapper (getting streaming data from stdin and
>>> writing
>>> to stdout) and no reducer.
>>> 
>>> so my question is: how do i implement binary input output in this
>>> context?
>>> as far as i understand i need to put an '\n' char at the end of each
>>> binary-'line'. so hadoop knows how to split/distribute the input data
>>> among
>>> the nodes and how to collect it for output(??)
>>> 
>>> is this approach reasonable?
>>> 
>>> thanks,
>>> john
>> 
>> 
>> 
>> 
> 
> -- 
> View this message in context:
> http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16656661.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 
> 
>         
> 

-- 
View this message in context: 
http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16691343.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply via email to