Actually, there is an old jira about the same issue:
https://issues.apache.org/jira/browse/HADOOP-1722

Runping


> -----Original Message-----
> From: John Menzer [mailto:[EMAIL PROTECTED]
> Sent: Saturday, April 12, 2008 2:45 PM
> To: core-user@hadoop.apache.org
> Subject: RE: streaming + binary input/output data?
> 
> 
> so you mean you changed the hadoop streaming source code?
> actually i am not really willing to change the source code if it's not
> necessary.
> 
> so i thought about simply encoding the input binary data to txt (e.g.
with
> base64) and then adding a '\n' after each line to make it splittable
for
> streaming.
> after reading from stdin my C programm would just have to decode it
> map/reduce it and then encode it back to base64 so write to stdout.
> 
> what do you think about that? worth a try?
> 
> 
> 
> Joydeep Sen Sarma wrote:
> >
> > actually - this is possible - but changes to streaming are required.
> >
> > at one point - we had gotten rid of the '\n' and '\t' separators
between
> > the keys and the values in the streaming code and streamed byte
arrays
> > directly to scripts (and then decoded them in the script). it worked
> > perfectly fine. (in fact we were streaming thrift generated byte
streams
> -
> > encoded in java land and decoded in python land :-))
> >
> > the binary data on hdfs is best stored as sequencefiles (if u store
> binary
> > data in (what looks to hadoop as) a text file - then bad things will
> > happen). if stored this way - hadoop doesn't care about newlines and
> tabs
> > - those are purely artifacts of streaming.
> >
> > also - the streaming code (for unknown reasons) doesn't allow a
> > SequencefileInputFormat. there were minor tweaks we had to make to
the
> > streaming driver to allow this stuff ..
> >
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:[EMAIL PROTECTED]
> > Sent: Mon 4/7/2008 7:43 AM
> > To: core-user@hadoop.apache.org
> > Subject: Re: streaming + binary input/output data?
> >
> >
> > I don't think that binary input works with streaming because of the
> > assumption of one record per line.
> >
> > If you want to script map-reduce programs, would you be open to a
Groovy
> > implementation that avoids these problems?
> >
> >
> > On 4/7/08 6:42 AM, "John Menzer" <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> hi,
> >>
> >> i would like to use binary input and output data in combination
with
> >> hadoop
> >> streaming.
> >>
> >> the reason why i want to use binary data is, that parsing text to
float
> >> seems to consume a big lot of time compared to directly reading the
> >> binary
> >> floats.
> >>
> >> i am using a C-coded mapper (getting streaming data from stdin and
> >> writing
> >> to stdout) and no reducer.
> >>
> >> so my question is: how do i implement binary input output in this
> >> context?
> >> as far as i understand i need to put an '\n' char at the end of
each
> >> binary-'line'. so hadoop knows how to split/distribute the input
data
> >> among
> >> the nodes and how to collect it for output(??)
> >>
> >> is this approach reasonable?
> >>
> >> thanks,
> >> john
> >
> >
> >
> >
> 
> --
> View this message in context:
http://www.nabble.com/streaming-%2B-binary-
> input-output-data--tp16537427p16656661.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply via email to