Actually, there is an old jira about the same issue: https://issues.apache.org/jira/browse/HADOOP-1722
Runping > -----Original Message----- > From: John Menzer [mailto:[EMAIL PROTECTED] > Sent: Saturday, April 12, 2008 2:45 PM > To: core-user@hadoop.apache.org > Subject: RE: streaming + binary input/output data? > > > so you mean you changed the hadoop streaming source code? > actually i am not really willing to change the source code if it's not > necessary. > > so i thought about simply encoding the input binary data to txt (e.g. with > base64) and then adding a '\n' after each line to make it splittable for > streaming. > after reading from stdin my C programm would just have to decode it > map/reduce it and then encode it back to base64 so write to stdout. > > what do you think about that? worth a try? > > > > Joydeep Sen Sarma wrote: > > > > actually - this is possible - but changes to streaming are required. > > > > at one point - we had gotten rid of the '\n' and '\t' separators between > > the keys and the values in the streaming code and streamed byte arrays > > directly to scripts (and then decoded them in the script). it worked > > perfectly fine. (in fact we were streaming thrift generated byte streams > - > > encoded in java land and decoded in python land :-)) > > > > the binary data on hdfs is best stored as sequencefiles (if u store > binary > > data in (what looks to hadoop as) a text file - then bad things will > > happen). if stored this way - hadoop doesn't care about newlines and > tabs > > - those are purely artifacts of streaming. > > > > also - the streaming code (for unknown reasons) doesn't allow a > > SequencefileInputFormat. there were minor tweaks we had to make to the > > streaming driver to allow this stuff .. > > > > > > -----Original Message----- > > From: Ted Dunning [mailto:[EMAIL PROTECTED] > > Sent: Mon 4/7/2008 7:43 AM > > To: core-user@hadoop.apache.org > > Subject: Re: streaming + binary input/output data? > > > > > > I don't think that binary input works with streaming because of the > > assumption of one record per line. > > > > If you want to script map-reduce programs, would you be open to a Groovy > > implementation that avoids these problems? > > > > > > On 4/7/08 6:42 AM, "John Menzer" <[EMAIL PROTECTED]> wrote: > > > >> > >> hi, > >> > >> i would like to use binary input and output data in combination with > >> hadoop > >> streaming. > >> > >> the reason why i want to use binary data is, that parsing text to float > >> seems to consume a big lot of time compared to directly reading the > >> binary > >> floats. > >> > >> i am using a C-coded mapper (getting streaming data from stdin and > >> writing > >> to stdout) and no reducer. > >> > >> so my question is: how do i implement binary input output in this > >> context? > >> as far as i understand i need to put an '\n' char at the end of each > >> binary-'line'. so hadoop knows how to split/distribute the input data > >> among > >> the nodes and how to collect it for output(??) > >> > >> is this approach reasonable? > >> > >> thanks, > >> john > > > > > > > > > > -- > View this message in context: http://www.nabble.com/streaming-%2B-binary- > input-output-data--tp16537427p16656661.html > Sent from the Hadoop core-user mailing list archive at Nabble.com.