Sure! Some equivalent should be possible. And like Runping already postet there have been some ideas about implementing binary data processing in hadoop streaming: https://issues.apache.org/jira/browse/HADOOP-1722 However this hasn't happened yet.
That's why I am looking for a minimum-effort-work-around. Reading binary data (in my case the data are floats being processed by a C-coded mapper) just seems to be much faster than parsing them from txt (to float). I am going to implement a base64 version to find out whether it's still faster than text-parsing. John Pra wrote: > > John, > > That's an interesting approach, but isn't it possible that an equivalent > \n might get encoded in the binary data? > > John Menzer <[EMAIL PROTECTED]> wrote: > > so you mean you changed the hadoop streaming source code? > actually i am not really willing to change the source code if it's not > necessary. > > so i thought about simply encoding the input binary data to txt (e.g. with > base64) and then adding a '\n' after each line to make it splittable for > streaming. > after reading from stdin my C programm would just have to decode it > map/reduce it and then encode it back to base64 so write to stdout. > > what do you think about that? worth a try? > > > > Joydeep Sen Sarma wrote: >> >> actually - this is possible - but changes to streaming are required. >> >> at one point - we had gotten rid of the '\n' and '\t' separators between >> the keys and the values in the streaming code and streamed byte arrays >> directly to scripts (and then decoded them in the script). it worked >> perfectly fine. (in fact we were streaming thrift generated byte streams >> - >> encoded in java land and decoded in python land :-)) >> >> the binary data on hdfs is best stored as sequencefiles (if u store >> binary >> data in (what looks to hadoop as) a text file - then bad things will >> happen). if stored this way - hadoop doesn't care about newlines and tabs >> - those are purely artifacts of streaming. >> >> also - the streaming code (for unknown reasons) doesn't allow a >> SequencefileInputFormat. there were minor tweaks we had to make to the >> streaming driver to allow this stuff .. >> >> >> -----Original Message----- >> From: Ted Dunning [mailto:[EMAIL PROTECTED] >> Sent: Mon 4/7/2008 7:43 AM >> To: core-user@hadoop.apache.org >> Subject: Re: streaming + binary input/output data? >> >> >> I don't think that binary input works with streaming because of the >> assumption of one record per line. >> >> If you want to script map-reduce programs, would you be open to a Groovy >> implementation that avoids these problems? >> >> >> On 4/7/08 6:42 AM, "John Menzer" wrote: >> >>> >>> hi, >>> >>> i would like to use binary input and output data in combination with >>> hadoop >>> streaming. >>> >>> the reason why i want to use binary data is, that parsing text to float >>> seems to consume a big lot of time compared to directly reading the >>> binary >>> floats. >>> >>> i am using a C-coded mapper (getting streaming data from stdin and >>> writing >>> to stdout) and no reducer. >>> >>> so my question is: how do i implement binary input output in this >>> context? >>> as far as i understand i need to put an '\n' char at the end of each >>> binary-'line'. so hadoop knows how to split/distribute the input data >>> among >>> the nodes and how to collect it for output(??) >>> >>> is this approach reasonable? >>> >>> thanks, >>> john >> >> >> >> > > -- > View this message in context: > http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16656661.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > > -- View this message in context: http://www.nabble.com/streaming-%2B-binary-input-output-data--tp16537427p16691343.html Sent from the Hadoop core-user mailing list archive at Nabble.com.