Hello, Thanks a lot for the information. It helped me figure out the solution of this problem.
I posted the sketch of solution on StackOverflow (http://stackoverflow.com/a/19295610/337194) for anybody who is interested. Best regards, Youssef Hatem On Oct 9, 2013, at 14:08 , Peter Marron wrote: > Hi, > > The only way that I could find was to override the various InputWriter and > OutputWriter classes. > as defined by the configuration settings > stream.map.input.writer.class > stream.map.output.reader.class > stream.reduce.input.writer.class > stream.reduce. output.reader.class > which was painful. Hopefully someone will tell you the _correct_ way to do > this. > If not I will provide more details. > > Regards, > > Peter Marron > Trillium Software UK Limited > > Tel : +44 (0) 118 940 7609 > Fax : +44 (0) 118 940 7699 > E: peter.mar...@trilliumsoftware.com > > -----Original Message----- > From: Youssef Hatem [mailto:youssef.ha...@rwth-aachen.de] > Sent: 09 October 2013 12:14 > To: user@hadoop.apache.org > Subject: Problem with streaming exact binary chunks > > Hello, > > I wrote a very simple InputFormat and RecordReader to send binary data to > mappers. Binary data can contain anything (including \n, \t, \r), here is > what next() may actually send: > > public class MyRecordReader implements > RecordReader<BytesWritable, BytesWritable> { > ... > public boolean next(BytesWritable key, BytesWritable ignore) > throws IOException { > ... > > byte[] result = new byte[8]; > for (int i = 0; i < result.length; ++i) > result[i] = (byte)(i+1); > result[3] = (byte)'\n'; > result[4] = (byte)'\n'; > > key.set(result, 0, result.length); > return true; > } > } > > As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a > 06 07 08, I also use Hadoop-1722 typed bytes (by setting -D > stream.map.input=typedbytes). > > According to the documentation of typed bytes the mapper should receive the > following byte sequence: > 00 00 00 08 01 02 03 0a 0a 06 07 08 > > However bytes are somehow modified and I get the following sequence instead: > 00 00 00 08 01 02 03 09 0a 09 0a 06 07 08 > > 0a = '\n' > 09 = '\t' > > It seems that Hadoop (streaming?) parsed the new line character as a > separator and put '\t' which is the key/value separator for streaming I > assume. > > Is there any work around to send *exactly* the same bytes sequence no matter > what characters are in the sequence? Thanks in advance. > > Best regards, > Youssef Hatem