If you don't want key in the final output, you can set like this in Java. job.setOutputKeyClass(NullWritable.class);
It will just print the value in the output file. I don't how to do it in python. On 9/10/14, Dmitry Sivachenko <trtrmi...@gmail.com> wrote: > Hello! > > Imagine the following common task: I want to process big text file > line-by-line using streaming interface. > Run unix grep command for instance. Or some other line-by-line processing, > e.g. line.upper(). > I copy file to HDFS. > > Then I run a map task on this file which reads one line, modifies it some > way and then writes it to the output. > > TextInputFormat suites well for reading: it's key is the offset in bytes > (meaningless in my case) and the value is the line itself, so I can iterate > over line like this (in python): > for line in sys.stdin: > print(line.upper()) > > The problem arises with TextOutputFormat: It tries to split the resulting > line on mapreduce.output.textoutputformat.separator which results in extra > separator in output if this character is missing in the line, for instance > (extra TAB at the end if we stick to defaults). > > Is there any way to write the result of streaming task without any internal > processing so it appears exactly as the script produces it? > > If it is impossible with Hadoop, which works with key/value pairs, may be > there are other frameworks which work on top of HDFS which allow to do > this? > > Thanks in advance!