Re: Writing output from streaming task without dealing with key/value

2014-09-11 Thread Dmitry Sivachenko
After streaming job outputs some data to stdout, some hadoop code receives it 
and splits into key/value pair before it reaches TextOutputFormat.
Can anyone point me to that piece of code please?

Thanks!

On 11 сент. 2014 г., at 0:37, Dmitry Sivachenko trtrmi...@gmail.com wrote:

 
 On 10 сент. 2014 г., at 22:33, Felix Chern idry...@gmail.com wrote:
 
 Use ‘tr -s’ to stripe out tabs?
 
 $ echo -e a\t\t\tb
 ab
 
 $ echo -e a\t\t\tb | tr -s \t
 ab
 
 
 There can be tabs in the input, I want to keep input lines without any 
 modification.
 
 Actually it is rather standard task: process lines one by one without 
 inserting extra characters.  There should be standard solution for it IMO.
 



Re: Writing output from streaming task without dealing with key/value

2014-09-11 Thread Dmitry Sivachenko
Okay, FWIW I found the solution:

https://issues.apache.org/jira/browse/MAPREDUCE-6085

Thanks for all who replied.


On 11 сент. 2014 г., at 11:16, Dmitry Sivachenko trtrmi...@gmail.com wrote:

 After streaming job outputs some data to stdout, some hadoop code receives it 
 and splits into key/value pair before it reaches TextOutputFormat.
 Can anyone point me to that piece of code please?
 
 Thanks!
 
 On 11 сент. 2014 г., at 0:37, Dmitry Sivachenko trtrmi...@gmail.com wrote:
 
 
 On 10 сент. 2014 г., at 22:33, Felix Chern idry...@gmail.com wrote:
 
 Use ‘tr -s’ to stripe out tabs?
 
 $ echo -e a\t\t\tb
 a   b
 
 $ echo -e a\t\t\tb | tr -s \t
 a   b
 
 
 There can be tabs in the input, I want to keep input lines without any 
 modification.
 
 Actually it is rather standard task: process lines one by one without 
 inserting extra characters.  There should be standard solution for it IMO.
 
 



Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Susheel Kumar Gadalay
If you don't want key in the final output, you can set like this in Java.

job.setOutputKeyClass(NullWritable.class);

It will just print the value in the output file.

I don't how to do it in python.

On 9/10/14, Dmitry Sivachenko trtrmi...@gmail.com wrote:
 Hello!

 Imagine the following common task: I want to process big text file
 line-by-line using streaming interface.
 Run unix grep command for instance.  Or some other line-by-line processing,
 e.g. line.upper().
 I copy file to HDFS.

 Then I run a map task on this file which reads one line, modifies it some
 way and then writes it to the output.

 TextInputFormat suites well for reading: it's key is the offset in bytes
 (meaningless in my case) and the value is the line itself, so I can iterate
 over line like this (in python):
 for line in sys.stdin:
   print(line.upper())

 The problem arises with TextOutputFormat:  It tries to split the resulting
 line on mapreduce.output.textoutputformat.separator which results in extra
 separator in output if this character is missing in the line, for instance
 (extra TAB at the end if we stick to defaults).

 Is there any way to write the result of streaming task without any internal
 processing so it appears exactly as the script produces it?

 If it is impossible with Hadoop, which works with key/value pairs, may be
 there are other frameworks which work on top of HDFS which allow to do
 this?

 Thanks in advance!


Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Rich Haase
In python, or any streaming program just set the output value to the empty
string and you will get something like key\t.

On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay skgada...@gmail.com
 wrote:

 If you don't want key in the final output, you can set like this in Java.

 job.setOutputKeyClass(NullWritable.class);

 It will just print the value in the output file.

 I don't how to do it in python.

 On 9/10/14, Dmitry Sivachenko trtrmi...@gmail.com wrote:
  Hello!
 
  Imagine the following common task: I want to process big text file
  line-by-line using streaming interface.
  Run unix grep command for instance.  Or some other line-by-line
 processing,
  e.g. line.upper().
  I copy file to HDFS.
 
  Then I run a map task on this file which reads one line, modifies it some
  way and then writes it to the output.
 
  TextInputFormat suites well for reading: it's key is the offset in bytes
  (meaningless in my case) and the value is the line itself, so I can
 iterate
  over line like this (in python):
  for line in sys.stdin:
print(line.upper())
 
  The problem arises with TextOutputFormat:  It tries to split the
 resulting
  line on mapreduce.output.textoutputformat.separator which results in
 extra
  separator in output if this character is missing in the line, for
 instance
  (extra TAB at the end if we stick to defaults).
 
  Is there any way to write the result of streaming task without any
 internal
  processing so it appears exactly as the script produces it?
 
  If it is impossible with Hadoop, which works with key/value pairs, may be
  there are other frameworks which work on top of HDFS which allow to do
  this?
 
  Thanks in advance!




-- 
*Kernighan's Law*
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it.


Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Dmitry Sivachenko

On 10 сент. 2014 г., at 22:05, Rich Haase rdha...@gmail.com wrote:

 In python, or any streaming program just set the output value to the empty 
 string and you will get something like key\t.
 


I see, but I want to use many existing programs (like UNIX grep), and I don't 
want to have and extra \t in the output.

Is there any way to achieve this?  Or may be it is possible to write custom 
XxxOutputFormat to workaround that issue?

(something opposite to TextInputFormat: it passes input line without any 
modification to script's stdin, there should be a way to write stdout to file 
as is).


Thanks!


 On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay skgada...@gmail.com 
 wrote:
 If you don't want key in the final output, you can set like this in Java.
 
 job.setOutputKeyClass(NullWritable.class);
 
 It will just print the value in the output file.
 
 I don't how to do it in python.
 
 On 9/10/14, Dmitry Sivachenko trtrmi...@gmail.com wrote:
  Hello!
 
  Imagine the following common task: I want to process big text file
  line-by-line using streaming interface.
  Run unix grep command for instance.  Or some other line-by-line processing,
  e.g. line.upper().
  I copy file to HDFS.
 
  Then I run a map task on this file which reads one line, modifies it some
  way and then writes it to the output.
 
  TextInputFormat suites well for reading: it's key is the offset in bytes
  (meaningless in my case) and the value is the line itself, so I can iterate
  over line like this (in python):
  for line in sys.stdin:
print(line.upper())
 
  The problem arises with TextOutputFormat:  It tries to split the resulting
  line on mapreduce.output.textoutputformat.separator which results in extra
  separator in output if this character is missing in the line, for instance
  (extra TAB at the end if we stick to defaults).
 
  Is there any way to write the result of streaming task without any internal
  processing so it appears exactly as the script produces it?
 
  If it is impossible with Hadoop, which works with key/value pairs, may be
  there are other frameworks which work on top of HDFS which allow to do
  this?
 
  Thanks in advance!
 
 
 
 -- 
 Kernighan's Law
 Debugging is twice as hard as writing the code in the first place.  
 Therefore, if you write the code as cleverly as possible, you are, by 
 definition, not smart enough to debug it.



Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Rich Haase
You can write a custom output format, or you can write your mapreduce job
in Java and use a NullWritable as Susheel recommended.

grep (and every other *nix text processing command) I can think of would
not be limited by a trailing tab character.  It's even quite easy to strip
away that tab character if you don't want it during the post processing
steps you want to perform with *nix commands.

On Wed, Sep 10, 2014 at 12:12 PM, Dmitry Sivachenko trtrmi...@gmail.com
wrote:


 On 10 сент. 2014 г., at 22:05, Rich Haase rdha...@gmail.com wrote:

  In python, or any streaming program just set the output value to the
 empty string and you will get something like key\t.
 


 I see, but I want to use many existing programs (like UNIX grep), and I
 don't want to have and extra \t in the output.

 Is there any way to achieve this?  Or may be it is possible to write
 custom XxxOutputFormat to workaround that issue?

 (something opposite to TextInputFormat: it passes input line without any
 modification to script's stdin, there should be a way to write stdout to
 file as is).


 Thanks!


  On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay 
 skgada...@gmail.com wrote:
  If you don't want key in the final output, you can set like this in Java.
 
  job.setOutputKeyClass(NullWritable.class);
 
  It will just print the value in the output file.
 
  I don't how to do it in python.
 
  On 9/10/14, Dmitry Sivachenko trtrmi...@gmail.com wrote:
   Hello!
  
   Imagine the following common task: I want to process big text file
   line-by-line using streaming interface.
   Run unix grep command for instance.  Or some other line-by-line
 processing,
   e.g. line.upper().
   I copy file to HDFS.
  
   Then I run a map task on this file which reads one line, modifies it
 some
   way and then writes it to the output.
  
   TextInputFormat suites well for reading: it's key is the offset in
 bytes
   (meaningless in my case) and the value is the line itself, so I can
 iterate
   over line like this (in python):
   for line in sys.stdin:
 print(line.upper())
  
   The problem arises with TextOutputFormat:  It tries to split the
 resulting
   line on mapreduce.output.textoutputformat.separator which results in
 extra
   separator in output if this character is missing in the line, for
 instance
   (extra TAB at the end if we stick to defaults).
  
   Is there any way to write the result of streaming task without any
 internal
   processing so it appears exactly as the script produces it?
  
   If it is impossible with Hadoop, which works with key/value pairs, may
 be
   there are other frameworks which work on top of HDFS which allow to do
   this?
  
   Thanks in advance!
 
 
 
  --
  Kernighan's Law
  Debugging is twice as hard as writing the code in the first place.
 Therefore, if you write the code as cleverly as possible, you are, by
 definition, not smart enough to debug it.




-- 
*Kernighan's Law*
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it.


Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Dmitry Sivachenko

On 10 сент. 2014 г., at 22:19, Rich Haase rdha...@gmail.com wrote:

 You can write a custom output format


Any clues how can this can be done?



 , or you can write your mapreduce job in Java and use a NullWritable as 
 Susheel recommended.  
 
 grep (and every other *nix text processing command) I can think of would not 
 be limited by a trailing tab character.  It's even quite easy to strip away 
 that tab character if you don't want it during the post processing steps you 
 want to perform with *nix commands. 


Problem is that the line itself contains a TAB in the middle, there will not be 
extra trailing TAB at the end.
So it is not that simple.
You never know if it is a TAB from the original line or it is extra TAB added 
by TextOutputFormat.

Thanks!

Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Shahab Yunus
Examples (the top ones are related to streaming jobs):

http://www.infoq.com/articles/HadoopOutputFormat
http://research.neustar.biz/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/
http://stackoverflow.com/questions/12759651/how-to-override-inputformat-and-outputformat-in-hadoop-application

Regards,
Shahab

On Wed, Sep 10, 2014 at 2:28 PM, Dmitry Sivachenko trtrmi...@gmail.com
wrote:


 On 10 сент. 2014 г., at 22:19, Rich Haase rdha...@gmail.com wrote:

  You can write a custom output format


 Any clues how can this can be done?



  , or you can write your mapreduce job in Java and use a NullWritable as
 Susheel recommended.
 
  grep (and every other *nix text processing command) I can think of would
 not be limited by a trailing tab character.  It's even quite easy to strip
 away that tab character if you don't want it during the post processing
 steps you want to perform with *nix commands.


 Problem is that the line itself contains a TAB in the middle, there will
 not be extra trailing TAB at the end.
 So it is not that simple.
 You never know if it is a TAB from the original line or it is extra TAB
 added by TextOutputFormat.

 Thanks!


Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Dmitry Sivachenko


 10 сент. 2014 г., в 22:47, Shahab Yunus shahab.yu...@gmail.com написал(а):
 
 Examples (the top ones are related to streaming jobs):
 
 http://www.infoq.com/articles/HadoopOutputFormat
 http://research.neustar.biz/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/
 http://stackoverflow.com/questions/12759651/how-to-override-inputformat-and-outputformat-in-hadoop-application
 


Thanks for the links.  Problem is that in RecordWriter() I get two parameters: 
key and value. If one of them is empty I have no way to tell if I should output 
the delimiter (because it was present in the original line) or not.

What is the proper way to workaround that isuue?


 Regards,
 Shahab
 
 On Wed, Sep 10, 2014 at 2:28 PM, Dmitry Sivachenko trtrmi...@gmail.com 
 wrote:
 
 On 10 сент. 2014 г., at 22:19, Rich Haase rdha...@gmail.com wrote:
 
  You can write a custom output format
 
 
 Any clues how can this can be done?
 
 
 
  , or you can write your mapreduce job in Java and use a NullWritable as 
  Susheel recommended.
 
  grep (and every other *nix text processing command) I can think of would 
  not be limited by a trailing tab character.  It's even quite easy to strip 
  away that tab character if you don't want it during the post processing 
  steps you want to perform with *nix commands.
 
 
 Problem is that the line itself contains a TAB in the middle, there will not 
 be extra trailing TAB at the end.
 So it is not that simple.
 You never know if it is a TAB from the original line or it is extra TAB 
 added by TextOutputFormat.
 
 Thanks!
 


Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Felix Chern
Use ‘tr -s’ to stripe out tabs?

 $ echo -e a\t\t\tb
a   b

 $ echo -e a\t\t\tb | tr -s \t
a   b


On Sep 10, 2014, at 11:28 AM, Dmitry Sivachenko trtrmi...@gmail.com wrote:

 
 On 10 сент. 2014 г., at 22:19, Rich Haase rdha...@gmail.com wrote:
 
 You can write a custom output format
 
 
 Any clues how can this can be done?
 
 
 
 , or you can write your mapreduce job in Java and use a NullWritable as 
 Susheel recommended.  
 
 grep (and every other *nix text processing command) I can think of would not 
 be limited by a trailing tab character.  It's even quite easy to strip away 
 that tab character if you don't want it during the post processing steps you 
 want to perform with *nix commands. 
 
 
 Problem is that the line itself contains a TAB in the middle, there will not 
 be extra trailing TAB at the end.
 So it is not that simple.
 You never know if it is a TAB from the original line or it is extra TAB added 
 by TextOutputFormat.
 
 Thanks!



Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Felix Chern
If you don’t want anything get inserted, just set your output to key only or 
value only.
TextOutputFormat$LineRecordWriter won’t insert anything unless both values are 
set:

public synchronized void write(K key, V value)
  throws IOException {

  boolean nullKey = key == null || key instanceof NullWritable;
  boolean nullValue = value == null || value instanceof NullWritable;
  if (nullKey  nullValue) {
return;
  }
  if (!nullKey) {
writeObject(key);
  }
  if (!(nullKey || nullValue)) {
out.write(keyValueSeparator);
  }
  if (!nullValue) {
writeObject(value);
  }
  out.write(newline);
}

On Sep 10, 2014, at 1:37 PM, Dmitry Sivachenko trtrmi...@gmail.com wrote:

 
 On 10 сент. 2014 г., at 22:33, Felix Chern idry...@gmail.com wrote:
 
 Use ‘tr -s’ to stripe out tabs?
 
 $ echo -e a\t\t\tb
 ab
 
 $ echo -e a\t\t\tb | tr -s \t
 ab
 
 
 There can be tabs in the input, I want to keep input lines without any 
 modification.
 
 Actually it is rather standard task: process lines one by one without 
 inserting extra characters.  There should be standard solution for it IMO.
 



Re: Writing output from streaming task without dealing with key/value

2014-09-10 Thread Dmitry Sivachenko
On 11 сент. 2014 г., at 0:47, Felix Chern idry...@gmail.com wrote:

 If you don’t want anything get inserted, just set your output to key only or 
 value only.
 TextOutputFormat$LineRecordWriter won’t insert anything unless both values 
 are set:


If I output value only, for instance, and my line contains TAB then everything 
before TAB will be lost?
If I output key only, and my line contains TAB then everything after TAB will 
be lost?


 
 public synchronized void write(K key, V value)
   throws IOException {
 
   boolean nullKey = key == null || key instanceof NullWritable;
   boolean nullValue = value == null || value instanceof NullWritable;
   if (nullKey  nullValue) {
 return;
   }
   if (!nullKey) {
 writeObject(key);
   }
   if (!(nullKey || nullValue)) {
 out.write(keyValueSeparator);
   }
   if (!nullValue) {
 writeObject(value);
   }
   out.write(newline);
 }
 
 On Sep 10, 2014, at 1:37 PM, Dmitry Sivachenko trtrmi...@gmail.com wrote:
 
 
 On 10 сент. 2014 г., at 22:33, Felix Chern idry...@gmail.com wrote:
 
 Use ‘tr -s’ to stripe out tabs?
 
 $ echo -e a\t\t\tb
 a   b
 
 $ echo -e a\t\t\tb | tr -s \t
 a   b
 
 
 There can be tabs in the input, I want to keep input lines without any 
 modification.
 
 Actually it is rather standard task: process lines one by one without 
 inserting extra characters.  There should be standard solution for it IMO.