Managing stdout in streaming

Keith Wiley Tue, 01 Feb 2011 12:55:57 -0800

So streaming uses stdout to organize the mapper/reducer output, one record per 
line with each key/val split at the first TAB.


(Presumably multiple TABS are permitted and become embedded in the value 
string, I haven't experimented with this yet).

Obviously, one must be very careful not to write any debugging or logging 
output to stdout.  It seems fairly straight-forward to simply use stderr 
instead, such that all associated output appears in the job tracker logs.

Buuuuut, what if I'm using a third-party library and I can't tell it to send 
output elsewhere?  I know that it is possible to redirect stdout using tricks 
like freopen(), but I believe it can be quite tricky to redirect stdout back to 
its original stream.  So if I directed stdout away from the original stream for 
processing, I'm not sure how I would latch it back onto the stream for the 
purpose of generating my mapper/reducer output data (in the Hadoop streaming 
TAB-delimited line-per-record format).

Any thoughts on this?  The cluster is running Linux incidentally.  I realize 
details like that become important when one starts fiddling with redirecting 
streams and such.

Thank you.

________________________________________________________________________________
Keith Wiley               kwi...@keithwiley.com               www.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
  -- Keith Wiley
________________________________________________________________________________

Managing stdout in streaming

Reply via email to