Re: writing output files in hadoop streaming

John Heidemann Wed, 16 Jan 2008 08:51:10 -0800

>On 1/15/08 12:54 PM, "Miles Osborne" <[EMAIL PROTECTED]> wrote:
>
>> surely the clean way (in a streaming environment) would be to define a
>> representation of some kind which serialises the output.
>> 
>> http://en.wikipedia.org/wiki/Serialization
>> 
>> after your mappers and reducers have completed, you would then have some
>> code which deserialise (unpacked) the output as desired.   this would easily
>> allow you to reconstruct the  two files from a single (set) of file
>> fragments.

On Tue, 15 Jan 2008 12:56:12 PST, Ted Dunning wrote: 
>Also, this gives you a solution to your race condition (by using hadoop's
>mechanisms) and it also gives you much higher
>throughput/reliability/scalability than writing to NFS can possibly give
>you.
>

I agree that serializing and using the standard Hadoop output stream
best leverages the Hadoop mechanisms.  I even labeled it the "proper"
way, and talked about serialization (without using that word):

>> On 15/01/2008, John Heidemann <[EMAIL PROTECTED]> wrote:
>>...
>>> There's a second way, which is where most of the discussion has gone,
>>> call it the "proper" way:
>>> 
>>> Rather than writing files as side-effects, the argument is to just
>>> output the data with the standard hadoop mechanism.  In streaming, this
>>> means through stdout.
>>>...
>>> But I actually think this is not viable for us,
>>> because we're writing images which are binary.
>>>...
>>> If we go that way, then we're basically packing many files into one.
>>> To me it seems to me cleanest, if one wants to do that, to use some
>>> existing format, like tar or zip or cpio, or maybe the hadoop multi-file
>>> format.  But this way seems fraught with peril, since we have to fight
>>> streaming and custom record output, and then still extract the files
>>> after output completes anyway.  Lots and lots of work---it feels like
>>> this can't be right.
>>> 
>>> (Another one hacky way to make this work in streaming is to convert binary
>>> to
>>> ascii, like base-64-ize the files.  Been there in SQL.  Done that.
>>> Don't want to do it again.  It still has all the encoding and
>>> post-processing junk. :-)

BUT...I'm suggesting that should not be the ONLY viable way.

Two reasons:  first, yes, serialization can work.  But you've put a lot
of layers of junk in the way, all of which has to be done and undone.
This can easily become a lot of code, and it can easily eat into any
performance advantage.

On the other hand, if Hadoop would just send a signal to the aborted
terminated reducer, rather than just closing stdin/stdout, then a few
lines of signal capture code and a few more to unlink the temp file does
everything, and a few lines of signal sending code in Hadoop streaming.
Plus a few on the commit side, and you end up with about 50 lines of
code.  As opposed to serialization, which is hundreds of lines of stubs,
or large libraries to handle something like tar or zip, plus potentially
storage overhead (if you convert to base-64 or something), and storage
overhead because you have to store (at least temporarily) both
serialized and unserialized versions.

Second, the Google folks found side-effects useful enough that they
support them, documented them in Dean and Ghemawat, and seem to use them
internally.  Perhaps Hadoop should consider the costs of supporting
side-effects before discarding them?

Going back to part of Ted's comment and his performance objection:

On Tue, 15 Jan 2008 12:56:12 PST, Ted Dunning wrote: 
>Also, this gives you a solution to your race condition (by using hadoop's
>mechanisms) and it also gives you much higher
>throughput/reliability/scalability than writing to NFS can possibly give
>you.

About the throughput issues, if you don't want to write to NFS (we can
at our current cluster size, but I know others are lucker than us :-).
If you want, just write side effect files into HDFS to get all the
throughput/reliability/scalability you would get with Hadoop's standard
mechanisms.  

To try and clarify what I'm hearing, though, I think the answer I'm
hearing to my question:

>So what do the Hadoop architects think about side-effects and recovering
>from half-run jobs?  Does hadoop intend to support side-effects (for
>interested users, obviously not as standard practice)?  If we were in
>Java would we get a signal we could use to do cleanup?

Is that Hadoop does NOT current support side-effects, because people
didn't really consider it.

And there's some push-back against side-effects as being not very
clean.  (Which I agree with to a first order, but not strongly enough
that I think it should be forbidden.)

Are folks anti-side-effect so much that if we submit the 10-line signal
sending patch to streaming it will be given a -1?  (Footnote: it's a
10-line C patch, I have to confirm what it looks like in Java.)

   -John Heidemann

Re: writing output files in hadoop streaming

Reply via email to