Chris,
Your version will use LongWritable as the map output key type, which
changes the job nature completely. You should use
${hadoop} jar hadoop-0.17-examples.jar sort -m <num maps> \
> -r 88 \
> -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
> -outFormat org.apache.hadoop.mapred.lib.NullOutputFormat \
> -outKey org.apache.hadoop.io.Text \
> -outValue org.apache.hadoop.io.Text \
> <input dir> <ouput dir (ignored)>
instead.
Runping
> -----Original Message-----
> From: Chris Douglas [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, June 03, 2008 11:35 AM
> To: [email protected]
> Subject: Re: Stackoverflow
>
> >> By "not exactly small, do you mean each line is long or that there
> >> are many records?
> >
> > Well, not small in the meaning, that even I could get my boss to
> > allow me to
> > give you the data, transfering it might be painful. (E.g. the job
that
> > aborted had about 12M lines with with ~2.6GB data => the lines are
> > not really
> > long, but longer than 80 chars)
>
> Ah, I see. Would it be possible to run the Java sort example over
> your data? It would be helpful to verify that this is not specific to
> streaming.
>
> ${hadoop} jar hadoop-0.17-examples.jar sort -m <num maps> \
> -r 88 \
> -inFormat org.apache.hadoop.mapred.TextInputFormat \
> -outFormat org.apache.hadoop.mapred.lib.NullOutputFormat \
> -outKey org.apache.hadoop.io.LongWritable \
> -outValue org.apache.hadoop.io.Text \
> <input dir> <ouput dir (ignored)>
>
> This should be close to streaming with cat as the mapper.
>
> >> util.QuickSort is only used on the map side, so this shouldn't have
> >> anything to do with the reduce. Is it always and only the *last*
map
> >
> > Nope, although sometimes it happens earlier.
>
> Is it always the same splits when you re-run your job? Though
> distributing the full dataset may not be feasible, if there are
> splits that fail consistently then we might be able to work from that.
>
> >> task that fails? If I sent you a patch that would print a trace
with
> >> the partitions, would you mind running it? Do you have any other
> >> settings that differ from the defaults? -C
> >
> > If you tell me how to apply it, I'm happy to. (I'm not the biggest
> > Java
> > hotshot on this planet, I'm just using the provided 0.17.0 jars,
> > Guess I
> > would have to patch the source and run ant. On all nodes or just the
> > control?).
>
> Unfortunately, it would need to be deployed to all the TaskTrackers,
> and it would be pretty invasive (i.e. I was planning on logging all
> the offsets from the sort as the stack unwinds from the exception).
> I'll test something and send it to you, and if it's not too much
> trouble you can try it.
>
> > My hadoop-site.xml:
> > [snip]
>
> Nothing suspect, there. -C