Ah; you're right, of course. Sorry about that. -C
On Jun 3, 2008, at 12:00 PM, Runping Qi wrote:
Chris,
Your version will use LongWritable as the map output key type, which
changes the job nature completely. You should use
${hadoop} jar hadoop-0.17-examples.jar sort -m <num maps> \
-r 88 \
-inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
-outFormat org.apache.hadoop.mapred.lib.NullOutputFormat \
-outKey org.apache.hadoop.io.Text \
-outValue org.apache.hadoop.io.Text \
<input dir> <ouput dir (ignored)>
instead.
Runping
-----Original Message-----
From: Chris Douglas [mailto:[EMAIL PROTECTED]
Sent: Tuesday, June 03, 2008 11:35 AM
To: [email protected]
Subject: Re: Stackoverflow
By "not exactly small, do you mean each line is long or that there
are many records?
Well, not small in the meaning, that even I could get my boss to
allow me to
give you the data, transfering it might be painful. (E.g. the job
that
aborted had about 12M lines with with ~2.6GB data => the lines are
not really
long, but longer than 80 chars)
Ah, I see. Would it be possible to run the Java sort example over
your data? It would be helpful to verify that this is not specific to
streaming.
${hadoop} jar hadoop-0.17-examples.jar sort -m <num maps> \
-r 88 \
-inFormat org.apache.hadoop.mapred.TextInputFormat \
-outFormat org.apache.hadoop.mapred.lib.NullOutputFormat \
-outKey org.apache.hadoop.io.LongWritable \
-outValue org.apache.hadoop.io.Text \
<input dir> <ouput dir (ignored)>
This should be close to streaming with cat as the mapper.
util.QuickSort is only used on the map side, so this shouldn't have
anything to do with the reduce. Is it always and only the *last*
map
Nope, although sometimes it happens earlier.
Is it always the same splits when you re-run your job? Though
distributing the full dataset may not be feasible, if there are
splits that fail consistently then we might be able to work from
that.
task that fails? If I sent you a patch that would print a trace
with
the partitions, would you mind running it? Do you have any other
settings that differ from the defaults? -C
If you tell me how to apply it, I'm happy to. (I'm not the biggest
Java
hotshot on this planet, I'm just using the provided 0.17.0 jars,
Guess I
would have to patch the source and run ant. On all nodes or just the
control?).
Unfortunately, it would need to be deployed to all the TaskTrackers,
and it would be pretty invasive (i.e. I was planning on logging all
the offsets from the sort as the stack unwinds from the exception).
I'll test something and send it to you, and if it's not too much
trouble you can try it.
My hadoop-site.xml:
[snip]
Nothing suspect, there. -C