On Tuesday 03 June 2008 08:35:10 Chris Douglas wrote:
> > I have no Java implementation of my job, sorry.
>
> Since it's all in the map side, IdentityMapper/IdentityReducer is
> fine, as long as both the splits and the number of reduce tasks are
> the same.
>
> > The data is a representation for loglines, and not exactly small,
> > e.g. the
> > stuff has already been reduced once.
>
> By "not exactly small, do you mean each line is long or that there
> are many records?
Well, not small in the meaning, that even I could get my boss to allow me to
give you the data, transfering it might be painful. (E.g. the job that
aborted had about 12M lines with with ~2.6GB data => the lines are not really
long, but longer than 80 chars)
The expected result was that around 8-10M of lines would be output by the
reduce task. (The lines are of two different types, one type means that all
key/values but the first one can be dropped, and the second one is the more
classical type where all values need to be added up.)
Because the stuff has already been reduced in big chunks already, I'd expect a
~20% reduction. Still that's useful, considering that each of these lines
turns into at least one SQL statement after it leaves the hadoop cluster.
>
> > The interesting thing is that it happens inside the last Map task,
> > not in the
> > reducer tasks.
> > As you can see above the mapper cmd is rather on the simple side.
>
> util.QuickSort is only used on the map side, so this shouldn't have
> anything to do with the reduce. Is it always and only the *last* map
Nope, although sometimes it happens earlier.
> task that fails? If I sent you a patch that would print a trace with
> the partitions, would you mind running it? Do you have any other
> settings that differ from the defaults? -C
If you tell me how to apply it, I'm happy to. (I'm not the biggest Java
hotshot on this planet, I'm just using the provided 0.17.0 jars, Guess I
would have to patch the source and run ant. On all nodes or just the
control?).
And no, it's mostly untuned from the default hadoop config, paths and network
addresses being configured, everything left as is.
OTOH, I would have to try to get enough data into my work queue to have a big
enough chunk to reproduce it I guess. OTOH, it's not that bad, I stiil have
over 1TB of logfiles for May to process, so I would just need to take off the
brakes from hadoop to produce the data needed I guess.
Thanks,
Andreas
My hadoop-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://ec2-67-202-58-97.compute-1.amazonaws.com:9000</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>XXXXXX</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>XXXX</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>ec2-67-202-58-97.compute-1.amazonaws.com:9001</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/mnt/tmp/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>dfs.name.dir</name><value>/mnt/hadoop/namedir</value>
</property>
<property>
<name>dfs.data.dir</name><value>/mnt/hadoop/datadir</value>
</property>
<property>
<name>mapred.local.dir</name><value>/mnt/hadoop/mrlocal</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>