OK, continuing our earlier conversation...

I have a job that schedules 100 map jobs (small number just for testing), 
passing data view a set of 100 sequence files. This is based on the PiEstimator 
example, that is shipped with the distribution.

The data consist of a blob of serialised state, amounting to around 20MB of 
data. I have added various checks, including checksums,
to reduce the risk of data corruption or misalignment.

The mapper takes the blob of data as its value input and an integer in the 
range 0-99 as its key (passed as a LongWritable).

Each mapper then does some processing, based upon the deserialised contents of 
the blob and the integer key value (0-99).

The reducer then selects the minimum value that was produced across all of the 
mappers.

Unfortunately, this process is generating an incorrect value, when compared to 
a simple iterative solution.

After inspecting the results it seems that the mappers are generating correct 
values for even-numbered keys, but incorrect
values for odd-numbered keys. I am logging the values of the keys, so I am 
confident that these are correct. My serialisation
checks also make me confident that the ‘value’ blobs are not getting corrupted, 
so it’s all something of a mystery.

Harsh J: Previously, you indicated that this might be a “...key/val data issue… 
...Perhaps bad partitioning/grouping is happening as a result of that”. I 
apologise for the lack of detail, but do you think this still might be the 
case? If so, could you refer me to some place that gives more detail on this 
type of issue?

With apologies for continuing to be a nuisance :-(

Andy D

Reply via email to