Johannes Herr created HADOOP-10921:
--------------------------------------

             Summary: MapFile.fix fails silently when file is block compressed
                 Key: HADOOP-10921
                 URL: https://issues.apache.org/jira/browse/HADOOP-10921
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 0.20.2
            Reporter: Johannes Herr


MapFile provides a method 'fix' to reconstruct missing 'index' files. If the 
'data' file is block compressed the method will compute offsets that are to 
large, which will lead to keys not being found in the mapfile. (See the 
attached test case.)

Tested against 0.20.2 but the trunk version looks like it has the same problem.

Cause of the problem is, that 'dataReader.getPosition()' is used to find the 
offset to write for the next entry that should be indexed. When the file is 
block compressed however 'dataReader.getPosition()' seems to return the  
position of the next compressed block, not of block that contains the last 
entry. This position will thus be to large in most cases and a seek operation 
with this offset will incorrectly report the key as not present.

I think its not obvious how to fix it, since the SequenceFile-Reader does not 
provide the offset of the currently buffered entries. I've experimented with 
watching the offset change and that seems to work mostly, but is quiet ugly and 
not exact in edge cases.

The method should probably throw an exception when the 'data' file is block 
compressed instead of silently creating invalid files. A workaround for block 
compressed files is to read the sequence file and write the entries to a new 
mapfile and then replace the old file. This also avoids the problems mentioned 
below.

A few side notes: 

1. The 'index' files created by the fix-method are not block compressed (which 
the 'index' files created by MapFile Writer always are, since the 'index' file 
is read completely anyway).

2. The fix method does not index the first entry, the Writer does.

3. The header offset is not used.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to