MapFile.Reader.getClosest() seeks to behind keys

Markus.Breuer Wed, 26 Jul 2017 07:28:24 -0700

I am using MapFile.Reader from Hadoop-commons 2.6.5 (which is packaged with 
apache spark 2.2.0). When using methods like seek() or getClosest() very 
different keys - which are much larger than the key I searched - are returned.


In a long debug-session a found the location, which is responsible for my 
problem:

    private synchronized int seekInternal(WritableComparable key,
        final boolean before)
      throws IOException {
      readIndex();                                // make sure index is read

      if (seekIndex != -1                         // seeked before
          && seekIndex+1 < count
          && comparator.compare(key, keys[seekIndex+1])<0 // before next indexed
          && comparator.compare(key, nextKey)
          >= 0) {                                 // but after last seeked
        // do nothing
      } else {
        seekIndex = binarySearch(key);
        if (seekIndex < 0)                        // decode insertion point
          seekIndex = -seekIndex-2;

        if (seekIndex == -1)                      // belongs before first entry
          seekPosition = firstPosition;           // use beginning of file
        else
          seekPosition = positions[seekIndex];    // else use index
      }
      data.seek(seekPosition);

Operation readIndex() builds an in-memory map from index-file contents. With my 
example data, I see about 300 entries with positions. There are 3 different 
positions at position 300k, 600k and 900k. Because of the higher position I 
assume these map stores references from second up to last block in underlaying 
sequence file. Also firstPosition references 203, which is a position at the 
very beginning of the data file.

Variable seekPosition is always set to -1, so the else-block is executed. 
Method binarySearch() seems to be a algorithm of kind quick-sort and returns an 
offset to in-memory map (from readIndex()). In my example I am searching a key 
between very first and second key, binarySearch() returns a negative value of 
-4. In all my test a seekPosition from is chosen from positions[] array and 
never firstPosition is used. As result the requested key is not found.

While debugging I set seekPosition = firstPosition and a wonder happened: now 
the correct key is found. I worked with severals other mapfiles and never had 
such issues. Does anyone have an idea whats wrong here?


-    I rebuild the index-file with fix() method (files are identical)

-    Wrote all keys to an text file. Entries have correct order and look fine.

-    Checked configuration settings, but it seems there are no setting which 
affect mapfiles in this way. All settings are in system defaults.

-    Tests with other keys show the same effects, closest key are always larger 
then the requested one. They are behind.

Any ideas?

MapFile.Reader.getClosest() seeks to behind keys

Reply via email to