I am using MapFile.Reader from Hadoop-commons 2.6.5 (which is packaged with 
apache spark 2.2.0). When using methods like seek() or getClosest() very 
different keys - which are much larger than the key I searched - are returned.

In a long debug-session a found the location, which is responsible for my 
problem:

    private synchronized int seekInternal(WritableComparable key,
        final boolean before)
      throws IOException {
      readIndex();                                // make sure index is read

      if (seekIndex != -1                         // seeked before
          && seekIndex+1 < count
          && comparator.compare(key, keys[seekIndex+1])<0 // before next indexed
          && comparator.compare(key, nextKey)
          >= 0) {                                 // but after last seeked
        // do nothing
      } else {
        seekIndex = binarySearch(key);
        if (seekIndex < 0)                        // decode insertion point
          seekIndex = -seekIndex-2;

        if (seekIndex == -1)                      // belongs before first entry
          seekPosition = firstPosition;           // use beginning of file
        else
          seekPosition = positions[seekIndex];    // else use index
      }
      data.seek(seekPosition);

Operation readIndex() builds an in-memory map from index-file contents. With my 
example data, I see about 300 entries with positions. There are 3 different 
positions at position 300k, 600k and 900k. Because of the higher position I 
assume these map stores references from second up to last block in underlaying 
sequence file. Also firstPosition references 203, which is a position at the 
very beginning of the data file.

Variable seekPosition is always set to -1, so the else-block is executed. 
Method binarySearch() seems to be a algorithm of kind quick-sort and returns an 
offset to in-memory map (from readIndex()). In my example I am searching a key 
between very first and second key, binarySearch() returns a negative value of 
-4. In all my test a seekPosition from is chosen from positions[] array and 
never firstPosition is used. As result the requested key is not found.

While debugging I set seekPosition = firstPosition and a wonder happened: now 
the correct key is found. I worked with severals other mapfiles and never had 
such issues. Does anyone have an idea whats wrong here?


-    I rebuild the index-file with fix() method (files are identical)

-    Wrote all keys to an text file. Entries have correct order and look fine.

-    Checked configuration settings, but it seems there are no setting which 
affect mapfiles in this way. All settings are in system defaults.

-    Tests with other keys show the same effects, closest key are always larger 
then the requested one. They are behind.

Any ideas?

Reply via email to