I am using MapFile.Reader from Hadoop-commons 2.6.5 (which is packaged with apache spark 2.2.0). When using methods like seek() or getClosest() very different keys - which are much larger than the key I searched - are returned.
In a long debug-session a found the location, which is responsible for my problem: private synchronized int seekInternal(WritableComparable key, final boolean before) throws IOException { readIndex(); // make sure index is read if (seekIndex != -1 // seeked before && seekIndex+1 < count && comparator.compare(key, keys[seekIndex+1])<0 // before next indexed && comparator.compare(key, nextKey) >= 0) { // but after last seeked // do nothing } else { seekIndex = binarySearch(key); if (seekIndex < 0) // decode insertion point seekIndex = -seekIndex-2; if (seekIndex == -1) // belongs before first entry seekPosition = firstPosition; // use beginning of file else seekPosition = positions[seekIndex]; // else use index } data.seek(seekPosition); Operation readIndex() builds an in-memory map from index-file contents. With my example data, I see about 300 entries with positions. There are 3 different positions at position 300k, 600k and 900k. Because of the higher position I assume these map stores references from second up to last block in underlaying sequence file. Also firstPosition references 203, which is a position at the very beginning of the data file. Variable seekPosition is always set to -1, so the else-block is executed. Method binarySearch() seems to be a algorithm of kind quick-sort and returns an offset to in-memory map (from readIndex()). In my example I am searching a key between very first and second key, binarySearch() returns a negative value of -4. In all my test a seekPosition from is chosen from positions[] array and never firstPosition is used. As result the requested key is not found. While debugging I set seekPosition = firstPosition and a wonder happened: now the correct key is found. I worked with severals other mapfiles and never had such issues. Does anyone have an idea whats wrong here? - I rebuild the index-file with fix() method (files are identical) - Wrote all keys to an text file. Entries have correct order and look fine. - Checked configuration settings, but it seems there are no setting which affect mapfiles in this way. All settings are in system defaults. - Tests with other keys show the same effects, closest key are always larger then the requested one. They are behind. Any ideas?