Yiqun Zhang created ORC-1030:
--------------------------------

             Summary: Java Tools Recover File command does not accurately find 
OrcFile.MAGIC
                 Key: ORC-1030
                 URL: https://issues.apache.org/jira/browse/ORC-1030
             Project: ORC
          Issue Type: Bug
          Components: Java, tools
    Affects Versions: 1.6.11, 1.7.0, 1.8.0
            Reporter: Yiqun Zhang


{code:java}
        while (remaining > 0) {
          int toRead = (int) Math.min(DEFAULT_BLOCK_SIZE, remaining);
          byte[] data = new byte[toRead];
          long startPos = corruptFileLen - remaining;
          fdis.readFully(startPos, data, 0, toRead);

          // find all MAGIC string and see if the file is readable from there
          int index = 0;
          long nextFooterOffset;
          byte[] magicBytes = OrcFile.MAGIC.getBytes(StandardCharsets.UTF_8);
          while (index != -1) {
            index = indexOf(data, magicBytes, index + 1);
            if (index != -1) {
              nextFooterOffset = startPos + index + magicBytes.length + 1;
              if (isReadable(corruptPath, conf, nextFooterOffset)) {
                footerOffsets.add(nextFooterOffset);
              }
            }
          }

          System.err.println("Scanning for valid footers - startPos: " + 
startPos +
              " toRead: " + toRead + " remaining: " + remaining);
          remaining = remaining - toRead;
        }
{code}
Two adjacent reads may be exactly separated by OrcFile.MAGIC, making it 
impossible to find the location of the recovered file. Because the current 
implementation only matches in a single read.



{code:java}
  private static int indexOf(final byte[] data, final byte[] pattern, final int 
index) {
    if (data == null || data.length == 0 || pattern == null || pattern.length 
== 0 ||
        index > data.length || index < 0) {
      return -1;
    }

    int j = 0;
    for (int i = index; i < data.length; i++) {
      if (pattern[j] == data[i]) {
        j++;
      } else {
        j = 0;
      }

      if (j == pattern.length) {
        return i - pattern.length + 1;
      }
    }

    return -1;
  }
{code}
This matching algorithm is wrong when i does not backtrack after a failed match 
in the middle. As a simple example data = OOORC, pattern= ORC, index = 1, this 
algorithm will return -1.






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to