Yiqun Zhang created ORC-1030:
--------------------------------
Summary: Java Tools Recover File command does not accurately find
OrcFile.MAGIC
Key: ORC-1030
URL: https://issues.apache.org/jira/browse/ORC-1030
Project: ORC
Issue Type: Bug
Components: Java, tools
Affects Versions: 1.6.11, 1.7.0, 1.8.0
Reporter: Yiqun Zhang
{code:java}
while (remaining > 0) {
int toRead = (int) Math.min(DEFAULT_BLOCK_SIZE, remaining);
byte[] data = new byte[toRead];
long startPos = corruptFileLen - remaining;
fdis.readFully(startPos, data, 0, toRead);
// find all MAGIC string and see if the file is readable from there
int index = 0;
long nextFooterOffset;
byte[] magicBytes = OrcFile.MAGIC.getBytes(StandardCharsets.UTF_8);
while (index != -1) {
index = indexOf(data, magicBytes, index + 1);
if (index != -1) {
nextFooterOffset = startPos + index + magicBytes.length + 1;
if (isReadable(corruptPath, conf, nextFooterOffset)) {
footerOffsets.add(nextFooterOffset);
}
}
}
System.err.println("Scanning for valid footers - startPos: " +
startPos +
" toRead: " + toRead + " remaining: " + remaining);
remaining = remaining - toRead;
}
{code}
Two adjacent reads may be exactly separated by OrcFile.MAGIC, making it
impossible to find the location of the recovered file. Because the current
implementation only matches in a single read.
{code:java}
private static int indexOf(final byte[] data, final byte[] pattern, final int
index) {
if (data == null || data.length == 0 || pattern == null || pattern.length
== 0 ||
index > data.length || index < 0) {
return -1;
}
int j = 0;
for (int i = index; i < data.length; i++) {
if (pattern[j] == data[i]) {
j++;
} else {
j = 0;
}
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}
{code}
This matching algorithm is wrong when i does not backtrack after a failed match
in the middle. As a simple example data = OOORC, pattern= ORC, index = 1, this
algorithm will return -1.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)