xinqiu.hu created HADOOP-18395:
----------------------------------
Summary: Performance improvement in
org.apache.hadoop.io.Text.find()
Key: HADOOP-18395
URL: https://issues.apache.org/jira/browse/HADOOP-18395
Project: Hadoop Common
Issue Type: Improvement
Components: io
Reporter: xinqiu.hu
The current implementation reset src and tgt to the mark and continues
searching when tgt has remaining and src expired first. which is probably not
necessary.
{code:java}
public int find(String what, int start) {
try {
ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length);
ByteBuffer tgt = encode(what);
byte b = tgt.get();
src.position(start);
while (src.hasRemaining()) {
if (b == src.get()) { // matching first byte
src.mark(); // save position in loop
tgt.mark(); // save position in target
boolean found = true;
int pos = src.position()-1;
while (tgt.hasRemaining()) {
if (!src.hasRemaining()) { // src expired first
tgt.reset();
src.reset();
found = false;
break;
}
if (!(tgt.get() == src.get())) {
tgt.reset();
src.reset();
found = false;
break; // no match
}
}
if (found) return pos;
}
}
return -1; // not found
} catch (CharacterCodingException e) {
throw new RuntimeException("Should not have happened", e);
}
} {code}
For example, when q is searched, it is found that src has no remaining, and src
is reset to d to continue searching. But the remaining length of src is always
smaller than tgt, at this point we can return -1 directly.
{code:java}
@Test
public void testFind() throws Exception {
Text text = new Text("abcd\u20acbdcd\u20ac");
assertThat(text.find("cd\u20acq")).isEqualTo(-1);
} {code}
Perhaps it could be:
{code:java}
public int find(String what, int start) {
try {
ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length);
ByteBuffer tgt = encode(what);
byte b = tgt.get();
src.position(start);
while (src.hasRemaining()) {
if (b == src.get()) { // matching first byte
src.mark(); // save position in loop
tgt.mark(); // save position in target
boolean found = true;
int pos = src.position()-1;
while (tgt.hasRemaining()) {
if (!src.hasRemaining()) { // src expired first
return -1;
}
if (!(tgt.get() == src.get())) {
tgt.reset();
src.reset();
found = false;
break; // no match
}
}
if (found) return pos;
}
}
return -1; // not found
} catch (CharacterCodingException e) {
throw new RuntimeException("Should not have happened", e);
}
}{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]