[ https://issues.apache.org/jira/browse/HADOOP-18395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617488#comment-17617488 ]
ASF GitHub Bot commented on HADOOP-18395: ----------------------------------------- huxinqiu commented on PR #4714: URL: https://github.com/apache/hadoop/pull/4714#issuecomment-1278521721 @ZanderXu Thanks for helping to review the code. Can you help merge this pr into trunk branch? > Performance improvement in org.apache.hadoop.io.Text#find > --------------------------------------------------------- > > Key: HADOOP-18395 > URL: https://issues.apache.org/jira/browse/HADOOP-18395 > Project: Hadoop Common > Issue Type: Improvement > Components: io > Reporter: xinqiu.hu > Priority: Trivial > Labels: pull-request-available > Attachments: > 0001-add-UT-with-timeout-for-Text-find-and-fix-comments.patch > > > The current implementation reset src and tgt to the mark and continues > searching when tgt has remaining and src expired first. which is probably not > necessary. > {code:java} > public int find(String what, int start) { > try { > ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length); > ByteBuffer tgt = encode(what); > byte b = tgt.get(); > src.position(start); > while (src.hasRemaining()) { > if (b == src.get()) { // matching first byte > src.mark(); // save position in loop > tgt.mark(); // save position in target > boolean found = true; > int pos = src.position()-1; > while (tgt.hasRemaining()) { > if (!src.hasRemaining()) { // src expired first > tgt.reset(); > src.reset(); > found = false; > break; > } > if (!(tgt.get() == src.get())) { > tgt.reset(); > src.reset(); > found = false; > break; // no match > } > } > if (found) return pos; > } > } > return -1; // not found > } catch (CharacterCodingException e) { > throw new RuntimeException("Should not have happened", e); > } > } {code} > For example, when q is searched, it is found that src has no remaining, and > src is reset to d to continue searching. But the remaining length of src is > always smaller than tgt, at this point we can return -1 directly. > {code:java} > @Test > public void testFind() throws Exception { > Text text = new Text("abcd\u20acbdcd\u20ac"); > assertThat(text.find("cd\u20acq")).isEqualTo(-1); > } {code} > Perhaps it could be: > {code:java} > public int find(String what, int start) { > try { > ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length); > ByteBuffer tgt = encode(what); > byte b = tgt.get(); > src.position(start); > while (src.hasRemaining()) { > if (b == src.get()) { // matching first byte > src.mark(); // save position in loop > tgt.mark(); // save position in target > boolean found = true; > int pos = src.position()-1; > while (tgt.hasRemaining()) { > if (!src.hasRemaining()) { // src expired first > return -1; > } > if (!(tgt.get() == src.get())) { > tgt.reset(); > src.reset(); > found = false; > break; // no match > } > } > if (found) return pos; > } > } > return -1; // not found > } catch (CharacterCodingException e) { > throw new RuntimeException("Should not have happened", e); > } > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org