[ 
https://issues.apache.org/jira/browse/HADOOP-18395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xinqiu.hu updated HADOOP-18395:
-------------------------------
    Attachment:     (was: 
0001-retrun-1-when-tgt-has-remaining-and-src-expired-firs.patch)

> Performance improvement in org.apache.hadoop.io.Text#find
> ---------------------------------------------------------
>
>                 Key: HADOOP-18395
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18395
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: xinqiu.hu
>            Priority: Trivial
>              Labels: pull-request-available
>         Attachments: 
> 0001-add-UT-with-timeout-for-Text-find-and-fix-comments.patch
>
>
> The current implementation reset src and tgt to the mark and continues 
> searching when tgt has remaining and src expired first. which is probably not 
> necessary.
> {code:java}
> public int find(String what, int start) {
>   try {
>     ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length);
>     ByteBuffer tgt = encode(what);
>     byte b = tgt.get();
>     src.position(start);
>     while (src.hasRemaining()) {
>       if (b == src.get()) { // matching first byte
>         src.mark(); // save position in loop
>         tgt.mark(); // save position in target
>         boolean found = true;
>         int pos = src.position()-1;
>         while (tgt.hasRemaining()) {
>           if (!src.hasRemaining()) { // src expired first
>             tgt.reset();
>             src.reset();
>             found = false;
>             break;
>           }
>           if (!(tgt.get() == src.get())) {
>             tgt.reset();
>             src.reset();
>             found = false;
>             break; // no match
>           }
>         }
>         if (found) return pos;
>       }
>     }
>     return -1; // not found
>   } catch (CharacterCodingException e) {
>     throw new RuntimeException("Should not have happened", e);
>   }
> } {code}
> For example, when q is searched, it is found that src has no remaining, and 
> src is reset to d to continue searching. But the remaining length of src is 
> always smaller than tgt, at this point we can return -1 directly.
> {code:java}
> @Test
> public void testFind() throws Exception {
>   Text text = new Text("abcd\u20acbdcd\u20ac");
>   assertThat(text.find("cd\u20acq")).isEqualTo(-1);
> } {code}
> Perhaps it could be:
> {code:java}
> public int find(String what, int start) {
>   try {
>     ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length);
>     ByteBuffer tgt = encode(what);
>     byte b = tgt.get();
>     src.position(start);
>     while (src.hasRemaining()) {
>       if (b == src.get()) { // matching first byte
>         src.mark(); // save position in loop
>         tgt.mark(); // save position in target
>         boolean found = true;
>         int pos = src.position()-1;
>         while (tgt.hasRemaining()) {
>           if (!src.hasRemaining()) { // src expired first
>             return -1;
>           }
>           if (!(tgt.get() == src.get())) {
>             tgt.reset();
>             src.reset();
>             found = false;
>             break; // no match
>           }
>         }
>         if (found) return pos;
>       }
>     }
>     return -1; // not found
>   } catch (CharacterCodingException e) {
>     throw new RuntimeException("Should not have happened", e);
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to