[ 
https://issues.apache.org/jira/browse/HIVE-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13619267#comment-13619267
 ] 

Ashutosh Chauhan commented on HIVE-3992:
----------------------------------------

I have a question related to 1) Does not do sync if ExecMapper.getDone() == true
Outside of Hive (meaning using RCFile independent of Hive), getDone() will 
always be false, so this if block will always be skipped and part of this 
optimization will not fire. That is alright, but want to make sure this will 
not result in wrong results for that scenario. [~gopalv] Is that correct?

                
> Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks
> -------------------------------------------------------------------------
>
>                 Key: HIVE-3992
>                 URL: https://issues.apache.org/jira/browse/HIVE-3992
>             Project: Hive
>          Issue Type: Bug
>         Environment: Ubuntu x86_64/java-1.6/hadoop-2.0.3
>            Reporter: Gopal V
>            Assignee: Gopal V
>         Attachments: HIVE-3992.patch, select-join-limit.html
>
>
> The following function does some bad I/O
> {code}
> public synchronized void sync(long position) throws IOException {
>   ...
>       try {
>         seek(position + 4); // skip escape
>         in.readFully(syncCheck);
>         int syncLen = sync.length;
>         for (int i = 0; in.getPos() < end; i++) {
>           int j = 0;
>           for (; j < syncLen; j++) {
>             if (sync[j] != syncCheck[(i + j) % syncLen]) {
>               break;
>             }
>           }
>           if (j == syncLen) {
>             in.seek(in.getPos() - SYNC_SIZE); // position before
>             // sync
>             return;
>           }
>           syncCheck[i % syncLen] = in.readByte();
>         }
>       }
> ...
>     }
> {code}
> This causes a rather large number of readByte() calls which are passed onto a 
> ByteBuffer via a single byte array.
> This results in rather a large amount of CPU being burnt in a the linear 
> search for the sync pattern in the input RCFile (upto 92% for a skewed 
> example - a trivial map-join + limit 100).
> This behaviour should be avoided at best or at least replaced by a rolling 
> hash for efficient comparison, since it has a known byte-width of 16 bytes.
> Attached the stack trace from a Yourkit profile.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to