[ 
https://issues.apache.org/jira/browse/HIVE-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13622699#comment-13622699
 ] 

Gopal V commented on HIVE-3992:
-------------------------------

I think the ExecMapper.getDone() is breaking layering & control. Once I thought 
about it more, I couldn't retain it as there is no way to really turn it off 
once a mapper runs (say, you want to read an RCFile in the cleanup code for the 
mapper).

The GC overhead errors were due to the fact that the cache grows without bounds 
for a long running process. The weak hashmap works as a solution for that & 
will invalidate the cache as the weak-refs are collected by the gc.
                
> Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks
> -------------------------------------------------------------------------
>
>                 Key: HIVE-3992
>                 URL: https://issues.apache.org/jira/browse/HIVE-3992
>             Project: Hive
>          Issue Type: Bug
>         Environment: Ubuntu x86_64/java-1.6/hadoop-2.0.3
>            Reporter: Gopal V
>            Assignee: Gopal V
>         Attachments: HIVE-3992.2.patch, HIVE-3992.patch, 
> select-join-limit.html
>
>
> The following function does some bad I/O
> {code}
> public synchronized void sync(long position) throws IOException {
>   ...
>       try {
>         seek(position + 4); // skip escape
>         in.readFully(syncCheck);
>         int syncLen = sync.length;
>         for (int i = 0; in.getPos() < end; i++) {
>           int j = 0;
>           for (; j < syncLen; j++) {
>             if (sync[j] != syncCheck[(i + j) % syncLen]) {
>               break;
>             }
>           }
>           if (j == syncLen) {
>             in.seek(in.getPos() - SYNC_SIZE); // position before
>             // sync
>             return;
>           }
>           syncCheck[i % syncLen] = in.readByte();
>         }
>       }
> ...
>     }
> {code}
> This causes a rather large number of readByte() calls which are passed onto a 
> ByteBuffer via a single byte array.
> This results in rather a large amount of CPU being burnt in a the linear 
> search for the sync pattern in the input RCFile (upto 92% for a skewed 
> example - a trivial map-join + limit 100).
> This behaviour should be avoided at best or at least replaced by a rolling 
> hash for efficient comparison, since it has a known byte-width of 16 bytes.
> Attached the stack trace from a Yourkit profile.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to