[jira] [Commented] (HIVE-3992) Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks

Gopal V (JIRA) Thu, 07 Feb 2013 03:37:19 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573402#comment-13573402
 ]


Gopal V commented on HIVE-3992:
-------------------------------

Testing dummy query (to simulate a "col in (select ...)" style query) at 
SCALE=10

select /*+MAPJOIN(time_dim)*/ store_sales_rc.ss_item_sk from store_sales_rc 
join time_dim on (store_sales_rc.ss_sold_time_sk = time_dim.t_time_sk) limit 
100;

Before
{code}
2013-02-07 06:32:02,164 Stage-1 map = 0%,  reduce = 0%
2013-02-07 06:32:20,082 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 53.9 sec
2013-02-07 06:32:21,127 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 61.59 
sec
Job 0: Map: 8   Cumulative CPU: 61.59 sec   HDFS Read: 104763092 HDFS Write: 
4749 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 1 seconds 590 msec
Time taken: 34.572 seconds, Fetched: 100 row(s)
{code}

After
{code}
2013-02-07 06:35:29,413 Stage-1 map = 0%,  reduce = 0%
2013-02-07 06:35:43,200 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 9.31 sec
2013-02-07 06:35:44,247 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 39.45 
sec
MapReduce Total cumulative CPU time: 39 seconds 450 msec
Ended Job = job_1359695160319_0164
MapReduce Jobs Launched: 
Job 0: Map: 8   Cumulative CPU: 39.45 sec   HDFS Read: 25416952 HDFS Write: 
4749 SUCCESS
Total MapReduce CPU Time Spent: 39 seconds 450 msec
Time taken: 31.351 seconds, Fetched: 100 row(s)
{code}

Now the interesting bit is that even though we cut down the CPU cost by almost 
50%, the over-all latency drops only by 2 secs.
                
> Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks
> -------------------------------------------------------------------------
>
>                 Key: HIVE-3992
>                 URL: https://issues.apache.org/jira/browse/HIVE-3992
>             Project: Hive
>          Issue Type: Bug
>         Environment: Ubuntu x86_64/java-1.6/hadoop-2.0.3
>            Reporter: Gopal V
>         Attachments: HIVE-3992.patch, select-join-limit.html
>
>
> The following function does some bad I/O
> {code}
> public synchronized void sync(long position) throws IOException {
>   ...
>       try {
>         seek(position + 4); // skip escape
>         in.readFully(syncCheck);
>         int syncLen = sync.length;
>         for (int i = 0; in.getPos() < end; i++) {
>           int j = 0;
>           for (; j < syncLen; j++) {
>             if (sync[j] != syncCheck[(i + j) % syncLen]) {
>               break;
>             }
>           }
>           if (j == syncLen) {
>             in.seek(in.getPos() - SYNC_SIZE); // position before
>             // sync
>             return;
>           }
>           syncCheck[i % syncLen] = in.readByte();
>         }
>       }
> ...
>     }
> {code}
> This causes a rather large number of readByte() calls which are passed onto a 
> ByteBuffer via a single byte array.
> This results in rather a large amount of CPU being burnt in a the linear 
> search for the sync pattern in the input RCFile (upto 92% for a skewed 
> example - a trivial map-join + limit 100).
> This behaviour should be avoided at best or at least replaced by a rolling 
> hash for efficient comparison, since it has a known byte-width of 16 bytes.
> Attached the stack trace from a Yourkit profile.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3992) Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks

Reply via email to