[ 
https://issues.apache.org/jira/browse/HIVE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744635#action_12744635
 ] 

Ning Zhang commented on HIVE-756:
---------------------------------

The ret.set(i, BytesRefWritable.ZeroBytesRefWritable); in RCFile.java:1273 
seems unnecessary here since when the BytesRefArrayWritable is constructed each 
member is initialized as the same value as 
BytesRefWritable.ZeroBytesRefWritable. So as long as the list of projected 
columns do not change during the table scan iterator RCFileRecord.next(), we 
don't need to set this values.  

The reason I'm kind of picky about this small thing is that the CPU cost could 
be a huge difference by maintaining reasonable invariants (assertions) during 
the two nested loops (over rows and over columns) and removing unnecessary code 
or reducing number of loops. The code inside the loop/iterator should be really 
lean and only do the absolutely necessary things.  In my test, these simple 
changes reduce the iterator fetch time from 5 sec to less than 1 sec, and about 
15% - 20% overall query performance.

In this case the invariant is that the projected columns do not change during 
the table scan. Please let me know if you think there are cases that break the 
invariant. I'll revert the changes. 

> performance improvement for RCFile and ColumnarSerDe in Hive
> ------------------------------------------------------------
>
>                 Key: HIVE-756
>                 URL: https://issues.apache.org/jira/browse/HIVE-756
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: hive-756.patch, hive-756_2.patch
>
>
> There are some easy performance improvements in the columnar storage in Hive 
> I found during Hackathon. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to