[ https://issues.apache.org/jira/browse/HIVE-16198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214720#comment-16214720 ]
Colin Ma commented on HIVE-16198: --------------------------------- hi, [~teddy.choi], [~mmccline], because of the problem HIVE-17133, I rebased the patch based on HIVE-2.3.0 with some minor changes. To evaluate the performance improvement, the following table is used: {code} hive> describe temperature_orc_5g; t_date string city string temperatures array<double> hive> show tblproperties temperature_orc_5g; COLUMN_STATS_ACCURATE {"BASIC_STATS":"true"} numFiles 20 numRows 100000000 rawDataSize 24100000000 totalSize 1793960785 {code} Tested by HIVE on Spark, with the sql {color:#59afe1}select city, avg(temperatures\[0\]), avg(temperatures\[5\]) from temperature_orc_5g where temperatures\[2\] > 20 group by city limit 10{color}, the following are the result: || ||Disable vectorization||Enable vectorization|| |execution time|{color:#d04437}34s{color}|{color:#14892c}26s{color}| Specifically, the detail time cost for the same task which will process 15154763 rows as follow table: || ||Disable vectorization||Enable vectorization|| |Time with RecorderReader|{color:#d04437}8.9s{color}|{color:#14892c}5.9s{color}| |Time with filter operator|{color:#d04437}3.1s{color}|{color:#14892c}0.1s{color}| |Time with groupBy and followup operators|10.8s|11.5s| I think the improvement is obviously, do you know why the patch isn't committed until now, thanks. > Vectorize GenericUDFIndex for ARRAY > ----------------------------------- > > Key: HIVE-16198 > URL: https://issues.apache.org/jira/browse/HIVE-16198 > Project: Hive > Issue Type: Sub-task > Components: UDF, Vectorization > Reporter: Teddy Choi > Assignee: Teddy Choi > Attachments: HIVE-16198.1.patch, HIVE-16198.2.patch, > HIVE-16198.3.patch > > > Vectorize GenericUDFIndex for array data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029)