[ https://issues.apache.org/jira/browse/HIVE-4701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Hanson updated HIVE-4701: ------------------------------ Description: OR filters have been optimized to run with vectorized query execution. IN filters of the form "Column IN (list-of-constants)" are a special case of OR. However, IN does not vectorized currently. E.g. select ddate, count\(\*\) from factsqlengineam_vec_orc where ddate = "2012-05-19 00:00:00" OR ddate = "2012-05-20 00:00:00" or ddate = "2012-05-21 00:00:00" group by ddate; takes about 23 seconds of CPU and select ddate, count\(\*\) from factsqlengineam_vec_orc where ddate IN ("2012-05-19 00:00:00", "2012-05-20 00:00:00", "2012-05-21 00:00:00") group by ddate; takes about 153 seconds of CPU. A simple fix may be that for short IN lists (say <= 64 elements) we turn them into OR by manipulating the query tree before planning whether vectorization can be used. A more complex fix that covers more cases would be to turn longer IN lists into a join so when we eventually support vectorized joins it will be fast. An intermediate approach might be to implement a special IN filter operator that stores the constant values in a sorted array or high-performance hash table (like Cuckoo hashing). was: OR filters have been optimized to run with vectorized query execution. IN filters of the form "Column IN (list-of-constants)" are a special case of OR. However, IN does not vectorized currently. E.g. select ddate, count(*) from factsqlengineam_vec_orc where ddate = "2012-05-19 00:00:00" OR ddate = "2012-05-20 00:00:00" or ddate = "2012-05-21 00:00:00" group by ddate; takes about 23 seconds of CPU and select ddate, count(*) from factsqlengineam_vec_orc where ddate IN ("2012-05-19 00:00:00", "2012-05-20 00:00:00", "2012-05-21 00:00:00") group by ddate; takes about 153 seconds of CPU. A simple fix may be that for short IN lists (say <= 64 elements) we turn them into OR by manipulating the query tree before planning whether vectorization can be used. A more complex fix that covers more cases would be to turn longer IN lists into a join so when we eventually support vectorized joins it will be fast. An intermediate approach might be to implement a special IN filter operator that stores the constant values in a sorted array or high-performance hash table (like Cuckoo hashing). > Optimize filter Column IN ( list-of-constants ) for vectorized execution > ------------------------------------------------------------------------ > > Key: HIVE-4701 > URL: https://issues.apache.org/jira/browse/HIVE-4701 > Project: Hive > Issue Type: Sub-task > Reporter: Eric Hanson > > OR filters have been optimized to run with vectorized query execution. IN > filters of the form "Column IN (list-of-constants)" are a special case of OR. > However, IN does not vectorized currently. > E.g. > select ddate, count\(\*\) from factsqlengineam_vec_orc where ddate = > "2012-05-19 00:00:00" OR ddate = "2012-05-20 00:00:00" or ddate = "2012-05-21 > 00:00:00" group by ddate; > takes about 23 seconds of CPU and > select ddate, count\(\*\) from factsqlengineam_vec_orc where ddate IN > ("2012-05-19 00:00:00", "2012-05-20 00:00:00", "2012-05-21 00:00:00") group > by ddate; > takes about 153 seconds of CPU. > A simple fix may be that for short IN lists (say <= 64 elements) we turn them > into OR by manipulating the query tree before planning whether vectorization > can be used. > A more complex fix that covers more cases would be to turn longer IN lists > into a join so when we eventually support vectorized joins it will be fast. > An intermediate approach might be to implement a special IN filter operator > that stores the constant values in a sorted array or high-performance hash > table (like Cuckoo hashing). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira