[ https://issues.apache.org/jira/browse/HIVE-19889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Janaki Lahorani updated HIVE-19889: ----------------------------------- Fix Version/s: 4.0.0 Status: Patch Available (was: Open) > Wrong results due to PPD of non deterministic functions with CBO > ---------------------------------------------------------------- > > Key: HIVE-19889 > URL: https://issues.apache.org/jira/browse/HIVE-19889 > Project: Hive > Issue Type: Bug > Reporter: Janaki Lahorani > Assignee: Janaki Lahorani > Priority: Major > Fix For: 4.0.0 > > > The following query can give wrong results when CBO is on: > select * from ( > select part1,randum123 > from (SELECT *, cast(rand() as double) AS randum123 FROM testA where > part1='CA' and part2 = 'ABC') a > where randum123 <= 0.5) s where s.randum123 > 0.25 limit 20; > The plan of the query is as follows: > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Map Operator Tree: > TableScan > alias: testa > Statistics: Num rows: 2 Data size: 4580 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: > boolean) > Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: 'CA' (type: string), rand() (type: double) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE > Column stats: NONE > Limit > Number of rows: 20 > Statistics: Num rows: 1 Data size: 2290 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 1 Data size: 2290 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.SequenceFileInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: 20 > Processor Tree: > ListSink > The relevant part in the plan is the filter: > Filter Operator > predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: > boolean) > The predicates s.randum123 > 0.25 and s.randum123 > 0.25 were pushed down. > And randum123 was resolved to rand(). This is bad because it will result in > invocation of rand() two times and rand() UDF is non-deterministic. Both the > rand calls can generate values that can satisfy the predicates independently, > but not together, whereas the original intention of the query is to give > results when rand falls between 0.25 and 0.5. > A sample result: > CA 0.9191984370369802 > CA 0.397933021566812 > where the condition was not satisfied. -- This message was sent by Atlassian JIRA (v7.6.3#76005)