Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

via GitHub Tue, 02 Dec 2025 22:49:17 -0800


ramitg254 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3605348137


   Hi,
   so it is regarding the unit tests added:
   1. why I am going for driver based approach instead of simply adding a q 
file under an already present driver:
      -> the `-Dhive.stats.fetch.bitvector` and 
`-Dhive.metastore.direct.sql.batch.size` can't be set at the query time inside 
a q file and needed to be set in `hive-site.xml`
      
    2. why going for a docker image instead of a q file loading the data into 
the table:
      -> these changes are for scenario where statistics have been computed 
already and then someone specify this property 
`hive.metastore.direct.sql.batch.size` and restart hms so in that case results 
are different. But to replicate this in unit test it won't be possible if I 
define table and load data in q file as ` it will give ORM related exception 
when placing debug pointer although getting executed` . So what i did is 
created a `test_stats` table having  data in 13 partitions and then compute 
statistics for its columns and built a docker image with the stats dump and 
used that docker image.
     
   the q files added are performing self join when 
`hive.metastore.direct.sql.batch.size=5` and the 
`hive.metastore.direct.sql.batch.size` is set according to the q file name.
   
   you can validate the results via:
   cherry-picking the unit tests added commit on the top of the current master 
branch and run:
   `mvn test -pl itests/qtest -Pitests -Dtest=TestTezBatchedStatsCliDriver  
-Dqfile=sketch_query.q -Dtest.output.overwrite`
   similar for `no_sketch_query.q`
   and you will see different results from expected one added in commit and if 
you just drop the property `hive.metastore.direct.sql.batch.size` from the 
`hive-site.xml`  added for this driver you will see the expected results which 
is the validation for corrected expected output with these changes in case of 
batching.
   
   FYI: the stats are added in such a way that overlapping values lie in 
between batches of size 5 resulting in just taking maximum of ndv is not enough 
to evaluate ndv calculated without batching as it is the case failing only 
otherwise it passes even after having redundant per column elements which just 
simply gets concatenated via simple function like maximum and  all on the 
higher level
     


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

Reply via email to