ramitg254 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3605348137
Hi,
so it is regarding the unit tests added:
1. why I am going for driver based approach instead of simply adding a q
file under an already present driver:
-> the `-Dhive.stats.fetch.bitvector` and
`-Dhive.metastore.direct.sql.batch.size` can't be set at the query time inside
a q file and needed to be set in `hive-site.xml`
2. why going for a docker image instead of a q file loading the data into
the table:
-> these changes are for scenario where statistics have been computed
already and then someone specify this property
`hive.metastore.direct.sql.batch.size` and restart hms so in that case results
are different. But to replicate this in unit test it won't be possible if I
define table and load data in q file as ` it will give ORM related exception
when placing debug pointer although getting executed` . So what i did is
created a `test_stats` table having data in 13 partitions and then compute
statistics for its columns and built a docker image with the stats dump and
used that docker image.
the q files added are performing self join when
`hive.metastore.direct.sql.batch.size=5` and the
`hive.metastore.direct.sql.batch.size` is set according to the q file name.
you can validate the results via:
cherry-picking the unit tests added commit on the top of the current master
branch and run:
`mvn test -pl itests/qtest -Pitests -Dtest=TestTezBatchedStatsCliDriver
-Dqfile=sketch_query.q -Dtest.output.overwrite`
similar for `no_sketch_query.q`
and you will see different results from expected one added in commit and if
you just drop the property `hive.metastore.direct.sql.batch.size` from the
`hive-site.xml` added for this driver you will see the expected results which
is the validation for corrected expected output with these changes in case of
batching.
FYI: the stats are added in such a way that overlapping values lie in
between batches of size 5 resulting in just taking maximum of ndv is not enough
to evaluate ndv calculated without batching as it is the case failing only
otherwise it passes even after having redundant per column elements which just
simply gets concatenated via simple function like maximum and all on the
higher level
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]