Hey All! In the recent months I was working with Jesus Camacho Rodriguez on integrating DataSketches more tightly with Hive [1].
So..from Hive 4.0 : almost all the datasketches functions will be available in by default; to do this - I had to come up with some naming convention/etc (ds_{sketchType}_{functionName}) to register all ds functions. I will contribute back some of these changes; but I was able to avoid changing even datasketches-hive so far - I've noticed that there are some "simple" functions which are missing; and they should be there - just for completeness reasons (iirc mostly toString function and probably a few more).
Probably the most interesting for you is that by utilizing Calcite a set of rules can transparently rewrite COUNT(DISTINCT)/PERCENTILE_DISC/CUME_DIST/RANK/NTILE to use sketch functions! :)
Materialized views are also supported - so that sketches can be stored precomputed(and rolled up). If you would like to get a quick look what it does; the test for rewriting rank [2] shows a few statements. Thank you for this great library! cheers, Zoltan [1] https://issues.apache.org/jira/browse/HIVE-22939 [2] https://github.com/apache/hive/blob/e4256fc91fe2c123428400f3737883a83208d29e/ql/src/test/queries/clientpositive/sketches_rewrite_rank.q#L15 --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
