[ https://issues.apache.org/jira/browse/HIVE-26243?focusedWorklogId=779005&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-779005 ]
ASF GitHub Bot logged work on HIVE-26243: ----------------------------------------- Author: ASF GitHub Bot Created on: 07/Jun/22 11:41 Start Date: 07/Jun/22 11:41 Worklog Time Spent: 10m Work Description: asolimando commented on PR #3317: URL: https://github.com/apache/hive/pull/3317#issuecomment-1148553006 > PR-1824 has nothing to do with datasketches; I don't know how you followed it's conventions but you might end up in trouble...because DS also has a HLL implementation...wouldn't that conflict with the existing one? HyperLogLog is one of the few examples of vectorized functions over a "complex" data type, it seemed like a good candidate to me. > note: PR-1824 named the file `VectorUDAFComputeBitVector.txt` and internally named the method `compute_bit_vector_hll` ; I think the class name should have contained the `Hll` keyword > > I think that the file name `VectorUDAFComputeKLL.txt` is not connected at all to the `ds_kll_sketch` function its about to vectorize...and as such its a bit confusing.... I agree on that, maybe `VectorUDAF_ds_kll_sketch` would be clearer. > The current implementation doesn't really look forward: I think we have 20 _sketch_ function from datasketches already exposed as inside Hive which could be vectorized; I think they are behind the same api cover...so just vectorizing the KLL one without any sight forward and taking "ideas" from the old hll codepath doesn't seem the best idea to me... > > ``` > grep ^ds_ ql/src/test/results/clientpositive/llap/show_functions.q.out|grep _sketch$ > ``` > > no need to do everything in 1 patch - but this is pretty much just copy-pasting the existing hll txtfile substituted to kll here and there...so we should do that 20 times? Sketches are behind the same api for methods like `update()` and `serialize()`, but they also have differences, like the supported constructors, it seems hard to have a uniform UDAF for all of them. I have omitted the constructor for supporting the K param of Kll sketches in the current PR, but it will be needed later on (I have implemented it as suggested here: https://github.com/apache/hive/pull/1824#issuecomment-756588127), so we will have to have a specialized version of the KLL UDAF anyway, and the same for all the other 9 sketches, I don't see a way out of this. > > For instance, you seem to be suggesting to remove all helper classes/methods etc > > I don't think those changes neccessary in the _metastore_ for a vectorization of this function? > > [HIVE-26221](https://issues.apache.org/jira/browse/HIVE-26221) is something which have changes - but has no real end-user accessible value - and as such I don't think its ready. [HIVE-26221](https://issues.apache.org/jira/browse/HIVE-26221) already sensibly improves range filter estimation for runtime statistics that can be used by Tez, of course exploiting statistics in query planning is the next obvious step, but I am a fan of iterative improvements, especially which such a big patch spanning few modules. Let's have this discussion offline as it is not key here, I just wanted to highlight the needs behind the current patch by linking to this ticket. I am trying to figure out a better place to put an interface plus the concrete implementation(s) of the "complex" object incapsulating the different sketches: the only examples I could find are `HyperLogLog` (stored in the `metastore-server` module) and `BloomFilter`/`BloomKFilter` (stored in the `storage-api` module). As of now I have followed the `HyperLogLog` example but you seem to disagree, `storage-api` seems pretty arbitrary too, so I am not sure how to proceed here, do you have any suggestions? Issue Time Tracking ------------------- Worklog Id: (was: 779005) Time Spent: 40m (was: 0.5h) > Add vectorized implementation of the 'ds_kll_sketch' UDAF > --------------------------------------------------------- > > Key: HIVE-26243 > URL: https://issues.apache.org/jira/browse/HIVE-26243 > Project: Hive > Issue Type: Improvement > Components: UDF, Vectorization > Affects Versions: 4.0.0-alpha-2 > Reporter: Alessandro Solimando > Assignee: Alessandro Solimando > Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > _ds_kll_sketch_ UDAF does not have a vectorized implementation at the moment, > the present ticket aims at bridging this gap. > This is particularly important because vectorization has an "all or nothing" > approach, so if this function is used at the side of vectorized functions, > they won't be able to benefit from vectorized execution. -- This message was sent by Atlassian Jira (v8.20.7#820007)