[ 
https://issues.apache.org/jira/browse/HIVE-26243?focusedWorklogId=779005&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-779005
 ]

ASF GitHub Bot logged work on HIVE-26243:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Jun/22 11:41
            Start Date: 07/Jun/22 11:41
    Worklog Time Spent: 10m 
      Work Description: asolimando commented on PR #3317:
URL: https://github.com/apache/hive/pull/3317#issuecomment-1148553006

   > PR-1824 has nothing to do with datasketches; I don't know how you followed 
it's conventions but you might end up in trouble...because DS also has a HLL 
implementation...wouldn't that conflict with the existing one?
   
   HyperLogLog is one of the few examples of vectorized functions over a 
"complex" data type, it seemed like a good candidate to me.
   
   > note: PR-1824 named the file `VectorUDAFComputeBitVector.txt` and 
internally named the method `compute_bit_vector_hll` ; I think the class name 
should have contained the `Hll` keyword
   > 
   > I think that the file name `VectorUDAFComputeKLL.txt` is not connected at 
all to the `ds_kll_sketch` function its about to vectorize...and as such its a 
bit confusing....
   
   I agree on that, maybe `VectorUDAF_ds_kll_sketch` would be clearer.
   
   > The current implementation doesn't really look forward: I think we have 20 
_sketch_ function from datasketches already exposed as inside Hive which could 
be vectorized; I think they are behind the same api cover...so just vectorizing 
the KLL one without any sight forward and taking "ideas" from the old hll 
codepath doesn't seem the best idea to me...
   > 
   > ```
   > grep ^ds_ 
ql/src/test/results/clientpositive/llap/show_functions.q.out|grep _sketch$
   > ```
   > 
   > no need to do everything in 1 patch - but this is pretty much just 
copy-pasting the existing hll txtfile substituted to kll here and there...so we 
should do that 20 times?
   
   Sketches are behind the same api for methods like `update()` and 
`serialize()`, but they also have differences, like the supported constructors, 
it seems hard to have a uniform UDAF for all of them. I have omitted the 
constructor for supporting the K param of Kll sketches in the current PR, but 
it will be needed later on (I have implemented it as suggested here: 
https://github.com/apache/hive/pull/1824#issuecomment-756588127), so we will 
have to have a specialized version of the KLL UDAF anyway, and the same for all 
the other 9 sketches, I don't see a way out of this.
   
   > > For instance, you seem to be suggesting to remove all helper 
classes/methods etc
   > > I don't think those changes neccessary in the _metastore_ for a 
vectorization of this function?
   > 
   > [HIVE-26221](https://issues.apache.org/jira/browse/HIVE-26221) is 
something which have changes - but has no real end-user accessible value - and 
as such I don't think its ready.
   
   [HIVE-26221](https://issues.apache.org/jira/browse/HIVE-26221) already 
sensibly improves range filter estimation for runtime statistics that can be 
used by Tez, of course exploiting statistics in query planning is the next 
obvious step, but I am a fan of iterative improvements, especially which such a 
big patch spanning few modules. Let's have this discussion offline as it is not 
key here, I just wanted to highlight the needs behind the current patch by 
linking to this ticket.
   
   I am trying to figure out a better place to put an interface plus the 
concrete implementation(s) of the "complex" object incapsulating the different 
sketches: the only examples I could find are `HyperLogLog` (stored in the 
`metastore-server` module) and `BloomFilter`/`BloomKFilter` (stored in the 
`storage-api` module). As of now I have followed the `HyperLogLog` example but 
you seem to disagree, `storage-api` seems pretty arbitrary too, so I am not 
sure how to proceed here, do you have any suggestions?




Issue Time Tracking
-------------------

    Worklog Id:     (was: 779005)
    Time Spent: 40m  (was: 0.5h)

> Add vectorized implementation of the 'ds_kll_sketch' UDAF
> ---------------------------------------------------------
>
>                 Key: HIVE-26243
>                 URL: https://issues.apache.org/jira/browse/HIVE-26243
>             Project: Hive
>          Issue Type: Improvement
>          Components: UDF, Vectorization
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> _ds_kll_sketch_ UDAF does not have a vectorized implementation at the moment, 
> the present ticket aims at bridging this gap.
> This is particularly important because vectorization has an "all or nothing" 
> approach, so if this function is used at the side of vectorized functions, 
> they won't be able to benefit from vectorized execution.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to