[
https://issues.apache.org/jira/browse/HIVE-26243?focusedWorklogId=779005&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-779005
]
ASF GitHub Bot logged work on HIVE-26243:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 07/Jun/22 11:41
Start Date: 07/Jun/22 11:41
Worklog Time Spent: 10m
Work Description: asolimando commented on PR #3317:
URL: https://github.com/apache/hive/pull/3317#issuecomment-1148553006
> PR-1824 has nothing to do with datasketches; I don't know how you followed
it's conventions but you might end up in trouble...because DS also has a HLL
implementation...wouldn't that conflict with the existing one?
HyperLogLog is one of the few examples of vectorized functions over a
"complex" data type, it seemed like a good candidate to me.
> note: PR-1824 named the file `VectorUDAFComputeBitVector.txt` and
internally named the method `compute_bit_vector_hll` ; I think the class name
should have contained the `Hll` keyword
>
> I think that the file name `VectorUDAFComputeKLL.txt` is not connected at
all to the `ds_kll_sketch` function its about to vectorize...and as such its a
bit confusing....
I agree on that, maybe `VectorUDAF_ds_kll_sketch` would be clearer.
> The current implementation doesn't really look forward: I think we have 20
_sketch_ function from datasketches already exposed as inside Hive which could
be vectorized; I think they are behind the same api cover...so just vectorizing
the KLL one without any sight forward and taking "ideas" from the old hll
codepath doesn't seem the best idea to me...
>
> ```
> grep ^ds_
ql/src/test/results/clientpositive/llap/show_functions.q.out|grep _sketch$
> ```
>
> no need to do everything in 1 patch - but this is pretty much just
copy-pasting the existing hll txtfile substituted to kll here and there...so we
should do that 20 times?
Sketches are behind the same api for methods like `update()` and
`serialize()`, but they also have differences, like the supported constructors,
it seems hard to have a uniform UDAF for all of them. I have omitted the
constructor for supporting the K param of Kll sketches in the current PR, but
it will be needed later on (I have implemented it as suggested here:
https://github.com/apache/hive/pull/1824#issuecomment-756588127), so we will
have to have a specialized version of the KLL UDAF anyway, and the same for all
the other 9 sketches, I don't see a way out of this.
> > For instance, you seem to be suggesting to remove all helper
classes/methods etc
> > I don't think those changes neccessary in the _metastore_ for a
vectorization of this function?
>
> [HIVE-26221](https://issues.apache.org/jira/browse/HIVE-26221) is
something which have changes - but has no real end-user accessible value - and
as such I don't think its ready.
[HIVE-26221](https://issues.apache.org/jira/browse/HIVE-26221) already
sensibly improves range filter estimation for runtime statistics that can be
used by Tez, of course exploiting statistics in query planning is the next
obvious step, but I am a fan of iterative improvements, especially which such a
big patch spanning few modules. Let's have this discussion offline as it is not
key here, I just wanted to highlight the needs behind the current patch by
linking to this ticket.
I am trying to figure out a better place to put an interface plus the
concrete implementation(s) of the "complex" object incapsulating the different
sketches: the only examples I could find are `HyperLogLog` (stored in the
`metastore-server` module) and `BloomFilter`/`BloomKFilter` (stored in the
`storage-api` module). As of now I have followed the `HyperLogLog` example but
you seem to disagree, `storage-api` seems pretty arbitrary too, so I am not
sure how to proceed here, do you have any suggestions?
Issue Time Tracking
-------------------
Worklog Id: (was: 779005)
Time Spent: 40m (was: 0.5h)
> Add vectorized implementation of the 'ds_kll_sketch' UDAF
> ---------------------------------------------------------
>
> Key: HIVE-26243
> URL: https://issues.apache.org/jira/browse/HIVE-26243
> Project: Hive
> Issue Type: Improvement
> Components: UDF, Vectorization
> Affects Versions: 4.0.0-alpha-2
> Reporter: Alessandro Solimando
> Assignee: Alessandro Solimando
> Priority: Major
> Labels: pull-request-available
> Time Spent: 40m
> Remaining Estimate: 0h
>
> _ds_kll_sketch_ UDAF does not have a vectorized implementation at the moment,
> the present ticket aims at bridging this gap.
> This is particularly important because vectorization has an "all or nothing"
> approach, so if this function is used at the side of vectorized functions,
> they won't be able to benefit from vectorized execution.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)