> Torenwacht 98
>>>>>> 2353 DC Leiderdorp
>>>>>> hvanhov...@questtec.nl
>>>>>> +599 9 521 4402
>>>>>>
>>>>>>
>>>>>> 2015-09-12 10:07 GMT+02:00 Nick Pentreath <nick.pentre...@gmail.co
egate operators. Are there any
>>>>>>> opinions
>>>>>>> on this?
>>>>>>>
>>>>>>> Kind regards,
>>>>>>>
>>>>>>> Herman van Hövell tot Westerflier
>>>>>>>
&
sed on the Spark 1.5 UDAF
>>>>> interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141
>>>>>
>>>>> Some questions -
>>>>>
>>>>> 1. How do I get the UDAF to accept input arguments of different type?
>>>>&g
15-09-12 10:07 GMT+02:00 Nick Pentreath <nick.pentre...@gmail.com>:
>>>>>
>>>>>> Inspired by this post:
>>>>>> http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/,
>>>>>&
. I've based this on the Sum and Avg examples for the new UDAF interface -
any suggestions or issue please advise. Is the intermediate buffer
efficient?
4. The current HyperLogLogUDT is private - so I've had to make my own one
which is a bit pointless as it's copy-pasted. Any thoughts on exposing
2. @Reynold, how would I ensure this works for Tungsten (ie against raw
>>> bytes in memory)? Or does the new Aggregate2 stuff automatically do that?
>>> Where should I look for examples on how this works internally?
>>> 3. I've based this on the Sum and Avg examples for the
I look for examples on how this works internally?
> 3. I've based this on the Sum and Avg examples for the new UDAF interface
> - any suggestions or issue please advise. Is the intermediate buffer
> efficient?
> 4. The current HyperLogLogUDT is private - so I've had to make my ow
an be
>> merged/aggregated (e.g. for grouped data)
>> 2. @Reynold, how would I ensure this works for Tungsten (ie against raw
>> bytes in memory)? Or does the new Aggregate2 stuff automatically do that?
>> Where should I look for examples on how this works internally?
>>
Int, Long, String, Object, raw
>>>> bytes etc. Right now it seems we'd need to build a new UDAF for each input
>>>> type, which seems strange - I should be able to use one UDAF that can
>>>> handle raw input of different types, as well as handle existing HLLs that
- so could one adapt
the HyperLogLogUDT to be a Struct with the serialized HLL as a field as
well as count as a field? Then I assume this would automatically play
nicely with DataFrame I/O etc. The gotcha is one needs to then call
approx_count_field.count (or is there a concept of a default
a cardinality field and a binary field containing the
serialized HLL.
I was wondering if there would be interest in something like this? I am
not so clear on how UDTs work with regards to SerDe - so could one adapt
the HyperLogLogUDT to be a Struct with the serialized HLL as a field as
well as count
the
HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as
count as a field? Then I assume this would automatically play nicely with
DataFrame I/O etc. The gotcha is one needs to then call
approx_count_field.count (or is there a concept of a default field for
a Struct?).
Also
work with regards to SerDe - so could one adapt
the HyperLogLogUDT to be a Struct with the serialized HLL as a field as
well as count as a field? Then I assume this would automatically play
nicely with DataFrame I/O etc. The gotcha is one needs to then call
approx_count_field.count
the serialized HLL.
I was wondering if there would be interest in something like this? I am not
so clear on how UDTs work with regards to SerDe - so could one adapt the
HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as
count as a field? Then I assume this would automatically
14 matches
Mail list logo