Re: HyperLogLogUDT

2015-09-13 Thread Nick Pentreath
> Torenwacht 98 >>>>>> 2353 DC Leiderdorp >>>>>> hvanhov...@questtec.nl >>>>>> +599 9 521 4402 >>>>>> >>>>>> >>>>>> 2015-09-12 10:07 GMT+02:00 Nick Pentreath <nick.pentre...@gmail.co

Re: HyperLogLogUDT

2015-09-13 Thread Yin Huai
egate operators. Are there any >>>>>>> opinions >>>>>>> on this? >>>>>>> >>>>>>> Kind regards, >>>>>>> >>>>>>> Herman van Hövell tot Westerflier >>>>>>> &

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
sed on the Spark 1.5 UDAF >>>>> interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141 >>>>> >>>>> Some questions - >>>>> >>>>> 1. How do I get the UDAF to accept input arguments of different type? >>>>&g

Re: HyperLogLogUDT

2015-09-12 Thread Yin Huai
15-09-12 10:07 GMT+02:00 Nick Pentreath <nick.pentre...@gmail.com>: >>>>> >>>>>> Inspired by this post: >>>>>> http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/, >>>>>&

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
. I've based this on the Sum and Avg examples for the new UDAF interface - any suggestions or issue please advise. Is the intermediate buffer efficient? 4. The current HyperLogLogUDT is private - so I've had to make my own one which is a bit pointless as it's copy-pasted. Any thoughts on exposing

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
2. @Reynold, how would I ensure this works for Tungsten (ie against raw >>> bytes in memory)? Or does the new Aggregate2 stuff automatically do that? >>> Where should I look for examples on how this works internally? >>> 3. I've based this on the Sum and Avg examples for the

Re: HyperLogLogUDT

2015-09-12 Thread Herman van Hövell tot Westerflier
I look for examples on how this works internally? > 3. I've based this on the Sum and Avg examples for the new UDAF interface > - any suggestions or issue please advise. Is the intermediate buffer > efficient? > 4. The current HyperLogLogUDT is private - so I've had to make my ow

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
an be >> merged/aggregated (e.g. for grouped data) >> 2. @Reynold, how would I ensure this works for Tungsten (ie against raw >> bytes in memory)? Or does the new Aggregate2 stuff automatically do that? >> Where should I look for examples on how this works internally? >>

Re: HyperLogLogUDT

2015-09-12 Thread Herman van Hövell tot Westerflier
Int, Long, String, Object, raw >>>> bytes etc. Right now it seems we'd need to build a new UDAF for each input >>>> type, which seems strange - I should be able to use one UDAF that can >>>> handle raw input of different types, as well as handle existing HLLs that

Re: HyperLogLogUDT

2015-07-02 Thread Reynold Xin
- so could one adapt the HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as count as a field? Then I assume this would automatically play nicely with DataFrame I/O etc. The gotcha is one needs to then call approx_count_field.count (or is there a concept of a default

Re: HyperLogLogUDT

2015-07-02 Thread Nick Pentreath
a cardinality field and a binary field containing the serialized HLL. I was wondering if there would be interest in something like this? I am not so clear on how UDTs work with regards to SerDe - so could one adapt the HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as count

Re: HyperLogLogUDT

2015-07-01 Thread Nick Pentreath
the HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as count as a field? Then I assume this would automatically play nicely with DataFrame I/O etc. The gotcha is one needs to then call approx_count_field.count (or is there a concept of a default field for a Struct?). Also

Re: HyperLogLogUDT

2015-07-01 Thread Daniel Darabos
work with regards to SerDe - so could one adapt the HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as count as a field? Then I assume this would automatically play nicely with DataFrame I/O etc. The gotcha is one needs to then call approx_count_field.count

HyperLogLogUDT

2015-06-23 Thread Nick Pentreath
the serialized HLL. I was wondering if there would be interest in something like this? I am not so clear on how UDTs work with regards to SerDe - so could one adapt the HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as count as a field? Then I assume this would automatically