Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-148306575
we can create a hash expression, and codegen that. And then just use
hyperloglog(hash(field)).
---
If your project is set up for it, you can reply to this email and have y
Github user hvanhovell commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-148305817
Another thought on hashing. The ClearSpring hash is a generic hash
function. We could used very specialized (hopefully fast) hashing functions,
because we know the ty
Github user yhuai commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-148227211
Thanks for the pointer. Looks like we only have 32-bit Murmur3 in spark's
unsafe module
(https://github.com/apache/spark/blob/master/unsafe/src/main/java/org/apache/spark/
Github user hvanhovell commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-148209543
A good article on HLL++ and the hashcode:
http://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll
---
If your project is set up for it, yo
Github user hvanhovell commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-148209375
@yhuai It doesn't. A 64-bit hashcode is recommended though, especially when
would want to approximate a billion or more unique values. I have used the
ClearSpring has
Github user yhuai commented on a diff in the pull request:
https://github.com/apache/spark/pull/8362#discussion_r42054243
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala
---
@@ -302,3 +307,397 @@ case class Sum(child: Expre
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/8362
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enab
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-144479867
LGTM, merging this into master, thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project do
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-144479889
mer
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-11597
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-11434
[Test build #43132 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43132/console)
for PR 8362 at commit
[`a5fdd07`](https://github.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-11593
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-144391587
[Test build #43132 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43132/consoleFull)
for PR 8362 at commit
[`a5fdd07`](https://gith
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-144390880
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not h
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-144390907
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user hvanhovell commented on a diff in the pull request:
https://github.com/apache/spark/pull/8362#discussion_r40782166
--- Diff:
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala
---
@@ -0,0 +1,125 @@
+/*
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-144136673
We can work on improving the aggregate operator.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your pro
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-144118463
@hvanhovell @rxin Just realized that the tungsten aggregation does not
support var-length types in aggregation buffer, so we can't have sparse version
without aggregation
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/8362#discussion_r40696819
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala
---
@@ -302,3 +307,393 @@ case class Sum(child: Expr
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/8362#discussion_r40697020
--- Diff:
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala
---
@@ -0,0 +1,125 @@
+/*
+ *
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-143945761
@yhuai can we make non-codegen path use tungsten aggregate as well?
Otherwise we would need to maintain two entirely separate codepath.
---
If your project is set up for i
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-143902631
I took a round, this looks pretty good to me over all.
Currently, each grouping key needs 200 bytes (b=8, by default), so the
sparse version could help to reduce
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/8362#discussion_r40610794
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala
---
@@ -302,3 +307,393 @@ case class Sum(child: Expr
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/8362#discussion_r40610006
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala
---
@@ -302,3 +307,393 @@ case class Sum(child: Expr
Github user hvanhovell commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-142323666
@MLnick I am in the process of moving house, so I am a bit slow/late with
my response :(...
I think it is very usefull to be able to return the HLL registers
Github user yhuai commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-140141999
@MLnick This one will replace the existing implementation. For now, we will
do conversion as shown at
https://github.com/apache/spark/pull/8362/files#diff-78b9b210b8cee72e
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-140019691
@hvanhovell @rxin is it intended that this replace the existing
`approxCountDistinct ` implementation? And I assume this will happen
automatically due to extending `Aggre
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-140019207
@hvanhovell as discussed on the `dev` mailing list, perhaps it would be
interesting to allow the return type to include the aggregated HLL registers.
This could be (for e
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-139406671
One quick note: https://github.com/twitter/algebird/pull/491/files
anything we can learn from the above pr?
---
If your project is set up for it, you can reply to
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-138699022
Ah ok. will add this to our sprint backlog and get somebody to review it
soon.
---
If your project is set up for it, you can reply to this email and have your
reply appea
Github user hvanhovell commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-138493630
@rxin the dense version of HLL++ is ready. We could also add this, and add
the sparse logic in a follow-up PR. Let me know what you think. I'll close if
you'd rather
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-138384295
@hvanhovell do you mind closing this pull request, and re-open when you
feel it is ready for review again?
---
If your project is set up for it, you can reply to this emai
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-135639349
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-135639351
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-135639256
[Test build #41719 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41719/console)
for PR 8362 at commit
[`1ea722b`](https://github.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-135619890
[Test build #41719 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41719/consoleFull)
for PR 8362 at commit
[`1ea722b`](https://gith
Github user hvanhovell commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-135619245
Implemented initial non-sparse HLL++. I am going to take a look at the
sparse version next week. The results are still equal to the Clearspring HLL+
implementation in
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-135617993
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-135617983
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not h
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-133567281
Thanks - I think blackbox testing is fine. But it would be great to apply
that at the "unit" testing level, i.e. running directly against the aggregate
function, rather tha
Github user hvanhovell commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-133566608
Thanks.
I was aiming for compatibility with the existing approxCountDistinct, but
we can also implement HLL++. HLL++ introduces three (orthogonal) refinements
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-133562453
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-133562454
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-133562302
[Test build #41377 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41377/console)
for PR 8362 at commit
[`e178d9e`](https://github.
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-133544919
This made my day. The approach is super cool.
Couple suggestions:
1. Can we use HyperLogLogPlus? It's also in streamlib:
https://github.com/addthis/stream-
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-133525793
[Test build #41377 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41377/consoleFull)
for PR 8362 at commit
[`e178d9e`](https://gith
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-133523850
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-133523827
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not h
Github user yhuai commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-133523761
ok to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabl
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8362#issuecomment-133515988
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your pr
GitHub user hvanhovell opened a pull request:
https://github.com/apache/spark/pull/8362
[SPARK-9741][SQL] Approximate Count Distinct using the new UDAF interface.
This PR implements a HyperLogLog based Approximate Count Distinct function
using the new UDAF interface.
The im
51 matches
Mail list logo