how does isDistinct work on expressions

2016-11-13 Thread assaf.mendelson
Hi, I am trying to understand how aggregate functions are implemented internally. I see that the expression is wrapped using toAggregateExpression using isDistinct. I can't figure out where the code that makes the data distinct is located. I am trying to figure out how the input data is converted

Converting spark types and standard scala types

2016-11-13 Thread assaf.mendelson
Hi, I am trying to write a new aggregate function (https://issues.apache.org/jira/browse/SPARK-17691) and I wanted it to support all ordered types. I have several issues though: 1. How to convert the type of the child expression to a Scala standard type (e.g. I need an Array[Int] for Int

On the use of catalyst.dsl package and deserialize vs CatalystSerde.deserialize

2016-11-13 Thread Jacek Laskowski
Hi, It's just a (minor?) example of how to use catalyst.dsl package [1], but am currently reviewing deserialize [2] and got a question. CatalystSerde.deserialize [3] is exactly the deserialize operator (referred above) and since CatalystSerde.deserialize's used in few places like Dataset.rdd [4]

Re: how does isDistinct work on expressions

2016-11-13 Thread Jacek Laskowski
Hi, I might not have been there yet, but since I'm with the code every day I might be close... When you say "aggregate functions", are you about typed or untyped ones? Just today I reviewed the typed ones and honestly took me some time to figure out what belongs to where. Are you creating a new U

Re: Component naming in the PR title

2016-11-13 Thread Jacek Laskowski
Hi Hyukjin, What's worked for me so the Spark committers have accepted was to use the first group SQL, MLlib, Core, Python, Scheduler, Build, Docs, Streaming, Mesos, Web UI, YARN, GraphX, R with all the letters uppercase. It's less to remember so I'd vote for keeping it in use (or be acceptable

Re: Component naming in the PR title

2016-11-13 Thread Sean Owen
Yes they really correspond to, if anything, the categories at spark-prs.appspot.com . They aren't that consistently used however and there isn't really a definite list. It is really mostly of use for the fact that it tags emails in a way people can filter semi-effectively. So I think we have left i

Re: how does isDistinct work on expressions

2016-11-13 Thread Herman van Hövell tot Westerflier
Hi, You should take a look at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala Spark SQL does not directly support the aggregation of multiple distinct groups. For example select count(distinct a), coun

Re: does The Design of spark consider the scala parallelize collections?

2016-11-13 Thread Reynold Xin
Some places in Spark do use it: > git grep "\\.par\\." mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala: val models = Range(0, numClasses).par.map { index => sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala: (0 until 10).par.fo

Re: Component naming in the PR title

2016-11-13 Thread Hyukjin Kwon
I see. I was just curious as I find myself hesitating when I open a PR time to time. Thank you both for echoing! On 14 Nov 2016 5:02 a.m., "Sean Owen" wrote: > Yes they really correspond to, if anything, the categories at > spark-prs.appspot.com . They aren't that consistently used however and

statistics collection and propagation for cost-based optimizer

2016-11-13 Thread Reynold Xin
I want to bring this discussion to the dev list to gather broader feedback, as there have been some discussions that happened over multiple JIRA tickets (SPARK-16026 , etc) and GitHub pull requests about what statistics to collect and how to use th

Re: statistics collection and propagation for cost-based optimizer

2016-11-13 Thread Reynold Xin
One additional note: in terms of size, the size of a count-min sketch with eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes. To look up what that means, see http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/CountMinSketch.html On Sun, Nov 13, 2016 at 5:30 PM,

RE: how does isDistinct work on expressions

2016-11-13 Thread assaf.mendelson
Thanks for the pointer. It makes more sense now. Assaf. From: Herman van Hövell tot Westerflier-2 [via Apache Spark Developers List] [mailto:ml-node+s1001551n19842...@n3.nabble.com] Sent: Sunday, November 13, 2016 10:03 PM To: Mendelson, Assaf Subject: Re: how does isDistinct work on expressions