[GitHub] spark issue #21699: [SPARK-24722][SQL] pivot() with Column type argument

2018-07-03 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/21699 Using either `Column` or `String` type was actually in my original PR: https://github.com/apache/spark/pull/7841 @rxin later modified the api to only take a `String` prior to the release as part

[GitHub] spark issue #21187: [SPARK-24035][SQL] SQL syntax for Pivot

2018-04-30 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/21187 LGTM thanks for doing this! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #19629: [SPARK-22408][SQL] RelationalGroupedDataset's distinct p...

2017-11-01 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/19629 diff LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #18306: [SPARK-21029][SS] All StreamingQuery should be st...

2017-10-07 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18306#discussion_r143345713 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -562,6 +563,8 @@ class SparkContext(config: SparkConf) extends Logging

[GitHub] spark pull request #19226: [SPARK-21985][PySpark] PairDeserializer is broken...

2017-09-14 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/19226#discussion_r138917350 --- Diff: python/pyspark/serializers.py --- @@ -343,6 +343,8 @@ def _load_stream_without_unbatching(self, stream): key_batch_stream

[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...

2017-09-13 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/19226 @holdenk I'm not going to be able to solve this tonight (short of just removing the failing test). --- - To unsubscribe, e-mail

[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...

2017-09-13 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/19226 It's actually this one that is failing https://github.com/aray/spark/blob/0d64a6d11237383c2a6ea21275dc9daa5cc8d634/python/pyspark/tests.py#L964

[GitHub] spark pull request #19226: [SPARK-21985][PySpark] PairDeserializer is broken...

2017-09-13 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/19226 [SPARK-21985][PySpark] PairDeserializer is broken for double-zipped RDDs ## What changes were proposed in this pull request? This removes the mostly unnecessary test that each individual

[GitHub] spark issue #16121: [SPARK-16589][PYTHON] Chained cartesian produces incorre...

2017-09-13 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16121 I'll take a look, sorry about that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #18306: [SPARK-21029][SS] All StreamingQuery should be st...

2017-08-31 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18306#discussion_r136436631 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -562,6 +563,8 @@ class SparkContext(config: SparkConf) extends Logging

[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-31 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18818#discussion_r136421644 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -582,6 +582,7 @@ class CodegenContext

[GitHub] spark pull request #19080: [SPARK-21865][SQL] simplify the distribution sema...

2017-08-31 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/19080#discussion_r136419947 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala --- @@ -284,24 +241,17 @@ case class RangePartitioning

[GitHub] spark issue #18306: [SPARK-21029][SS] All StreamingQuery should be stopped w...

2017-08-31 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18306 ping @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #18818: [SPARK-21110][SQL] Structs, arrays, and other orderable ...

2017-08-31 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18818 ping @viirya @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark pull request #18786: [SPARK-21584][SQL][SparkR] Update R method for su...

2017-08-17 Thread aray
GitHub user aray reopened a pull request: https://github.com/apache/spark/pull/18786 [SPARK-21584][SQL][SparkR] Update R method for summary to call new implementation ## What changes were proposed in this pull request? SPARK-21100 introduced a new `summary` method

[GitHub] spark pull request #18786: [SPARK-21584][SQL][SparkR] Update R method for su...

2017-08-17 Thread aray
Github user aray closed the pull request at: https://github.com/apache/spark/pull/18786 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #18786: [SPARK-21584][SQL][SparkR] Update R method for summary t...

2017-08-17 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18786 closing and reopening to trigger AppVeyor test that timed out --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #18818: [SPARK-21110][SQL] Structs, arrays, and other orderable ...

2017-08-14 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18818 @viirya @gatorsmile I have addressed your comments, could you take another look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-14 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18818#discussion_r133116720 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -465,7 +475,7 @@ abstract class BinaryComparison

[GitHub] spark issue #18818: [SPARK-21110][SQL] Structs, arrays, and other orderable ...

2017-08-14 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18818 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #18786: [SPARK-21584][SQL][SparkR] Update R method for summary t...

2017-08-09 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18786 I'm pushing for it to stay as is because it's the more logical layout of the data: min=0%, 25%, 50%, 75%, max=100%. It's also more consistent with summary of native R dataframes (and for Python

[GitHub] spark issue #18306: [SPARK-21029][SS] All StreamingQuery should be stopped w...

2017-08-08 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18306 @zsxwing can you take another look at this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #18786: [SPARK-21584][SQL][SparkR] Update R method for summary t...

2017-08-08 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18786 @rxin Any thoughts on whether it's ok to change the output of `summary` in R in a non "additive" way? --- If your project is set up for it, you can reply to this email and have your re

[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-08 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18818#discussion_r131913185 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -453,6 +453,14 @@ case class Or(left: Expression

[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-07 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18818#discussion_r131808912 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -453,6 +453,14 @@ case class Or(left: Expression

[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-07 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18818#discussion_r131656840 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala --- @@ -79,18 +79,6 @@ private[sql] class TypeCollection(private val

[GitHub] spark issue #18835: [SPARK-21628][BUILD] Explicitly specify Java version in ...

2017-08-03 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18835 Thanks, I see it now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #18835: [SPARK-21628][BUILD] Explicitly specify Java vers...

2017-08-03 Thread aray
Github user aray closed the pull request at: https://github.com/apache/spark/pull/18835 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request #18835: [SPARK-21628][BUILD] Explicitly specify Java vers...

2017-08-03 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/18835 [SPARK-21628][BUILD] Explicitly specify Java version in maven compiler plugin so IntelliJ imports project correctly ## What changes were proposed in this pull request? Explicitly specify

[GitHub] spark issue #18786: [SPARK-21584][SQL][SparkR] Update R method for summary t...

2017-08-02 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18786 No the changes to `summary` are not additive, it inserts 25%, 50%, and 75% percentiles before max (the last row). People that want the previous behavior can use `describe`. Or if they are trying

[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-02 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/18818 [SPARK-21110][SQL] Structs, arrays, and other orderable datatypes should be usable in inequalities ## What changes were proposed in this pull request? Allows `BinaryComparison` operators

[GitHub] spark pull request #18786: [SPARK-21584][SQL][SparkR] Update R method for su...

2017-08-02 Thread aray
GitHub user aray reopened a pull request: https://github.com/apache/spark/pull/18786 [SPARK-21584][SQL][SparkR] Update R method for summary to call new implementation ## What changes were proposed in this pull request? SPARK-21100 introduced a new `summary` method

[GitHub] spark pull request #18786: [SPARK-21584][SQL][SparkR] Update R method for su...

2017-08-02 Thread aray
Github user aray closed the pull request at: https://github.com/apache/spark/pull/18786 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request #18800: [SPARK-21330][SQL] Bad partitioning does not allo...

2017-08-01 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/18800 [SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC table with extreme values on the partition column ## What changes were proposed in this pull request? An overflow

[GitHub] spark pull request #18786: [SPARK-21584][SQL][SparkR] Update R method for su...

2017-08-01 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18786#discussion_r130620399 --- Diff: R/pkg/R/DataFrame.R --- @@ -2973,15 +2974,51 @@ setMethod("describe", dataFrame(sdf) }) +

[GitHub] spark pull request #18786: [SPARK-21584][SQL][SparkR] Update R method for su...

2017-08-01 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18786#discussion_r130618566 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -2500,8 +2500,15 @@ test_that("describe() and summarize() on a DataFrame", { ex

[GitHub] spark pull request #18786: [SPARK-21584][SQL][SparkR] Update R method for su...

2017-07-31 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/18786 [SPARK-21584][SQL][SparkR] Update R method for summary to call new implementation ## What changes were proposed in this pull request? SPARK-21100 introduced a new `summary` method

[GitHub] spark issue #18697: [SPARK-16683][SQL] Repeated joins to same table can leak...

2017-07-31 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18697 @viirya We could certainly make that improvement. I believe it would be a fairly trivial change to this PR if we were just considering expressions that have the same canonical representation. However

[GitHub] spark pull request #18697: [SPARK-16683][SQL] Repeated joins to same table c...

2017-07-31 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18697#discussion_r130396904 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala --- @@ -65,6 +65,10 @@ abstract class SparkPlan extends QueryPlan[SparkPlan

[GitHub] spark issue #18762: [SPARK-21566][SQL][Python] Python method for summary

2017-07-28 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18762 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #18762: [SPARK-21566][SQL][Python] Python method for summ...

2017-07-28 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/18762 [SPARK-21566][SQL][Python] Python method for summary ## What changes were proposed in this pull request? Adds the recently added `summary` method to the python dataframe interface

[GitHub] spark issue #18697: [SPARK-16683][SQL] Repeated joins to same table can leak...

2017-07-25 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18697 ping @rxin can someone look at this correctness fix? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #16577: [SPARK-19214][SQL] Typed aggregate count output field na...

2017-07-24 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16577 Closing since it does not look like there is any interest in changing this. Thanks everyone! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request #16577: [SPARK-19214][SQL] Typed aggregate count output f...

2017-07-24 Thread aray
Github user aray closed the pull request at: https://github.com/apache/spark/pull/16577 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #18697: [SPARK-16683][SQL] Repeated joins to same table can leak...

2017-07-21 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18697 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #18697: [SPARK-16683][SQL] Repeated joins to same table can leak...

2017-07-20 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18697 Plan for the example query before the patch (with partitioning as suffix): ``` *HashAggregate(keys=[parent#228], functions=[], output=[level2#274]) hashpartitioning(parent#228, 5

[GitHub] spark pull request #18697: [SPARK-16683][SQL] Repeated joins to same table c...

2017-07-20 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/18697 [SPARK-16683][SQL] Repeated joins to same table can leak attributes via partitioning ## What changes were proposed in this pull request? In some complex queries where the same table

[GitHub] spark issue #18306: [SPARK-21029][SS] All StreamingQuery should be stopped w...

2017-07-19 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18306 ping @rxin @marmbrus @zsxwing @felixcheung can anyone look at this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request #18307: [SPARK-21100][SQL] describe should give quartiles...

2017-06-30 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18307#discussion_r125053122 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2205,37 +2205,170 @@ class Dataset[T] private[sql]( * // max 92.0

[GitHub] spark issue #18307: [SPARK-21100][SQL] describe should give quartiles simila...

2017-06-29 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18307 @rxin @felixcheung Thanks for the feedback, I revamped this PR to leave `describe` unchanged and added two new methods `describeExtended` and `describeAdvanced` (the latter is used to implement all

[GitHub] spark issue #18306: [SPARK-21029][SS] All StreamingQuery should be stopped w...

2017-06-28 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18306 @zsxwing @marmbrus @rxin This is ready for review. I have changed the approach so that queries from all SparkSession's are stopped. I was not able to use a SparkListener as @zsxwing suggested or even

[GitHub] spark issue #18307: [SPARK-21100][SQL] describe should give quartiles simila...

2017-06-15 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18307 @rxin Yes it slows things down quite a bit. Informal testing on 10M row 2 column synthetic data puts this implementation at around 10s vs 0.5s in 2.2-rc4. I can speed it up some by doing only a single

[GitHub] spark pull request #18306: [SPARK-21029][SS] All StreamingQuery should be st...

2017-06-14 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18306#discussion_r122112073 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala --- @@ -690,6 +690,7 @@ class SparkSession private( * @since 2.0.0

[GitHub] spark pull request #18307: [SPARK-21100][SQL] describe should give quartiles...

2017-06-14 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/18307 [SPARK-21100][SQL] describe should give quartiles similar to Pandas ## What changes were proposed in this pull request? Modify the describe method to include quartiles (25th, 50th, and 75th

[GitHub] spark pull request #18306: [SPARK-21029][SS] All StreamingQuery should be st...

2017-06-14 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/18306 [SPARK-21029][SS] All StreamingQuery should be stopped when the SparkSession is stopped ## What changes were proposed in this pull request? Adds method to `StreamingQueryManager` that stops

[GitHub] spark issue #18001: [SPARK-20769][Doc] Incorrect documentation for using Jup...

2017-05-16 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/18001 Yes, it does not work without ``` Andrews-MacBook-Pro:spark-2.1.1-bin-hadoop2.7 andrew$ jupyter --version 4.0.6 Andrews-MacBook-Pro:spark-2.1.1-bin-hadoop2.7 andrew

[GitHub] spark pull request #18001: [SPARK-20769][Doc] Incorrect documentation for us...

2017-05-16 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/18001 [SPARK-20769][Doc] Incorrect documentation for using Jupyter notebook ## What changes were proposed in this pull request? SPARK-13973 incorrectly removed the required

[GitHub] spark issue #17348: [SPARK-20018][SQL] Pivot with timestamp and count should...

2017-03-19 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/17348 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-03-16 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16483 @thunterdb The extra step -- as implemented -- is only at the end as that gives the same result as doing it after every iteration but without the extra overhead. --- If your project is set up

[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106548090 --- Diff: graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala --- @@ -68,26 +69,34 @@ class PageRankSuite extends SparkFunSuite

[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106546448 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -322,13 +335,12 @@ object PageRank extends Logging { def

[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-03-16 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16483 @rxin can anyone else review this? It would be nice to get this correctness fix into 2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/17226 @HyukjinKwon There is an inconsistency/regression but its not being introduced in this PR, its already there. Take an example without null as a pivot column value like below. The only difference

[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/17226 @HyukjinKwon we're not introducing a regression in this PR by fixing the NPE, the answer given by 1.6 was incorrect under any interpenetration. Again, there is a completely separate issue of what

[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/17226 BTW for 3 above if we decide it should be 0, we can add an initial value for `PivotFirst` to make the fix. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/17226 There are three things going on here in your one example. 1. Spark 1.6 [first version with pivot] (and Spark 2.0+ with an aggregate output type unsupported by PivotFirst) gives incorrect

[GitHub] spark pull request #17226: [SPARK-19882][SQL] Pivot with null as a distinct ...

2017-03-09 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/17226#discussion_r105324124 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -522,7 +522,7 @@ class Analyzer( } else

[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/17226 @HyukjinKwon As stated in 17226#discussion_r105322758 I think we should open a second JIRA to have the discussion on whether or not count(1) of no values in a pivot should be filled with 0's

[GitHub] spark pull request #17226: [SPARK-19882][SQL] Pivot with null as a distinct ...

2017-03-09 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/17226#discussion_r105322758 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala --- @@ -216,4 +216,10 @@ class DataFramePivotSuite extends QueryTest

[GitHub] spark pull request #17226: [SPARK-19882][SQL] Pivot with null as a distinct ...

2017-03-09 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/17226 [SPARK-19882][SQL] Pivot with null as a distinct pivot value throws NPE ## What changes were proposed in this pull request? Allows null values of the pivot column to be included in the pivot

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-20 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r97168170 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,251 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-20 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r97162464 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,251 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-20 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r97168311 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,251 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-20 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r97166816 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,251 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #16539: [SPARK-8855][MLlib][PySpark] Python API for Assoc...

2017-01-20 Thread aray
Github user aray closed the pull request at: https://github.com/apache/spark/pull/16539 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-01-17 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16483 @rxin can you take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark pull request #16577: [SPARK-19214][SQL] Typed aggregate count output f...

2017-01-13 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16577 [SPARK-19214][SQL] Typed aggregate count output field name should be "count" ## What changes were proposed in this pull request? Changes the output field name of typed aggreg

[GitHub] spark issue #16559: [WIP] Add expression index and test cases

2017-01-12 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16559 It can already be done with the `posexplode` UDTF like ``` with t as (values (array(1,2,3)), (array(4,5,6)) as (a)) select col from t lateral view posexplode(a) tt where pos = 2

[GitHub] spark issue #16555: [SPARK-19180][SQL] the offset of short should be 4 in Of...

2017-01-12 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16555 The title should say 2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #16539: [SPARK-8855][MLlib][PySpark] Python API for Assoc...

2017-01-10 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16539 [SPARK-8855][MLlib][PySpark] Python API for Association Rules ## What changes were proposed in this pull request? This patch adds a `generateAssociationRules(confidence)` method

[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-01-06 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16483 ping @srowen @ankurdave can you take a look at this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-01-05 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16483 [SPARK-18847][GraphX] PageRank gives incorrect results for graphs with sinks ## What changes were proposed in this pull request? Graphs with sinks (vertices with no outgoing edges) don't have

[GitHub] spark issue #16271: [SPARK-18845][GraphX] PageRank has incorrect initializat...

2016-12-15 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16271 Yes the improvement is from the sum of magnitudes of initial values being closer to the (known) sum of the solution. Fiddling with resetProb controls a completely different thing. The current

[GitHub] spark pull request #16271: [SPARK-18845][GraphX] PageRank has incorrect init...

2016-12-15 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/16271#discussion_r92621591 --- Diff: graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala --- @@ -70,10 +70,10 @@ class PageRankSuite extends SparkFunSuite

[GitHub] spark pull request #16240: [SPARK-16792][SQL] Dataset containing a Case Clas...

2016-12-14 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/16240#discussion_r92546082 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala --- @@ -100,31 +100,76 @@ abstract class SQLImplicits { // Seqs

[GitHub] spark issue #16271: [SPARK-18845][GraphX] PageRank has incorrect initializat...

2016-12-14 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16271 **References** [Pagerank paper](http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf) > We need to make an initial assignment of the ranks. This assignment can be made by one of several strateg

[GitHub] spark issue #16271: [SPARK-18845][GraphX] PageRank has incorrect initializat...

2016-12-14 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16271 ping @srowen @dbtsai @rxin @ankurdave @jegonzal --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #16271: [SPARK-18845][GraphX] PageRank has incorrect initializat...

2016-12-14 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16271 Updated the above benchmark code with a log normal random graph on 10,000 vertices the difference is much more drastic. ![](http://i.imgur.com/Zo56dEO.png) (take the very bottom of the graph

[GitHub] spark pull request #16271: [SPARK-18845][GraphX] PageRank has incorrect init...

2016-12-13 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16271 [SPARK-18845][GraphX] PageRank has incorrect initialization value that leads to slow convergence ## What changes were proposed in this pull request? Change the initial value in all PageRank

[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...

2016-12-07 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16161 I would be happy to create a seperate PR for adding support for `mutable.Map` (and `List`) if that is wanted. But there is no _generic_ solution as there is no type that is assignable to both

[GitHub] spark pull request #16197: [SPARK-17760][SQL][Backport] AnalysisException wi...

2016-12-07 Thread aray
Github user aray closed the pull request at: https://github.com/apache/spark/pull/16197 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request #16197: [SPARK-17760][SQL][Backport] AnalysisException wi...

2016-12-07 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16197 [SPARK-17760][SQL][Backport] AnalysisException with dataframe pivot when groupBy column is not attribute ## What changes were proposed in this pull request? Backport of #16177 to branch-2.0

[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...

2016-12-06 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16161 Right now it's not supported to have the following: ``` case class Foo(a: Map[Int, Int]) ``` (using the scala Predef version of Map) The [documented](http://spark.apache.org

[GitHub] spark pull request #16177: [SPARK-17760][SQL] AnalysisException with datafra...

2016-12-06 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16177 [SPARK-17760][SQL] AnalysisException with dataframe pivot when groupBy column is not attribute ## What changes were proposed in this pull request? Fixes AnalysisException for pivot queries

[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...

2016-12-06 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16161 The approach is to change the deserializer (via `ScalaReflection#deserializerFor`) to return the more specific type `scala.collections.immutable.Map` instead of `scala.collections.Map` as it does now

[GitHub] spark pull request #16161: [SPARK-18717][SQL] Make code generation for Scala...

2016-12-05 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16161 [SPARK-18717][SQL] Make code generation for Scala Map work with immutable.Map also ## What changes were proposed in this pull request? Fixes compile errors in generated code when user has

[GitHub] spark issue #16121: [SPARK-16589][PYTHON] Chained cartesian produces incorre...

2016-12-05 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16121 @davies, @zero323, and @holdenk this is in a good place for review if you want to take a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #16121: [SPARK-16589][PYTHON] Chained cartesian produces incorre...

2016-12-02 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16121 @davies I was trying to make minimal changes to `PairDeserializer`, but you are right it needs changed also. I'll update the PR shortly. --- If your project is set up for it, you can reply

[GitHub] spark pull request #16121: [SPARK-16589][PYTHON] Chained cartesian produces ...

2016-12-02 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16121 [SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records ## What changes were proposed in this pull request? Fixes a bug in the python implementation of rdd cartesian

[GitHub] spark issue #15898: [SPARK-18457][SQL] ORC and other columnar formats using ...

2016-11-15 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/15898 @tejasapatil yes that is the use case where this applies. It's only tested against whatever version is included in the hadoop2.7+hive build configuration listed above. Is there anything in particular

  1   2   >