[GitHub] spark pull request #18306: [SPARK-21029][SS] All StreamingQuery should be st...

2017-10-07 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/18306#discussion_r143345713
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -562,6 +563,8 @@ class SparkContext(config: SparkConf) extends Logging {
   }
 _cleaner.foreach(_.start())
 
+_stopHooks = new SparkShutdownHookManager()
--- End diff --

@felixcheung I don't get the objection. This is not adding a new shutdown 
hook, just reusing the class to add "stop hooks" that are called by 
`sc.stop()`. Furthermore this is needed because `sc.stop()` can not just call 
`stopAllQueries` unassisted without creating a circular package dependency.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19629: [SPARK-22408][SQL] RelationalGroupedDataset's distinct p...

2017-11-01 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/19629
  
diff LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18818: [SPARK-21110][SQL] Structs, arrays, and other orderable ...

2017-08-31 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/18818
  
ping @viirya @gatorsmile


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18306: [SPARK-21029][SS] All StreamingQuery should be stopped w...

2017-08-31 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/18306
  
ping @zsxwing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19080: [SPARK-21865][SQL] simplify the distribution sema...

2017-08-31 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/19080#discussion_r136419947
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 ---
@@ -284,24 +241,17 @@ case class RangePartitioning(ordering: 
Seq[SortOrder], numPartitions: Int)
   override def nullable: Boolean = false
   override def dataType: DataType = IntegerType
 
-  override def satisfies(required: Distribution): Boolean = required match 
{
-case UnspecifiedDistribution => true
-case OrderedDistribution(requiredOrdering) =>
-  val minSize = Seq(requiredOrdering.size, ordering.size).min
-  requiredOrdering.take(minSize) == ordering.take(minSize)
-case ClusteredDistribution(requiredClustering) =>
-  ordering.map(_.child).forall(x => 
requiredClustering.exists(_.semanticEquals(x)))
-case _ => false
-  }
-
-  override def compatibleWith(other: Partitioning): Boolean = other match {
-case o: RangePartitioning => this.semanticEquals(o)
-case _ => false
-  }
-
-  override def guarantees(other: Partitioning): Boolean = other match {
-case o: RangePartitioning => this.semanticEquals(o)
-case _ => false
+  override def satisfies(required: Distribution): Boolean = {
+super.satisfies(required) || {
+  required match {
+case OrderedDistribution(requiredOrdering) =>
+  val minSize = Seq(requiredOrdering.size, ordering.size).min
+  requiredOrdering.take(minSize) == ordering.take(minSize)
--- End diff --

While we are cleaning things up, this needs fixed. 
`RangePartitioning(a+,b+)` does not satisfy `OrderedDistribution(a+)`. It 
violates the requirement that all values of `a` need to be in the same 
partition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-31 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/18818#discussion_r136421644
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -582,6 +582,7 @@ class CodegenContext {
 case array: ArrayType => genComp(array, c1, c2) + " == 0"
 case struct: StructType => genComp(struct, c1, c2) + " == 0"
 case udt: UserDefinedType[_] => genEqual(udt.sqlType, c1, c2)
+case NullType => "true"
--- End diff --

Yea, codegen fails without this. I had originally made the value `false` 
but when i noticed the codegen for comparison 
(https://github.com/aray/spark/blob/cc2f3eca28ee6b9faa87853568205307567827cc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L606)
 returned `0`, I changed it to be consistent. Happy to change it back though. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18306: [SPARK-21029][SS] All StreamingQuery should be st...

2017-08-31 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/18306#discussion_r136436631
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -562,6 +563,8 @@ class SparkContext(config: SparkConf) extends Logging {
   }
 _cleaner.foreach(_.start())
 
+_stopHooks = new SparkShutdownHookManager()
--- End diff --

The queries also need to be gracefully stopped if someone calls `sc.stop()` 
without shutting down the JVM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16121: [SPARK-16589][PYTHON] Chained cartesian produces incorre...

2017-09-13 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16121
  
I'll take a look, sorry about that. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19226: [SPARK-21985][PySpark] PairDeserializer is broken...

2017-09-13 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/19226

[SPARK-21985][PySpark] PairDeserializer is broken for double-zipped RDDs

## What changes were proposed in this pull request?

This removes the mostly unnecessary test that each individual batch from 
the key and value serializers are of the same size. We already enforce the 
batch sizes are the same in rdd.zip (see: 
https://github.com/apache/spark/blob/c06f3f5ac500b02d38ca7ec5fcb33085e07f2f75/python/pyspark/rdd.py#L2118
 ) which is the only palce it is used in a non trivial manner. This adds a 
comment to the PairDeserializer documentation about this requirement.

## How was this patch tested?

Additional unit test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark SPARK-21985

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19226.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19226


commit 4a9eb935b8438a159c9f12239135eedd59b25fd3
Author: Andrew Ray 
Date:   2017-09-14T01:26:15Z

remove check and add tests




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...

2017-09-13 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/19226
  
It's actually this one that is failing 
https://github.com/aray/spark/blob/0d64a6d11237383c2a6ea21275dc9daa5cc8d634/python/pyspark/tests.py#L964



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...

2017-09-13 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/19226
  
@holdenk I'm not going to be able to solve this tonight (short of just 
removing the failing test). 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19226: [SPARK-21985][PySpark] PairDeserializer is broken...

2017-09-14 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/19226#discussion_r138917350
  
--- Diff: python/pyspark/serializers.py ---
@@ -343,6 +343,8 @@ def _load_stream_without_unbatching(self, stream):
 key_batch_stream = 
self.key_ser._load_stream_without_unbatching(stream)
 val_batch_stream = 
self.val_ser._load_stream_without_unbatching(stream)
 for (key_batch, val_batch) in zip(key_batch_stream, 
val_batch_stream):
+key_batch = list(key_batch)
+val_batch = list(val_batch)
--- End diff --

fixed in 66477f8



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21187: [SPARK-24035][SQL] SQL syntax for Pivot

2018-04-30 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/21187
  
LGTM thanks for doing this!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15111: [SPARK-17458][SQL] Alias specified for aggregates...

2016-09-15 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/15111

[SPARK-17458][SQL] Alias specified for aggregates in a pivot are not honored

## What changes were proposed in this pull request?

This change preserves aliases that are given for pivot aggregations


## How was this patch tested?

New unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark SPARK-17458

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15111.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15111


commit a6cf8e4986ceabfaa37e7f7039991eb4cc66eda6
Author: Andrew Ray 
Date:   2016-09-15T16:03:36Z

fix SPARK-17458




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15898: [SPARK-18457][SQL] ORC and other columnar formats...

2016-11-15 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/15898

[SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all 
columns when doing a simple count 

## What changes were proposed in this pull request?

When reading zero columns (e.g., count(*)) from ORC or any other format 
that uses HiveShim, actually set the read column list to empty for Hive to use.

## How was this patch tested?

Query correctness is handled by existing unit tests. I'm happy to add more 
if anyone can point out some case that is not covered.

Reduction in data read can be verified in the UI when built with a recent 
version of Hadoop say:
```
build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -DskipTests 
clean package
```
However the default Hadoop 2.2 that is used for unit tests does not report 
actual bytes read and instead just full file sizes (see FileScanRDD.scala line 
80). Therefore I don't think there is a good way to add a unit test for this.

I tested with the following setup using above build options
```
case class OrcData(intField: Long, stringField: String)
spark.range(1,100).map(i => OrcData(i, 
s"part-$i")).toDF().write.format("orc").save("orc_test")

sql(
  s"""CREATE EXTERNAL TABLE orc_test(
 |  intField LONG,
 |  stringField STRING
 |)
 |STORED AS ORC
 |LOCATION '${System.getProperty("user.dir") + "/orc_test"}'
   """.stripMargin)
```

## Results

query | Spark 2.0.2 | this PR
---|---|---
`sql("select count(*) from orc_test").collect`|4.4 MB|199.4 KB
`sql("select intField from orc_test").collect`|743.4 KB|743.4 KB
`sql("select * from orc_test").collect`|4.4 MB|4.4 MB


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark sql-orc-no-col

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15898.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15898


commit d6dbd479ed382049ab80fe92558550a26277431e
Author: Andrew Ray 
Date:   2016-03-29T12:33:12Z

meh

commit a4d1ce3e1e7b8602860d890ff3266ef464899a9b
Author: Andrew Ray 
Date:   2016-11-14T15:57:46Z

Merge branch 'master' of https://github.com/apache/spark into sql-orc-no-col

commit 037ca1d7d2765c5104b90cb3fa623ca1bb24480d
Author: Andrew Ray 
Date:   2016-11-16T03:57:00Z

update comment to be consistent with logic




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15898: [SPARK-18457][SQL] ORC and other columnar formats using ...

2016-11-15 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/15898
  
The code that is being changed originated 2 years ago with the addition of 
Hive 0.13 support by @zhzhan, see 

https://github.com/apache/spark/commit/7c89a8f0c81ecf91dba34c1f44393f45845d438c#diff-77621c6c599cfdd24363967a500be1c5R109
Before that there was not even a null check.

I will add the test you are suggesting @HyukjinKwon 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15898: [SPARK-18457][SQL] ORC and other columnar formats using ...

2016-11-15 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/15898
  
@tejasapatil yes that is the use case where this applies. It's only tested 
against whatever version is included in the hadoop2.7+hive build configuration 
listed above. Is there anything in particular you were concerned about 
compatibility wise?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16121: [SPARK-16589][PYTHON] Chained cartesian produces ...

2016-12-02 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/16121

[SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records

## What changes were proposed in this pull request?

Fixes a bug in the python implementation of rdd cartesian product related 
to batching that showed up in repeated cartesian products with seemingly random 
results. The root cause being multiple iterators pulling from the same stream 
in the wrong order because of incorrect logic around batching.

`CartesianDeserializer` was changed to implement 
`_load_stream_without_unbatching` and borrow the one line implementation of 
`load_stream` from `BatchedSerializer`. The default implementation of 
`_load_stream_without_unbatching` was changed to give consistent results 
(always an iterable) so that it could be used without additional checks.

`PairDeserializer` was minorly modified to remove inheritance from 
`CartesianDeserializer` as it was no really proper and no longer worked.

Both `CartesianDeserializer` and `PairDeserializer` now only extend 
`Serializer` (which has no `dump_stream` implementation) since they are only 
meant for *de*serialization.

## How was this patch tested?

Additional unit tests (sourced from #14248)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark fix-cartesian

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16121.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16121


commit a73c1a2afb0d9ae3838cff8f83bc4c13010a9e66
Author: Andrew Ray 
Date:   2016-12-01T20:20:56Z

unit test

commit 4ed8c388a9077b89341d57301c241b6b2d2d
Author: Andrew Ray 
Date:   2016-12-02T15:23:11Z

working

commit a0e36522175bed10a6309b2fe2d37793d746584b
Author: Andrew Ray 
Date:   2016-12-02T15:32:39Z

remove unneeded debug vars and add comment




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16121: [SPARK-16589][PYTHON] Chained cartesian produces incorre...

2016-12-02 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16121
  
@davies I was trying to make minimal changes to `PairDeserializer`, but you 
are right it needs changed also. I'll update the PR shortly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16121: [SPARK-16589][PYTHON] Chained cartesian produces incorre...

2016-12-05 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16121
  
@davies, @zero323, and @holdenk this is in a good place for review if you 
want to take a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16161: [SPARK-18717][SQL] Make code generation for Scala...

2016-12-05 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/16161

[SPARK-18717][SQL] Make code generation for Scala Map work with 
immutable.Map also

## What changes were proposed in this pull request?

Fixes compile errors in generated code when user has case class with a 
`scala.collections.immutable.Map` instead of a `scala.collections.Map`. Since 
ArrayBasedMapData.toScalaMap returns the immutable version we can make it work 
with both.

## How was this patch tested?

Additional unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark fix-map-codegen

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16161.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16161


commit 3aa737160c27bfb8845a9d6c880451db88baaff2
Author: Andrew Ray 
Date:   2016-12-05T21:41:28Z

the fix

commit c0705094151ebefc7a43299bc6d07116a425aa35
Author: Andrew Ray 
Date:   2016-12-05T23:02:48Z

unit test

commit 699cc75f8e5d0d53225bd7a8003eed949a8250e0
Author: Andrew Ray 
Date:   2016-12-05T23:07:23Z

unused import




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...

2016-12-06 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16161
  
The approach is to change the deserializer (via 
`ScalaReflection#deserializerFor`) to return the more specific type 
`scala.collections.immutable.Map` instead of `scala.collections.Map` as it does 
now (no change to `ArrayBasedMapData#toScalaMap` is needed as it already 
returns an immutable map). This allows the generated code for case classes 
using either of those Map's to work as they are both assignable from 
`scala.collections.immutable.Map`. This PR does not address 
`scala.collections.mutable.Map`, it still wont work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16177: [SPARK-17760][SQL] AnalysisException with datafra...

2016-12-06 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/16177

[SPARK-17760][SQL] AnalysisException with dataframe pivot when groupBy 
column is not attribute

## What changes were proposed in this pull request?

Fixes AnalysisException for pivot queries that have group by columns that 
are expressions and not attributes by substituting the expressions output 
attribute in the second aggregation and final projection.

## How was this patch tested?

existing and additional unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark SPARK-17760

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16177.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16177


commit 0924765dd939bb2a75c43265fa3d43b14916747c
Author: Andrew Ray 
Date:   2016-12-06T20:34:59Z

fix and unit test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...

2016-12-06 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16161
  
Right now it's not supported to have the following:
```
case class Foo(a: Map[Int, Int])
```
(using the scala Predef version of Map)

The 
[documented](http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types)
 way to do this is:
```
case class Foo(a: scala.collections.Map[Int, Int])
```
and it works fine.

However if someone did not read the documentation carefully and used the 
former, instead of a reasonable error they get a compile error on the code 
spark generates. Therefore IMHO there are two options:

1. Make the generated code work with either version of Map (this PR)
2. Intercept definitions that use the wrong version of Map and issue a 
proper error.

If the consensus is that this PR is not worth it then I'll be happy to work 
on option 2. But in my opinion as a Spark user option 1 is better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16197: [SPARK-17760][SQL][Backport] AnalysisException wi...

2016-12-07 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/16197

[SPARK-17760][SQL][Backport] AnalysisException with dataframe pivot when 
groupBy column is not attribute

## What changes were proposed in this pull request?

Backport of #16177 to branch-2.0

## How was this patch tested?

existing and additional unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark SPARK-17760-2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16197.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16197


commit 05bb22533bb65dfc1314d067b9b78bcf786dbc72
Author: Andrew Ray 
Date:   2016-12-07T14:46:04Z

backport of patch to 2.0




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16197: [SPARK-17760][SQL][Backport] AnalysisException wi...

2016-12-07 Thread aray
Github user aray closed the pull request at:

https://github.com/apache/spark/pull/16197


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...

2016-12-07 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16161
  
I would be happy to create a seperate PR for adding support for 
`mutable.Map` (and `List`) if that is wanted. But there is no _generic_ 
solution as there is no type that is assignable to both `mutable.Map` and 
`immutable.Map`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16271: [SPARK-18845][GraphX] PageRank has incorrect init...

2016-12-13 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/16271

[SPARK-18845][GraphX] PageRank has incorrect initialization value that 
leads to slow convergence

## What changes were proposed in this pull request?

Change the initial value in all PageRank implementations to be `1.0` 
instead of `resetProb` (default `0.15`) and use `outerJoinVertices` instead of 
`joinVertices` so that source vertices get updated in each iteration. 

This seems to have been introduced a long time ago in 
https://github.com/apache/spark/commit/15a564598fe63003652b1e24527c432080b5976c#diff-b2bf3f97dcd2f19d61c921836159cda9L90

With the exception of graphs with sinks (which currently give incorrect 
results see SPARK-18847) this gives faster convergence as the sum of ranks is 
already correct (sum or ranks should be number of vertices).

Convergence comparision benchmark for small graph: http://imgur.com/a/HkkZf
Code for benchmark: 
https://gist.github.com/aray/a7de1f3801a810f8b1fa00c271a1fefd

## How was this patch tested?

(corrected) existing unit tests and additional test that verifies against 
result of igraph and NetworkX on a loop with a source.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark pagerank-initial-value

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16271.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16271


commit 9149ca233722e3d70aefb223dfd6a16ee8dbf924
Author: Andrew Ray 
Date:   2016-12-12T18:34:49Z

fix

commit b145376d88b6f5e58e2b9d051d9c268a36b9f939
Author: Andrew Ray 
Date:   2016-12-13T15:51:06Z

fix initial value for grid graph independent calculation

commit d39d2f07ab1a1aadb24dbd67bbbe37400beaadb4
Author: Andrew Ray 
Date:   2016-12-13T16:25:15Z

use outer join so that sources are updated and fix reset probability for 
personalized

commit 7ea03a88a3d9caa0ab7a7e6e681b8bf00b5cc128
Author: Andrew Ray 
Date:   2016-12-13T16:36:10Z

fix star page rank test to account for sources getting updated in the first 
iteration which then changes the center in the second iteration




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16271: [SPARK-18845][GraphX] PageRank has incorrect initializat...

2016-12-14 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16271
  
Updated the above benchmark code with a log normal random graph on 10,000 
vertices the difference is much more drastic.
![](http://i.imgur.com/Zo56dEO.png)
(take the very bottom of the graph with a grain of salt as its in 
comparison to `g.pageRank(0.1)`, actual error continues to drop)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16271: [SPARK-18845][GraphX] PageRank has incorrect initializat...

2016-12-14 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16271
  
ping @srowen @dbtsai @rxin @ankurdave @jegonzal


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16271: [SPARK-18845][GraphX] PageRank has incorrect initializat...

2016-12-14 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16271
  
**References**
[Pagerank paper](http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf)
> We need to make an initial assignment of the ranks. This assignment can 
be made by one of several strategies. If it is going to iterate until 
convergence, in general the initial values will not affect final values, just 
the rate of convergence. But we can speed
up convergence by choosing a good initial assignment.

Since they are more focused on updating values for one evolving graph (the 
internet) they dont really talk about starting from scratch. But this does 
emphisize that there is no change to answers, just rate of convergence.

A more direct statement would be 
[Wikipedia](https://en.wikipedia.org/wiki/PageRank)
> PageRank is initialized to the same value for all pages. In the original 
form of PageRank, the sum of PageRank over all pages was the total number of 
pages on the web at that time, so each page in this example would have an 
initial value of 1.

Note that there are two variants of pagerank that differ by a constant 
multiple in outputs but are determined by the dampening factor, we use the 
version that sums to N (most other implementations use the other). More 
Wikipedia:
>The difference between them is that the PageRank values in the first 
formula sum to one, while in the second formula each PageRank is multiplied by 
N and the sum becomes N.

Essentialy starting with the correct sum is closer to the actual fixed 
point and thus gets you faster convergence.

The [NetworkX 
implementation](https://github.com/networkx/networkx/blob/master/networkx/algorithms/link_analysis/pagerank_alg.py#L122)
 uses the variant that sums to 1 hence their initialization values are all 1/N. 

igraph is unfortunately not comparable as they use a [more complex linear 
solver 
approach](https://github.com/igraph/igraph/blob/master/src/prpack/prpack_solver.cpp)

Additional credentials (if it matters): PhD Mathematics with dissertation 
in Graph Theory


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16240: [SPARK-16792][SQL] Dataset containing a Case Clas...

2016-12-14 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/16240#discussion_r92546082
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala 
---
@@ -100,31 +100,76 @@ abstract class SQLImplicits {
   // Seqs
 
   /** @since 1.6.1 */
-  implicit def newIntSeqEncoder: Encoder[Seq[Int]] = ExpressionEncoder()
+  implicit def newIntSeqEncoder[T <: Seq[Int] : TypeTag]: Encoder[T] = 
ExpressionEncoder()
 
   /** @since 1.6.1 */
-  implicit def newLongSeqEncoder: Encoder[Seq[Long]] = ExpressionEncoder()
+  implicit def newLongSeqEncoder[T <: Seq[Long] : TypeTag]: Encoder[T] = 
ExpressionEncoder()
 
   /** @since 1.6.1 */
-  implicit def newDoubleSeqEncoder: Encoder[Seq[Double]] = 
ExpressionEncoder()
+  implicit def newDoubleSeqEncoder[T <: Seq[Double] : TypeTag]: Encoder[T] 
= ExpressionEncoder()
 
   /** @since 1.6.1 */
-  implicit def newFloatSeqEncoder: Encoder[Seq[Float]] = 
ExpressionEncoder()
+  implicit def newFloatSeqEncoder[T <: Seq[Float] : TypeTag]: Encoder[T] = 
ExpressionEncoder()
 
   /** @since 1.6.1 */
-  implicit def newByteSeqEncoder: Encoder[Seq[Byte]] = ExpressionEncoder()
+  implicit def newByteSeqEncoder[T <: Seq[Byte] : TypeTag]: Encoder[T] = 
ExpressionEncoder()
 
   /** @since 1.6.1 */
-  implicit def newShortSeqEncoder: Encoder[Seq[Short]] = 
ExpressionEncoder()
+  implicit def newShortSeqEncoder[T <: Seq[Short] : TypeTag]: Encoder[T] = 
ExpressionEncoder()
 
   /** @since 1.6.1 */
-  implicit def newBooleanSeqEncoder: Encoder[Seq[Boolean]] = 
ExpressionEncoder()
+  implicit def newBooleanSeqEncoder[T <: Seq[Boolean] : TypeTag]: 
Encoder[T] = ExpressionEncoder()
 
   /** @since 1.6.1 */
-  implicit def newStringSeqEncoder: Encoder[Seq[String]] = 
ExpressionEncoder()
+  implicit def newStringSeqEncoder[T <: Seq[String] : TypeTag]: Encoder[T] 
= ExpressionEncoder()
 
   /** @since 1.6.1 */
-  implicit def newProductSeqEncoder[A <: Product : TypeTag]: 
Encoder[Seq[A]] = ExpressionEncoder()
+  implicit def newProductSeqEncoder[A <: Product : TypeTag, T <: Seq[A] : 
TypeTag]: Encoder[T] =
+ExpressionEncoder()
+
+  // Seqs with product (List) disambiguation
+
+  /** @since 2.2.0 */
+  implicit def newIntSeqWithProductEncoder[T <: Seq[Int] with Product : 
TypeTag]: Encoder[T] =
+newIntSeqEncoder
+
+  /** @since 2.2.0 */
+  implicit def newLongSeqWithProductEncoder[T <: Seq[Long] with Product : 
TypeTag]: Encoder[T] =
+newLongSeqEncoder
+
+  /** @since 2.2.0 */
+  implicit def newDoubleListEncoder[T <: Seq[Double] with Product : 
TypeTag]: Encoder[T] =
--- End diff --

Should this be `newDoubleSeqWithProductEncoder`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16271: [SPARK-18845][GraphX] PageRank has incorrect init...

2016-12-15 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/16271#discussion_r92621591
  
--- Diff: 
graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala ---
@@ -70,10 +70,10 @@ class PageRankSuite extends SparkFunSuite with 
LocalSparkContext {
   val resetProb = 0.15
   val errorTol = 1.0e-5
 
-  val staticRanks1 = starGraph.staticPageRank(numIter = 1, 
resetProb).vertices
-  val staticRanks2 = starGraph.staticPageRank(numIter = 2, 
resetProb).vertices.cache()
+  val staticRanks1 = starGraph.staticPageRank(numIter = 2, 
resetProb).vertices
--- End diff --

Not really more robust since it has a sink and thus is still wrong pending 
SPARK-18847. But it is needed with the change to fully propagate the change in 
rank of source vertices in the first iteration as explained above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16271: [SPARK-18845][GraphX] PageRank has incorrect initializat...

2016-12-15 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16271
  
Yes the improvement is from the sum of magnitudes of initial values being 
closer to the (known) sum of the solution. Fiddling with resetProb controls a 
completely different thing. The current implementation has no advantage 
(excluding finding the incorrect solution to a star graph one iteration faster).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13221] [SQL] Fixing GroupingSets when A...

2016-02-11 Thread aray
Github user aray commented on the pull request:

https://github.com/apache/spark/pull/11100#issuecomment-182891207
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8718] [GRAPHX] Improve EdgePartition2D ...

2015-06-29 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/7104

[SPARK-8718] [GRAPHX] Improve EdgePartition2D for non perfect square number 
of partitions

See https://github.com/aray/e2d/blob/master/EdgePartition2D.ipynb

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark edge-partition-2d-improvement

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7104.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7104


commit cfa2c5ece11fa379e9c82d4e48d45192387c4859
Author: Andrew Ray 
Date:   2015-05-12T04:11:32Z

Modifications to EdgePartition2D so that it works for non perfect squares.

commit f006364bde37d0b87667227a68ab5e3567276fe6
Author: Andrew Ray 
Date:   2015-05-12T04:29:35Z

remove unneeded comments

commit 3560084686753dac11bb22babb32c8918770fe5a
Author: Andrew Ray 
Date:   2015-06-01T19:07:19Z

Merge branch 'master' into edge-partition-2d-improvement

commit 5d42105891acb37c228d4cc4d42e29021645da63
Author: Andrew Ray 
Date:   2015-06-01T21:16:42Z

% -> /

commit 001bfd09d186bf9647ae49b9cb71a3e21eee1579
Author: Andrew Ray 
Date:   2015-06-02T16:13:38Z

Refactor PartitionStrategy so that we can return a prtition function for a 
given number of parts.
To keep compatibility we define default methods that translate between the 
two implementation options.
Made EdgePartition2D use old strategy when we have a perfect square and 
implement new interface.

commit 925fd2cf0fabbe897f6cda4c9625aa14a6915e4f
Author: Andrew Ray 
Date:   2015-06-02T21:08:50Z

use new interface for partitioning

commit 5141ab450c1eb25553fde06a8afa7a3212737482
Author: Andrew Ray 
Date:   2015-06-27T03:52:58Z

Merge branch 'master' into edge-partition-2d-improvement

commit 97f8464a4055b262a6449bf9932ed7f8f407bd07
Author: Andrew Ray 
Date:   2015-06-27T05:37:14Z

change less

commit 3729f840849d26459edd5bcbb94d865338170142
Author: Andrew Ray 
Date:   2015-06-27T12:51:22Z

correct bounds and remove unneeded comments




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8718] [GRAPHX] Improve EdgePartition2D ...

2015-06-29 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/7104#discussion_r33529689
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/PartitionStrategy.scala ---
@@ -32,7 +32,7 @@ trait PartitionStrategy extends Serializable {
 object PartitionStrategy {
   /**
* Assigns edges to partitions using a 2D partitioning of the sparse 
edge adjacency matrix,
-   * guaranteeing a `2 * sqrt(numParts) - 1` bound on vertex replication.
+   * guaranteeing a `2 * sqrt(numParts)` bound on vertex replication.
--- End diff --

Note I'm changing the bounds given in the doc since they were never correct 
for non perfect squares. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-07-31 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/7841

[SPARK-8992] [SQL] Add pivot to dataframe api

This adds a pivot method to the dataframe api.

Following the lead of cube and rollup this adds a Pivot operator that is 
translated into an Aggregate by the analyzer.

Currently the syntax is like:

courseSales.pivot(Seq($"year"), $"course", Seq("dotNET", "Java"), 
sum($"earnings"))

Would we be interested in the following syntax also/alternatively?

courseSales.groupBy($"year").pivot($"course", "dotNET", 
"Java").agg(sum($"earnings"))

Later we can add it to `SQLParser`, but as Hive doesn't support it we cant 
add it there, right?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark sql-pivot

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7841.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7841


commit 599e9e0b9bd46be798da1274e9ba9839151b2aaf
Author: Andrew Ray 
Date:   2015-07-29T21:05:21Z

Add pivot to dataframe api




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-10-22 Thread aray
Github user aray commented on the pull request:

https://github.com/apache/spark/pull/7841#issuecomment-150464038
  
@rxin and @JoshRosen, this is ready for review now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-10-23 Thread aray
Github user aray commented on the pull request:

https://github.com/apache/spark/pull/7841#issuecomment-150620321
  
@rxin here is my summary of other frameworks API's

I'm going to use an example dataset form the pandas doc for all the 
examples (as df)

|A|B|C|D|
|---|---|---|---|
|foo|one|small|1|
|foo|one|large|2|
|foo|one|large|2|
|foo|two|small|3|
|foo|two|small|3|
|bar|one|large|4|
|bar|one|small|5|
|bar|two|small|6|
|bar|two|large|7|

This API


```scala
scala> df.groupBy("A", "B").pivot("C", "small", "large").sum("D").show
+---+---+-+-+
|  A|  B|small|large|
+---+---+-+-+
|foo|two|6| null|
|bar|two|6|7|
|foo|one|1|4|
|bar|one|5|4|
+---+---+-+-+

scala> df.groupBy("A", "B").pivot("C", "small", "large").agg(sum("D"), 
avg("D")).show
+---+---+++++
|  A|  B|small sum(D)|small avg(D)|large sum(D)|large avg(D)|
+---+---+++++
|foo|two|   6| 3.0|null|null|
|bar|two|   6| 6.0|   7| 7.0|
|foo|one|   1| 1.0|   4| 2.0|
|bar|one|   5| 5.0|   4| 4.0|
+---+---+++++

scala> df.pivot(Seq($"A", $"B"), $"C", Seq("small", "large"), 
sum($"D")).show
+---+---+-+-+
|  A|  B|small|large|
+---+---+-+-+
|foo|two|6| null|
|bar|two|6|7|
|foo|one|1|4|
|bar|one|5|4|
+---+---+-+-+
```

We require a list of values for the pivot column as we are required to know 
the output columns of the operator ahead of time. Pandas and reshape2 do not 
require this but the comparable SQL operators do. We also allow multiple 
aggregations which not all implementations allow.

pandas
--

The comparable metod for pandas is `pivot_table(data, values=None, 
index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, 
dropna=True)`

Example

```python
>>> pivot_table(df, values='D', index=['A', 'B'], columns=['C'], 
aggfunc=np.sum)
  small  large
foo  one  1  4
 two  6  NaN
bar  one  5  4
 two  6  7
```

Pandas also allows multiple aggregations:

```python
>>> pivot_table(df, values='D', index=['A', 'B'], columns=['C'], 
aggfunc=[np.sum, np.average])
  sum   average  
C   large small   large small
A   B
bar one 4 5   4 5
two 7 6   7 6
foo one 4 1   2 1
two   NaN 6 NaN 3
```

References

- http://pandas.pydata.org/pandas-docs/stable/reshaping.html
- 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html

See also: `pivot`, `stack`, `unstack`.

reshape2 (R)

The comparable method for reshape2 is `dcast(data, formula, fun.aggregate = 
NULL, ..., margins = NULL, subset = NULL, fill = NULL, drop = TRUE, value.var = 
guess_value(data))`

```r
> dcast(df, A + B ~ C, sum)
Using D as value column: use value.var to override.
A   B large small
1 bar one 4 5
2 bar two 7 6
3 foo one 4 1
4 foo two 0 6
```

Note that by default cast fills with the value from applying fun.aggregate 
to 0 length vector

References

- https://cran.r-project.org/web/packages/reshape2/reshape2.pdf
- http://seananderson.ca/2013/10/19/reshape.html
- http://www.inside-r.org/packages/cran/reshape2/docs/cast

See also: `melt`.

MS SQL Server
--

```sql
SELECT *
FROM df
pivot (sum(D) for C in ([small], [large])) p
```

http://sqlfiddle.com/#!3/cf887/3/0

References

- http://sqlhints.com/2014/03/10/pivot-and-unpivot-in-sql-server/


Oracle 11g
--

```sql
SELECT *
FROM df
pivot (sum(D) for C in ('small', 'large')) p
```
http://sqlfiddle.com/#!4/29bc5/3/0

Oracle also allows multiple aggregations and with similar output to this api

```sql
SELECT *
FROM df
pivot (sum(D) as sum, avg(D) as avg for C in ('small', 'large')) p
```
ht

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-10-23 Thread aray
Github user aray commented on the pull request:

https://github.com/apache/spark/pull/7841#issuecomment-150745807
  
@rxin, Not requiring the values would necessitate doing a separate query 
for the distinct values of the column before the pivot query. It looks like at 
least some DF operations (eg, drop) would need the result so even if we made 
Pivot.output lazy we would be running an unexpected job.

If a user really didn't want to specify the values, they can explicitly do 
the query:

```scala
df.groupBy("A", "B").pivot("C", 
df.select("C").distinct.collect.map(_.getString(0)): _*).sum("D")
```

Needing to know the output columns of an operator for analysis/planning is 
probably why the other SQL implementations require the values also (technically 
Oracle supports omitting it but only in XML mode where you essentially just get 
one column).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11275][SQL] Reimplement Expand as a Gen...

2015-11-02 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/9429

[SPARK-11275][SQL] Reimplement Expand as a Generator and fix existing 
implementation bugs

This is an alternative to https://github.com/apache/spark/pull/9419

I got tired of fighting/fixing the bugs with the existing implementation of 
cube/rollup/grouping sets specifically around the Expand operator so I 
reimplemented it as a Generator. I think this makes for a cleaner 
implementation. I also added unit tests that show this implementation solves 
SPARK-11275.

I look forward to your comments!

cc: @rxin @marmbrus @gatorsmile @rick-ibm @hvanhovell @chenghao-intel 
@holdenk 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark SPARK-11275

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9429.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9429


commit fad28d6126187f88b473fc35692248ad2cb00748
Author: Andrew Ray 
Date:   2015-11-02T18:10:02Z

Reimplement Expand as a Generator
- added unit tests for cube and rollup that actualy check the result
- fixed bugs present in previous implementation of cube/rollup/groupingsets 
(SPARK-11275)

commit e4636791da3367ee6fcb371f2fce029f3b2e8a3e
Author: Andrew Ray 
Date:   2015-11-03T06:06:41Z

newline at end of generators.scala




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11275][SQL] Reimplement Expand as a Gen...

2015-11-03 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/9429#discussion_r43810369
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -205,45 +205,30 @@ class Analyzer(
 GroupingSets(bitmasks(a), a.groupByExprs, a.child, a.aggregations)
   case x: GroupingSets =>
 val gid = AttributeReference(VirtualColumn.groupingIdName, 
IntegerType, false)()
-// We will insert another Projection if the GROUP BY keys contains 
the
-// non-attribute expressions. And the top operators can references 
those
-// expressions by its alias.
-// e.g. SELECT key%5 as c1 FROM src GROUP BY key%5 ==>
-//  SELECT a as c1 FROM (SELECT key%5 AS a FROM src) GROUP BY a
-
-// find all of the non-attribute expressions in the GROUP BY keys
-val nonAttributeGroupByExpressions = new ArrayBuffer[Alias]()
-
-// The pair of (the original GROUP BY key, associated attribute)
-val groupByExprPairs = x.groupByExprs.map(_ match {
-  case e: NamedExpression => (e, e.toAttribute)
-  case other => {
-val alias = Alias(other, other.toString)()
-nonAttributeGroupByExpressions += alias // add the 
non-attributes expression alias
-(other, alias.toAttribute)
-  }
-})
 
-// substitute the non-attribute expressions for aggregations.
-val aggregation = x.aggregations.map(expr => expr.transformDown {
-  case e => 
groupByExprPairs.find(_._1.semanticEquals(e)).map(_._2).getOrElse(e)
-}.asInstanceOf[NamedExpression])
+val aliasedGroupByExprPairs = x.groupByExprs.map{
+  case a @ Alias(expr, _) => (expr, a)
+  case expr: NamedExpression => (expr, Alias(expr, expr.name)())
+  case expr => (expr, Alias(expr, expr.prettyString)())
+}
 
-// substitute the group by expressions.
-val newGroupByExprs = groupByExprPairs.map(_._2)
+val aliasedGroupByExprs = aliasedGroupByExprPairs.map(_._2)
+val aliasedGroupByAttr = aliasedGroupByExprs.map(_.toAttribute)
 
-val child = if (nonAttributeGroupByExpressions.length > 0) {
-  // insert additional projection if contains the
-  // non-attribute expressions in the GROUP BY keys
-  Project(x.child.output ++ nonAttributeGroupByExpressions, 
x.child)
-} else {
-  x.child
+// substitute group by expressions in aggregation list with 
appropriate attribute
+val aggregations = x.aggregations.map{
--- End diff --

@chenghao-intel actually that change would bring back the bug in question 
since it would do the substitutions in situations like below and the 
aggregations would be computed off the manipulated (nulls inserted) values.
```
select a + b, c, sum(a+b) + count(c)
from t1
group by a + b, c with rollup
```
In general anything below an AggregateExpression we don't want to 
transform, but above we do. So really I need a transformDownUntil method. BTW 
making this change does fix the `groupby_grouping_sets1` test so I really do 
need to do something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11275][SQL] Reimplement Expand as a Gen...

2015-11-03 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/9429#discussion_r43811041
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -205,45 +205,30 @@ class Analyzer(
 GroupingSets(bitmasks(a), a.groupByExprs, a.child, a.aggregations)
   case x: GroupingSets =>
 val gid = AttributeReference(VirtualColumn.groupingIdName, 
IntegerType, false)()
-// We will insert another Projection if the GROUP BY keys contains 
the
-// non-attribute expressions. And the top operators can references 
those
-// expressions by its alias.
-// e.g. SELECT key%5 as c1 FROM src GROUP BY key%5 ==>
-//  SELECT a as c1 FROM (SELECT key%5 AS a FROM src) GROUP BY a
-
-// find all of the non-attribute expressions in the GROUP BY keys
-val nonAttributeGroupByExpressions = new ArrayBuffer[Alias]()
-
-// The pair of (the original GROUP BY key, associated attribute)
-val groupByExprPairs = x.groupByExprs.map(_ match {
-  case e: NamedExpression => (e, e.toAttribute)
-  case other => {
-val alias = Alias(other, other.toString)()
-nonAttributeGroupByExpressions += alias // add the 
non-attributes expression alias
-(other, alias.toAttribute)
-  }
-})
 
-// substitute the non-attribute expressions for aggregations.
-val aggregation = x.aggregations.map(expr => expr.transformDown {
-  case e => 
groupByExprPairs.find(_._1.semanticEquals(e)).map(_._2).getOrElse(e)
-}.asInstanceOf[NamedExpression])
+val aliasedGroupByExprPairs = x.groupByExprs.map{
+  case a @ Alias(expr, _) => (expr, a)
+  case expr: NamedExpression => (expr, Alias(expr, expr.name)())
--- End diff --

I believe I need a new Alias here since we really have two versions of the 
expression -- the original and the version manipulated by the Generator with 
nulls inserted per the bitmask. In the Aggregate 'aggregation' list the 
grouping columns need to refer to the manipulated version and 'real' aggregates 
need to refer to the original version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21699: [SPARK-24722][SQL] pivot() with Column type argument

2018-07-03 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/21699
  
Using either `Column` or `String` type was actually in my original PR: 
https://github.com/apache/spark/pull/7841
@rxin later modified the api to only take a `String` prior to the release 
as part of an API audit: https://github.com/apache/spark/pull/9929

Considering you can just make a call to `withColumn` first I'm not really 
convinced in the utility of this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17226: [SPARK-19882][SQL] Pivot with null as a distinct ...

2017-03-09 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/17226

[SPARK-19882][SQL] Pivot with null as a distinct pivot value throws NPE

## What changes were proposed in this pull request?

Allows null values of the pivot column to be included in the pivot values 
list without throwing NPE

Note this PR was made as an alternative to #17224 but preserves the two 
phase aggregate operation that is needed for good performance.

## How was this patch tested?

Additional unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark pivot-null

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17226.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17226


commit 18b74a7520ac3a019cfb6ceb993a284677178c31
Author: Andrew Ray 
Date:   2017-03-09T18:56:25Z

fix




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17226: [SPARK-19882][SQL] Pivot with null as a distinct ...

2017-03-09 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/17226#discussion_r105322758
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala ---
@@ -216,4 +216,10 @@ class DataFramePivotSuite extends QueryTest with 
SharedSQLContext{
   Row("d", 15000.0, 48000.0) :: Row("J", 2.0, 3.0) :: Nil
 )
   }
+
+  test("pivot with null should not throw NPE") {
+checkAnswer(
+  Seq(Tuple1(None), 
Tuple1(Some(1))).toDF("a").groupBy($"a").pivot("a").count(),
+  Row(null, 1, null) :: Row(1, null, 1) :: Nil)
--- End diff --

Right the non optimized codepath should have been doing a null safe equals 
in the if statement. I have fixed that in a81c062 and added a unit test.

As to whether an aggregate function of count(1) in a pivot should fill 0's 
for null I think that is an orthogonal issue. First note that that it will 
always* follow the optimized codepath as the choice is based on the return type 
of the aggregate. Second its not clear that that is the expected result, for 
instance pandas leaves those values as null and Oracle 11g gives 0 (Still need 
to check R/reshape2 and MS SQL Server). I think it would be best to open 
another JIRA ticket to discuss this further.

* unless there are multiple aggregates and one of them is not supported, 
which is a consistancy problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/17226
  
@HyukjinKwon As stated in 17226#discussion_r105322758 I think we should 
open a second JIRA to have the discussion on whether or not count(1) of no 
values in a pivot should be filled with 0's.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17226: [SPARK-19882][SQL] Pivot with null as a distinct ...

2017-03-09 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/17226#discussion_r105324124
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -522,7 +522,7 @@ class Analyzer(
 } else {
   val pivotAggregates: Seq[NamedExpression] = pivotValues.flatMap 
{ value =>
 def ifExpr(expr: Expression) = {
-  If(EqualTo(pivotColumn, value), expr, Literal(null))
+  If(EqualNullSafe(pivotColumn, value), expr, Literal(null))
--- End diff --

We need to use null safe equality regardless of aggregate as this is for 
comparison to each of the pivot values (one of which may be a null).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/17226
  
There are three things going on here in your one example.

1. Spark 1.6 [first version with pivot] (and Spark 2.0+ with an aggregate 
output type unsupported by PivotFirst) gives incorrect answers to when one of 
the pivot column values is null (only affects the 'null' column) this is fixed 
by doing a null safe equals in the injected if statement 
https://github.com/apache/spark/pull/17226/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R525

2. Spark 2.0+ with PivotFirst gives a NPE when one of the pivot column 
values is null. The main thing fixed in this PR.

3. There is inconsistency between Spark 1.6 and 2.0+ on the result of a 
pivot with a `count(1)` aggregate when no values are aggregated for a cell. 
This is separate from the issues above and it's not clear which version is 
naturally correct (pandas leaves those values as null, Oracle 11g gives 0, and 
I need to test others).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/17226
  
BTW for 3 above if we decide it should be 0, we can add an initial value 
for `PivotFirst` to make the fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/17226
  
@HyukjinKwon we're not introducing a regression in this PR by fixing the 
NPE, the answer given by 1.6 was incorrect under any interpenetration. Again, 
there is a completely separate issue of what the proper value of count(1) on no 
values should be in a pivot that does not depend at all on nulls in the pivot 
column.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/17226
  
@HyukjinKwon There is an inconsistency/regression but its not being 
introduced in this PR, its already there. Take an example without null as a 
pivot column value like below. The only difference is for the `count(1)` 
aggregate on cells with no values aggregated in the pivot table. Again I don't 
think it's clear which is "correct" here.

**Spark 2.1.0** (and presumably 2.0.x/master)
```
scala> Seq(1,2).toDF("a").groupBy("a").pivot("a").count().show
+---+++
|  a|   1|   2|
+---+++
|  1|   1|null|
|  2|null|   1|
+---+++
scala> Seq(1,2).toDF("a").groupBy("a").pivot("a").sum("a").show
+---+++
|  a|   1|   2|
+---+++
|  1|   1|null|
|  2|null|   2|
+---+++
```

**Spark 1.6.0**
```
scala> 
sc.parallelize(Seq(1,2)).toDF("a").groupBy("a").pivot("a").count().show
+---+---+---+
|  a|  1|  2|
+---+---+---+
|  1|  1|  0|
|  2|  0|  1|
+---+---+---+
scala> 
sc.parallelize(Seq(1,2)).toDF("a").groupBy("a").pivot("a").sum("a").show
+---+++
|  a|   1|   2|
+---+++
|  1|   1|null|
|  2|null|   2|
+---+++
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-03-16 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16483
  
@rxin can anyone else review this? It would be nice to get this correctness 
fix into 2.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/16483#discussion_r106546448
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
---
@@ -322,13 +335,12 @@ object PageRank extends Logging {
 def personalizedVertexProgram(id: VertexId, attr: (Double, Double),
   msgSum: Double): (Double, Double) = {
   val (oldPR, lastDelta) = attr
-  var teleport = oldPR
-  val delta = if (src==id) resetProb else 0.0
-  teleport = oldPR*delta
-
-  val newPR = teleport + (1.0 - resetProb) * msgSum
-  val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else 
newPR - oldPR
-  (newPR, newDelta)
+  val newPR = if (lastDelta == Double.NegativeInfinity) {
--- End diff --

I'm guessing you mean the `if (src==id)` check? I'm honestly not sure what 
was going on with this code its just wrong. The results do not match up with 
igraph/networkx at all. Furthermore the code is just nonsensical -- definition 
of `var teleport = oldPR` that is then unconditionally set two lines down to  
`teleport = oldPR*delta` without being used prior.

This revised implementation is much easier to follow and is now tested 
against 3 sets of reference values computed by igraph/networkx. Please let me 
know if you thing I'm missing something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/16483#discussion_r106548090
  
--- Diff: 
graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala ---
@@ -68,26 +69,34 @@ class PageRankSuite extends SparkFunSuite with 
LocalSparkContext {
   val nVertices = 100
   val starGraph = GraphGenerators.starGraph(sc, nVertices).cache()
   val resetProb = 0.15
+  val tol = 0.0001
+  val numIter = 2
   val errorTol = 1.0e-5
 
-  val staticRanks1 = starGraph.staticPageRank(numIter = 2, 
resetProb).vertices
-  val staticRanks2 = starGraph.staticPageRank(numIter = 3, 
resetProb).vertices.cache()
+  val staticRanks = starGraph.staticPageRank(numIter, 
resetProb).vertices.cache()
+  val staticRanks2 = starGraph.staticPageRank(numIter + 1, 
resetProb).vertices
 
-  // Static PageRank should only take 3 iterations to converge
-  val notMatching = staticRanks1.innerZipJoin(staticRanks2) { (vid, 
pr1, pr2) =>
+  // Static PageRank should only take 2 iterations to converge
--- End diff --

It didn't change, were still comparing the output of the 2nd and 3rd 
iteration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-03-16 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16483
  
@thunterdb The extra step -- as implemented -- is only at the end as that 
gives the same result as doing it after every iteration but without the extra 
overhead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17348: [SPARK-20018][SQL] Pivot with timestamp and count should...

2017-03-19 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/17348
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-01-05 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/16483

[SPARK-18847][GraphX] PageRank gives incorrect results for graphs with sinks

## What changes were proposed in this pull request?

Graphs with sinks (vertices with no outgoing edges) don't have the expected 
rank sum of n (or 1 for personalized). We fix this by normalizing to the 
expected sum at the end of each implementation.

Additionally this fixes the dynamic version of personal pagerank which gave 
incorrect answers that were not detected by existing unit tests.

## How was this patch tested?

Revamped existing and additional unit tests with reference values (and 
reproduction code) from igraph and NetworkX.

Note that for comparison on personal pagerank we use the arpack algorithm 
in igraph as prpack (the  current default) redistributes rank to all vertices 
uniformly instead of just to the personalization source. We could take the 
alternate convention (redistribute rank to all vertices uniformly) but that 
would involve more extensive changes to the algorithms (the dynamic version 
would no longer be able to use Pregel).


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark pagerank-sink2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16483.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16483


commit 41178a3f7310eb30b2eebbac1fac532196ad3432
Author: Andrew Ray 
Date:   2017-01-06T06:24:21Z

page rank sink fixes and unit tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-01-06 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16483
  
ping @srowen @ankurdave can you take a look at this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16539: [SPARK-8855][MLlib][PySpark] Python API for Assoc...

2017-01-10 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/16539

[SPARK-8855][MLlib][PySpark] Python API for Association Rules

## What changes were proposed in this pull request?

This patch adds a `generateAssociationRules(confidence)` method to 
`FPGrowthModel` for feature parity with the Scala and Java API.

I will do a follow up for website documentation.

## How was this patch tested?

Doctest


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark py-association-rules

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16539.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16539


commit 9b783cf2ea21c086d6ebf3316e300d859788a723
Author: Andrew Ray 
Date:   2017-01-10T22:46:25Z

python wrapper for association rules




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16555: [SPARK-19180][SQL] the offset of short should be 4 in Of...

2017-01-12 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16555
  
The title should say 2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16559: [WIP] Add expression index and test cases

2017-01-12 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16559
  
It can already be done with the `posexplode` UDTF like
```
with t as (values (array(1,2,3)), (array(4,5,6)) as (a))
select col from t lateral view posexplode(a) tt where pos = 2
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16577: [SPARK-19214][SQL] Typed aggregate count output f...

2017-01-13 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/16577

[SPARK-19214][SQL] Typed aggregate count output field name should be "count"

## What changes were proposed in this pull request?

Changes the output field name of typed aggregate counts to be `count` 
(instead of `count(1)`) for consistency with the dataframe api

## How was this patch tested?

unit test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark typed-count-name

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16577.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16577


commit 37319fe312b37f67078e41e3621171893aa20a92
Author: Andrew Ray 
Date:   2017-01-13T15:20:25Z

typed aggregate count output field name should be "count"




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-01-17 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/16483
  
@rxin can you take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16539: [SPARK-8855][MLlib][PySpark] Python API for Assoc...

2017-01-20 Thread aray
Github user aray closed the pull request at:

https://github.com/apache/spark/pull/16539


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-20 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/15415#discussion_r97166816
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
@@ -0,0 +1,251 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.fpm.{FPGrowth => MLlibFPGrowth, 
FPGrowthModel => MLlibFPGrowthModel}
+import org.apache.spark.sql.{DataFrame, _}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
+
+/**
+ * Common params for FPGrowth and FPGrowthModel
+ */
+private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with 
HasPredictionCol {
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new 
ArrayType(StringType, false))
+SchemaUtils.appendColumn(schema, $(predictionCol), new 
ArrayType(StringType, false))
+  }
+
+  /**
+   * Minimal support level of the frequent pattern. [0.0, 1.0]. Any 
pattern that appears
+   * more than (minSupport * size-of-the-dataset) times will be output
+   * Default: 0.3
+   * @group param
+   */
+  @Since("2.2.0")
+  val minSupport: DoubleParam = new DoubleParam(this, "minSupport",
+"the minimal support level of the frequent pattern (Default: 0.3)",
+ParamValidators.inRange(0.0, 1.0))
+  setDefault(minSupport -> 0.3)
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getMinSupport: Double = $(minSupport)
+
+  /**
+   * Number of partitions used by parallel FP-growth
+   * @group param
+   */
+  @Since("2.2.0")
+  val numPartitions: IntParam = new IntParam(this, "numPartitions",
+"Number of partitions used by parallel FP-growth", 
ParamValidators.gtEq[Int](1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getNumPartitions: Int = $(numPartitions)
+
+}
+
+/**
+ * :: Experimental ::
+ * A parallel FP-growth algorithm to mine frequent itemsets.
+ *
+ * @see [[http://dx.doi.org/10.1145/1454008.1454027 Li et al., PFP: 
Parallel FP-Growth for Query
+ *  Recommendation]]
+ */
+@Since("2.2.0")
+@Experimental
+class FPGrowth @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Estimator[FPGrowthModel] with FPGrowthParams with 
DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("FPGrowth"))
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setMinSupport(value: Double): this.type = set(minSupport, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setNumPartitions(value: Int): this.type = set(numPartitions, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
+
+  def fit(dataset: Dataset[_]): FPGrowthModel = {
+val data = dataset.select($(featuresCol)).rdd.map(r => 
r.getSeq[String](0).toArray)
+val parentModel = new 
MLlibFPGrowth().setMinSupport($(minSupport)).run(data)
+copyValues(new FPGrowthModel(uid, paren

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-20 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/15415#discussion_r97168311
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
@@ -0,0 +1,251 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.fpm.{FPGrowth => MLlibFPGrowth, 
FPGrowthModel => MLlibFPGrowthModel}
+import org.apache.spark.sql.{DataFrame, _}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
+
+/**
+ * Common params for FPGrowth and FPGrowthModel
+ */
+private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with 
HasPredictionCol {
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new 
ArrayType(StringType, false))
+SchemaUtils.appendColumn(schema, $(predictionCol), new 
ArrayType(StringType, false))
+  }
+
+  /**
+   * Minimal support level of the frequent pattern. [0.0, 1.0]. Any 
pattern that appears
+   * more than (minSupport * size-of-the-dataset) times will be output
+   * Default: 0.3
+   * @group param
+   */
+  @Since("2.2.0")
+  val minSupport: DoubleParam = new DoubleParam(this, "minSupport",
+"the minimal support level of the frequent pattern (Default: 0.3)",
+ParamValidators.inRange(0.0, 1.0))
+  setDefault(minSupport -> 0.3)
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getMinSupport: Double = $(minSupport)
+
+  /**
+   * Number of partitions used by parallel FP-growth
+   * @group param
+   */
+  @Since("2.2.0")
+  val numPartitions: IntParam = new IntParam(this, "numPartitions",
+"Number of partitions used by parallel FP-growth", 
ParamValidators.gtEq[Int](1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getNumPartitions: Int = $(numPartitions)
+
+}
+
+/**
+ * :: Experimental ::
+ * A parallel FP-growth algorithm to mine frequent itemsets.
+ *
+ * @see [[http://dx.doi.org/10.1145/1454008.1454027 Li et al., PFP: 
Parallel FP-Growth for Query
+ *  Recommendation]]
+ */
+@Since("2.2.0")
+@Experimental
+class FPGrowth @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Estimator[FPGrowthModel] with FPGrowthParams with 
DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("FPGrowth"))
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setMinSupport(value: Double): this.type = set(minSupport, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setNumPartitions(value: Int): this.type = set(numPartitions, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
+
+  def fit(dataset: Dataset[_]): FPGrowthModel = {
+val data = dataset.select($(featuresCol)).rdd.map(r => 
r.getSeq[String](0).toArray)
+val parentModel = new 
MLlibFPGrowth().setMinSupport($(minSupport)).run(data)
+copyValues(new FPGrowthModel(uid, paren

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-20 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/15415#discussion_r97168170
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
@@ -0,0 +1,251 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.fpm.{FPGrowth => MLlibFPGrowth, 
FPGrowthModel => MLlibFPGrowthModel}
+import org.apache.spark.sql.{DataFrame, _}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
+
+/**
+ * Common params for FPGrowth and FPGrowthModel
+ */
+private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with 
HasPredictionCol {
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new 
ArrayType(StringType, false))
+SchemaUtils.appendColumn(schema, $(predictionCol), new 
ArrayType(StringType, false))
+  }
+
+  /**
+   * Minimal support level of the frequent pattern. [0.0, 1.0]. Any 
pattern that appears
+   * more than (minSupport * size-of-the-dataset) times will be output
+   * Default: 0.3
+   * @group param
+   */
+  @Since("2.2.0")
+  val minSupport: DoubleParam = new DoubleParam(this, "minSupport",
+"the minimal support level of the frequent pattern (Default: 0.3)",
+ParamValidators.inRange(0.0, 1.0))
+  setDefault(minSupport -> 0.3)
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getMinSupport: Double = $(minSupport)
+
+  /**
+   * Number of partitions used by parallel FP-growth
+   * @group param
+   */
+  @Since("2.2.0")
+  val numPartitions: IntParam = new IntParam(this, "numPartitions",
+"Number of partitions used by parallel FP-growth", 
ParamValidators.gtEq[Int](1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getNumPartitions: Int = $(numPartitions)
+
+}
+
+/**
+ * :: Experimental ::
+ * A parallel FP-growth algorithm to mine frequent itemsets.
+ *
+ * @see [[http://dx.doi.org/10.1145/1454008.1454027 Li et al., PFP: 
Parallel FP-Growth for Query
+ *  Recommendation]]
+ */
+@Since("2.2.0")
+@Experimental
+class FPGrowth @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Estimator[FPGrowthModel] with FPGrowthParams with 
DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("FPGrowth"))
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setMinSupport(value: Double): this.type = set(minSupport, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setNumPartitions(value: Int): this.type = set(numPartitions, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
+
+  def fit(dataset: Dataset[_]): FPGrowthModel = {
+val data = dataset.select($(featuresCol)).rdd.map(r => 
r.getSeq[String](0).toArray)
+val parentModel = new 
MLlibFPGrowth().setMinSupport($(minSupport)).run(data)
+copyValues(new FPGrowthModel(uid, paren

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-20 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/15415#discussion_r97162464
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala ---
@@ -0,0 +1,251 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.fpm.{FPGrowth => MLlibFPGrowth, 
FPGrowthModel => MLlibFPGrowthModel}
+import org.apache.spark.sql.{DataFrame, _}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
+
+/**
+ * Common params for FPGrowth and FPGrowthModel
+ */
+private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with 
HasPredictionCol {
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new 
ArrayType(StringType, false))
+SchemaUtils.appendColumn(schema, $(predictionCol), new 
ArrayType(StringType, false))
+  }
+
+  /**
+   * Minimal support level of the frequent pattern. [0.0, 1.0]. Any 
pattern that appears
+   * more than (minSupport * size-of-the-dataset) times will be output
+   * Default: 0.3
+   * @group param
+   */
+  @Since("2.2.0")
+  val minSupport: DoubleParam = new DoubleParam(this, "minSupport",
+"the minimal support level of the frequent pattern (Default: 0.3)",
+ParamValidators.inRange(0.0, 1.0))
+  setDefault(minSupport -> 0.3)
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getMinSupport: Double = $(minSupport)
+
+  /**
+   * Number of partitions used by parallel FP-growth
+   * @group param
+   */
+  @Since("2.2.0")
+  val numPartitions: IntParam = new IntParam(this, "numPartitions",
+"Number of partitions used by parallel FP-growth", 
ParamValidators.gtEq[Int](1))
+
+  /** @group getParam */
+  @Since("2.2.0")
+  def getNumPartitions: Int = $(numPartitions)
+
+}
+
+/**
+ * :: Experimental ::
+ * A parallel FP-growth algorithm to mine frequent itemsets.
+ *
+ * @see [[http://dx.doi.org/10.1145/1454008.1454027 Li et al., PFP: 
Parallel FP-Growth for Query
+ *  Recommendation]]
+ */
+@Since("2.2.0")
+@Experimental
+class FPGrowth @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Estimator[FPGrowthModel] with FPGrowthParams with 
DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("FPGrowth"))
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setMinSupport(value: Double): this.type = set(minSupport, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setNumPartitions(value: Int): this.type = set(numPartitions, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
+
+  def fit(dataset: Dataset[_]): FPGrowthModel = {
+val data = dataset.select($(featuresCol)).rdd.map(r => 
r.getSeq[String](0).toArray)
+val parentModel = new 
MLlibFPGrowth().setMinSupport($(minSupport)).run(data)
+copyValues(new FPGrowthModel(uid, paren

[GitHub] spark issue #18697: [SPARK-16683][SQL] Repeated joins to same table can leak...

2017-07-25 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/18697
  
ping @rxin can someone look at this correctness fix?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18762: [SPARK-21566][SQL][Python] Python method for summ...

2017-07-28 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/18762

[SPARK-21566][SQL][Python] Python method for summary

## What changes were proposed in this pull request?

Adds the recently added `summary` method to the python dataframe interface.

## How was this patch tested?

Additional inline doctests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark summary-py

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18762.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18762


commit cb9d617aff3135143affbc7f0f6431c403a7503b
Author: Andrew Ray 
Date:   2017-07-24T20:59:00Z

python summary




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18762: [SPARK-21566][SQL][Python] Python method for summary

2017-07-28 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/18762
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18697: [SPARK-16683][SQL] Repeated joins to same table c...

2017-07-31 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/18697#discussion_r130396904
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala ---
@@ -65,6 +65,10 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] 
with Logging with Serializ
 false
   }
 
+  override def verboseStringWithSuffix: String = {
+s"$verboseString $outputPartitioning"
+  }
--- End diff --

This doesn't change anything that is in common use, one has to do 
`plan.treeString(verbose = true, addSuffix = true)` to get it. I would argue 
for keeping it for any future debugging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18697: [SPARK-16683][SQL] Repeated joins to same table can leak...

2017-07-31 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/18697
  
@viirya We could certainly make that improvement. I believe it would be a 
fairly trivial change to this PR if we were just considering expressions that 
have the same canonical representation. However for reasons that are not clear 
to me an alias does not automatically have the same canonical representation as 
the `exprId` is not copied 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L984).
 Can anyone enlighten me as to why this is the case? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18786: [SPARK-21584][SQL][SparkR] Update R method for su...

2017-07-31 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/18786

[SPARK-21584][SQL][SparkR] Update R method for summary to call new 
implementation

## What changes were proposed in this pull request?

SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API 
that included  expanded statistics (vs `describe`) and control over which 
statistics to compute. Currently in the R API `summary` acts as an alias for 
`describe`. This patch updates the R API to call the new `summary` method in 
the JVM that includes additional statistics and ability to select which to 
compute.

This does not break the current interface as the present `summary` method 
does not take additional arguments like `describe` and the output was never 
meant to be used programmatically. 

## How was this patch tested?

Modified and additional unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark summary-r

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18786.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18786


commit 210601ab422b7c24159b4042d9c3481f740739bc
Author: Andrew Ray 
Date:   2017-07-28T19:16:11Z

R lang summary

commit e05cdbac7cb23951f0323a4aa4fe9a04cc09b5b7
Author: Andrew Ray 
Date:   2017-07-31T19:43:27Z

doc

commit b8784b476ce36eddb054835724b2ef5426f55387
Author: Andrew Ray 
Date:   2017-07-31T20:06:01Z

remove comment




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18786: [SPARK-21584][SQL][SparkR] Update R method for su...

2017-08-01 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/18786#discussion_r130618566
  
--- Diff: R/pkg/tests/fulltests/test_sparkSQL.R ---
@@ -2500,8 +2500,15 @@ test_that("describe() and summarize() on a 
DataFrame", {
   expect_equal(collect(stats)[5, "age"], "30")
 
   stats2 <- summary(df)
-  expect_equal(collect(stats2)[4, "summary"], "min")
-  expect_equal(collect(stats2)[5, "age"], "30")
+  expect_equal(collect(stats2)[5, "summary"], "25%")
+  expect_equal(collect(stats2)[5, "age"], "30.0")
--- End diff --

Yes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18786: [SPARK-21584][SQL][SparkR] Update R method for su...

2017-08-01 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/18786#discussion_r130620399
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2973,15 +2974,51 @@ setMethod("describe",
 dataFrame(sdf)
   })
 
+#' summary
+#'
+#' Computes specified statistics for numeric and string columns.
--- End diff --

This is unchanged from before. The stats are only computed for 
`NumericType` and `StringType` columns. Of course the only ones that are non 
null for strings are count, min, and max.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18800: [SPARK-21330][SQL] Bad partitioning does not allo...

2017-08-01 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/18800

[SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC table 
with extreme values on the partition column

## What changes were proposed in this pull request?

An overflow of the difference of bounds on the partitioning column leads to 
no data being read. This 
patch checks for this overflow. 

## How was this patch tested?

New unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark SPARK-21330

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18800.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18800


commit 7de8ccc05158f5b60af96c8cf22a2b9e20675817
Author: Andrew Ray 
Date:   2017-08-01T19:12:00Z

the fix




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18786: [SPARK-21584][SQL][SparkR] Update R method for su...

2017-08-02 Thread aray
Github user aray closed the pull request at:

https://github.com/apache/spark/pull/18786


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18786: [SPARK-21584][SQL][SparkR] Update R method for su...

2017-08-02 Thread aray
GitHub user aray reopened a pull request:

https://github.com/apache/spark/pull/18786

[SPARK-21584][SQL][SparkR] Update R method for summary to call new 
implementation

## What changes were proposed in this pull request?

SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API 
that included  expanded statistics (vs `describe`) and control over which 
statistics to compute. Currently in the R API `summary` acts as an alias for 
`describe`. This patch updates the R API to call the new `summary` method in 
the JVM that includes additional statistics and ability to select which to 
compute.

This does not break the current interface as the present `summary` method 
does not take additional arguments like `describe` and the output was never 
meant to be used programmatically. 

## How was this patch tested?

Modified and additional unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark summary-r

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18786.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18786


commit 210601ab422b7c24159b4042d9c3481f740739bc
Author: Andrew Ray 
Date:   2017-07-28T19:16:11Z

R lang summary

commit e05cdbac7cb23951f0323a4aa4fe9a04cc09b5b7
Author: Andrew Ray 
Date:   2017-07-31T19:43:27Z

doc

commit b8784b476ce36eddb054835724b2ef5426f55387
Author: Andrew Ray 
Date:   2017-07-31T20:06:01Z

remove comment

commit 08f3cf82632b93e27f4fd3d537d7c36cbd288004
Author: Andrew Ray 
Date:   2017-08-01T14:32:07Z

address doc formatting comments




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-02 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/18818

[SPARK-21110][SQL] Structs, arrays, and other orderable datatypes should be 
usable in inequalities

## What changes were proposed in this pull request?

Allows `BinaryComparison` operators to work on any data type that actually 
supports ordering as verified by `TypeUtils.checkForOrderingExpr` instead of 
relying on the incomplete list `TypeCollection.Ordered` (which is removed by 
this PR).

## How was this patch tested?

Updated unit tests to cover structs and arrays.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark SPARK-21110

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18818.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18818


commit cf29fc5f931e2a650eaeb6b4c08ed6e457f1b073
Author: Andrew Ray 
Date:   2017-08-01T20:52:03Z

should work needs tests

commit 0d1fd568f2af298bfb72ed8ca2f2560fa935b6f6
Author: Andrew Ray 
Date:   2017-08-02T14:57:54Z

update unit test with array and struct types




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18786: [SPARK-21584][SQL][SparkR] Update R method for summary t...

2017-08-02 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/18786
  
No the changes to `summary` are not additive, it inserts 25%, 50%, and 75% 
percentiles before max (the last row). People that want the previous behavior 
can use `describe`. Or if they are trying to programmatically access these 
fields they should really be explicitly specifying an aggregation. If you 
recall we discussed using the summary name in the original PR 
https://github.com/apache/spark/pull/18307#issuecomment-312347948


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18835: [SPARK-21628][BUILD] Explicitly specify Java vers...

2017-08-03 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/18835

[SPARK-21628][BUILD] Explicitly specify Java version in maven compiler 
plugin so IntelliJ imports project correctly

## What changes were proposed in this pull request?

Explicitly specify Java version in maven compiler plugin so IntelliJ 
imports project correctly. See 
https://stackoverflow.com/questions/27037657/stop-intellij-idea-to-switch-java-language-level-every-time-the-pom-is-reloaded

## How was this patch tested?

Manually tested reload of pom in IntelliJ.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark pom-1.8

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18835.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18835


commit 3c8e473fc921977fa02bd5ad883716c2a0ccb0fa
Author: Andrew Ray 
Date:   2017-08-03T15:26:10Z

the change




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18835: [SPARK-21628][BUILD] Explicitly specify Java vers...

2017-08-03 Thread aray
Github user aray closed the pull request at:

https://github.com/apache/spark/pull/18835


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18835: [SPARK-21628][BUILD] Explicitly specify Java version in ...

2017-08-03 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/18835
  
Thanks, I see it now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-07 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/18818#discussion_r131656840
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala 
---
@@ -79,18 +79,6 @@ private[sql] class TypeCollection(private val types: 
Seq[AbstractDataType])
 private[sql] object TypeCollection {
 
   /**
-   * Types that can be ordered/compared. In the long run we should 
probably make this a trait
-   * that can be mixed into each data type, and perhaps create an 
`AbstractDataType`.
-   */
-  // TODO: Should we consolidate this with RowOrdering.isOrderable?
--- End diff --

Nope, `RowOrdering.isOrderable` (which is used by 
`TypeUtils.checkForOrderingExpr`) returns true on a strict superset of this 
type collection as it works for complex types that need recursive checks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-07 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/18818#discussion_r131808912
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -453,6 +453,14 @@ case class Or(left: Expression, right: Expression) 
extends BinaryOperator with P
 
 abstract class BinaryComparison extends BinaryOperator with Predicate {
 
+  override def inputType: AbstractDataType = AnyDataType
--- End diff --

Right, but the real type check is below in `checkInputDataTypes`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-08 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/18818#discussion_r131913185
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -453,6 +453,14 @@ case class Or(left: Expression, right: Expression) 
extends BinaryOperator with P
 
 abstract class BinaryComparison extends BinaryOperator with Predicate {
 
+  override def inputType: AbstractDataType = AnyDataType
--- End diff --

We have to define `inputType` because it extends `BinaryOperator`. 
Previously the `LessThan`-like operators defined `inputType` was a subset of 
what they could actually support. This PR fixes that, but since the supported 
types can not be finitely specified as a type collection (there are a countably 
infinite number of legal `StructType`'s), we need to give a superset of what is 
actually supported for `inputType` and then do the real recursive check in 
`checkInputDataTypes`. This is much like how the `EqualTo` and `EqualNullSafe` 
operators were previously implemented. In this PR we just move that logic up to 
`BinaryComparison` as it's really the same for equality and inequality 
operators.

Did that answer your concerns? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18786: [SPARK-21584][SQL][SparkR] Update R method for summary t...

2017-08-08 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/18786
  
@rxin Any thoughts on whether it's ok to change the output of `summary` in 
R in a non "additive" way?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18306: [SPARK-21029][SS] All StreamingQuery should be stopped w...

2017-08-08 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/18306
  
@zsxwing can you take another look at this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18786: [SPARK-21584][SQL][SparkR] Update R method for summary t...

2017-08-09 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/18786
  
I'm pushing for it to stay as is because it's the more logical layout of 
the data: min=0%, 25%, 50%, 75%, max=100%. It's also more consistent with 
summary of native R dataframes (and for Python the Pandas describe method).

Anyone who is accessing these fields blindly by index should know they are 
taking a risk. Furthermore we already cast everything to strings for the output 
of summary so it should be obvious that it's not meant for reuse.

@felixcheung If you still feel strongly a compromise might be to print a 
warning when `summary` is called from R with no additional arguments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18818: [SPARK-21110][SQL] Structs, arrays, and other orderable ...

2017-08-14 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/18818
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-14 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/18818#discussion_r133116720
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -465,7 +475,7 @@ abstract class BinaryComparison extends BinaryOperator 
with Predicate {
 }
   }
 
-  protected lazy val ordering = 
TypeUtils.getInterpretedOrdering(left.dataType)
+  protected lazy val ordering: Ordering[Any] = 
TypeUtils.getInterpretedOrdering(left.dataType)
--- End diff --

addressed incc2f3ec


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18818: [SPARK-21110][SQL] Structs, arrays, and other orderable ...

2017-08-14 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/18818
  
@viirya @gatorsmile I have addressed your comments, could you take another 
look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18786: [SPARK-21584][SQL][SparkR] Update R method for su...

2017-08-17 Thread aray
Github user aray closed the pull request at:

https://github.com/apache/spark/pull/18786


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18786: [SPARK-21584][SQL][SparkR] Update R method for summary t...

2017-08-17 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/18786
  
closing and reopening to trigger AppVeyor test that timed out 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18786: [SPARK-21584][SQL][SparkR] Update R method for su...

2017-08-17 Thread aray
GitHub user aray reopened a pull request:

https://github.com/apache/spark/pull/18786

[SPARK-21584][SQL][SparkR] Update R method for summary to call new 
implementation

## What changes were proposed in this pull request?

SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API 
that included  expanded statistics (vs `describe`) and control over which 
statistics to compute. Currently in the R API `summary` acts as an alias for 
`describe`. This patch updates the R API to call the new `summary` method in 
the JVM that includes additional statistics and ability to select which to 
compute.

This does not break the current interface as the present `summary` method 
does not take additional arguments like `describe` and the output was never 
meant to be used programmatically. 

## How was this patch tested?

Modified and additional unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark summary-r

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18786.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18786


commit 210601ab422b7c24159b4042d9c3481f740739bc
Author: Andrew Ray 
Date:   2017-07-28T19:16:11Z

R lang summary

commit e05cdbac7cb23951f0323a4aa4fe9a04cc09b5b7
Author: Andrew Ray 
Date:   2017-07-31T19:43:27Z

doc

commit b8784b476ce36eddb054835724b2ef5426f55387
Author: Andrew Ray 
Date:   2017-07-31T20:06:01Z

remove comment

commit 08f3cf82632b93e27f4fd3d537d7c36cbd288004
Author: Andrew Ray 
Date:   2017-08-01T14:32:07Z

address doc formatting comments

commit 9c9f0f60a5bb93f21b89f9048eb7185d43c609e9
Author: Andrew Ray 
Date:   2017-08-15T14:02:16Z

clean up doc per review




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18001: [SPARK-20769][Doc] Incorrect documentation for us...

2017-05-16 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/18001

[SPARK-20769][Doc] Incorrect documentation for using Jupyter notebook

## What changes were proposed in this pull request?

SPARK-13973 incorrectly removed the required 
PYSPARK_DRIVER_PYTHON_OPTS="notebook" from documentation to use pyspark with 
Jupyter notebook. This patch corrects the documentation error.

## How was this patch tested?

Tested invocation locally with 
```bash
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook 
./bin/pyspark
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18001.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18001


commit 0598fafc7bcfe20afbcb95e7924769d600858547
Author: Andrew Ray 
Date:   2017-05-16T14:56:23Z

[SPARK-20769][Doc] Incorrect documentation for using Jupyter notebook

SPARK-13973 incorrectly removed the required 
PYSPARK_DRIVER_PYTHON_OPTS="notebook" from documentation to use pyspark with 
Jupyter notebook. This patch corrects the documentation error.

Tested manualy




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18001: [SPARK-20769][Doc] Incorrect documentation for using Jup...

2017-05-16 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/18001
  
Yes, it does not work without

```
Andrews-MacBook-Pro:spark-2.1.1-bin-hadoop2.7 andrew$ jupyter --version
4.0.6
Andrews-MacBook-Pro:spark-2.1.1-bin-hadoop2.7 andrew$ 
PYSPARK_DRIVER_PYTHON=jupyter ./bin/pyspark
usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
   [--paths] [--json]
   [subcommand]
jupyter: error: one of the arguments --version subcommand --config-dir 
--data-dir --runtime-dir --paths is required
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >