[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs

2016-04-28 Thread kevincox
Github user kevincox commented on the pull request:

https://github.com/apache/spark/pull/12335#issuecomment-215525434
  
@davies You mean to support non-null return values? I don't think I know 
enough scala to automatically infer that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs

2016-04-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12335#issuecomment-215524881
  
**[Test build #2903 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2903/consoleFull)**
 for PR 12335 at commit 
[`efbdc26`](https://github.com/apache/spark/commit/efbdc26759fc8654a389db7920403ac4f760e186).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs

2016-04-28 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/12335#issuecomment-215524585
  
@kevincox Could you also Update the Scala UDF?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs

2016-04-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12335#issuecomment-215524403
  
**[Test build #2903 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2903/consoleFull)**
 for PR 12335 at commit 
[`efbdc26`](https://github.com/apache/spark/commit/efbdc26759fc8654a389db7920403ac4f760e186).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs

2016-04-28 Thread kevincox
Github user kevincox commented on the pull request:

https://github.com/apache/spark/pull/12335#issuecomment-215498477
  
I've added some tests but I'm having trouble getting the test suite to run 
locally before or after my changes. So I'm kinda just praying that everything 
works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs

2016-04-18 Thread kevincox
Github user kevincox commented on the pull request:

https://github.com/apache/spark/pull/12335#issuecomment-211574940
  
Sure thing. It'll be a while until I get around to it but I will make sure 
to do that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs

2016-04-18 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/12335#issuecomment-211556065
  
@kevincox Could you add some tests for this?

Jenkins, OK to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs

2016-04-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12335#issuecomment-209106353
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs

2016-04-12 Thread kevincox
GitHub user kevincox opened a pull request:

https://github.com/apache/spark/pull/12335

[SPARK-11321] [SQL] Python non null udfs

## What changes were proposed in this pull request?

This patch allows Python UDFs to return non-nullable values.

## How was this patch tested?

This was tested by running PySpark jobs.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kevincox/spark python-non-null-udfs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12335.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12335


commit 2ddd10486b91619117b0c236c86e4e0f39869cfa
Author: anabranch 
Date:   2015-12-11T20:55:56Z

[SPARK-11964][DOCS][ML] Add in Pipeline Import/Export Documentation

Adding in Pipeline Import and Export Documentation.

Author: anabranch 
Author: Bill Chambers 

Closes #10179 from anabranch/master.

(cherry picked from commit aa305dcaf5b4148aba9e669e081d0b9235f50857)
Signed-off-by: Joseph K. Bradley 

commit bfcc8cfee7219e63d2f53fc36627f95dc60428eb
Author: Mike Dusenberry 
Date:   2015-12-11T22:21:33Z

[SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure 
Issue

As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our 
PySpark `RowMatrix` constructor.  As discussed on the dev list 
[here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html),
 there appears to be an issue with type erasure with RDDs coming from Java, and 
by extension from PySpark.  Although we are attempting to construct a 
`RowMatrix` from an `RDD[Vector]` in 
[PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115),
 the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when 
calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` 
in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the 
aforementioned dev list thread, this issue was also encountered with 
`DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a 
`Vector` type.  `IndexedRowMatrix` and `CoordinateM
 atrix` do not appear to have this issue likely due to their related helper 
functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with 
pattern matching, thus preserving the types.

This PR currently contains that retagging fix applied to the 
`createRowMatrix` helper function in `PythonMLlibAPI`.  This PR blocks #9441, 
so once this is merged, the other can be rebased.

cc holdenk

Author: Mike Dusenberry 

Closes #9458 from 
dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.

(cherry picked from commit 1b8220387e6903564f765fabb54be0420c3e99d7)
Signed-off-by: Joseph K. Bradley 

commit 75531c77e85073c7be18985a54c623710894d861
Author: BenFradet 
Date:   2015-12-11T23:43:00Z

[SPARK-12217][ML] Document invalid handling for StringIndexer

Added a paragraph regarding StringIndexer#setHandleInvalid to the 
ml-features documentation.

I wonder if I should also add a snippet to the code example, input welcome.

Author: BenFradet 

Closes #10257 from BenFradet/SPARK-12217.

(cherry picked from commit aea676ca2d07c72b1a752e9308c961118e5bfc3c)
Signed-off-by: Joseph K. Bradley 

commit c2f20469d5b53a027b022e3c4a9bea57452c5ba6
Author: Yanbo Liang 
Date:   2015-12-12T02:02:24Z

[SPARK-11978][ML] Move dataset_example.py to examples/ml and rename to 
dataframe_example.py

Since ```Dataset``` has a new meaning in Spark 1.6, we should rename it to 
avoid confusion.
#9873 finished the work of Scala example, here we focus on the Python one.
Move dataset_example.py to ```examples/ml``` and rename to 
```dataframe_example.py```.
BTW, fix minor missing issues of #9873.
cc mengxr

Author: Yanbo Liang 

Closes #9957 from yanboliang/SPARK-11978.

(cherry picked from commit a0ff6d16ef4bcc1b6ff7282e82a9b345d8449454)
Signed-off-by: Joseph K. Bradley 

commit 03d801587936fe92d4e7541711f1f41965e64956
Author: Ankur Dave 
Date:   2015-12-12T03:07:48Z

[SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions

Modifies the String overload to call the Column overload and ensures this 
is called in a test.

Author: Ankur Dave 

Closes #10271 from ankurdave/SPARK-12298.

(cherry picked from commit 1e799d617a28cd0eaa8f22d103ea8248c4655ae5)
Signed-off-by: Yin Huai 

commit 47461fea7c079819de6add308f823c7a8294f891
Author: gatorsmile 
Date:   2015-12-12T04:55:16Z

[SPARK-12158][SPARKR][SQL] Fix 'sample' function