[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/12335#issuecomment-215525434 @davies You mean to support non-null return values? I don't think I know enough scala to automatically infer that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12335#issuecomment-215524881 **[Test build #2903 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2903/consoleFull)** for PR 12335 at commit [`efbdc26`](https://github.com/apache/spark/commit/efbdc26759fc8654a389db7920403ac4f760e186). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs
Github user davies commented on the pull request: https://github.com/apache/spark/pull/12335#issuecomment-215524585 @kevincox Could you also Update the Scala UDF? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12335#issuecomment-215524403 **[Test build #2903 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2903/consoleFull)** for PR 12335 at commit [`efbdc26`](https://github.com/apache/spark/commit/efbdc26759fc8654a389db7920403ac4f760e186). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/12335#issuecomment-215498477 I've added some tests but I'm having trouble getting the test suite to run locally before or after my changes. So I'm kinda just praying that everything works. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs
Github user kevincox commented on the pull request: https://github.com/apache/spark/pull/12335#issuecomment-211574940 Sure thing. It'll be a while until I get around to it but I will make sure to do that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs
Github user davies commented on the pull request: https://github.com/apache/spark/pull/12335#issuecomment-211556065 @kevincox Could you add some tests for this? Jenkins, OK to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12335#issuecomment-209106353 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11321] [SQL] Python non null udfs
GitHub user kevincox opened a pull request: https://github.com/apache/spark/pull/12335 [SPARK-11321] [SQL] Python non null udfs ## What changes were proposed in this pull request? This patch allows Python UDFs to return non-nullable values. ## How was this patch tested? This was tested by running PySpark jobs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kevincox/spark python-non-null-udfs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12335.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12335 commit 2ddd10486b91619117b0c236c86e4e0f39869cfa Author: anabranch Date: 2015-12-11T20:55:56Z [SPARK-11964][DOCS][ML] Add in Pipeline Import/Export Documentation Adding in Pipeline Import and Export Documentation. Author: anabranch Author: Bill Chambers Closes #10179 from anabranch/master. (cherry picked from commit aa305dcaf5b4148aba9e669e081d0b9235f50857) Signed-off-by: Joseph K. Bradley commit bfcc8cfee7219e63d2f53fc36627f95dc60428eb Author: Mike Dusenberry Date: 2015-12-11T22:21:33Z [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateM atrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike Dusenberry Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue. (cherry picked from commit 1b8220387e6903564f765fabb54be0420c3e99d7) Signed-off-by: Joseph K. Bradley commit 75531c77e85073c7be18985a54c623710894d861 Author: BenFradet Date: 2015-12-11T23:43:00Z [SPARK-12217][ML] Document invalid handling for StringIndexer Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation. I wonder if I should also add a snippet to the code example, input welcome. Author: BenFradet Closes #10257 from BenFradet/SPARK-12217. (cherry picked from commit aea676ca2d07c72b1a752e9308c961118e5bfc3c) Signed-off-by: Joseph K. Bradley commit c2f20469d5b53a027b022e3c4a9bea57452c5ba6 Author: Yanbo Liang Date: 2015-12-12T02:02:24Z [SPARK-11978][ML] Move dataset_example.py to examples/ml and rename to dataframe_example.py Since ```Dataset``` has a new meaning in Spark 1.6, we should rename it to avoid confusion. #9873 finished the work of Scala example, here we focus on the Python one. Move dataset_example.py to ```examples/ml``` and rename to ```dataframe_example.py```. BTW, fix minor missing issues of #9873. cc mengxr Author: Yanbo Liang Closes #9957 from yanboliang/SPARK-11978. (cherry picked from commit a0ff6d16ef4bcc1b6ff7282e82a9b345d8449454) Signed-off-by: Joseph K. Bradley commit 03d801587936fe92d4e7541711f1f41965e64956 Author: Ankur Dave Date: 2015-12-12T03:07:48Z [SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions Modifies the String overload to call the Column overload and ensures this is called in a test. Author: Ankur Dave Closes #10271 from ankurdave/SPARK-12298. (cherry picked from commit 1e799d617a28cd0eaa8f22d103ea8248c4655ae5) Signed-off-by: Yin Huai commit 47461fea7c079819de6add308f823c7a8294f891 Author: gatorsmile Date: 2015-12-12T04:55:16Z [SPARK-12158][SPARKR][SQL] Fix 'sample' function