[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/16537 Thanks @HyukjinKwon --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18759: [SPARK-20601][ML] Python API for Constrained Logistic Re...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18759 Thanks @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16822: [SPARK-19475][PYTHON][ML][MLLIB] Support (ml|mlli...
Github user zero323 closed the pull request at: https://github.com/apache/spark/pull/16822 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__...
Github user zero323 closed the pull request at: https://github.com/apache/spark/pull/16537 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18052: [SPARK-20347][PYSPARK][WIP] Provide AsyncRDDActio...
Github user zero323 closed the pull request at: https://github.com/apache/spark/pull/18052 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17922: [SPARK-20601][PYTHON][ML] Python API Changes for Constra...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17922 @BryanCutler @yanboliang @nchammas Thanks for all the comments. Unfortunately I don't have access to a hardware I can use for development at this moment, and most I likely I won't have in the upcoming weeks. I going to close this PR, but I'd really appreciate if one of you could pick it up from here. TIA --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user zero323 closed the pull request at: https://github.com/apache/spark/pull/17922 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/16537#discussion_r125755849 --- Diff: python/pyspark/sql/functions.py --- @@ -1949,6 +1949,14 @@ def _create_judf(self): return judf def __call__(self, *cols): +for c in cols: +if not isinstance(c, (Column, str)): --- End diff -- @HyukjinKwon Sorry for a delayed response, I am seldom online these days. You're right, it looks like an issue. I'll take a look at this, when I have more time --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r123766355 --- Diff: python/pyspark/ml/tests.py --- @@ -832,6 +860,96 @@ def test_logistic_regression(self): except OSError: pass +def logistic_regression_check_thresholds(self): +self.assertIsInstance( +LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]), +LogisticRegressionModel +) + +self.assertRaisesRegexp( +ValueError, +"Logistic Regression getThreshold found inconsistent.*$", +LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5] +) + +def test_binomial_logistic_regression_bounds(self): --- End diff -- Example datasets are not that good for checking constraints, and generator seems like a better idea than creating large enough example by hand. I can of course remove it, if this is an issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r123765079 --- Diff: python/pyspark/ml/param/__init__.py --- @@ -170,6 +170,15 @@ def toVector(value): raise TypeError("Could not convert %s to vector" % value) @staticmethod +def toMatrix(value): +""" +Convert a value to ML Matrix, if possible --- End diff -- While I am aware of this, distinction between `ml.linalg` and `mllib.linalg`, is a common source of confusion for the PySpark users. Of course we could be more forgiving, and automatically convert objects to the required class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18049: [SPARK-20830][PYSPARK][SQL] Add posexplode and posexplod...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18049 Thanks @ueshin! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18049: [SPARK-20830][PYSPARK][SQL] Add posexplode and po...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/18049#discussion_r123362792 --- Diff: python/pyspark/sql/tests.py --- @@ -272,6 +276,11 @@ def test_explode(self): self.assertEqual(result[0][0], "a") self.assertEqual(result[0][1], "b") +self.assertEqual(data.select(posexplode_outer("intlist")).count(), 5) --- End diff -- @ueshin Of course, is this enough? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18052: [SPARK-20347][PYSPARK][WIP] Provide AsyncRDDActions in P...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18052 IMHO it is, but this feature is hardly essential. Arguably we wouldn't need Scala API in the first place, if the built-in `Future` supported canceling. It is possible I am overthinking the latter one, but I don't see much point of adding an API which doesn't integrate with existing language features. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18052: [SPARK-20347][PYSPARK][WIP] Provide AsyncRDDActions in P...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18052 Personally I would prefer not including this at all, than using JVM implementation with callbacks: - Py4J gateway is already pretty slow, and can be unstable under high load. Putting higher pressure there doesn't seem like a good approach. - To "wrap" JVM side we would have to re-implement a full featured future API, at least partially compatible with `asyncio.Future` or `concurrent.futures.Future`. It is much higher maintenance burden, especially when both APIs are actively developed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/16537 I cannot reproduce this locally, but do we really use `pypy-2.0.2`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__...
GitHub user zero323 reopened a pull request: https://github.com/apache/spark/pull/16537 [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ should validate input types ## What changes were proposed in this pull request? Adds basic input validation for `UserDefinedFunction.__call__` to avoid failing with cryptic `Py4J` errors. ## How was this patch tested? Unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-19165 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16537.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16537 commit d476faf7a9912e4ff93fcb9c567ffc91f21c0512 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-06-20T20:42:57Z Validate types in UserDefinedFunction.__call__ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__...
Github user zero323 closed the pull request at: https://github.com/apache/spark/pull/16537 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in Spark ML...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17969 @felixcheung Feel free to ping me if you think this is worth revisiting. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18052: [SPARK-20347][PYSPARK][WIP] Provide AsyncRDDActions in P...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18052 @davies It is. Monkey patching context, `RDD` and some classes not covered by Scala `AsyncRDDFunctions`, [takes around 100 LOCs](https://github.com/zero323/pyspark-asyncactions) (excluding tests, comments, and package boilerplate). Without implicit Spark requirements (thread safety) one could also use `asyncio`, and skip thread pool whatsoever. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/16537 @holdenk I'll try to reproduce this problem but it looks a bit awkward: > AttributeError: 'function' object has no attribute '__closure__' Doesn't look like something related to this PR at all ð --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in Spark ML...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17969 Not a problem. It is just easier to reopen this in a future, than resolving ongoing conflicts. This is mostly deletions, but covers large part of the API, and even with recursive + patience git doesn't handle that well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
Github user zero323 closed the pull request at: https://github.com/apache/spark/pull/17969 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in Spark ML...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17969 @felixcheung I assume there is no interest in that. We can revisit this some other time I guess. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18052: [SPARK-20347][PYSPARK][WIP] Provide AsyncRDDActions in P...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18052 __Note__: [Waiting for some feedback](https://twitter.com/holdenkarau/status/866672579318337537). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17922: [SPARK-20601][PYTHON][ML] Python API Changes for Constra...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17922 Sure @yanboliang. Give me a sec. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18116: [SPARK-20892][SparkR] Add SQL trunc function to SparkR
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18116 It is manually edited. We don't manage it with `roxygen`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17938 Thanks @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplic...
Github user zero323 closed the pull request at: https://github.com/apache/spark/pull/18051 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18051 Exactly my point. Run examples internally ([it is not hard to patch knitr](https://github.com/zero323/knitr/commit/7a0d8f9ddb9d77a9c235f25aca26131e83c1f6cc) or even `tools::Rd2ex`) to validate examples and improve online docs. #18025 looks great - I'll try to review it when I have a spare moment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18051 To be honest I thought mostly about online docs here. Duplicate links in the bundled documentation never bothered me before (in SparkR, or any other package for that matter) and don't I think these have to be fixed. Maybe just close this PR, mark upstream ticket as won't fix, and focus on bigger issues? Just saying... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18085: [SPARK-20631][FOLLOW-UP] Fix incorrect tests.
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18085 The root problem was lack of `test` in the method name, so it hasn't been executed during the tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18051 If we consider improvement of the online documentation to be a separate problem, then I fully agree with @actuaryzhang. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18089: [SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrow...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18089 Thanks @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17891: [SPARK-20631][PYTHON][ML] LogisticRegression._checkThres...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17891 @jkbradley It shouldn't. It is not a correct test #18085 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18085: [SPARK-20631][FOLLOW-UP] Fix incorrect tests.
GitHub user zero323 opened a pull request: https://github.com/apache/spark/pull/18085 [SPARK-20631][FOLLOW-UP] Fix incorrect tests. ## What changes were proposed in this pull request? - Fix incorrect tests for `_check_thresholds`. - Move test to `ParamTests`. ## How was this patch tested? Unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-20631-FOLLOW-UP Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18085.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18085 commit 59494f7e851523cc9038b3e06258148885a6ae34 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-24T09:52:22Z Fix incorrect test commit b780da2fc30f91fbe386a81c59975245c0f0f058 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-24T10:02:36Z Move test to ParamTests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18051 I think there are two different problems here: - Quality of the internal R documentation. I think that fixing this is non-goal. It is not only a normal state of R packages, but also impossible to fix without hacks or serious trade-offs. - Quality of the online documentation. This is subjective but I think there is a lot to do there including, but not limited to: - Removing this IFrame nonsense. It doesn't serve any real purpose and is completely useless. - Cleaning duplicate links. Since almost everything here is S4 we duplicate the size of the index with each addition make it only harder to use. It also affects _ee also_ sections. - Trying to clean _see also_. Something like this (example SQL function): ![image](https://cloud.githubusercontent.com/assets/1554276/26349341/b8dbceba-3faf-11e7-8a1f-4c51dd7fa818.png) is just useless. - Adding some kind of search functionality. - Running all examples as a part of the internal docs build process. Having readable, highlighted examples, with actual output, would be awesome. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18051 For a moment I thought I found another solution but I was wrong. I don't think there is a conflict between this and installable package. It won't help with the packaged version (but other packages depending on S4 suffer from the same issue), but we can have an improved online version. There is one possible alternative - converting all `names` to long version: ```r #' abs #' #' Computes the absolute value. #' #' @param x Column to compute on. #' #' @rdname abs #' @name abs-method #' @family non-aggregate functions #' @export #' @examples \dontrun{abs(df$c)} #' @aliases abs,Column-method ``` This would keep CRAN checks happy and removed duplicates but at the cost of having docs like this: ![image](https://cloud.githubusercontent.com/assets/1554276/26330751/b4b4c812-3f4d-11e7-8c98-992d7a2318cc.png) and making help unusable from R session, requiring: ?SparkR::`abs-method` instead ?SparkR::abs I am not sure if you agree, but IMHO this just makes things worse. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18051 OK, take two. Instead of modifying `00index.html`let's process `Rd` files. This will remove `-method` aliases before html version is created. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/18051 It doesn't, but `R CMD build pkg` doesn't generate html index. This happens somewhere in the `R CMD INSTALL` so even if we create custom build script (with `devtools`), it won't helps us here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18052: [SPARK-20347][PYSPARK][WIP] Provide AsyncRDDActio...
GitHub user zero323 opened a pull request: https://github.com/apache/spark/pull/18052 [SPARK-20347][PYSPARK][WIP] Provide AsyncRDDActions in Python ## What changes were proposed in this pull request? Adds asynchronous RDD actions (`collectAsync`, `countAsync`, `foreach(Partition)Async` and `takeAsync`) using `concurrent.futures` with `ThreadPoolExecutor`. In Python < 3.2 it requires a backported [`futures` package](https://pypi.python.org/pypi/futures) installed on the driver. ## How was this patch tested? Unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-20347 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18052.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18052 commit 72bd097896aca042944d8e20282617e4864d9dd0 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-21T21:08:09Z Initial commit --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplic...
GitHub user zero323 opened a pull request: https://github.com/apache/spark/pull/18051 [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate links in SparkR API doc index ## What changes were proposed in this pull request? Duplicate links come from the `00Index.html` created during package installation. The file has a regular structure, where each link is a table row, slitted into two lines: ```r atan-method atan ``` This PR adds an additional steps to the `R/create-docs.sh`: - Copy `00Index.html` to the current working directory; index_path = file.path(libDir, "SparkR", "html", "00Index.html"); invisible(file.copy(index_path, "00Index.html.bck"));writeLines(txt, index_path); - Reads file and removes problematic lines txt = readLines(index_path); method_lines = grep("-method", txt, fixed = TRUE); txt = txt[-c(method_lines, method_lines + 1)]; - Writes file back: writeLines(txt, index_path) - Executes current pipeline. - Restores original content: invisible(file.rename("00Index.html.bck", index_path)) Arguably this is not the most reliable approach, but doesn't require any parser, and can be embedded in the current `create-docs.sh` ## How was this patch tested? Manual inspection of the docs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-18825 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18051.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18051 ---- commit b28ec94b0c0589736b4e3377d160642b0b6181a6 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-21T18:19:22Z Initial implementation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18049: [SPARK-20830][PYSPARK][SQL] Add posexplode and po...
GitHub user zero323 opened a pull request: https://github.com/apache/spark/pull/18049 [SPARK-20830][PYSPARK][SQL] Add posexplode and posexplode_outer ## What changes were proposed in this pull request? Add Python wrappers for `o.a.s.sql.functions.explode_outer` and `o.a.s.sql.functions.posexplode_outer`. ## How was this patch tested? Unit tests, doctests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-20830 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18049.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18049 commit 2fc576d74d0c6d0c7f7e4916407876f39727ce85 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-21T17:08:25Z Add posexplode and posexplode_outer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17988: [SPARKR][DOCS][MINOR] Use consistent names in rollup and...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17988 I took another look and I think it is OK how it is. If we were to [actually run the examples](https://issues.apache.org/jira/browse/SPARK-18825?focusedCommentId=16011504=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16011504) we'll need a bigger clean-up but it is a different topic. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17988: [SPARKR][DOCS][MINOR] Use consistent names in rollup and...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17988 Let me take another look :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17988: [DOCS][MINOR] Use consistent names in rollup and ...
GitHub user zero323 opened a pull request: https://github.com/apache/spark/pull/17988 [DOCS][MINOR] Use consistent names in rollup and cube examples ## What changes were proposed in this pull request? Rename `carsDF` to `df` in SparkR `rollup` and `cube` examples. ## How was this patch tested? Manual tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark cube-docs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17988.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17988 commit c8ed4e08ca4ed6ff88ae98f234d7fed8bbd0faf7 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-15T23:35:58Z Rename carsDF to df --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17672: [SPARK-20371][R] Add wrappers for collect_list and colle...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17672 @felixcheung Do you know by any chance what is the policy about adding new datasets to Spark? License restrictions, file size and such? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17969#discussion_r116393810 --- Diff: R/pkg/R/mllib_wrapper.R --- @@ -0,0 +1,61 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +#' S4 class that represents a Java ML model +#' +#' @param jobj a Java object reference to the backing Scala model +#' @export +#' @note JavaModel since 2.3.0 +setClass("JavaModel", representation(jobj = "jobj")) + +#' Makes predictions from a Java ML model +#' +#' @param object a Spark ML model. +#' @param newData a SparkDataFrame for testing. +#' @return \code{predict} returns a SparkDataFrame containing predicted value. +#' @rdname spark.predict +#' @aliases predict,JavaModel-method --- End diff -- I believe there is no conflict here. If you find this useful you can use templates to include additional information about generic operations. Very simple example https://github.com/zero323/spark/commit/64a3e854792181e159d39b9e747170b707f2711d which would create section like this: ![image](https://cloud.githubusercontent.com/assets/1554276/26038702/72b70280-390e-11e7-922c-0d1dece4816e.png) This can be further parametrized if needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17976: [DOCS][SPARKR] Use verbose names for family annotations ...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17976 Thanks Felix! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17976: [DOCS][SPARKR] Use verbose names for family annotations ...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17976 Note: if multiple functions use the same `@rdname`, there is only `@family` annotation to avoid duplicated _See also_ section. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17976: [DOCS][SPARKR] Use verbose names for family annot...
GitHub user zero323 opened a pull request: https://github.com/apache/spark/pull/17976 [DOCS][SPARKR] Use verbose names for family annotations in functions.R ## What changes were proposed in this pull request? - Change current short annotations (same as Scala `@group`) to verbose names (same as Scala `@groupname`). Before: ![image](https://cloud.githubusercontent.com/assets/1554276/26033909/9a98b596-38b4-11e7-961e-15fd9ea7440d.png) After: ![image](https://cloud.githubusercontent.com/assets/1554276/26033903/727a9944-38b4-11e7-8873-b09c553f4ec3.png) - Add missing `@family` annotations. ## How was this patch tested? `check-cran.R` (skipping tests), manual inspection. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARKR-FUNCTIONS-DOCSTRINGS Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17976.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17976 commit 70723f5ae0662bde6b5454da07394cae240d46a5 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-13T20:04:31Z Use verbose family names commit a006f320a18fe46abf608cb400cc542762a4d2ac Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-14T12:19:26Z Use lowercase family names --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17848: [SPARK-20586] [SQL] Add deterministic and distinctLike t...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17848 My concern is that people trying non-deterministic UDFs get tripped by repeated computations at least as often as by internal optimizations, and `nonDeterministic` flag might send a wrong message. In particular let's say we have this fan-out - fan-in worfklow depending on a non-deterministic source: ![image](https://cloud.githubusercontent.com/assets/1554276/26033144/64395fa0-38a5-11e7-9d0f-b2d6dbe51850.png) where dotted edges represent an arbitrary chain of transformations. Can we ensure that the state of each `foo`descendant in `sinl` will be consistent (`x` hasn't been recomputed)? I hope my point here is clear. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17965#discussion_r116366145 --- Diff: R/pkg/R/generics.R --- @@ -799,6 +799,10 @@ setGeneric("write.df", function(df, path = NULL, ...) { standardGeneric("write.d #' @export setGeneric("randomSplit", function(x, weights, seed) { standardGeneric("randomSplit") }) +#' @rdname broadcast +#' @export +setGeneric("broadcast", function(x) { standardGeneric("broadcast") }) --- End diff -- > this list is sorted alphabetically within this section Looks like it used to be at some point, but these days are long gone. I can reorder it right now, but this means rearranging a whole section. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
GitHub user zero323 reopened a pull request: https://github.com/apache/spark/pull/17965 [SPARK-20726][SPARKR] wrapper for SQL broadcast ## What changes were proposed in this pull request? - Adds R wrapper for `o.a.s.sql.functions.broadcast`. - Renames `broadcast` to `broadcast_`. ## How was this patch tested? Unit tests, check `check-cran.sh`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-20726 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17965.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17965 commit f190d62460829dcfb84ff1a8e6dd6fe9cbd25719 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T15:54:46Z Initial implementation commit 397ab1f7b4b4e2b9e51b697c92e3be197fed4554 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T17:38:31Z Fix style commit 246b91f8af84115af8f6283fb783000c9cc613ec Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-13T10:08:08Z Style commit 1530785f7469830446cd95717d524eb42d88e4ab Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-13T10:38:50Z Rename broadcast_ to broadcastRDD --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user zero323 closed the pull request at: https://github.com/apache/spark/pull/17965 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17938 @cloud-fan Thanks for the clarification. Just a thought - shouldn't we either support it consistently or don't support at all? Current behaviour is quite confusing and I don't think that documentation alone will cut it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17965#discussion_r116355839 --- Diff: R/pkg/R/DataFrame.R --- @@ -3769,3 +3769,33 @@ setMethod("alias", sdf <- callJMethod(object@sdf, "alias", data) dataFrame(sdf) }) + --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17965#discussion_r116355836 --- Diff: R/pkg/R/generics.R --- @@ -799,6 +799,10 @@ setGeneric("write.df", function(df, path = NULL, ...) { standardGeneric("write.d #' @export setGeneric("randomSplit", function(x, weights, seed) { standardGeneric("randomSplit") }) +#' @rdname broadcast +#' @export +setGeneric("broadcast", function(x) { standardGeneric("broadcast") }) --- End diff -- It doesn't seem to affect the docs so I don't think we have to touch this for now: ![image](https://cloud.githubusercontent.com/assets/1554276/26024791/88a39940-37d9-11e7-9f11-ac1510b59215.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17969#discussion_r116355659 --- Diff: R/pkg/R/mllib_wrapper.R --- @@ -0,0 +1,61 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +#' S4 class that represents a Java ML model +#' +#' @param jobj a Java object reference to the backing Scala model +#' @export +#' @note JavaModel since 2.3.0 +setClass("JavaModel", representation(jobj = "jobj")) + +#' Makes predictions from a Java ML model +#' +#' @param object a Spark ML model. +#' @param newData a SparkDataFrame for testing. +#' @return \code{predict} returns a SparkDataFrame containing predicted value. +#' @rdname spark.predict +#' @aliases predict,JavaModel-method --- End diff -- I am biased here, but I'll argue that it doesn't. Both `predict` and `write.ml` (same as `read.ml`) are extremely generic and in general we don't provide any useful information about these. And the usage is already covered by class `examples`. Finally we can use `@seealso` to provide a bit more R-is experience if you think it is not enough Something around the lines of `lm` docs: ![image](https://cloud.githubusercontent.com/assets/1554276/26024731/2214f012-37d8-11e7-9afb-b750e9c647ff.png) Moreover using this approach significantly reduces amount of clutter in the generated docs. There are shorter, argument list is focused on the important parts, same as `value`. See for example GLM docs below. So IMHO this is actually a significant improvement. Personally I would do the same with all the `prints` and `summaries` as well, although it wouldn't reduce the codebase (for now ð). This would further shorten the docs and remove awkward descriptions like this: ![image](https://cloud.githubusercontent.com/assets/1554276/26024707/567b2020-37d7-11e7-8c21-260404d7767d.png) And from the developer side it is a clear win. No mindless copy / paste / replace cycle and more time to provide useful examples. __Before__: ![image](https://cloud.githubusercontent.com/assets/1554276/26024648/1c36253c-37d6-11e7-9411-72c0c14c54a8.png) __After__: ![image](https://cloud.githubusercontent.com/assets/1554276/26024653/2643bd64-37d6-11e7-8463-08662611cd37.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17965#discussion_r116355102 --- Diff: R/pkg/R/DataFrame.R --- @@ -3769,3 +3769,33 @@ setMethod("alias", sdf <- callJMethod(object@sdf, "alias", data) dataFrame(sdf) }) + + +#' broadcast +#' +#' Return a new SparkDataFrame marked as small enough for use in broadcast joins. +#' +#' Equivalent to hint(x, "broadcast). --- End diff -- I double check this but for some reason `\code` here made `roxygen` unhappy when I tried it last time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17938 @gatorsmile Huh... in that case it looks like parser (?) needs a little bit of work, unless of course following are features. - Omitting `USING` doesn't work ```sql CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) CLUSTERED BY(user_id) INTO 256 BUCKETS ``` with: ``` Error in query: Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 1, pos 0) == SQL == CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) ^^^ CLUSTERED BY(user_id) INTO 256 BUCKETS ``` - Omitting `USING` adding `PARTITION BY` with column not present in the main clause (valid Hive DDL) doesn't work: ```sql CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) PARTITIONED BY (department STRING) CLUSTERED BY(user_id) INTO 256 BUCKETS ``` with ``` Error in query: Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 1, pos 2) == SQL == CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) --^^^ PARTITIONED BY (department STRING) CLUSTERED BY(user_id) INTO 256 BUCKETS ``` - `PARTITION BY` alone works: ```sql CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING) PARTITIONED BY (department STRING) ``` - `PARTITION BY` with `USING` when partition column is in the main spec works: ```sql CREATE TABLE user_info_bucketed( user_id BIGINT, firstname STRING, lastname STRING, department STRING) USING parquet PARTITIONED BY (department) ``` - `CLUSTERED BY` + `PARTITION BY` with `USING` when partition column is in the main spec works: ```sql CREATE TABLE user_info_bucketed( user_id BIGINT, firstname STRING, lastname STRING, department STRING) USING parquet PARTITIONED BY (department) CLUSTERED BY(user_id) INTO 256 BUCKETS ``` - `PARTITION BY` when parition column is in the main spec, `USING` omitted: ```sql CREATE TABLE user_info_bucketed( user_id BIGINT, firstname STRING, lastname STRING, department STRING) PARTITIONED BY (department) ``` with: ``` Error in query: mismatched input ')' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSA CTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IGNORE', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'T RANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 3, pos 30) == SQL == CREATE TABLE user_info_bucketed( user_id BIGINT, firstname STRING, lastname STRING, department STRING) PARTITIONED
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17969#discussion_r116346161 --- Diff: R/pkg/R/generics.R --- @@ -1535,9 +1535,7 @@ setGeneric("spark.freqItemsets", function(object) { standardGeneric("spark.freqI #' @export setGeneric("spark.associationRules", function(object) { standardGeneric("spark.associationRules") }) -#' @param object a fitted ML model object. --- End diff -- I think it makes more sense to keep param annotations with concrete implementation and keeping both would violate style by duplicating Rd entries. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17969#discussion_r116346059 --- Diff: R/pkg/R/mllib_wrapper.R --- @@ -0,0 +1,61 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +#' S4 class that represents a Java ML model +#' +#' @param jobj a Java object reference to the backing Scala model --- End diff -- We use "backing" all over the docs. I am not sure if backend is really better or not, but changing this only here doesn't make sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17969#discussion_r116345958 --- Diff: R/pkg/DESCRIPTION --- @@ -42,6 +42,7 @@ Collate: 'functions.R' 'install.R' 'jvm.R' +'mllib_wrapper.R' --- End diff -- No. Even if it wasn't automatically generated by `roxygen`, we have to enforce loading base case classes first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
GitHub user zero323 reopened a pull request: https://github.com/apache/spark/pull/17965 [SPARK-20726][SPARKR] wrapper for SQL broadcast ## What changes were proposed in this pull request? - Adds R wrapper for `o.a.s.sql.functions.broadcast`. - Renames `broadcast` to `broadcast_`. ## How was this patch tested? Unit tests, check `check-cran.sh`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-20726 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17965.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17965 commit f190d62460829dcfb84ff1a8e6dd6fe9cbd25719 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T15:54:46Z Initial implementation commit 397ab1f7b4b4e2b9e51b697c92e3be197fed4554 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T17:38:31Z Fix style --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user zero323 closed the pull request at: https://github.com/apache/spark/pull/17965 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...
GitHub user zero323 opened a pull request: https://github.com/apache/spark/pull/17969 [SPARK-20729][SPARKR][ML] Reduce boilerplate in Spark ML models ## What changes were proposed in this pull request? - Add `JavaModel` and `JavaMLWritable` S4 classes and mix them with existing ML wrappers. - Remove individual implementations on `predict` and `write.ml`. ## How was this patch tested? Unit tests, `check_cran.sh`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-20729 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17969.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17969 commit 8f76158762d74dcf7fa58a9e3f78683a5712e7ad Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T21:49:01Z Add JavaModel class commit a77a714f284fe33e425065eed13ae748ef4bf16b Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T22:13:43Z Remove predict impls from mllib_regression.R commit 31d60bc422be9b59f37c6ee2b4a2852625d56620 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T22:20:01Z Remove predict impls from mllib_classification.R commit 6e7bfdc672140ccee23649273c2d622f7ae78e7d Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T22:22:06Z Remove predict impls from mllib_clustering.R commit 95207fdfd6eebbe0374ed6c241b57adb24666d42 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T22:23:32Z Remove predict impls from mllib_fpm.R commit 93eefc4e6bc346e50a70a87114f7c51cfe0865b6 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T22:24:29Z Remove predict impls from mllib_recommendation.R commit a060dc76473b6cd9dfcf72ba73bd9eb34031b078 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T22:27:15Z Remove predict impls from mllib_tree.R commit 7be99929cc3391b075150b65e7daae21c1e97c63 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T22:51:23Z Add JavaMLWritable commit 322be5d511b01cf6dc4516a7799e945391db5c47 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T22:55:42Z Remove write.ml impls from mllib_tree.R commit 7e16a53a671380fd79c2b4e50ac0c78c4aa8b390 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T22:56:38Z Remove write.ml impls from mllib_recommendation.R commit dfbf2f94675114269a37991a83ece2c9644b546c Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T22:57:59Z Remove write.ml impls from mllib_regression.R commit 58ef13061d58caaba91b23221763418d78c918f6 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T22:59:50Z Remove write.ml impls from mllib_classification.R commit 50056a79cc25ae951ac788769680fa016f471406 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T23:01:01Z Remove write.ml impls from mllib_clustering.R commit 0f67137d7f1976d4e497964542bbe1f97d30401e Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T23:02:09Z Remove write.ml impls from mllib_fpm.R commit b29d0e21bca5cc12bb604dae4a60be93879bbf9c Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T23:02:49Z Add seealso to write.ml commit 1759cf7613385e68d43da4646dbcb1e0ef1b4a87 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T23:04:49Z Change rdname to write.ml commit 72f8bcaabeb9150d5ce209a7f8fab36eefd7e4c3 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T23:06:16Z Correct since annotation commit 95ec108ae7664c23d268facec0af1c37c6899ff3 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T23:11:40Z Remove param annotations from generics commit d7d9d4960132ccc985423b607357d7e56b6f5375 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T23:16:38Z Annotate object in mllib_tree.R commit 42c372d62b4c33b778f2ccdde030faea300e5159 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T23:34:42Z Add ... annotation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17965 Points to discuss: - Do we really need this. It gives us full API parity but is not strictly necessary. `hint(df, "broadcast")` should be equivalent. - Is this the best implementation? Some alternatives: - Use generics for both and `signature(x = "SparkDataFrame", "missing")` for `DataFrame` version and `signature(x = "jobj", object = "Any")` for general version. This would keep internal API intact, but is hard to document without leaking internal details. - Use different name for `DataFrame` version, for example `broadcast_table`. This is a bit verbose, and slightly harder to port for users. - Is `dataframe.R` the best location? It is generic on `SparkDataFrame` so `functions.R` don't feel like a right choice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast
GitHub user zero323 opened a pull request: https://github.com/apache/spark/pull/17965 [SPARK-20726][SPARKR] wrapper for SQL broadcast ## What changes were proposed in this pull request? Adds R wrapper for `o.a.s.sql.functions.broadcast`. ## How was this patch tested? Unit tests, check `check-cran.sh`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-20726 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17965.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17965 commit f190d62460829dcfb84ff1a8e6dd6fe9cbd25719 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-12T15:54:46Z Initial implementation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17932: [SPARK-20689][PYSPARK] python doctest leaking bucketed t...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17932 I see I am the one to blame here. Sorry for that @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17938 @gatorsmile Sure, but I assume you mean only `PARTITION BY`, right? I don't think that `CLUSTER BY` or `SORT BY` is supported in SQL (should it be supported after #17644 is resolved?). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17938#discussion_r116072807 --- Diff: docs/sql-programming-guide.md --- @@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. +### Bucketing, Sorting and Partitioning --- End diff -- Oh, I thought you are implying there are some known issues. This actually behaves sensibly - all supported options seem to work independent of the order, and unsupported ones (`partitionBy` + `sortBy` without `bucketBy` or overlapping `bucketBy` and `partitionBy` columns) give enough feedback to diagnose the issue. I haven't tested this with large datasets though, so there can be hidden issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17938#discussion_r116030963 --- Diff: docs/sql-programming-guide.md --- @@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. +### Bucketing, Sorting and Partitioning --- End diff -- @cloud-fan I think we can redirect to partition discovery here. But explaining the difference and possible applications (low vs. high cardinality) could be a good idea. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17938#discussion_r116029940 --- Diff: docs/sql-programming-guide.md --- @@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. +### Bucketing, Sorting and Partitioning --- End diff -- @tejasapatil > There could be multiple possible orderings of `partitionBy,` `bucketBy` and `sortBy` calls. Not all of them are supported, not all of them would produce correct outputs. Shouldn't the output be the same no matter the order? `sortBy` is not applicable for `partitionBy` and takes precedence over `bucketBy`, if both are present. This is Hive's behaviour if I am not mistaken, and at first glance Spark is doing the same thing. It there any gotcha here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17938 @HyukjinKwon Sounds good. [SPARK-20694](https://issues.apache.org/jira/browse/SPARK-20694). Should we document the difference between buckets (metastore based) and partitions (file system based)? The latter one could by done by referencing [Partition Discover](https://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17077 @gatorsmile #17938 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17938: [DOCS][SQL] Document bucketing and partitioning i...
GitHub user zero323 opened a pull request: https://github.com/apache/spark/pull/17938 [DOCS][SQL] Document bucketing and partitioning in SQL guide ## What changes were proposed in this pull request? - Add Scala, Python and Java examples for `partitionBy`, `sortBy` and `bucketBy`. - Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide ## How was this patch tested? Manual tests, docs build. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark DOCS-BUCKETING-AND-PARTITIONING Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17938.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17938 commit 560fd7978c2a18c8c216604eeea4563bcc4f7c5c Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-10T09:56:28Z Add Scala examples commit c0b037b302b10c20b2dadcc32048f3ee370d1864 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-10T09:56:50Z Add Python examples commit b2f45efcb883508e906232582e4a9e89b7f706d0 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-10T10:22:27Z Add Java examples commit 0af67cea0f1a1644139115274f14dab76732b5b5 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-10T10:32:47Z Add examples to sql guide --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17891: [SPARK-20631][PYTHON][ML] LogisticRegression._checkThres...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17891 Thanks @yanboliang! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r11267 --- Diff: python/pyspark/ml/classification.py --- @@ -374,6 +415,48 @@ def getFamily(self): """ return self.getOrDefault(self.family) +@since("2.2.0") --- End diff -- Probably. I've seen that Scala version has been targeted for 2.2.1 so who knows? But let's make 2.3. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17825 Thanks @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-2060][PYTHON][ML] Python API Changes for C...
GitHub user zero323 opened a pull request: https://github.com/apache/spark/pull/17922 [SPARK-2060][PYTHON][ML] Python API Changes for Constrained Logistic Regression Params ## What changes were proposed in this pull request? - Add new `Params` to `pyspark.ml.classification.LogisticRegression`. - Add `toMatrix` method to `pyspark.ml.param.TypeConverters`. - Add `generate_multinomial_logistic_input` helper to `pyspark.ml.tests`. ## How was this patch tested? Unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-20601 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17922.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17922 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17891: [SPARK-20631][PYTHON][ML] LogisticRegression._checkThres...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17891 cc @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17891: [SPARK-20631][PYTHON][ML] LogisticRegression._che...
GitHub user zero323 reopened a pull request: https://github.com/apache/spark/pull/17891 [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params ## What changes were proposed in this pull request? - Replace `getParam` calls with `getOrDefault` calls. - Fix exception message to avoid unintended `TypeError`. - Add unit tests ## How was this patch tested? New unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-20631 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17891.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17891 commit 098e26202bfed089efad057b3eead593ffda08b3 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-07T19:36:40Z Use getOrDefault to access values --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17891: [SPARK-20631][PYTHON][ML] LogisticRegression._che...
Github user zero323 closed the pull request at: https://github.com/apache/spark/pull/17891 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17891: [SPARK-11834][PYTHON][ML] LogisticRegression._che...
GitHub user zero323 opened a pull request: https://github.com/apache/spark/pull/17891 [SPARK-11834][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params ## What changes were proposed in this pull request? - Replace `getParam` calls with `getOrDefault` calls. - Fix exception message to avoid unintended `TypeError`. - Add unit tests ## How was this patch tested? New unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-20631 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17891.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17891 commit 098e26202bfed089efad057b3eead593ffda08b3 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-07T19:36:40Z Use getOrDefault to access values --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17077#discussion_r115151299 --- Diff: python/pyspark/sql/readwriter.py --- @@ -563,6 +563,63 @@ def partitionBy(self, *cols): self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols)) return self +@since(2.3) +def bucketBy(self, numBuckets, col, *cols): +"""Buckets the output by the given columns.If specified, +the output is laid out on the file system similar to Hive's bucketing scheme. + +:param numBuckets: the number of buckets to save +:param col: a name of a column, or a list of names. +:param cols: additional names (optional). If `col` is a list it should be empty. + +.. note:: Applicable for file-based data sources in combination with + :py:meth:`DataFrameWriter.saveAsTable`. --- End diff -- @gatorsmile Can we? ``` â spark git:(master) git rev-parse HEAD 2cf83c47838115f71419ba5b9296c69ec1d746cd â spark git:(master) bin/spark-shell Spark context Web UI available at http://192.168.1.101:4041 Spark context available as 'sc' (master = local[*], app id = local-1494184109262). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_121) Type in expressions to have them evaluated. Type :help for more information. scala> Seq(("a", 1, 3)).toDF("x", "y", "z").write.bucketBy(3, "x", "y").format("parquet").save("/tmp/foo") org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now; at org.apache.spark.sql.DataFrameWriter.assertNotBucketed(DataFrameWriter.scala:305) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:231) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217) ... 48 elided ``` ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17077 @gatorsmile > Could you also update the SQL document? Sure, but I'll need some guidance here. Somewhere in the [Generic Load/Save Functions](https://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions), right? But I guess we'll need a separate section for that. And should probably document `partitionBy`as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17077#discussion_r115138060 --- Diff: python/pyspark/sql/readwriter.py --- @@ -563,6 +563,60 @@ def partitionBy(self, *cols): self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols)) return self +@since(2.3) +def bucketBy(self, numBuckets, *cols): +"""Buckets the output by the given columns on the file system. + +:param numBuckets: the number of buckets to save +:param cols: name of columns + +.. note:: Applicable for file-based data sources in combination with + :py:meth:`DataFrameWriter.saveAsTable`. + +>>> (df.write.format('parquet') +... .bucketBy(100, 'year', 'month') +... .mode("overwrite") +... .saveAsTable('bucketed_table')) +""" +if len(cols) == 1 and isinstance(cols[0], (list, tuple)): +cols = cols[0] + +if not isinstance(numBuckets, int): +raise TypeError("numBuckets should be an int, got {0}.".format(type(numBuckets))) + +if not all(isinstance(c, basestring) for c in cols): +raise TypeError("cols argument should be a string or a sequence of strings.") --- End diff -- Or we just replace error message with: ``` "cols argument should be a string, List[str] or Tuple[str, ...]" ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17077#discussion_r115138021 --- Diff: python/pyspark/sql/readwriter.py --- @@ -563,6 +563,60 @@ def partitionBy(self, *cols): self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols)) return self +@since(2.3) +def bucketBy(self, numBuckets, *cols): +"""Buckets the output by the given columns on the file system. + +:param numBuckets: the number of buckets to save +:param cols: name of columns + +.. note:: Applicable for file-based data sources in combination with + :py:meth:`DataFrameWriter.saveAsTable`. + +>>> (df.write.format('parquet') +... .bucketBy(100, 'year', 'month') +... .mode("overwrite") +... .saveAsTable('bucketed_table')) +""" +if len(cols) == 1 and isinstance(cols[0], (list, tuple)): --- End diff -- Why do you say that? `cols` are variadic, so it should be always `Sized`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17831: [SPARK-18777][PYTHON][SQL] Return UDF from udf.register
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17831 Thanks everyone. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17077#discussion_r115133626 --- Diff: python/pyspark/sql/readwriter.py --- @@ -563,6 +563,60 @@ def partitionBy(self, *cols): self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols)) return self +@since(2.3) +def bucketBy(self, numBuckets, *cols): +"""Buckets the output by the given columns on the file system. + +:param numBuckets: the number of buckets to save +:param cols: name of columns + +.. note:: Applicable for file-based data sources in combination with + :py:meth:`DataFrameWriter.saveAsTable`. + +>>> (df.write.format('parquet') +... .bucketBy(100, 'year', 'month') +... .mode("overwrite") +... .saveAsTable('bucketed_table')) +""" +if len(cols) == 1 and isinstance(cols[0], (list, tuple)): +cols = cols[0] + +if not isinstance(numBuckets, int): +raise TypeError("numBuckets should be an int, got {0}.".format(type(numBuckets))) + +if not all(isinstance(c, basestring) for c in cols): +raise TypeError("cols argument should be a string or a sequence of strings.") --- End diff -- Good point. We can support arbitrary `Iterable[str]` though. ```python if len(cols) == 1 and isinstance(cols[0], collections.abc.Iterable): cols = list(cols[0]) ``` Caveat is, we don't allow this anywhere else. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17825#discussion_r115113723 --- Diff: R/pkg/R/DataFrame.R --- @@ -3745,3 +3745,26 @@ setMethod("hint", jdf <- callJMethod(x@sdf, "hint", name, parameters) dataFrame(jdf) }) + +#' alias +#' +#' @aliases alias,SparkDataFrame-method +#' @family SparkDataFrame functions +#' @rdname alias +#' @name alias +#' @examples --- End diff -- Done, but do we actually need this? We don't use roxygen to maintain `NAMESPACE`, and (I believe i mentioned this before) we `@export` objects which are not really exported. Just saying... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17825#discussion_r115085302 --- Diff: R/pkg/R/generics.R --- @@ -387,6 +387,16 @@ setGeneric("value", function(bcast) { standardGeneric("value") }) #' @export setGeneric("agg", function (x, ...) { standardGeneric("agg") }) +#' alias +#' +#' Set a new name for a Column or a SparkDataFrame. Equivalent to SQL "AS" keyword. +#' +#' @name alias +#' @rdname alias +#' @param object x a Column or a SparkDataFrame +#' @param data new name to use --- End diff -- On the bright side it looks like matching `@rdname` and `@aliases` like: ```r #' alias #' #' @aliases alias,SparkDataFrame-method #' @family SparkDataFrame functions #' @rdname alias,SparkDataFrame-method #' @name alias ... ``` and ```r #' alias #' #' @aliases alias,SparkDataFrame-method #' @family SparkDataFrame functions #' @rdname alias,SparkDataFrame-method #' @name alias ... ``` (I hope this is what you mean) indeed solves SPARK-18825. But it doesn't generate any docs for these two and makes CRAN checker unhappy: ``` Undocumented S4 methods: generic 'alias' and siglist 'Column' generic 'alias' and siglist 'SparkDataFrame' ``` Docs for generic are created but it doesn't help us here. Even if we bring `@examples` there we still have to deal with CRAN. Theres is also my favorite `\name must exist and be unique in Rd files` which doesn't gives us much room here, does it? I opened to suggestions, but personally I am out ideas. I've been digging trough `roxygen` docs, but between CRAN, S4 requirements, `roxygen` limitation and our own rules there is not much room left. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17825#discussion_r114931344 --- Diff: R/pkg/R/generics.R --- @@ -387,6 +387,16 @@ setGeneric("value", function(bcast) { standardGeneric("value") }) #' @export setGeneric("agg", function (x, ...) { standardGeneric("agg") }) +#' alias +#' +#' Set a new name for a Column or a SparkDataFrame. Equivalent to SQL "AS" keyword. --- End diff -- I still believe that AS is applicable to both. Essentially what we do is: ``` SELECT column AS new_column FROM table ``` and ``` (SELECT * FROM table) AS new_table ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17825#discussion_r114931185 --- Diff: R/pkg/R/generics.R --- @@ -387,6 +387,16 @@ setGeneric("value", function(bcast) { standardGeneric("value") }) #' @export setGeneric("agg", function (x, ...) { standardGeneric("agg") }) +#' alias +#' +#' Set a new name for a Column or a SparkDataFrame. Equivalent to SQL "AS" keyword. +#' +#' @name alias +#' @rdname alias +#' @param object x a Column or a SparkDataFrame +#' @param data new name to use --- End diff -- To be honest I find both equally confusing, so if you think that a single annotation is better, I am happy to oblige. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17825#discussion_r114929528 --- Diff: R/pkg/R/generics.R --- @@ -387,6 +387,16 @@ setGeneric("value", function(bcast) { standardGeneric("value") }) #' @export setGeneric("agg", function (x, ...) { standardGeneric("agg") }) +#' alias +#' +#' Set a new name for a Column or a SparkDataFrame. Equivalent to SQL "AS" keyword. +#' +#' @name alias +#' @rdname alias +#' @param object x a Column or a SparkDataFrame +#' @param data new name to use --- End diff -- Wouldn't be better to annotate actual implementations? To get something like this: ![image](https://cloud.githubusercontent.com/assets/1554276/25733425/295f465e-3159-11e7-87b7-d959c9bf3352.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias
Github user zero323 closed the pull request at: https://github.com/apache/spark/pull/17825 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias
GitHub user zero323 reopened a pull request: https://github.com/apache/spark/pull/17825 [SPARK-20550][SPARKR] R wrapper for Dataset.alias ## What changes were proposed in this pull request? - Add SparkR wrapper for `Dataset.alias`. - Adjust roxygen annotations for `functions.alias` (including example usage). ## How was this patch tested? Unit tests, `check_cran.sh`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zero323/spark SPARK-20550 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17825.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17825 commit 944a3ec791a8f103093e24511e895a4ce60970d8 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-01T08:59:24Z Initial implementation commit 5e9f8da45c432e0752e5e78556add33e0a6d0557 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-01T22:27:11Z Adjust argument annotations - Remove param annotations from dataframe.alias - Use generic annotations for column.alias commit 73133f9442ad8317fb12b600221962bf47d8a95c Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-01T22:31:26Z Add usage examples to column.alias commit 848eeefc1f18c6aabaf65e6efed259a2fa5c19c3 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-01T22:34:51Z Remove return type annotation commit 05c0781110b42a940e06cc31650449a8715e85c9 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-02T02:00:13Z Fix typo commit 22d7cf661bb54a8f7f9c660e1d914802f1eb4153 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-02T04:25:34Z Move dontruns to their own lines commit 22e1292557f1a5597cde6337267a099bbcdc07aa Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-02T04:27:11Z Extend param description commit 6bb3d914960d1cf63e582a7d732ca80ed321e9c5 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-02T04:33:34Z Add type annotations to since notes commit b3c1a416a16a9d32649edda2b66fc9c3476358a5 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-02T04:38:51Z Attach alias test to select-with-column test case commit 40fedcb8c41bc84deead205aad81e84c095045b5 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-02T04:44:45Z Extend description commit 1e1ad443751fc3dc93487e5385cc934feb93f631 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-03T00:25:15Z Move alias documentation to generics commit 2d5ace288f2443327696823c343c095f0d8d64ca Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-04T01:13:45Z Add family annotation commit 5fe5495580eb3852ea5092a34dc2334c0e45c9b7 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-04T06:32:54Z Check that stats::alias is not masked commit 09f9ccaf5e66a400d26b4ab6d600d951305d5fd3 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-04T07:04:52Z Fix style commit f1c74f338b8df865a5e8b9a6e281211aa27af7d3 Author: zero323 <zero...@users.noreply.github.com> Date: 2017-05-04T10:17:42Z vim --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17825#discussion_r114925159 --- Diff: R/pkg/R/generics.R --- @@ -387,6 +387,16 @@ setGeneric("value", function(bcast) { standardGeneric("value") }) #' @export setGeneric("agg", function (x, ...) { standardGeneric("agg") }) +#' alias +#' +#' Set a new name for a Column or a SparkDataFrame. Equivalent to SQL "AS" keyword. --- End diff -- How about? ``` #' Return a new Column or a SparkDataFrame with a name set. Equivalent to SQL "AS" keyword. ``` Is the `Column` new? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17851: [SPARK-20585][SPARKR] R generic hint support
Github user zero323 commented on the issue: https://github.com/apache/spark/pull/17851 Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17851: [SPARK-20585][SPARKR] R generic hint support
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17851#discussion_r114709260 --- Diff: R/pkg/R/DataFrame.R --- @@ -3715,3 +3715,34 @@ setMethod("rollup", sgd <- callJMethod(x@sdf, "rollup", jcol) groupedData(sgd) }) + +#' hint +#' +#' Specifies execution plan hint on the current SparkDataFrame. +#' +#' @param x a SparkDataFrame. +#' @param name a name of the hint. +#' @param ... additional argument(s) passed to the method. +#' +#' @return A SparkDataFrame. +#' @family SparkDataFrame functions +#' @aliases hint,SparkDataFrame,character-method +#' @rdname hint +#' @name hint +#' @export +#' @examples +#' \dontrun{ +#' df <- createDataFrame(mtcars) +#' avg_mpg <- mean(groupBy(createDataFrame(mtcars), "cyl"), "mpg") --- End diff -- Also with alias it will be quite dense: ```r #' @examples #' \dontrun{ #' # Set aliases to avoid ambiguity #' df <- alias(createDataFrame(mtcars), "cars") #' avg_mpg <- alias(mean(groupBy(createDataFrame(mtcars), "cyl"), "mpg"), "avg_mpg") #' #' head(join( #' df, hint(avg_mpg, "broadcast"), #' column("cars.cyl") == column("avg_mpg.cyl") #' )) #' } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org