[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

2017-08-29 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/16537
  
Thanks @HyukjinKwon 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18759: [SPARK-20601][ML] Python API for Constrained Logistic Re...

2017-08-29 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18759
  
Thanks @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16822: [SPARK-19475][PYTHON][ML][MLLIB] Support (ml|mlli...

2017-07-13 Thread zero323
Github user zero323 closed the pull request at:

https://github.com/apache/spark/pull/16822


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__...

2017-07-13 Thread zero323
Github user zero323 closed the pull request at:

https://github.com/apache/spark/pull/16537


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18052: [SPARK-20347][PYSPARK][WIP] Provide AsyncRDDActio...

2017-07-13 Thread zero323
Github user zero323 closed the pull request at:

https://github.com/apache/spark/pull/18052


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17922: [SPARK-20601][PYTHON][ML] Python API Changes for Constra...

2017-07-13 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17922
  
@BryanCutler @yanboliang @nchammas Thanks for all the comments. 
Unfortunately I don't have access to a hardware I can use for development at 
this moment, and most I likely I won't have in the upcoming weeks.  I going to 
close this PR, but I'd really appreciate if one of you could pick it up from 
here. TIA 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-07-13 Thread zero323
Github user zero323 closed the pull request at:

https://github.com/apache/spark/pull/17922


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__...

2017-07-05 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16537#discussion_r125755849
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1949,6 +1949,14 @@ def _create_judf(self):
 return judf
 
 def __call__(self, *cols):
+for c in cols:
+if not isinstance(c, (Column, str)):
--- End diff --

@HyukjinKwon Sorry for a delayed response, I am seldom online these days. 
You're right, it looks like an issue. I'll take a look at this, when I have 
more time


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-06-23 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r123766355
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -832,6 +860,96 @@ def test_logistic_regression(self):
 except OSError:
 pass
 
+def logistic_regression_check_thresholds(self):
+self.assertIsInstance(
+LogisticRegression(threshold=0.5, thresholds=[0.5, 0.5]),
+LogisticRegressionModel
+)
+
+self.assertRaisesRegexp(
+ValueError,
+"Logistic Regression getThreshold found inconsistent.*$",
+LogisticRegression, threshold=0.42, thresholds=[0.5, 0.5]
+)
+
+def test_binomial_logistic_regression_bounds(self):
--- End diff --

Example datasets are not that good for checking constraints, and generator 
seems like a better idea than creating large enough example by hand. I can of 
course remove it, if this is an issue. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-06-23 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r123765079
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -170,6 +170,15 @@ def toVector(value):
 raise TypeError("Could not convert %s to vector" % value)
 
 @staticmethod
+def toMatrix(value):
+"""
+Convert a value to ML Matrix, if possible
--- End diff --

While I am aware of this, distinction between `ml.linalg` and 
`mllib.linalg`, is a common source of confusion for the PySpark users. Of 
course we could be more forgiving, and automatically convert objects to the 
required class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18049: [SPARK-20830][PYSPARK][SQL] Add posexplode and posexplod...

2017-06-21 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18049
  
Thanks @ueshin!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18049: [SPARK-20830][PYSPARK][SQL] Add posexplode and po...

2017-06-21 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18049#discussion_r123362792
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -272,6 +276,11 @@ def test_explode(self):
 self.assertEqual(result[0][0], "a")
 self.assertEqual(result[0][1], "b")
 
+self.assertEqual(data.select(posexplode_outer("intlist")).count(), 
5)
--- End diff --

@ueshin Of course, is this enough?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18052: [SPARK-20347][PYSPARK][WIP] Provide AsyncRDDActions in P...

2017-06-20 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18052
  
IMHO it is, but this feature is hardly essential. Arguably we wouldn't need 
Scala API in the first place, if the built-in `Future` supported canceling.

It is possible I am overthinking the latter one, but I don't see much point 
of adding an API which doesn't integrate with existing language features. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18052: [SPARK-20347][PYSPARK][WIP] Provide AsyncRDDActions in P...

2017-06-20 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18052
  
Personally I would prefer not including this at all, than using JVM 
implementation with callbacks:

- Py4J gateway is already pretty slow, and can be unstable under high load. 
Putting higher pressure there doesn't seem like a good approach.
- To "wrap" JVM side we would have to re-implement a full featured future 
API, at least partially compatible with `asyncio.Future` or 
`concurrent.futures.Future`. It is much higher maintenance burden, especially 
when both APIs are actively developed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

2017-06-20 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/16537
  
I cannot reproduce this locally, but do we really use `pypy-2.0.2`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__...

2017-06-20 Thread zero323
GitHub user zero323 reopened a pull request:

https://github.com/apache/spark/pull/16537

[SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ should validate 
input types

## What changes were proposed in this pull request?

Adds basic input validation for `UserDefinedFunction.__call__` to avoid 
failing with cryptic `Py4J` errors.

## How was this patch tested?

Unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-19165

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16537.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16537


commit d476faf7a9912e4ff93fcb9c567ffc91f21c0512
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-06-20T20:42:57Z

Validate types in UserDefinedFunction.__call__




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__...

2017-06-20 Thread zero323
Github user zero323 closed the pull request at:

https://github.com/apache/spark/pull/16537


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in Spark ML...

2017-06-20 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17969
  
@felixcheung Feel free to ping me if you think this is worth revisiting.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18052: [SPARK-20347][PYSPARK][WIP] Provide AsyncRDDActions in P...

2017-06-20 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18052
  
@davies It is. Monkey patching context, `RDD` and some classes not covered 
by Scala `AsyncRDDFunctions`, [takes around 100 
LOCs](https://github.com/zero323/pyspark-asyncactions) (excluding tests, 
comments, and package boilerplate). Without implicit Spark requirements (thread 
safety) one could also use `asyncio`, and skip thread pool whatsoever.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16537: [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ ...

2017-06-20 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/16537
  
@holdenk I'll try to reproduce this problem but it looks a bit awkward:

> AttributeError: 'function' object has no attribute '__closure__'

Doesn't look like something related to this PR at all 😕


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in Spark ML...

2017-06-03 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17969
  
Not a problem. It is just easier to reopen this in a future, than resolving 
ongoing conflicts. This is mostly deletions, but covers large part of the API, 
and even with recursive + patience git doesn't handle that well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-06-03 Thread zero323
Github user zero323 closed the pull request at:

https://github.com/apache/spark/pull/17969


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in Spark ML...

2017-06-03 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17969
  
@felixcheung I assume there is no interest in that. We can revisit this 
some other time I guess.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18052: [SPARK-20347][PYSPARK][WIP] Provide AsyncRDDActions in P...

2017-06-03 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18052
  
__Note__: [Waiting for some 
feedback](https://twitter.com/holdenkarau/status/866672579318337537).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17922: [SPARK-20601][PYTHON][ML] Python API Changes for Constra...

2017-06-03 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17922
  
Sure @yanboliang. Give me a sec.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18116: [SPARK-20892][SparkR] Add SQL trunc function to SparkR

2017-05-30 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18116
  
It is manually edited. We don't manage it with `roxygen`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-26 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17938
  
Thanks @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplic...

2017-05-25 Thread zero323
Github user zero323 closed the pull request at:

https://github.com/apache/spark/pull/18051


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...

2017-05-25 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18051
  
Exactly my point. Run examples internally ([it is not hard to patch 
knitr](https://github.com/zero323/knitr/commit/7a0d8f9ddb9d77a9c235f25aca26131e83c1f6cc)
 or even `tools::Rd2ex`) to validate examples and improve online docs. #18025 
looks great - I'll try to review it when I have a spare moment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...

2017-05-25 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18051
  
To be honest I thought mostly about online docs here. Duplicate links in 
the bundled documentation never bothered me before (in SparkR, or any other 
package for that matter) and don't I think these have to be fixed. Maybe just 
close this PR, mark upstream ticket as won't fix, and focus on bigger issues? 
Just saying...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18085: [SPARK-20631][FOLLOW-UP] Fix incorrect tests.

2017-05-24 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18085
  
The root problem was lack of `test` in the method name, so it hasn't been 
executed during the tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...

2017-05-24 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18051
  
If we consider improvement of the online documentation to be a separate 
problem, then I fully agree with @actuaryzhang.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18089: [SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrow...

2017-05-24 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18089
  
Thanks @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17891: [SPARK-20631][PYTHON][ML] LogisticRegression._checkThres...

2017-05-24 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17891
  
@jkbradley It shouldn't. It is not a correct test #18085


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18085: [SPARK-20631][FOLLOW-UP] Fix incorrect tests.

2017-05-24 Thread zero323
GitHub user zero323 opened a pull request:

https://github.com/apache/spark/pull/18085

[SPARK-20631][FOLLOW-UP] Fix incorrect tests.

## What changes were proposed in this pull request?

- Fix incorrect tests for `_check_thresholds`.
- Move test to `ParamTests`.

## How was this patch tested?

Unit tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-20631-FOLLOW-UP

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18085.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18085


commit 59494f7e851523cc9038b3e06258148885a6ae34
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-24T09:52:22Z

Fix incorrect test

commit b780da2fc30f91fbe386a81c59975245c0f0f058
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-24T10:02:36Z

Move test to ParamTests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...

2017-05-23 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18051
  
I think there are two different problems here:

- Quality of the internal R documentation. I think that fixing this is 
non-goal. It is not only a normal state of R packages, but also impossible to 
fix without hacks or serious trade-offs.
- Quality of the online documentation. This is subjective but I think there 
is a lot to do there including, but not limited to:

  - Removing this IFrame nonsense. It doesn't serve any real purpose and is 
completely useless.
  - Cleaning duplicate links. Since almost everything here is S4 we 
duplicate the size of the index with each addition make it only harder to use. 
It also affects _ee also_ sections.
  - Trying to clean _see also_. Something like this (example SQL function):


![image](https://cloud.githubusercontent.com/assets/1554276/26349341/b8dbceba-3faf-11e7-8a1f-4c51dd7fa818.png)

 is just useless.

  - Adding some kind of search functionality.
  - Running all examples as a part of the internal docs build process.  
Having readable, highlighted examples, with actual output, would be awesome.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...

2017-05-22 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18051
  
For a moment I thought I found another solution but I was wrong.

I don't think there is a conflict between this and installable package. It 
won't help with the packaged version (but other packages depending on S4 suffer 
from the same issue), but we can have an improved online version.

There is one possible alternative - converting all `names`  to long 
version: 

```r
#' abs
#'
#' Computes the absolute value.
#'
#' @param x Column to compute on.
#'
#' @rdname abs
#' @name abs-method
#' @family non-aggregate functions
#' @export
#' @examples \dontrun{abs(df$c)}
#' @aliases abs,Column-method
```

This would keep CRAN checks happy and removed duplicates but at the cost of 
having docs like this:


![image](https://cloud.githubusercontent.com/assets/1554276/26330751/b4b4c812-3f4d-11e7-8c98-992d7a2318cc.png)

and making help unusable from R session, requiring:

?SparkR::`abs-method`

instead

?SparkR::abs

I am not sure if you agree, but IMHO this just makes things worse.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...

2017-05-22 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18051
  
OK, take two. Instead of modifying `00index.html`let's process `Rd` files. 
This will remove `-method` aliases before html version is created.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate lin...

2017-05-22 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/18051
  
It doesn't, but  `R CMD build pkg` doesn't generate html index. This 
happens somewhere in the `R CMD  INSTALL` so even if we create custom build 
script (with `devtools`), it won't helps us here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18052: [SPARK-20347][PYSPARK][WIP] Provide AsyncRDDActio...

2017-05-21 Thread zero323
GitHub user zero323 opened a pull request:

https://github.com/apache/spark/pull/18052

[SPARK-20347][PYSPARK][WIP] Provide AsyncRDDActions in Python

## What changes were proposed in this pull request?

Adds asynchronous RDD actions (`collectAsync`, `countAsync`, 
`foreach(Partition)Async` and `takeAsync`) using `concurrent.futures` with 
`ThreadPoolExecutor`.

In Python < 3.2 it requires a backported [`futures` 
package](https://pypi.python.org/pypi/futures) installed on the driver.

## How was this patch tested?

Unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-20347

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18052.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18052


commit 72bd097896aca042944d8e20282617e4864d9dd0
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-21T21:08:09Z

Initial commit




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18051: [SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplic...

2017-05-21 Thread zero323
GitHub user zero323 opened a pull request:

https://github.com/apache/spark/pull/18051

[SPARK-18825][SPARKR][DOCS][WIP] Eliminate duplicate links in SparkR API 
doc index

## What changes were proposed in this pull request?

Duplicate links come from the `00Index.html` created during package 
installation.  The file has a regular structure, where each link is a table 
row, slitted into two lines:

```r
atan-method
atan
```


This PR adds an additional steps to the `R/create-docs.sh`:

- Copy `00Index.html` to the current working directory;

  index_path = file.path(libDir, "SparkR", "html", "00Index.html"); 
  invisible(file.copy(index_path, "00Index.html.bck"));writeLines(txt, 
index_path); 

- Reads file and removes problematic lines

  txt = readLines(index_path); method_lines = grep("-method", txt, 
fixed = TRUE); 
  txt = txt[-c(method_lines, method_lines + 1)]; 

- Writes file back:

  writeLines(txt, index_path)

- Executes current pipeline.

- Restores original content:

  invisible(file.rename("00Index.html.bck", index_path))

Arguably this is not the most reliable approach, but doesn't require any 
parser, and can be embedded in the current `create-docs.sh` 

## How was this patch tested?

Manual inspection of the docs.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-18825

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18051.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18051

----
commit b28ec94b0c0589736b4e3377d160642b0b6181a6
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-21T18:19:22Z

Initial implementation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18049: [SPARK-20830][PYSPARK][SQL] Add posexplode and po...

2017-05-21 Thread zero323
GitHub user zero323 opened a pull request:

https://github.com/apache/spark/pull/18049

[SPARK-20830][PYSPARK][SQL] Add posexplode and posexplode_outer

## What changes were proposed in this pull request?

Add Python wrappers for `o.a.s.sql.functions.explode_outer` and 
`o.a.s.sql.functions.posexplode_outer`.

## How was this patch tested?

Unit tests, doctests.




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-20830

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18049.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18049


commit 2fc576d74d0c6d0c7f7e4916407876f39727ce85
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-21T17:08:25Z

Add posexplode and posexplode_outer




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17988: [SPARKR][DOCS][MINOR] Use consistent names in rollup and...

2017-05-17 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17988
  
I took another look and I think it is OK how it is. If we were to [actually 
run the 
examples](https://issues.apache.org/jira/browse/SPARK-18825?focusedCommentId=16011504=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16011504)
 we'll need a bigger clean-up but it is a different topic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17988: [SPARKR][DOCS][MINOR] Use consistent names in rollup and...

2017-05-16 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17988
  
Let me take another look :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17988: [DOCS][MINOR] Use consistent names in rollup and ...

2017-05-15 Thread zero323
GitHub user zero323 opened a pull request:

https://github.com/apache/spark/pull/17988

[DOCS][MINOR] Use consistent names in rollup and cube examples

## What changes were proposed in this pull request?

Rename `carsDF` to `df` in SparkR `rollup` and `cube` examples.

## How was this patch tested?

Manual tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark cube-docs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17988.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17988


commit c8ed4e08ca4ed6ff88ae98f234d7fed8bbd0faf7
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-15T23:35:58Z

Rename carsDF to df




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17672: [SPARK-20371][R] Add wrappers for collect_list and colle...

2017-05-15 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17672
  
@felixcheung Do you know by any chance what is the policy about adding new 
datasets to Spark? License restrictions, file size and such?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-05-14 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17969#discussion_r116393810
  
--- Diff: R/pkg/R/mllib_wrapper.R ---
@@ -0,0 +1,61 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+#' S4 class that represents a Java ML model
+#'
+#' @param jobj a Java object reference to the backing Scala model
+#' @export
+#' @note JavaModel since 2.3.0
+setClass("JavaModel", representation(jobj = "jobj"))
+
+#' Makes predictions from a Java ML model
+#'
+#' @param object a Spark ML model.
+#' @param newData a SparkDataFrame for testing.
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
value.
+#' @rdname spark.predict
+#' @aliases predict,JavaModel-method
--- End diff --

I believe there is no conflict here. If you find this useful you can use 
templates to include additional information about generic operations. Very 
simple example 
https://github.com/zero323/spark/commit/64a3e854792181e159d39b9e747170b707f2711d

which would create section like this:


![image](https://cloud.githubusercontent.com/assets/1554276/26038702/72b70280-390e-11e7-922c-0d1dece4816e.png)

This can be further parametrized if needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17976: [DOCS][SPARKR] Use verbose names for family annotations ...

2017-05-14 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17976
  
Thanks Felix!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17976: [DOCS][SPARKR] Use verbose names for family annotations ...

2017-05-14 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17976
  
Note: if multiple functions use the same `@rdname`, there is only `@family` 
annotation to avoid duplicated _See also_ section.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17976: [DOCS][SPARKR] Use verbose names for family annot...

2017-05-14 Thread zero323
GitHub user zero323 opened a pull request:

https://github.com/apache/spark/pull/17976

[DOCS][SPARKR] Use verbose names for family annotations in functions.R

## What changes were proposed in this pull request?

- Change current short annotations (same as Scala `@group`) to verbose 
names (same as Scala `@groupname`). 

Before:


![image](https://cloud.githubusercontent.com/assets/1554276/26033909/9a98b596-38b4-11e7-961e-15fd9ea7440d.png)

After:

![image](https://cloud.githubusercontent.com/assets/1554276/26033903/727a9944-38b4-11e7-8873-b09c553f4ec3.png)


- Add missing `@family` annotations.

## How was this patch tested?

`check-cran.R` (skipping tests), manual inspection.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARKR-FUNCTIONS-DOCSTRINGS

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17976.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17976


commit 70723f5ae0662bde6b5454da07394cae240d46a5
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-13T20:04:31Z

Use verbose family names

commit a006f320a18fe46abf608cb400cc542762a4d2ac
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-14T12:19:26Z

Use lowercase family names




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17848: [SPARK-20586] [SQL] Add deterministic and distinctLike t...

2017-05-14 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17848
  
My concern is that people trying non-deterministic UDFs get tripped by 
repeated computations at least as often as by internal optimizations, and 
`nonDeterministic` flag might send a wrong message.

In particular let's say we have this fan-out - fan-in worfklow depending on 
a non-deterministic source:


![image](https://cloud.githubusercontent.com/assets/1554276/26033144/64395fa0-38a5-11e7-9d0f-b2d6dbe51850.png)


where dotted edges represent an arbitrary chain of transformations. Can we 
ensure that the state of each `foo`descendant in `sinl` will be consistent (`x` 
hasn't been recomputed)? I hope my point here is clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17965#discussion_r116366145
  
--- Diff: R/pkg/R/generics.R ---
@@ -799,6 +799,10 @@ setGeneric("write.df", function(df, path = NULL, ...) 
{ standardGeneric("write.d
 #' @export
 setGeneric("randomSplit", function(x, weights, seed) { 
standardGeneric("randomSplit") })
 
+#' @rdname broadcast
+#' @export
+setGeneric("broadcast", function(x) { standardGeneric("broadcast") })
--- End diff --

> this list is sorted alphabetically within this section

Looks like it used to be at some point, but these days are long gone. I can 
reorder it right now, but this means rearranging a whole section. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread zero323
GitHub user zero323 reopened a pull request:

https://github.com/apache/spark/pull/17965

 [SPARK-20726][SPARKR] wrapper for SQL broadcast

## What changes were proposed in this pull request?

- Adds R wrapper for `o.a.s.sql.functions.broadcast`.
- Renames `broadcast` to `broadcast_`.

## How was this patch tested?

Unit tests, check `check-cran.sh`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-20726

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17965.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17965


commit f190d62460829dcfb84ff1a8e6dd6fe9cbd25719
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T15:54:46Z

Initial implementation

commit 397ab1f7b4b4e2b9e51b697c92e3be197fed4554
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T17:38:31Z

Fix style

commit 246b91f8af84115af8f6283fb783000c9cc613ec
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-13T10:08:08Z

Style

commit 1530785f7469830446cd95717d524eb42d88e4ab
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-13T10:38:50Z

Rename broadcast_ to broadcastRDD




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread zero323
Github user zero323 closed the pull request at:

https://github.com/apache/spark/pull/17965


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17938
  
@cloud-fan Thanks for the clarification. Just a thought - shouldn't we 
either support it consistently or don't support at all? Current behaviour is 
quite confusing and I don't think that documentation alone will cut it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17965#discussion_r116355839
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3769,3 +3769,33 @@ setMethod("alias",
 sdf <- callJMethod(object@sdf, "alias", data)
 dataFrame(sdf)
   })
+
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17965#discussion_r116355836
  
--- Diff: R/pkg/R/generics.R ---
@@ -799,6 +799,10 @@ setGeneric("write.df", function(df, path = NULL, ...) 
{ standardGeneric("write.d
 #' @export
 setGeneric("randomSplit", function(x, weights, seed) { 
standardGeneric("randomSplit") })
 
+#' @rdname broadcast
+#' @export
+setGeneric("broadcast", function(x) { standardGeneric("broadcast") })
--- End diff --

It doesn't seem to affect the docs so I don't think we have to touch this 
for now:


![image](https://cloud.githubusercontent.com/assets/1554276/26024791/88a39940-37d9-11e7-9f11-ac1510b59215.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-05-13 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17969#discussion_r116355659
  
--- Diff: R/pkg/R/mllib_wrapper.R ---
@@ -0,0 +1,61 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+#' S4 class that represents a Java ML model
+#'
+#' @param jobj a Java object reference to the backing Scala model
+#' @export
+#' @note JavaModel since 2.3.0
+setClass("JavaModel", representation(jobj = "jobj"))
+
+#' Makes predictions from a Java ML model
+#'
+#' @param object a Spark ML model.
+#' @param newData a SparkDataFrame for testing.
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
value.
+#' @rdname spark.predict
+#' @aliases predict,JavaModel-method
--- End diff --

I am biased here, but I'll argue that it doesn't. Both `predict` and 
`write.ml` (same as `read.ml`) are extremely generic and in  general we don't 
provide any useful information about these. And the usage is already covered by 
class `examples`.  Finally we can use `@seealso` to provide a bit more R-is 
experience if you think it is not enough  Something around the lines of `lm` 
docs:


![image](https://cloud.githubusercontent.com/assets/1554276/26024731/2214f012-37d8-11e7-9afb-b750e9c647ff.png)


Moreover using this approach significantly reduces amount of clutter in the 
generated docs. There are shorter, argument list is focused on the important 
parts, same as `value`. See for example GLM docs below.  So IMHO this is 
actually a significant improvement.

Personally I would do the same with all the `prints` and `summaries` as 
well, although it wouldn't reduce the codebase (for now 😈).  This would 
further shorten the docs and remove awkward descriptions like this:


![image](https://cloud.githubusercontent.com/assets/1554276/26024707/567b2020-37d7-11e7-8c21-260404d7767d.png)
 
And from the developer side it is a clear win. No mindless copy / paste / 
replace cycle and more time to provide useful examples.

 __Before__:


![image](https://cloud.githubusercontent.com/assets/1554276/26024648/1c36253c-37d6-11e7-9411-72c0c14c54a8.png)

__After__:


![image](https://cloud.githubusercontent.com/assets/1554276/26024653/2643bd64-37d6-11e7-8463-08662611cd37.png)

 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-13 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17965#discussion_r116355102
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3769,3 +3769,33 @@ setMethod("alias",
 sdf <- callJMethod(object@sdf, "alias", data)
 dataFrame(sdf)
   })
+
+
+#' broadcast
+#' 
+#' Return a new SparkDataFrame marked as small enough for use in broadcast 
joins. 
+#' 
+#' Equivalent to hint(x, "broadcast).
--- End diff --

I double check this but for some reason `\code` here made `roxygen` unhappy 
when I tried it last time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-13 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17938
  
@gatorsmile  Huh...  in that case it looks like parser (?) needs a little 
bit of work, unless of course following are features.  

- Omitting `USING` doesn't work 

  ```sql
  CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, 
lastname STRING)
  CLUSTERED BY(user_id) INTO 256 BUCKETS
  ```
  with:

  ```
  Error in query: 
  Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 1, pos 0)
  
  == SQL ==
  CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, 
lastname STRING)
  ^^^
  CLUSTERED BY(user_id) INTO 256 BUCKETS
  ```

- Omitting `USING` adding `PARTITION BY` with column not present in the 
main clause (valid Hive DDL) doesn't work: 

  ```sql
  CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, 
lastname STRING)
  PARTITIONED BY (department STRING)
  CLUSTERED BY(user_id) INTO 256 BUCKETS
  ```
  with

  ```
  Error in query: 
  Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 1, pos 2)
  
  == SQL ==
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, 
lastname STRING)
  --^^^
PARTITIONED BY (department STRING)
CLUSTERED BY(user_id) INTO 256 BUCKETS
  ```

- `PARTITION BY` alone works:

  ```sql
  CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, 
lastname STRING)
  PARTITIONED BY (department STRING)
  ```

-   `PARTITION BY` with `USING` when partition column is in the main spec 
works:

 ```sql
CREATE TABLE user_info_bucketed(
  user_id BIGINT, firstname STRING, lastname STRING, department STRING)
USING parquet
PARTITIONED BY (department)
```

-  `CLUSTERED BY` +  `PARTITION BY` with `USING` when partition column is 
in the main spec works:

```sql
CREATE TABLE user_info_bucketed(
   user_id BIGINT, firstname STRING, lastname STRING, department STRING)
USING parquet
PARTITIONED BY (department)
CLUSTERED BY(user_id) INTO 256 BUCKETS 
```
- `PARTITION BY` when parition column is in the main spec, `USING` omitted:

```sql
CREATE TABLE user_info_bucketed(
 user_id BIGINT, firstname STRING, lastname STRING, department STRING)
PARTITIONED BY (department)
```
 
with:

```
Error in query: 
mismatched input ')' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 
'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 
'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 'ROW', 'WITH', 
'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 
'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'COST', 'CAST', 'SHOW', 
'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 
'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 
'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 
'START', 'TRANSA
 CTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IGNORE', 'IF', 'DIV', 'PERCENT', 
'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 
'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 
'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 
'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 
'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 
'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 
'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 
'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 
'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', 
DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 
'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 
'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 
'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'T
 RANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 
'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, 
BACKQUOTED_IDENTIFIER}(line 3, pos 30)

== SQL ==
CREATE TABLE user_info_bucketed(
  user_id BIGINT, firstname STRING, lastname STRING, department 
STRING)
PARTITIONED

[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-05-12 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17969#discussion_r116346161
  
--- Diff: R/pkg/R/generics.R ---
@@ -1535,9 +1535,7 @@ setGeneric("spark.freqItemsets", function(object) { 
standardGeneric("spark.freqI
 #' @export
 setGeneric("spark.associationRules", function(object) { 
standardGeneric("spark.associationRules") })
 
-#' @param object a fitted ML model object.
--- End diff --

I think it makes more sense to keep param annotations with concrete 
implementation and keeping both would violate style by duplicating Rd entries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-05-12 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17969#discussion_r116346059
  
--- Diff: R/pkg/R/mllib_wrapper.R ---
@@ -0,0 +1,61 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+#' S4 class that represents a Java ML model
+#'
+#' @param jobj a Java object reference to the backing Scala model
--- End diff --

We use "backing" all over the docs. I am not sure if backend is really 
better or not, but changing this only here doesn't make sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-05-12 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17969#discussion_r116345958
  
--- Diff: R/pkg/DESCRIPTION ---
@@ -42,6 +42,7 @@ Collate:
 'functions.R'
 'install.R'
 'jvm.R'
+'mllib_wrapper.R'
--- End diff --

No. Even if it wasn't automatically generated by `roxygen`, we have to 
enforce loading base case classes first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-12 Thread zero323
GitHub user zero323 reopened a pull request:

https://github.com/apache/spark/pull/17965

 [SPARK-20726][SPARKR] wrapper for SQL broadcast

## What changes were proposed in this pull request?

- Adds R wrapper for `o.a.s.sql.functions.broadcast`.
- Renames `broadcast` to `broadcast_`.

## How was this patch tested?

Unit tests, check `check-cran.sh`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-20726

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17965.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17965


commit f190d62460829dcfb84ff1a8e6dd6fe9cbd25719
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T15:54:46Z

Initial implementation

commit 397ab1f7b4b4e2b9e51b697c92e3be197fed4554
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T17:38:31Z

Fix style




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-12 Thread zero323
Github user zero323 closed the pull request at:

https://github.com/apache/spark/pull/17965


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17969: [SPARK-20729][SPARKR][ML] Reduce boilerplate in S...

2017-05-12 Thread zero323
GitHub user zero323 opened a pull request:

https://github.com/apache/spark/pull/17969

[SPARK-20729][SPARKR][ML]  Reduce boilerplate in Spark ML models

## What changes were proposed in this pull request?

- Add `JavaModel` and `JavaMLWritable` S4 classes and mix them with 
existing ML wrappers.
- Remove individual implementations on `predict` and `write.ml`.

## How was this patch tested?

Unit tests, `check_cran.sh`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-20729

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17969.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17969


commit 8f76158762d74dcf7fa58a9e3f78683a5712e7ad
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T21:49:01Z

Add JavaModel class

commit a77a714f284fe33e425065eed13ae748ef4bf16b
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T22:13:43Z

Remove predict impls from mllib_regression.R

commit 31d60bc422be9b59f37c6ee2b4a2852625d56620
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T22:20:01Z

Remove predict impls from mllib_classification.R

commit 6e7bfdc672140ccee23649273c2d622f7ae78e7d
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T22:22:06Z

Remove predict impls from mllib_clustering.R

commit 95207fdfd6eebbe0374ed6c241b57adb24666d42
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T22:23:32Z

Remove predict impls from mllib_fpm.R

commit 93eefc4e6bc346e50a70a87114f7c51cfe0865b6
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T22:24:29Z

Remove predict impls from mllib_recommendation.R

commit a060dc76473b6cd9dfcf72ba73bd9eb34031b078
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T22:27:15Z

Remove predict impls from mllib_tree.R

commit 7be99929cc3391b075150b65e7daae21c1e97c63
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T22:51:23Z

Add JavaMLWritable

commit 322be5d511b01cf6dc4516a7799e945391db5c47
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T22:55:42Z

Remove write.ml impls from mllib_tree.R

commit 7e16a53a671380fd79c2b4e50ac0c78c4aa8b390
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T22:56:38Z

Remove write.ml impls from mllib_recommendation.R

commit dfbf2f94675114269a37991a83ece2c9644b546c
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T22:57:59Z

Remove write.ml impls from mllib_regression.R

commit 58ef13061d58caaba91b23221763418d78c918f6
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T22:59:50Z

Remove write.ml impls from mllib_classification.R

commit 50056a79cc25ae951ac788769680fa016f471406
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T23:01:01Z

Remove write.ml impls from mllib_clustering.R

commit 0f67137d7f1976d4e497964542bbe1f97d30401e
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T23:02:09Z

Remove write.ml impls from mllib_fpm.R

commit b29d0e21bca5cc12bb604dae4a60be93879bbf9c
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T23:02:49Z

    Add seealso to write.ml

commit 1759cf7613385e68d43da4646dbcb1e0ef1b4a87
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T23:04:49Z

Change rdname to write.ml

commit 72f8bcaabeb9150d5ce209a7f8fab36eefd7e4c3
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T23:06:16Z

Correct since annotation

commit 95ec108ae7664c23d268facec0af1c37c6899ff3
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T23:11:40Z

Remove param annotations from generics

commit d7d9d4960132ccc985423b607357d7e56b6f5375
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T23:16:38Z

Annotate object in mllib_tree.R

commit 42c372d62b4c33b778f2ccdde030faea300e5159
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T23:34:42Z

Add ... annotation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-12 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17965
  
Points to discuss:

- Do we really need this. It gives us full API parity but is not strictly 
necessary. `hint(df, "broadcast")` should be equivalent.
- Is this the best implementation? Some alternatives:

- Use generics for both and `signature(x = "SparkDataFrame", 
"missing")` for `DataFrame` version and `signature(x = "jobj", object = "Any")` 
for general version. This would keep internal API intact, but is hard to 
document without leaking internal details.

- Use different name for `DataFrame` version, for example 
`broadcast_table`.  This is a bit verbose, and slightly harder to port for 
users.

- Is `dataframe.R` the best location? It is generic on `SparkDataFrame` so 
`functions.R` don't feel like a right choice. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17965: [SPARK-20726][SPARKR] wrapper for SQL broadcast

2017-05-12 Thread zero323
GitHub user zero323 opened a pull request:

https://github.com/apache/spark/pull/17965

 [SPARK-20726][SPARKR] wrapper for SQL broadcast

## What changes were proposed in this pull request?

Adds R wrapper for `o.a.s.sql.functions.broadcast`.

## How was this patch tested?

Unit tests, check `check-cran.sh`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-20726

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17965.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17965


commit f190d62460829dcfb84ff1a8e6dd6fe9cbd25719
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-12T15:54:46Z

Initial implementation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17932: [SPARK-20689][PYSPARK] python doctest leaking bucketed t...

2017-05-12 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17932
  
I see I am the one to blame here. Sorry for that @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-12 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17938
  
@gatorsmile Sure, but I assume you mean only `PARTITION BY`, right? I don't 
think that `CLUSTER BY` or  `SORT BY` is supported in SQL (should it be 
supported after  #17644 is resolved?).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

2017-05-11 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17938#discussion_r116072807
  
--- Diff: docs/sql-programming-guide.md ---
@@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables 
have per-partition metadat
 
 Note that partition information is not gathered by default when creating 
external datasource tables (those with a `path` option). To sync the partition 
information in the metastore, you can invoke `MSCK REPAIR TABLE`.
 
+### Bucketing, Sorting and Partitioning
--- End diff --

Oh, I thought you are implying there are some known issues. This actually 
behaves sensibly - all supported options seem to work independent of the order, 
and unsupported ones (`partitionBy` + `sortBy` without `bucketBy` or 
overlapping `bucketBy` and `partitionBy` columns) give enough feedback to 
diagnose the issue.

I haven't tested this with large datasets though, so there can be hidden 
issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

2017-05-11 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17938#discussion_r116030963
  
--- Diff: docs/sql-programming-guide.md ---
@@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables 
have per-partition metadat
 
 Note that partition information is not gathered by default when creating 
external datasource tables (those with a `path` option). To sync the partition 
information in the metastore, you can invoke `MSCK REPAIR TABLE`.
 
+### Bucketing, Sorting and Partitioning
--- End diff --

@cloud-fan  I think we can redirect to partition discovery here. But 
explaining the difference and possible applications (low vs. high cardinality) 
could be a good idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter...

2017-05-11 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17938#discussion_r116029940
  
--- Diff: docs/sql-programming-guide.md ---
@@ -581,6 +581,46 @@ Starting from Spark 2.1, persistent datasource tables 
have per-partition metadat
 
 Note that partition information is not gathered by default when creating 
external datasource tables (those with a `path` option). To sync the partition 
information in the metastore, you can invoke `MSCK REPAIR TABLE`.
 
+### Bucketing, Sorting and Partitioning
--- End diff --

@tejasapatil 

> There could be multiple possible orderings of `partitionBy,` `bucketBy` 
and `sortBy` calls. Not all of them are supported, not all of them would 
produce correct outputs.

Shouldn't the output be the same no matter the order? `sortBy` is not 
applicable for `partitionBy` and takes precedence over  `bucketBy`, if both are 
present. This is Hive's behaviour if I am not mistaken, and at first glance 
Spark is doing the same thing. It there any gotcha here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17938: [SPARK-20694][DOCS][SQL] Document DataFrameWriter partit...

2017-05-10 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17938
  
@HyukjinKwon Sounds good. 
[SPARK-20694](https://issues.apache.org/jira/browse/SPARK-20694). 

Should we document the difference between buckets (metastore based) and 
partitions (file system based)? The latter one could by done by referencing 
[Partition 
Discover](https://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-05-10 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17077
  
@gatorsmile #17938


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17938: [DOCS][SQL] Document bucketing and partitioning i...

2017-05-10 Thread zero323
GitHub user zero323 opened a pull request:

https://github.com/apache/spark/pull/17938

[DOCS][SQL] Document bucketing and partitioning in SQL guide

## What changes were proposed in this pull request?

- Add Scala, Python and Java examples for `partitionBy`, `sortBy` and 
`bucketBy`.
- Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide

## How was this patch tested?

Manual tests, docs build.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark DOCS-BUCKETING-AND-PARTITIONING

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17938.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17938


commit 560fd7978c2a18c8c216604eeea4563bcc4f7c5c
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-10T09:56:28Z

Add Scala examples

commit c0b037b302b10c20b2dadcc32048f3ee370d1864
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-10T09:56:50Z

Add Python examples

commit b2f45efcb883508e906232582e4a9e89b7f706d0
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-10T10:22:27Z

Add Java examples

commit 0af67cea0f1a1644139115274f14dab76732b5b5
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-10T10:32:47Z

Add examples to sql guide




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17891: [SPARK-20631][PYTHON][ML] LogisticRegression._checkThres...

2017-05-10 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17891
  
Thanks @yanboliang!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-05-09 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r11267
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -374,6 +415,48 @@ def getFamily(self):
 """
 return self.getOrDefault(self.family)
 
+@since("2.2.0")
--- End diff --

Probably. I've seen that Scala version has been targeted for 2.2.1 so who 
knows? But let's make 2.3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias

2017-05-09 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17825
  
Thanks @felixcheung


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17922: [SPARK-2060][PYTHON][ML] Python API Changes for C...

2017-05-09 Thread zero323
GitHub user zero323 opened a pull request:

https://github.com/apache/spark/pull/17922

[SPARK-2060][PYTHON][ML] Python API Changes for Constrained Logistic 
Regression Params

## What changes were proposed in this pull request?

- Add new `Params` to `pyspark.ml.classification.LogisticRegression`.
- Add `toMatrix` method to `pyspark.ml.param.TypeConverters`.
- Add `generate_multinomial_logistic_input` helper to `pyspark.ml.tests`.

## How was this patch tested?

Unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-20601

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17922.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17922






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17891: [SPARK-20631][PYTHON][ML] LogisticRegression._checkThres...

2017-05-08 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17891
  
cc @jkbradley


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17891: [SPARK-20631][PYTHON][ML] LogisticRegression._che...

2017-05-07 Thread zero323
GitHub user zero323 reopened a pull request:

https://github.com/apache/spark/pull/17891

[SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency 
should use values not Params

## What changes were proposed in this pull request?

- Replace `getParam` calls with `getOrDefault` calls.
- Fix exception message to avoid unintended `TypeError`.
- Add unit tests

## How was this patch tested?

New unit tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-20631

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17891.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17891


commit 098e26202bfed089efad057b3eead593ffda08b3
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-07T19:36:40Z

Use getOrDefault to access values




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17891: [SPARK-20631][PYTHON][ML] LogisticRegression._che...

2017-05-07 Thread zero323
Github user zero323 closed the pull request at:

https://github.com/apache/spark/pull/17891


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17891: [SPARK-11834][PYTHON][ML] LogisticRegression._che...

2017-05-07 Thread zero323
GitHub user zero323 opened a pull request:

https://github.com/apache/spark/pull/17891

[SPARK-11834][PYTHON][ML] LogisticRegression._checkThresholdConsistency 
should use values not Params

## What changes were proposed in this pull request?

- Replace `getParam` calls with `getOrDefault` calls.
- Fix exception message to avoid unintended `TypeError`.
- Add unit tests

## How was this patch tested?

New unit tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-20631

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17891.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17891


commit 098e26202bfed089efad057b3eead593ffda08b3
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-07T19:36:40Z

Use getOrDefault to access values




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-05-07 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r115151299
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -563,6 +563,63 @@ def partitionBy(self, *cols):
 self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, 
cols))
 return self
 
+@since(2.3)
+def bucketBy(self, numBuckets, col, *cols):
+"""Buckets the output by the given columns.If specified,
+the output is laid out on the file system similar to Hive's 
bucketing scheme.
+
+:param numBuckets: the number of buckets to save
+:param col: a name of a column, or a list of names.
+:param cols: additional names (optional). If `col` is a list it 
should be empty.
+
+.. note:: Applicable for file-based data sources in combination 
with
+  :py:meth:`DataFrameWriter.saveAsTable`.
--- End diff --

@gatorsmile Can we?

```
➜  spark git:(master) git rev-parse HEAD   
2cf83c47838115f71419ba5b9296c69ec1d746cd
➜  spark git:(master) bin/spark-shell 
Spark context Web UI available at http://192.168.1.101:4041
Spark context available as 'sc' (master = local[*], app id = 
local-1494184109262).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/
 
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.

scala> Seq(("a", 1, 3)).toDF("x", "y", "z").write.bucketBy(3, "x", 
"y").format("parquet").save("/tmp/foo")
org.apache.spark.sql.AnalysisException: 'save' does not support bucketing 
right now;
  at 
org.apache.spark.sql.DataFrameWriter.assertNotBucketed(DataFrameWriter.scala:305)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:231)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
  ... 48 elided
```

`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...

2017-05-07 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17077
  
@gatorsmile 

>  Could you also update the SQL document?

Sure, but I'll need some guidance here. Somewhere in the [Generic Load/Save 
Functions](https://spark.apache.org/docs/latest/sql-programming-guide.html#generic-loadsave-functions),
 right? But I guess we'll need a separate section for that.  And should 
probably document `partitionBy`as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-05-07 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r115138060
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -563,6 +563,60 @@ def partitionBy(self, *cols):
 self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, 
cols))
 return self
 
+@since(2.3)
+def bucketBy(self, numBuckets, *cols):
+"""Buckets the output by the given columns on the file system.
+
+:param numBuckets: the number of buckets to save
+:param cols: name of columns
+
+.. note:: Applicable for file-based data sources in combination 
with
+  :py:meth:`DataFrameWriter.saveAsTable`.
+
+>>> (df.write.format('parquet')
+... .bucketBy(100, 'year', 'month')
+... .mode("overwrite")
+... .saveAsTable('bucketed_table'))
+"""
+if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
+cols = cols[0]
+
+if not isinstance(numBuckets, int):
+raise TypeError("numBuckets should be an int, got 
{0}.".format(type(numBuckets)))
+
+if not all(isinstance(c, basestring) for c in cols):
+raise TypeError("cols argument should be a string or a 
sequence of strings.")
--- End diff --

Or we just replace error message with:

```
"cols argument should be a string, List[str] or Tuple[str, ...]"
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-05-07 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r115138021
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -563,6 +563,60 @@ def partitionBy(self, *cols):
 self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, 
cols))
 return self
 
+@since(2.3)
+def bucketBy(self, numBuckets, *cols):
+"""Buckets the output by the given columns on the file system.
+
+:param numBuckets: the number of buckets to save
+:param cols: name of columns
+
+.. note:: Applicable for file-based data sources in combination 
with
+  :py:meth:`DataFrameWriter.saveAsTable`.
+
+>>> (df.write.format('parquet')
+... .bucketBy(100, 'year', 'month')
+... .mode("overwrite")
+... .saveAsTable('bucketed_table'))
+"""
+if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
--- End diff --

Why do you say that? `cols` are  variadic, so it should be always `Sized`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17831: [SPARK-18777][PYTHON][SQL] Return UDF from udf.register

2017-05-07 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17831
  
Thanks everyone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...

2017-05-06 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17077#discussion_r115133626
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -563,6 +563,60 @@ def partitionBy(self, *cols):
 self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, 
cols))
 return self
 
+@since(2.3)
+def bucketBy(self, numBuckets, *cols):
+"""Buckets the output by the given columns on the file system.
+
+:param numBuckets: the number of buckets to save
+:param cols: name of columns
+
+.. note:: Applicable for file-based data sources in combination 
with
+  :py:meth:`DataFrameWriter.saveAsTable`.
+
+>>> (df.write.format('parquet')
+... .bucketBy(100, 'year', 'month')
+... .mode("overwrite")
+... .saveAsTable('bucketed_table'))
+"""
+if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
+cols = cols[0]
+
+if not isinstance(numBuckets, int):
+raise TypeError("numBuckets should be an int, got 
{0}.".format(type(numBuckets)))
+
+if not all(isinstance(c, basestring) for c in cols):
+raise TypeError("cols argument should be a string or a 
sequence of strings.")
--- End diff --

Good point. We can support arbitrary `Iterable[str]` though. 

```python
if len(cols) == 1 and isinstance(cols[0], collections.abc.Iterable):
cols = list(cols[0])
```

Caveat is, we don't allow this anywhere else.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias

2017-05-05 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17825#discussion_r115113723
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3745,3 +3745,26 @@ setMethod("hint",
 jdf <- callJMethod(x@sdf, "hint", name, parameters)
 dataFrame(jdf)
   })
+
+#' alias
+#'
+#' @aliases alias,SparkDataFrame-method
+#' @family SparkDataFrame functions
+#' @rdname alias
+#' @name alias
+#' @examples
--- End diff --

Done, but do we actually need this? We don't use roxygen to maintain 
`NAMESPACE`, and (I believe i mentioned this before) we `@export` objects which 
are not really exported. Just saying...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias

2017-05-05 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17825#discussion_r115085302
  
--- Diff: R/pkg/R/generics.R ---
@@ -387,6 +387,16 @@ setGeneric("value", function(bcast) { 
standardGeneric("value") })
 #' @export
 setGeneric("agg", function (x, ...) { standardGeneric("agg") })
 
+#' alias
+#'
+#' Set a new name for a Column or a SparkDataFrame. Equivalent to SQL "AS" 
keyword.
+#'
+#' @name alias
+#' @rdname alias
+#' @param object x a Column or a SparkDataFrame
+#' @param data new name to use
--- End diff --

On the bright side it looks like matching `@rdname` and `@aliases` like:

```r
#' alias
#'
#' @aliases alias,SparkDataFrame-method
#' @family SparkDataFrame functions
#' @rdname alias,SparkDataFrame-method
#' @name alias
...
```
and

```r
#' alias
#'
#' @aliases alias,SparkDataFrame-method
#' @family SparkDataFrame functions
#' @rdname alias,SparkDataFrame-method
#' @name alias
...
```
(I hope this is what you mean) indeed solves SPARK-18825. But it doesn't 
generate any docs for these two and makes CRAN checker unhappy:

```
Undocumented S4 methods:
  generic 'alias' and siglist 'Column'
  generic 'alias' and siglist 'SparkDataFrame'
```
Docs for generic are created but it doesn't help us here. Even if we bring 
`@examples` there we still have to deal with CRAN.

Theres is also my favorite `\name must exist and be unique in Rd files` 
which doesn't gives us much room here, does it?

I opened to suggestions, but personally I am out ideas. I've been digging 
trough `roxygen` docs, but between CRAN,  S4 requirements, `roxygen` limitation 
and our own rules there is not much room left.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias

2017-05-04 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17825#discussion_r114931344
  
--- Diff: R/pkg/R/generics.R ---
@@ -387,6 +387,16 @@ setGeneric("value", function(bcast) { 
standardGeneric("value") })
 #' @export
 setGeneric("agg", function (x, ...) { standardGeneric("agg") })
 
+#' alias
+#'
+#' Set a new name for a Column or a SparkDataFrame. Equivalent to SQL "AS" 
keyword.
--- End diff --

I still believe that AS is applicable to both. Essentially what we do is:

```
SELECT column AS new_column FROM table
```

and

```
(SELECT * FROM table) AS new_table
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias

2017-05-04 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17825#discussion_r114931185
  
--- Diff: R/pkg/R/generics.R ---
@@ -387,6 +387,16 @@ setGeneric("value", function(bcast) { 
standardGeneric("value") })
 #' @export
 setGeneric("agg", function (x, ...) { standardGeneric("agg") })
 
+#' alias
+#'
+#' Set a new name for a Column or a SparkDataFrame. Equivalent to SQL "AS" 
keyword.
+#'
+#' @name alias
+#' @rdname alias
+#' @param object x a Column or a SparkDataFrame
+#' @param data new name to use
--- End diff --

To be honest I find both equally confusing, so if you think that a single 
annotation is better, I am happy to oblige.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias

2017-05-04 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17825#discussion_r114929528
  
--- Diff: R/pkg/R/generics.R ---
@@ -387,6 +387,16 @@ setGeneric("value", function(bcast) { 
standardGeneric("value") })
 #' @export
 setGeneric("agg", function (x, ...) { standardGeneric("agg") })
 
+#' alias
+#'
+#' Set a new name for a Column or a SparkDataFrame. Equivalent to SQL "AS" 
keyword.
+#'
+#' @name alias
+#' @rdname alias
+#' @param object x a Column or a SparkDataFrame
+#' @param data new name to use
--- End diff --

Wouldn't be better to annotate actual implementations? To get something 
like this:


![image](https://cloud.githubusercontent.com/assets/1554276/25733425/295f465e-3159-11e7-87b7-d959c9bf3352.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias

2017-05-04 Thread zero323
Github user zero323 closed the pull request at:

https://github.com/apache/spark/pull/17825


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias

2017-05-04 Thread zero323
GitHub user zero323 reopened a pull request:

https://github.com/apache/spark/pull/17825

[SPARK-20550][SPARKR] R wrapper for Dataset.alias

## What changes were proposed in this pull request?

- Add SparkR wrapper for `Dataset.alias`.
- Adjust roxygen annotations for `functions.alias` (including example 
usage).

## How was this patch tested?

Unit tests, `check_cran.sh`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zero323/spark SPARK-20550

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17825.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17825


commit 944a3ec791a8f103093e24511e895a4ce60970d8
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-01T08:59:24Z

Initial implementation

commit 5e9f8da45c432e0752e5e78556add33e0a6d0557
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-01T22:27:11Z

Adjust argument annotations

- Remove param annotations from dataframe.alias
- Use generic annotations for column.alias

commit 73133f9442ad8317fb12b600221962bf47d8a95c
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-01T22:31:26Z

Add usage examples to column.alias

commit 848eeefc1f18c6aabaf65e6efed259a2fa5c19c3
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-01T22:34:51Z

Remove return type annotation

commit 05c0781110b42a940e06cc31650449a8715e85c9
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-02T02:00:13Z

Fix typo

commit 22d7cf661bb54a8f7f9c660e1d914802f1eb4153
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-02T04:25:34Z

Move dontruns to their own lines

commit 22e1292557f1a5597cde6337267a099bbcdc07aa
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-02T04:27:11Z

Extend param description

commit 6bb3d914960d1cf63e582a7d732ca80ed321e9c5
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-02T04:33:34Z

Add type annotations to since notes

commit b3c1a416a16a9d32649edda2b66fc9c3476358a5
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-02T04:38:51Z

Attach alias test to select-with-column test case

commit 40fedcb8c41bc84deead205aad81e84c095045b5
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-02T04:44:45Z

Extend description

commit 1e1ad443751fc3dc93487e5385cc934feb93f631
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-03T00:25:15Z

Move alias documentation to generics

commit 2d5ace288f2443327696823c343c095f0d8d64ca
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-04T01:13:45Z

Add family annotation

commit 5fe5495580eb3852ea5092a34dc2334c0e45c9b7
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-04T06:32:54Z

Check that stats::alias is not masked

commit 09f9ccaf5e66a400d26b4ab6d600d951305d5fd3
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-04T07:04:52Z

Fix style

commit f1c74f338b8df865a5e8b9a6e281211aa27af7d3
Author: zero323 <zero...@users.noreply.github.com>
Date:   2017-05-04T10:17:42Z

vim




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17825: [SPARK-20550][SPARKR] R wrapper for Dataset.alias

2017-05-04 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17825#discussion_r114925159
  
--- Diff: R/pkg/R/generics.R ---
@@ -387,6 +387,16 @@ setGeneric("value", function(bcast) { 
standardGeneric("value") })
 #' @export
 setGeneric("agg", function (x, ...) { standardGeneric("agg") })
 
+#' alias
+#'
+#' Set a new name for a Column or a SparkDataFrame. Equivalent to SQL "AS" 
keyword.
--- End diff --

How about?

```
#' Return a new Column or a SparkDataFrame with a name set. Equivalent to 
SQL "AS" keyword.
```
Is the `Column` new?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17851: [SPARK-20585][SPARKR] R generic hint support

2017-05-04 Thread zero323
Github user zero323 commented on the issue:

https://github.com/apache/spark/pull/17851
  
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17851: [SPARK-20585][SPARKR] R generic hint support

2017-05-04 Thread zero323
Github user zero323 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17851#discussion_r114709260
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3715,3 +3715,34 @@ setMethod("rollup",
 sgd <- callJMethod(x@sdf, "rollup", jcol)
 groupedData(sgd)
   })
+
+#' hint
+#'
+#' Specifies execution plan hint on the current SparkDataFrame.
+#'
+#' @param x a SparkDataFrame.
+#' @param name a name of the hint.
+#' @param ... additional argument(s) passed to the method.
+#'
+#' @return A SparkDataFrame.
+#' @family SparkDataFrame functions
+#' @aliases hint,SparkDataFrame,character-method
+#' @rdname hint
+#' @name hint
+#' @export
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(mtcars)
+#' avg_mpg <- mean(groupBy(createDataFrame(mtcars), "cyl"), "mpg")
--- End diff --

Also with alias it will be quite dense:

```r
#' @examples
#' \dontrun{
#' # Set aliases to avoid ambiguity
#' df <- alias(createDataFrame(mtcars), "cars")
#' avg_mpg <- alias(mean(groupBy(createDataFrame(mtcars), "cyl"), "mpg"), 
"avg_mpg")
#'
#' head(join(
#'   df, hint(avg_mpg, "broadcast"), 
#'   column("cars.cyl") == column("avg_mpg.cyl")
#' ))
#' }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >