[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18270
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19537: [SQL] Mark strategies with override for clarity.

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19537
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-a...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18933#discussion_r145888158
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1760,6 +1760,17 @@ def toPandas(self):
 
 for f, t in dtype.items():
 pdf[f] = pdf[f].astype(t, copy=False)
+
+if 
self.sql_ctx.getConf("spark.sql.execution.pandas.timeZoneAware", 
"false").lower() \
--- End diff --

We still need a conf, even if it is a bug. This is just to avoid breaking 
any existing app. We can remove the conf in Spark 3.x.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-a...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18933#discussion_r145888010
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -912,6 +912,14 @@ object SQLConf {
   .intConf
   .createWithDefault(1)
 
+  val PANDAS_TIMEZONE_AWARE =
+buildConf("spark.sql.execution.pandas.timeZoneAware")
+  .internal()
+  .doc("When true, make Pandas DataFrame with timezone-aware timestamp 
type when converting " +
+"by pyspark.sql.DataFrame.toPandas. The session local timezone is 
used for the timezone.")
+  .booleanConf
+  .createWithDefault(false)
--- End diff --

We can change the default to `true`, since we agree that this is a bug. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19519: [SPARK-21840][core] Add trait that allows conf to be dir...

2017-10-19 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19519
  
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18664
  
BTW, we might need to resolve the comments in 
https://github.com/apache/spark/pull/18933 and merge that PR first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18664
  
Before we merging this PR, could anybody submit a PR for documenting this 
issue? Then, we can get more feedbacks from the others. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19540
  
**[Test build #82924 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82924/testReport)**
 for PR 19540 at commit 
[`08240f3`](https://github.com/apache/spark/commit/08240f368fdde6cf3ce02b50b7e7405a00f15fa4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-19 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19540
  
I think branch 2.2 also has similar issue when fetching resources from 
remote secure HDFS.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #10949: [SPARK-12832][MESOS] mesos scheduler respect agent attri...

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/10949
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-19 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19540
  
ok to test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-19 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19540
  
Thanks for the fix! I didn't test on secure cluster when did glob path 
support, so I didn't realize such issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] group...

2017-10-19 Thread ueshin
Github user ueshin closed the pull request at:

https://github.com/apache/spark/pull/19505


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

2017-10-19 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/19505
  
Sure, I'd close this.
@icexelloss Of course you can open a separate JIRA and another PR. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19485
  
This is the API link you refer 
`https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader@csv(paths:String*):org.apache.spark.sql.DataFrame`
 

I just quickly scanned them. The option descriptions are pretty rough. They 
are made for advanced dev who the read API docs and play with them. In the long 
term, we should follow what the mainstream RDBMS reference manual. Something 
like
- https://dev.mysql.com/doc/refman/5.5/en/creating-tables.html
- 
https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_sql_createtable.html
- 
https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables003.htm#ADMIN01503

I prefer to having something more human friendly. The whole SQL doc needs a 
complete re-org. cc @jiangxb1987 Maybe you are the right person to take it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19523: [SPARK-22301][SQL] Add rule to Optimizer for In with emp...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19523
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19523: [SPARK-22301][SQL] Add rule to Optimizer for In with emp...

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19523
  
**[Test build #82923 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82923/testReport)**
 for PR 19523 at commit 
[`50c7af3`](https://github.com/apache/spark/commit/50c7af3d4fb9a23ccf460a1842d7e57a26ca582c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19523: [SPARK-22301][SQL] Add rule to Optimizer for In with emp...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19523
  
@mgaido91 Could you update the PR title?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19508: [SPARK-20783][SQL][Follow-up] Create ColumnVector to abs...

2017-10-19 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19508
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19517
  
**[Test build #82922 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82922/testReport)**
 for PR 19517 at commit 
[`59d61a4`](https://github.com/apache/spark/commit/59d61a46a15b00f8af9ec8e2c6930853b7097b1c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19540
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] group...

2017-10-19 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19517#discussion_r145878293
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala
 ---
@@ -22,17 +22,26 @@ import org.apache.spark.sql.Column
 import org.apache.spark.sql.catalyst.expressions.Expression
 import org.apache.spark.sql.types.DataType
 
+private[spark] object PythonUdfType {
+  // row-based UDFs
--- End diff --

Sure, I'll update it, too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19540: [SPARK-22319][Core] call loginUserFromKeytab befo...

2017-10-19 Thread sjrand
GitHub user sjrand opened a pull request:

https://github.com/apache/spark/pull/19540

[SPARK-22319][Core] call loginUserFromKeytab before accessing hdfs

## What changes were proposed in this pull request?

In `SparkSubmit`, call `loginUserFromKeytab` before attempting to make RPC 
calls to the NameNode.

## How was this patch tested?

I manually tested this patch by:

1. Confirming that my Spark application failed to launch with the error 
reported in https://issues.apache.org/jira/browse/SPARK-22319.
2. Applying this patch and confirming that the app no longer fails to 
launch, even when I have not manually run `kinit` on the host.

Presumably we also want integration tests for secure clusters so that we 
catch this sort of thing. I'm happy to take a shot at this if it's feasible and 
someone can point me in the right direction. 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sjrand/spark SPARK-22319

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19540.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19540


commit 08240f368fdde6cf3ce02b50b7e7405a00f15fa4
Author: Steven Rand 
Date:   2017-10-19T04:51:26Z

call loginUserFromKeytab before accessing hdfs




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] group...

2017-10-19 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19517#discussion_r145878275
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2038,13 +2038,22 @@ def _wrap_function(sc, func, returnType):
   sc.pythonVer, broadcast_vars, 
sc._javaAccumulator)
 
 
+class PythonUdfType(object):
+# row-based UDFs
--- End diff --

Sure, I'll update it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19272
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82919/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19272
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19272
  
**[Test build #82919 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82919/testReport)**
 for PR 19272 at commit 
[`837157d`](https://github.com/apache/spark/commit/837157d1c76d1b9488025e8c294eb2839aeee5e3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

2017-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19485
  
I meant adding a new chapter describing options, removing duplication, for 
example here 

https://github.com/apache/spark/blob/73d80ec49713605d6a589e688020f0fc2d6feab2/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L513
and then leaving  a link to the new chapter instead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19505
  
@ueshin Maybe close this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19517
  
LGTM. 

@ueshin Could you remove `[WIP]` from the title of this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] group...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19517#discussion_r145877519
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala
 ---
@@ -22,17 +22,26 @@ import org.apache.spark.sql.Column
 import org.apache.spark.sql.catalyst.expressions.Expression
 import org.apache.spark.sql.types.DataType
 
+private[spark] object PythonUdfType {
+  // row-based UDFs
--- End diff --

The same here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] group...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19517#discussion_r145877442
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2038,13 +2038,22 @@ def _wrap_function(sc, func, returnType):
   sc.pythonVer, broadcast_vars, 
sc._javaAccumulator)
 
 
+class PythonUdfType(object):
+# row-based UDFs
--- End diff --

Nit: Please update all `row-based UDFs` to `row-at-a-time UDFs`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19517
  
**[Test build #82921 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82921/testReport)**
 for PR 19517 at commit 
[`7e43bb4`](https://github.com/apache/spark/commit/7e43bb44bf4267ef6a719108d3edbf545eaba23d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19517
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19485
  
@HyukjinKwon I did not understand what is your suggestion. 

@jomach Any reason you closed this PR or you plan to open a new one?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19539: [WIP] [SQL] Remove unnecessary methods

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19539
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82920/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19539: [WIP] [SQL] Remove unnecessary methods

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19539
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19539: [WIP] [SQL] Remove unnecessary methods

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19539
  
**[Test build #82920 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82920/testReport)**
 for PR 19539 at commit 
[`19bb867`](https://github.com/apache/spark/commit/19bb867234a3127727ef5500d49e93628c6c1ba3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

2017-10-19 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/19505
  
@viirya @cloud-fan I updated my original summary. I think it answers 
`group_transform` question. I also added more example to each type.

@HyukjinKwon @viirya I agree we can move this to a separate Jira and merge 
current PR of @ueshin. Maybe I can open another PR with just the proposal 
design doc? Not sure what's the best way is.

 




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19524
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82918/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19524
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19524
  
**[Test build #82918 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82918/testReport)**
 for PR 19524 at commit 
[`de2a70b`](https://github.com/apache/spark/commit/de2a70b2279d214022b1d575270ff7a38764d26f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19528: [SPARK-20393][WEBU UI][1.6] Strengthen Spark to prevent ...

2017-10-19 Thread ambauma
Github user ambauma commented on the issue:

https://github.com/apache/spark/pull/19528
  
Believed fixed.  Hard to say for sure without knowing the precise python 
and numpy versions the build is using.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19469: [SPARK-22243][DStreams]spark.yarn.jars reload from confi...

2017-10-19 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/19469
  
Ah, I didn't realize there is a change in that PR.
I agree we need a better solution





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19471
  
> No behavior change if there is no overlapped columns in data and 
partition schema.

> The schema changed(partition columns go to the end) when reading file 
format data source with partition columns in data files.

@cloud-fan Could you check why so many test cases failed? 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19528: [SPARK-20393][WEBU UI][1.6] Strengthen Spark to prevent ...

2017-10-19 Thread ambauma
Github user ambauma commented on the issue:

https://github.com/apache/spark/pull/19528
  
Able to duplicate.  Working theory is that this is related to numpy 1.12.1. 
 Here is my conda env:
(spark-1.6) andrew@andrew-Inspiron-7559:~/git/spark$ conda list
# packages in environment at /home/andrew/.conda/envs/spark-1.6:
#
ca-certificates   2017.08.26   h1d4fec5_0  
certifi   2016.2.28py34_0  
intel-openmp  2018.0.0 h15fc484_7  
libedit   3.1  heed3624_0  
libffi3.2.1h4deb6c0_3  
libgcc-ng 7.2.0h7cc24e2_2  
libgfortran   1.0   0  
libstdcxx-ng  7.2.0h7a57d05_2  
mkl   2017.0.3  0  
ncurses   6.0  h06874d7_1  
numpy 1.12.1   py34_0  
openblas  0.2.190  
openssl   1.0.2l   h077ae2c_5  
pip   9.0.1py34_1  
python3.4.5 0  
readline  6.2   2  
setuptools27.2.0   py34_0  
sqlite3.13.00  
tk8.5.180  
wheel 0.29.0   py34_0  
xz5.2.3h2bcbf08_1  
zlib  1.2.11   hfbfcf68_1 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19269


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19269
  
Thanks! Merged to master. 

This is just the first commit of the data source v2 write protocol. More 
PRs are coming to further improve it. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19269
  
It sounds like the initialization stages are missing in the current 
protocol API design. We can do it later. 

LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

2017-10-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19505
  
The group_transform udfs looks a bit weird to me. @icexelloss Can you 
explain the use case of it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-19 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19269#discussion_r145870061
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.spark.{SparkException, TaskContext}
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, 
RowEncoder}
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.sources.v2.writer._
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+
+/**
+ * The logical plan for writing data into data source v2.
+ */
+case class WriteToDataSourceV2(writer: DataSourceV2Writer, query: 
LogicalPlan) extends LogicalPlan {
+  override def children: Seq[LogicalPlan] = Seq(query)
+  override def output: Seq[Attribute] = Nil
+}
+
+/**
+ * The physical plan for writing data into data source v2.
+ */
+case class WriteToDataSourceV2Exec(writer: DataSourceV2Writer, query: 
SparkPlan) extends SparkPlan {
+  override def children: Seq[SparkPlan] = Seq(query)
+  override def output: Seq[Attribute] = Nil
+
+  override protected def doExecute(): RDD[InternalRow] = {
+val writeTask = writer match {
+  case w: SupportsWriteInternalRow => 
w.createInternalRowWriterFactory()
+  case _ => new 
RowToInternalRowDataWriterFactory(writer.createWriterFactory(), query.schema)
+}
+
--- End diff --

Do we need to add a function call here for initialization or setup?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...

2017-10-19 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19534
  
@sitalkedia I have a very old similar PR #11205 , maybe you can refer to it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r145865969
  
--- Diff: python/pyspark/sql/session.py ---
@@ -510,6 +578,12 @@ def createDataFrame(self, data, schema=None, 
samplingRatio=None, verifySchema=Tr
 except Exception:
 has_pandas = False
 if has_pandas and isinstance(data, pandas.DataFrame):
+if self.conf.get("spark.sql.execution.arrow.enabled", 
"false").lower() == "true" \
+and len(data) > 0:
+df = self._createFromPandasWithArrow(data, schema)
--- End diff --

As of https://github.com/apache/spark/pull/19459#issuecomment-337674952, 
`schema` from `_parse_datatype_string` could be not a `StructType`:


https://github.com/apache/spark/blob/bfc7e1fe1ad5f9777126f2941e29bbe51ea5da7c/python/pyspark/sql/tests.py#L1325

although I don't think we have supported this case with `pd.DataFrame` as 
`int` case resembles `Dataset` with primitive types, up to my knowledge:

```
spark.createDataFrame(["a", "b"], "string").show()
+-+
|value|
+-+
|a|
|b|
+-+
```

For `pd.DataFrame` case, looks we always have a list of list.


https://github.com/apache/spark/blob/d492cc5a21cd67b3999b85d97f5c41c3734b1ba3/python/pyspark/sql/session.py#L515

So, I think we should only support list of strings maybe with a proper 
exception for `int` case.

Of course, this case should work:

```
>>> spark.createDataFrame(pd.DataFrame([1]), "struct").show()
+---+
|  a|
+---+
|  1|
+---+
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19528: [SPARK-20393][WEBU UI][1.6] Strengthen Spark to prevent ...

2017-10-19 Thread ambauma
Github user ambauma commented on the issue:

https://github.com/apache/spark/pull/19528
  
Working on duplicating PySpark failures...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r145863796
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,73 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, pdf, schema):
+"""
+Create a DataFrame from a given pandas.DataFrame by slicing it 
into partitions, converting
+to Arrow data, then sending to the JVM to parallelize. If a schema 
is passed in, the
+data types will be used to coerce the data in Pandas to Arrow 
conversion.
+"""
+from pyspark.serializers import ArrowSerializer
+from pyspark.sql.types import from_arrow_schema, to_arrow_type, 
_cast_pandas_series_type
+import pyarrow as pa
+
+# Slice the DataFrame into batches
+step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # 
round int up
+pdf_slices = (pdf[start:start + step] for start in xrange(0, 
len(pdf), step))
+
+if schema is None or isinstance(schema, list):
+batches = [pa.RecordBatch.from_pandas(pdf_slice, 
preserve_index=False)
+   for pdf_slice in pdf_slices]
+
+# There will be at least 1 batch after slicing the 
pandas.DataFrame
+schema_from_arrow = from_arrow_schema(batches[0].schema)
+
+# If passed schema as a list of names then rename fields
+if isinstance(schema, list):
+fields = []
+for i, field in enumerate(schema_from_arrow):
+field.name = schema[i]
+fields.append(field)
+schema = StructType(fields)
+else:
+schema = schema_from_arrow
+else:
+batches = []
+for i, pdf_slice in enumerate(pdf_slices):
+
+# convert to series to pyarrow.Arrays to use mask when 
creating Arrow batches
+arrs = []
+names = []
+for c, (_, series) in enumerate(pdf_slice.iteritems()):
+field = schema[c]
+names.append(field.name)
+t = to_arrow_type(field.dataType)
+try:
+# NOTE: casting is not necessary with Arrow >= 0.7
+
arrs.append(pa.Array.from_pandas(_cast_pandas_series_type(series, t),
+ 
mask=series.isnull(), type=t))
+except ValueError as e:
--- End diff --

I think this guard only works to prevent casting like:
```python
>>> s = pd.Series(["abc", "2", "10001"])
   
>>> s.astype(np.object_)
   
0  abc
12
210001
dtype: object
>>> s
0  abc
12
210001
dtype: object
>>> s.astype(np.int8)   
   
...
ValueError: invalid literal for long() with base 10: 'abc'
```
For the casting that can cause overflow, this seems don't work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19527
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82917/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19527
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19527
  
**[Test build #82917 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82917/testReport)**
 for PR 19527 at commit 
[`b42d175`](https://github.com/apache/spark/commit/b42d175ddc4928ec36718177702059ccf0bfbfea).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r145862488
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,73 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, pdf, schema):
+"""
+Create a DataFrame from a given pandas.DataFrame by slicing it 
into partitions, converting
+to Arrow data, then sending to the JVM to parallelize. If a schema 
is passed in, the
+data types will be used to coerce the data in Pandas to Arrow 
conversion.
+"""
+from pyspark.serializers import ArrowSerializer
+from pyspark.sql.types import from_arrow_schema, to_arrow_type, 
_cast_pandas_series_type
+import pyarrow as pa
+
+# Slice the DataFrame into batches
+step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # 
round int up
+pdf_slices = (pdf[start:start + step] for start in xrange(0, 
len(pdf), step))
+
+if schema is None or isinstance(schema, list):
+batches = [pa.RecordBatch.from_pandas(pdf_slice, 
preserve_index=False)
+   for pdf_slice in pdf_slices]
+
+# There will be at least 1 batch after slicing the 
pandas.DataFrame
+schema_from_arrow = from_arrow_schema(batches[0].schema)
+
+# If passed schema as a list of names then rename fields
+if isinstance(schema, list):
+fields = []
+for i, field in enumerate(schema_from_arrow):
+field.name = schema[i]
+fields.append(field)
+schema = StructType(fields)
+else:
+schema = schema_from_arrow
+else:
+batches = []
+for i, pdf_slice in enumerate(pdf_slices):
+
+# convert to series to pyarrow.Arrays to use mask when 
creating Arrow batches
+arrs = []
+names = []
+for c, (_, series) in enumerate(pdf_slice.iteritems()):
+field = schema[c]
+names.append(field.name)
+t = to_arrow_type(field.dataType)
+try:
+# NOTE: casting is not necessary with Arrow >= 0.7
+
arrs.append(pa.Array.from_pandas(_cast_pandas_series_type(series, t),
--- End diff --

Any chance that the given data type for the field is not correct? By a 
wrong data type of a field, what the behavior of this casting?

For example, for a Series of numpy.int16`, if the given data type is 
bytetype, this casting will turn it to numpy.int8, so we will get:
```python
>>> s = pd.Series([1, 2, 10001], dtype=np.int16)
   
>>> s
01
12
210001
dtype: int16
>>> s.astype(np.int8)
0 1
1 2
217
dtype: int8
```

`createDataFrame` will check the data type of input data when converting to 
DataFrame. This implicit type casting seems inconsistent with original behavior.





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19539: [WIP] [SQL] Remove unnecessary methods

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19539
  
**[Test build #82920 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82920/testReport)**
 for PR 19539 at commit 
[`19bb867`](https://github.com/apache/spark/commit/19bb867234a3127727ef5500d49e93628c6c1ba3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19539: [WIP] [SQL] Remove unnecessary methods

2017-10-19 Thread wzhfy
GitHub user wzhfy opened a pull request:

https://github.com/apache/spark/pull/19539

[WIP] [SQL] Remove unnecessary methods

## What changes were proposed in this pull request?

Remove unnecessary methods.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wzhfy/spark remove_equals

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19539.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19539


commit 19bb867234a3127727ef5500d49e93628c6c1ba3
Author: Zhenhua Wang 
Date:   2017-10-20T01:31:27Z

remove necessary methods




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19272
  
**[Test build #82919 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82919/testReport)**
 for PR 19272 at commit 
[`837157d`](https://github.com/apache/spark/commit/837157d1c76d1b9488025e8c294eb2839aeee5e3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

2017-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19505
  
+1 for separate JIRA to clarify the proposal and +0 for 3. out of those 
three, too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19272: [Spark-21842][Mesos] Support Kerberos ticket rene...

2017-10-19 Thread ArtRand
Github user ArtRand commented on a diff in the pull request:

https://github.com/apache/spark/pull/19272#discussion_r145861062
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
 ---
@@ -194,6 +198,27 @@ private[spark] class 
MesosCoarseGrainedSchedulerBackend(
   sc.conf.getOption("spark.mesos.driver.frameworkId").map(_ + suffix)
 )
 
+// check that the credentials are defined, even though it's likely 
that auth would have failed
+// already if you've made it this far
+if (principal != null && hadoopDelegationCreds.isDefined) {
+  logDebug(s"Principal found ($principal) starting token renewer")
+  val credentialRenewerThread = new Thread {
+setName("MesosCredentialRenewer")
+override def run(): Unit = {
+  val rt = 
MesosCredentialRenewer.getTokenRenewalTime(hadoopDelegationCreds.get, conf)
+  val credentialRenewer =
+new MesosCredentialRenewer(
+  conf,
+  hadoopDelegationTokenManager.get,
+  MesosCredentialRenewer.getNextRenewalTime(rt),
+  driverEndpoint)
+  credentialRenewer.scheduleTokenRenewal()
+}
+  }
+
+  credentialRenewerThread.start()
+  credentialRenewerThread.join()
--- End diff --

Yes, sorry, for some reason I understood you you to mean the credential 
renewer itself. I added a comment to the same effect as the YARN analogue. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19469: [SPARK-22243][DStreams]spark.yarn.jars reload from confi...

2017-10-19 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19469
  
@felixcheung As you can see there's bunch of configurations needs to be 
added here in https://github.com/apache-spark-on-k8s/spark/pull/516, that's why 
I'm asking a general solutions for such related issue.

I'm OK to merge this PR. But I would suspect similar PRs will still be 
created in future, since those issues are quite scenario specific, users may 
have different scenarios and can touch different issues regarding to this. So 
I'm just wondering if we could have a better solution for this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r145859471
  
--- Diff: python/pyspark/sql/session.py ---
@@ -510,6 +578,12 @@ def createDataFrame(self, data, schema=None, 
samplingRatio=None, verifySchema=Tr
 except Exception:
 has_pandas = False
 if has_pandas and isinstance(data, pandas.DataFrame):
+if self.conf.get("spark.sql.execution.arrow.enabled", 
"false").lower() == "true" \
+and len(data) > 0:
+df = self._createFromPandasWithArrow(data, schema)
+# Fallback to create DataFrame without arrow if return None
+if df is not None:
--- End diff --

Shall we show some log message to users in this case?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

2017-10-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19505
  
@icexelloss The summary and the proposal 3 looks great. To prevent 
confusing, can you also put the usage of each function type in proposal 3? 
E.g., group_map is for `groupby().apply()`, transform is for `withColumn`, etc? 
Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

2017-10-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19505
  
Btw, I think the scope of this change is more than just a follow-up. Should 
we create another JIRA for it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r145858362
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,73 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, pdf, schema):
+"""
+Create a DataFrame from a given pandas.DataFrame by slicing it 
into partitions, converting
+to Arrow data, then sending to the JVM to parallelize. If a schema 
is passed in, the
+data types will be used to coerce the data in Pandas to Arrow 
conversion.
+"""
+from pyspark.serializers import ArrowSerializer
+from pyspark.sql.types import from_arrow_schema, to_arrow_type, 
_cast_pandas_series_type
+import pyarrow as pa
+
+# Slice the DataFrame into batches
+step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # 
round int up
+pdf_slices = (pdf[start:start + step] for start in xrange(0, 
len(pdf), step))
+
+if schema is None or isinstance(schema, list):
+batches = [pa.RecordBatch.from_pandas(pdf_slice, 
preserve_index=False)
+   for pdf_slice in pdf_slices]
+
+# There will be at least 1 batch after slicing the 
pandas.DataFrame
+schema_from_arrow = from_arrow_schema(batches[0].schema)
+
+# If passed schema as a list of names then rename fields
+if isinstance(schema, list):
+fields = []
+for i, field in enumerate(schema_from_arrow):
+field.name = schema[i]
+fields.append(field)
+schema = StructType(fields)
+else:
+schema = schema_from_arrow
+else:
+batches = []
+for i, pdf_slice in enumerate(pdf_slices):
+
+# convert to series to pyarrow.Arrays to use mask when 
creating Arrow batches
+arrs = []
+names = []
+for c, (_, series) in enumerate(pdf_slice.iteritems()):
+field = schema[c]
+names.append(field.name)
+t = to_arrow_type(field.dataType)
+try:
+# NOTE: casting is not necessary with Arrow >= 0.7
+
arrs.append(pa.Array.from_pandas(_cast_pandas_series_type(series, t),
+ 
mask=series.isnull(), type=t))
+except ValueError as e:
+warnings.warn("Arrow will not be used in 
createDataFrame: %s" % str(e))
+return None
+batches.append(pa.RecordBatch.from_arrays(arrs, names))
+
+# Verify schema of first batch, return None if not equal 
and fallback without Arrow
+if i == 0:
+schema_from_arrow = 
from_arrow_schema(batches[i].schema)
+if schema != schema_from_arrow:
+warnings.warn("Arrow will not be used in 
createDataFrame.\n" +
--- End diff --

OK by me too but let's keep on our eyes on mailing list and JIRAs that 
complains about it in the future, and improve it next time if this sounds more 
important than we think here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19524
  
**[Test build #82918 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82918/testReport)**
 for PR 19524 at commit 
[`de2a70b`](https://github.com/apache/spark/commit/de2a70b2279d214022b1d575270ff7a38764d26f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19524
  
Thanks for your review @shaneknapp.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19486: [SPARK-22268][BUILD] Fix lint-java

2017-10-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19486


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19486: [SPARK-22268][BUILD] Fix lint-java

2017-10-19 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19486
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19527
  
**[Test build #82917 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82917/testReport)**
 for PR 19527 at commit 
[`b42d175`](https://github.com/apache/spark/commit/b42d175ddc4928ec36718177702059ccf0bfbfea).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r145856074
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,439 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model, Transformer}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, 
HasInputCols, HasOutputCol, HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params for OneHotEncoderEstimator and 
OneHotEncoderModel */
+private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'skip' (filter out rows with invalid data) or 'error' 
(throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'skip' (filter out rows with invalid data) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0]`.
+ * The last category is not included by default (configurable via 
`dropLast`),
+ * because it makes the vector entries sum up to one, and hence linearly 
dependent.
+ * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`.
+ *
+ * @note This is different from scikit-learn's OneHotEncoder, which keeps 
all categories.
+ * The output vectors are sparse.
+ *
+ * @see `StringIndexer` for converting categorical values into category 
indices
+ */
+@Since("2.3.0")
+class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("oneHotEncoder"))
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setInputCols(values: Array[String]): this.type = set(inputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setOutputCols(values: Array[String]): this.type = set(outputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setDropLast(value: Boolean): this.type = set(dropLast, value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
+
+  @Since("2.3.0")
+  override def transformSchema(schema: StructType): S

[GitHub] spark issue #19486: [SPARK-22268][BUILD] Fix lint-java

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19486
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19486: [SPARK-22268][BUILD] Fix lint-java

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19486
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82916/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19486: [SPARK-22268][BUILD] Fix lint-java

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19486
  
**[Test build #82916 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82916/testReport)**
 for PR 19486 at commit 
[`95a2d9e`](https://github.com/apache/spark/commit/95a2d9ef6e07c53fca9d37e526abc2ac2f178c67).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-19 Thread MrBago
Github user MrBago commented on the issue:

https://github.com/apache/spark/pull/19439
  
@imatiach-msft just a few more comments. When I was looking over this I 
realized that the python and Scala name spaces are going to be a little 
different, eg `pyspark.ml.image.readImages` vs 
`spark.ml.image.ImageSchema.readImages`. Should we use an `ImageSchema` class 
in python as a namespace? Does spark do that in other pyspark modules?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-19 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r145843289
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,122 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+ocvTypes = {
+undefinedImageType: -1,
+"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24,
+"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25,
+"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, 
"CV_16UC4": 26,
+"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, 
"CV_16SC4": 27,
+"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, 
"CV_32SC4": 28,
+"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, 
"CV_32FC4": 29,
+"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, 
"CV_64FC4": 30
+}
+
+# DataFrame with a single column of images named "image" (nullable)
+imageSchema = StructType(StructField("image", StructType([
+StructField(imageFields[0], StringType(),  True),
+StructField(imageFields[1], IntegerType(), False),
+StructField(imageFields[2], IntegerType(), False),
+StructField(imageFields[3], IntegerType(), False),
+# OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields[4], StringType(), False),
+# bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields[5], BinaryType(), False)]), True))
+
+
+def toNDArray(image):
+"""
+Converts an image to a one-dimensional array.
+
+:param image (object): The image to be converted
+:rtype array: The image as a one-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+height = image.height
+width = image.width
+nChannels = image.nChannels
+return np.ndarray(
+shape=(height, width, nChannels),
+dtype=np.uint8,
+buffer=image.data,
+strides=(width * nChannels, nChannels, 1))
+
+
+def toImage(array, origin="", mode=ocvTypes["CV_8UC3"]):
--- End diff --

The interaction between mode and dtype here is unclear here. I think there 
are two sensible things to do.

1) determine the dtype from the mode, and use that to cast the image, ie:
  `toImage(a, mode="CV_8UC3") => 
bytearray(array.astype(dtype=np.uint8).ravel())`
  `toImage(a, mode="CV_32FC3") => 
bytearray(array.astype(dtype=np.float32).ravel())`

2) Remove the mode argument to the function and use `array.dtype` to infer 
the mode.

Also we don't currently check that nChannels is consistent with the mode 
proved. A (H, W, 1) image will currently get assigned mode "CV_8UC3".

I think it's fine to restrict the set of modes/dtypes we support but again 
we should be careful not to return a malformed image row.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-19 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r145842379
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,122 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+ocvTypes = {
+undefinedImageType: -1,
+"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24,
+"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25,
+"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, 
"CV_16UC4": 26,
+"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, 
"CV_16SC4": 27,
+"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, 
"CV_32SC4": 28,
+"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, 
"CV_32FC4": 29,
+"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, 
"CV_64FC4": 30
+}
+
+# DataFrame with a single column of images named "image" (nullable)
+imageSchema = StructType(StructField("image", StructType([
+StructField(imageFields[0], StringType(),  True),
+StructField(imageFields[1], IntegerType(), False),
+StructField(imageFields[2], IntegerType(), False),
+StructField(imageFields[3], IntegerType(), False),
+# OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields[4], StringType(), False),
+# bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields[5], BinaryType(), False)]), True))
+
+
+def toNDArray(image):
+"""
+Converts an image to a one-dimensional array.
+
+:param image (object): The image to be converted
+:rtype array: The image as a one-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+height = image.height
+width = image.width
+nChannels = image.nChannels
+return np.ndarray(
+shape=(height, width, nChannels),
+dtype=np.uint8,
--- End diff --

`dtype` and `strides` should depend on `image.mode`. I think it's fine to 
only support `uint8` modes for not, but can we add a check here to make sure we 
don't return a malformed array.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-19 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r145845879
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,122 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+imageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+ocvTypes = {
+undefinedImageType: -1,
+"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24,
+"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25,
+"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, 
"CV_16UC4": 26,
+"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, 
"CV_16SC4": 27,
+"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, 
"CV_32SC4": 28,
+"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, 
"CV_32FC4": 29,
+"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, 
"CV_64FC4": 30
+}
+
+# DataFrame with a single column of images named "image" (nullable)
+imageSchema = StructType(StructField("image", StructType([
+StructField(imageFields[0], StringType(),  True),
+StructField(imageFields[1], IntegerType(), False),
+StructField(imageFields[2], IntegerType(), False),
+StructField(imageFields[3], IntegerType(), False),
+# OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields[4], StringType(), False),
--- End diff --

I believe this was changed to IntegerType in scala. Is it possible to 
import this from scala so we don't need to define it in two places?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r145848280
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,439 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model, Transformer}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, 
HasInputCols, HasOutputCol, HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params for OneHotEncoderEstimator and 
OneHotEncoderModel */
+private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid
--- End diff --

Should we? If we don't plan to add `HasHandleInvalid`, `HasInputCols` and 
`HasOutputCols` into existing `OneHotEncoder`, I think we can keep it as it is 
now.

With this one hot encoder estimator added, we may want to deprecate the 
existing `OneHotEncoder`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r145847823
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,439 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model, Transformer}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, 
HasInputCols, HasOutputCol, HasOutputCols}
--- End diff --

Yes, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19527
  
Benchmark against existing one hot encoder.

Because existing encoder only needs to run `transform`, there is no fitting 
time.


Transforming:

numColums | Existing one hot encoder
-- | -- 
1 | 0.2516055188
100 | 20.29175892115
1000 | 26242.039411932*

* Because ten iterations take too long to finish, I just ran one iteration 
for 1000 columns. But it shows the scale already.

Benchmark codes:

```scala
import org.apache.spark.ml.feature._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import spark.implicits._
import scala.util.Random

val seed = 123l
val random = new Random(seed)
val n = 1
val m = 1000
val rows = sc.parallelize(1 to n).map(i=> 
Row(Array.fill(m)(random.nextInt(1000)): _*))
val struct = new StructType(Array.range(0,m,1).map(i => 
StructField(s"c$i",IntegerType,true)))
val df = spark.createDataFrame(rows, struct)
df.persist()
df.count()

val inputCols = Array.range(0,m,1).map(i => s"c$i")
val outputCols = Array.range(0,m,1).map(i => s"c${i}_encoded")

val encoders = Array.range(0,m,1).map(i => new 
OneHotEncoder().setInputCol(s"c$i").setOutputCol(s"c${i}_encoded"))
var duration = 0.0
for (i <- 0 until 10) {
  var encoded = df
  val start = System.nanoTime()
  encoders.foreach { encoder =>
encoded = encoder.transform(encoded)
  }
  encoded.count
  val end = System.nanoTime()
  duration += (end - start) / 1e9
}
println(s"duration: ${duration / 10}")
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...

2017-10-19 Thread yssharma
Github user yssharma commented on the issue:

https://github.com/apache/spark/pull/18029
  
@brkyvz Please have a look once you have time. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

2017-10-19 Thread ron8hu
Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19479#discussion_r145840719
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -216,65 +218,61 @@ object ColumnStat extends Logging {
 }
   }
 
-  /**
-   * Constructs an expression to compute column statistics for a given 
column.
-   *
-   * The expression should create a single struct column with the 
following schema:
-   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, 
maxLen: Long
-   *
-   * Together with [[rowToColumnStat]], this function is used to create 
[[ColumnStat]] and
-   * as a result should stay in sync with it.
-   */
-  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
-def struct(exprs: Expression*): CreateNamedStruct = 
CreateStruct(exprs.map { expr =>
-  expr.transformUp { case af: AggregateFunction => 
af.toAggregateExpression() }
-})
-val one = Literal(1, LongType)
+  private def convertToHistogram(s: String): EquiHeightHistogram = {
+val idx = s.indexOf(",")
+if (idx <= 0) {
+  throw new AnalysisException("Failed to parse histogram.")
+}
+val height = s.substring(0, idx).toDouble
+val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r
+val buckets = pattern.findAllMatchIn(s).map { m =>
+  EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, 
m.group(3).toLong)
+}.toSeq
+EquiHeightHistogram(height, buckets)
+  }
 
-// the approximate ndv (num distinct value) should never be larger 
than the number of rows
-val numNonNulls = if (col.nullable) Count(col) else Count(one)
-val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
-val numNulls = Subtract(Count(one), numNonNulls)
-val defaultSize = Literal(col.dataType.defaultSize, LongType)
+}
 
-def fixedLenTypeStruct(castType: DataType) = {
-  // For fixed width types, avg size should be the same as max size.
-  struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), 
numNulls, defaultSize,
-defaultSize)
-}
+/**
+ * There are a few types of histograms in state-of-the-art estimation 
methods. E.g. equi-width
+ * histogram, equi-height histogram, frequency histogram (value-frequency 
pairs) and hybrid
+ * histogram, etc.
+ * Currently in Spark, we support equi-height histogram since it is good 
at handling skew
+ * distribution, and also provides reasonable accuracy in other cases.
+ * We can add other histograms in the future, which will make estimation 
logic more complicated.
+ * Because we will have to deal with computation between different types 
of histograms in some
+ * cases, e.g. for join columns.
+ */
+trait Histogram
 
-col.dataType match {
-  case dt: IntegralType => fixedLenTypeStruct(dt)
-  case _: DecimalType => fixedLenTypeStruct(col.dataType)
-  case dt @ (DoubleType | FloatType) => fixedLenTypeStruct(dt)
-  case BooleanType => fixedLenTypeStruct(col.dataType)
-  case DateType => fixedLenTypeStruct(col.dataType)
-  case TimestampType => fixedLenTypeStruct(col.dataType)
-  case BinaryType | StringType =>
-// For string and binary type, we don't store min/max.
-val nullLit = Literal(null, col.dataType)
-struct(
-  ndv, nullLit, nullLit, numNulls,
-  // Set avg/max size to default size if all the values are null 
or there is no value.
-  Coalesce(Seq(Ceil(Average(Length(col))), defaultSize)),
-  Coalesce(Seq(Cast(Max(Length(col)), LongType), defaultSize)))
-  case _ =>
-throw new AnalysisException("Analyzing column statistics is not 
supported for column " +
-s"${col.name} of data type: ${col.dataType}.")
-}
-  }
+/**
+ * Equi-height histogram represents column value distribution by a 
sequence of buckets. Each bucket
+ * has a value range and contains approximately the same number of rows.
+ * @param height number of rows in each bucket
+ * @param ehBuckets equi-height histogram buckets
+ */
+case class EquiHeightHistogram(height: Double, ehBuckets: 
Seq[EquiHeightBucket]) extends Histogram {
 
-  /** Convert a struct for column stats (defined in statExprs) into 
[[ColumnStat]]. */
-  def rowToColumnStat(row: InternalRow, attr: Attribute): ColumnStat = {
-ColumnStat(
-  distinctCount = BigInt(row.getLong(0)),
-  // for string/binary min/max, get should return null
-  min = Option(row.get(1, attr.dataType)),
-  max = Option(row.get(2, attr.dataType)),
-  nullCount = BigInt(row.getLong(3)

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-19 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r145839008
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,439 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model, Transformer}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, 
HasInputCols, HasOutputCol, HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params for OneHotEncoderEstimator and 
OneHotEncoderModel */
+private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid
--- End diff --

Can the old `OneHotEncoder` also inherit this trait?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-19 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r145834490
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,439 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model, Transformer}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, 
HasInputCols, HasOutputCol, HasOutputCols}
--- End diff --

`HasInputCol`, `HasOutputCol` not needed, also I think `Transformer` above


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18664
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18664
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82915/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18664
  
**[Test build #82915 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82915/testReport)**
 for PR 18664 at commit 
[`f512deb`](https://github.com/apache/spark/commit/f512deb97f458043a6825bac8a44c1392f6c910b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19534
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82914/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...

2017-10-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19534
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19534
  
**[Test build #82914 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82914/testReport)**
 for PR 19534 at commit 
[`f8fcc35`](https://github.com/apache/spark/commit/f8fcc3560e087440c7618b33cc892f3feafd4a3a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

2017-10-19 Thread ron8hu
Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19479#discussion_r145828713
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -216,65 +218,61 @@ object ColumnStat extends Logging {
 }
   }
 
-  /**
-   * Constructs an expression to compute column statistics for a given 
column.
-   *
-   * The expression should create a single struct column with the 
following schema:
-   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, 
maxLen: Long
-   *
-   * Together with [[rowToColumnStat]], this function is used to create 
[[ColumnStat]] and
-   * as a result should stay in sync with it.
-   */
-  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
-def struct(exprs: Expression*): CreateNamedStruct = 
CreateStruct(exprs.map { expr =>
-  expr.transformUp { case af: AggregateFunction => 
af.toAggregateExpression() }
-})
-val one = Literal(1, LongType)
+  private def convertToHistogram(s: String): EquiHeightHistogram = {
+val idx = s.indexOf(",")
+if (idx <= 0) {
+  throw new AnalysisException("Failed to parse histogram.")
+}
+val height = s.substring(0, idx).toDouble
+val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r
+val buckets = pattern.findAllMatchIn(s).map { m =>
+  EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, 
m.group(3).toLong)
+}.toSeq
+EquiHeightHistogram(height, buckets)
+  }
 
-// the approximate ndv (num distinct value) should never be larger 
than the number of rows
-val numNonNulls = if (col.nullable) Count(col) else Count(one)
-val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
-val numNulls = Subtract(Count(one), numNonNulls)
-val defaultSize = Literal(col.dataType.defaultSize, LongType)
+}
 
-def fixedLenTypeStruct(castType: DataType) = {
-  // For fixed width types, avg size should be the same as max size.
-  struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), 
numNulls, defaultSize,
-defaultSize)
-}
+/**
+ * There are a few types of histograms in state-of-the-art estimation 
methods. E.g. equi-width
+ * histogram, equi-height histogram, frequency histogram (value-frequency 
pairs) and hybrid
+ * histogram, etc.
+ * Currently in Spark, we support equi-height histogram since it is good 
at handling skew
+ * distribution, and also provides reasonable accuracy in other cases.
+ * We can add other histograms in the future, which will make estimation 
logic more complicated.
+ * Because we will have to deal with computation between different types 
of histograms in some
--- End diff --

This is because we will have to 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19530: [SPARK-22309][ML] Remove unused param in `LDAModel.getTo...

2017-10-19 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/19530
  
fyi it looks like this is cleanup from removing a broadcast in #18152


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19486: [SPARK-22268][BUILD] Fix lint-java

2017-10-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19486
  
**[Test build #82916 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82916/testReport)**
 for PR 19486 at commit 
[`95a2d9e`](https://github.com/apache/spark/commit/95a2d9ef6e07c53fca9d37e526abc2ac2f178c67).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19530: [SPARK-22309][ML] Remove unused param in `LDAModel.getTo...

2017-10-19 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/19530
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19486: [SPARK-22268][BUILD] Fix lint-java

2017-10-19 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/19486
  
Updated


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >