[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18270 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19537: [SQL] Mark strategies with override for clarity.
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19537 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-a...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18933#discussion_r145888158 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1760,6 +1760,17 @@ def toPandas(self): for f, t in dtype.items(): pdf[f] = pdf[f].astype(t, copy=False) + +if self.sql_ctx.getConf("spark.sql.execution.pandas.timeZoneAware", "false").lower() \ --- End diff -- We still need a conf, even if it is a bug. This is just to avoid breaking any existing app. We can remove the conf in Spark 3.x. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-a...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18933#discussion_r145888010 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -912,6 +912,14 @@ object SQLConf { .intConf .createWithDefault(1) + val PANDAS_TIMEZONE_AWARE = +buildConf("spark.sql.execution.pandas.timeZoneAware") + .internal() + .doc("When true, make Pandas DataFrame with timezone-aware timestamp type when converting " + +"by pyspark.sql.DataFrame.toPandas. The session local timezone is used for the timezone.") + .booleanConf + .createWithDefault(false) --- End diff -- We can change the default to `true`, since we agree that this is a bug. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19519: [SPARK-21840][core] Add trait that allows conf to be dir...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19519 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18664 BTW, we might need to resolve the comments in https://github.com/apache/spark/pull/18933 and merge that PR first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18664 Before we merging this PR, could anybody submit a PR for documenting this issue? Then, we can get more feedbacks from the others. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19540 **[Test build #82924 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82924/testReport)** for PR 19540 at commit [`08240f3`](https://github.com/apache/spark/commit/08240f368fdde6cf3ce02b50b7e7405a00f15fa4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19540 I think branch 2.2 also has similar issue when fetching resources from remote secure HDFS. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10949: [SPARK-12832][MESOS] mesos scheduler respect agent attri...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/10949 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19540 ok to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19540 Thanks for the fix! I didn't test on secure cluster when did glob path support, so I didn't realize such issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] group...
Github user ueshin closed the pull request at: https://github.com/apache/spark/pull/19505 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/19505 Sure, I'd close this. @icexelloss Of course you can open a separate JIRA and another PR. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19485 This is the API link you refer `https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader@csv(paths:String*):org.apache.spark.sql.DataFrame` I just quickly scanned them. The option descriptions are pretty rough. They are made for advanced dev who the read API docs and play with them. In the long term, we should follow what the mainstream RDBMS reference manual. Something like - https://dev.mysql.com/doc/refman/5.5/en/creating-tables.html - https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_sql_createtable.html - https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables003.htm#ADMIN01503 I prefer to having something more human friendly. The whole SQL doc needs a complete re-org. cc @jiangxb1987 Maybe you are the right person to take it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19523: [SPARK-22301][SQL] Add rule to Optimizer for In with emp...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19523 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19523: [SPARK-22301][SQL] Add rule to Optimizer for In with emp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19523 **[Test build #82923 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82923/testReport)** for PR 19523 at commit [`50c7af3`](https://github.com/apache/spark/commit/50c7af3d4fb9a23ccf460a1842d7e57a26ca582c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19523: [SPARK-22301][SQL] Add rule to Optimizer for In with emp...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19523 @mgaido91 Could you update the PR title? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19508: [SPARK-20783][SQL][Follow-up] Create ColumnVector to abs...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19508 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19517 **[Test build #82922 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82922/testReport)** for PR 19517 at commit [`59d61a4`](https://github.com/apache/spark/commit/59d61a46a15b00f8af9ec8e2c6930853b7097b1c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19540: [SPARK-22319][Core] call loginUserFromKeytab before acce...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19540 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] group...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19517#discussion_r145878293 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala --- @@ -22,17 +22,26 @@ import org.apache.spark.sql.Column import org.apache.spark.sql.catalyst.expressions.Expression import org.apache.spark.sql.types.DataType +private[spark] object PythonUdfType { + // row-based UDFs --- End diff -- Sure, I'll update it, too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19540: [SPARK-22319][Core] call loginUserFromKeytab befo...
GitHub user sjrand opened a pull request: https://github.com/apache/spark/pull/19540 [SPARK-22319][Core] call loginUserFromKeytab before accessing hdfs ## What changes were proposed in this pull request? In `SparkSubmit`, call `loginUserFromKeytab` before attempting to make RPC calls to the NameNode. ## How was this patch tested? I manually tested this patch by: 1. Confirming that my Spark application failed to launch with the error reported in https://issues.apache.org/jira/browse/SPARK-22319. 2. Applying this patch and confirming that the app no longer fails to launch, even when I have not manually run `kinit` on the host. Presumably we also want integration tests for secure clusters so that we catch this sort of thing. I'm happy to take a shot at this if it's feasible and someone can point me in the right direction. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sjrand/spark SPARK-22319 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19540.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19540 commit 08240f368fdde6cf3ce02b50b7e7405a00f15fa4 Author: Steven Rand Date: 2017-10-19T04:51:26Z call loginUserFromKeytab before accessing hdfs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] group...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19517#discussion_r145878275 --- Diff: python/pyspark/sql/functions.py --- @@ -2038,13 +2038,22 @@ def _wrap_function(sc, func, returnType): sc.pythonVer, broadcast_vars, sc._javaAccumulator) +class PythonUdfType(object): +# row-based UDFs --- End diff -- Sure, I'll update it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19272 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82919/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19272 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19272 **[Test build #82919 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82919/testReport)** for PR 19272 at commit [`837157d`](https://github.com/apache/spark/commit/837157d1c76d1b9488025e8c294eb2839aeee5e3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19485 I meant adding a new chapter describing options, removing duplication, for example here https://github.com/apache/spark/blob/73d80ec49713605d6a589e688020f0fc2d6feab2/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L513 and then leaving a link to the new chapter instead. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19505 @ueshin Maybe close this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19517 LGTM. @ueshin Could you remove `[WIP]` from the title of this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] group...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19517#discussion_r145877519 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala --- @@ -22,17 +22,26 @@ import org.apache.spark.sql.Column import org.apache.spark.sql.catalyst.expressions.Expression import org.apache.spark.sql.types.DataType +private[spark] object PythonUdfType { + // row-based UDFs --- End diff -- The same here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] group...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19517#discussion_r145877442 --- Diff: python/pyspark/sql/functions.py --- @@ -2038,13 +2038,22 @@ def _wrap_function(sc, func, returnType): sc.pythonVer, broadcast_vars, sc._javaAccumulator) +class PythonUdfType(object): +# row-based UDFs --- End diff -- Nit: Please update all `row-based UDFs` to `row-at-a-time UDFs` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19517 **[Test build #82921 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82921/testReport)** for PR 19517 at commit [`7e43bb4`](https://github.com/apache/spark/commit/7e43bb44bf4267ef6a719108d3edbf545eaba23d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19517: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19517 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19485 @HyukjinKwon I did not understand what is your suggestion. @jomach Any reason you closed this PR or you plan to open a new one? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19539: [WIP] [SQL] Remove unnecessary methods
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19539 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82920/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19539: [WIP] [SQL] Remove unnecessary methods
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19539 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19539: [WIP] [SQL] Remove unnecessary methods
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19539 **[Test build #82920 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82920/testReport)** for PR 19539 at commit [`19bb867`](https://github.com/apache/spark/commit/19bb867234a3127727ef5500d49e93628c6c1ba3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/19505 @viirya @cloud-fan I updated my original summary. I think it answers `group_transform` question. I also added more example to each type. @HyukjinKwon @viirya I agree we can move this to a separate Jira and merge current PR of @ueshin. Maybe I can open another PR with just the proposal design doc? Not sure what's the best way is. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19524 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82918/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19524 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19524 **[Test build #82918 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82918/testReport)** for PR 19524 at commit [`de2a70b`](https://github.com/apache/spark/commit/de2a70b2279d214022b1d575270ff7a38764d26f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19528: [SPARK-20393][WEBU UI][1.6] Strengthen Spark to prevent ...
Github user ambauma commented on the issue: https://github.com/apache/spark/pull/19528 Believed fixed. Hard to say for sure without knowing the precise python and numpy versions the build is using. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19469: [SPARK-22243][DStreams]spark.yarn.jars reload from confi...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19469 Ah, I didn't realize there is a change in that PR. I agree we need a better solution --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19471 > No behavior change if there is no overlapped columns in data and partition schema. > The schema changed(partition columns go to the end) when reading file format data source with partition columns in data files. @cloud-fan Could you check why so many test cases failed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19528: [SPARK-20393][WEBU UI][1.6] Strengthen Spark to prevent ...
Github user ambauma commented on the issue: https://github.com/apache/spark/pull/19528 Able to duplicate. Working theory is that this is related to numpy 1.12.1. Here is my conda env: (spark-1.6) andrew@andrew-Inspiron-7559:~/git/spark$ conda list # packages in environment at /home/andrew/.conda/envs/spark-1.6: # ca-certificates 2017.08.26 h1d4fec5_0 certifi 2016.2.28py34_0 intel-openmp 2018.0.0 h15fc484_7 libedit 3.1 heed3624_0 libffi3.2.1h4deb6c0_3 libgcc-ng 7.2.0h7cc24e2_2 libgfortran 1.0 0 libstdcxx-ng 7.2.0h7a57d05_2 mkl 2017.0.3 0 ncurses 6.0 h06874d7_1 numpy 1.12.1 py34_0 openblas 0.2.190 openssl 1.0.2l h077ae2c_5 pip 9.0.1py34_1 python3.4.5 0 readline 6.2 2 setuptools27.2.0 py34_0 sqlite3.13.00 tk8.5.180 wheel 0.29.0 py34_0 xz5.2.3h2bcbf08_1 zlib 1.2.11 hfbfcf68_1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19269 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19269: [SPARK-22026][SQL] data source v2 write path
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19269 Thanks! Merged to master. This is just the first commit of the data source v2 write protocol. More PRs are coming to further improve it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19269: [SPARK-22026][SQL] data source v2 write path
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19269 It sounds like the initialization stages are missing in the current protocol API design. We can do it later. LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19505 The group_transform udfs looks a bit weird to me. @icexelloss Can you explain the use case of it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19269#discussion_r145870061 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import org.apache.spark.{SparkException, TaskContext} +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder} +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.sources.v2.writer._ +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.Utils + +/** + * The logical plan for writing data into data source v2. + */ +case class WriteToDataSourceV2(writer: DataSourceV2Writer, query: LogicalPlan) extends LogicalPlan { + override def children: Seq[LogicalPlan] = Seq(query) + override def output: Seq[Attribute] = Nil +} + +/** + * The physical plan for writing data into data source v2. + */ +case class WriteToDataSourceV2Exec(writer: DataSourceV2Writer, query: SparkPlan) extends SparkPlan { + override def children: Seq[SparkPlan] = Seq(query) + override def output: Seq[Attribute] = Nil + + override protected def doExecute(): RDD[InternalRow] = { +val writeTask = writer match { + case w: SupportsWriteInternalRow => w.createInternalRowWriterFactory() + case _ => new RowToInternalRowDataWriterFactory(writer.createWriterFactory(), query.schema) +} + --- End diff -- Do we need to add a function call here for initialization or setup? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19534 @sitalkedia I have a very old similar PR #11205 , maybe you can refer to it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r145865969 --- Diff: python/pyspark/sql/session.py --- @@ -510,6 +578,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr except Exception: has_pandas = False if has_pandas and isinstance(data, pandas.DataFrame): +if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower() == "true" \ +and len(data) > 0: +df = self._createFromPandasWithArrow(data, schema) --- End diff -- As of https://github.com/apache/spark/pull/19459#issuecomment-337674952, `schema` from `_parse_datatype_string` could be not a `StructType`: https://github.com/apache/spark/blob/bfc7e1fe1ad5f9777126f2941e29bbe51ea5da7c/python/pyspark/sql/tests.py#L1325 although I don't think we have supported this case with `pd.DataFrame` as `int` case resembles `Dataset` with primitive types, up to my knowledge: ``` spark.createDataFrame(["a", "b"], "string").show() +-+ |value| +-+ |a| |b| +-+ ``` For `pd.DataFrame` case, looks we always have a list of list. https://github.com/apache/spark/blob/d492cc5a21cd67b3999b85d97f5c41c3734b1ba3/python/pyspark/sql/session.py#L515 So, I think we should only support list of strings maybe with a proper exception for `int` case. Of course, this case should work: ``` >>> spark.createDataFrame(pd.DataFrame([1]), "struct").show() +---+ | a| +---+ | 1| +---+ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19528: [SPARK-20393][WEBU UI][1.6] Strengthen Spark to prevent ...
Github user ambauma commented on the issue: https://github.com/apache/spark/pull/19528 Working on duplicating PySpark failures... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r145863796 --- Diff: python/pyspark/sql/session.py --- @@ -414,6 +415,73 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _createFromPandasWithArrow(self, pdf, schema): +""" +Create a DataFrame from a given pandas.DataFrame by slicing it into partitions, converting +to Arrow data, then sending to the JVM to parallelize. If a schema is passed in, the +data types will be used to coerce the data in Pandas to Arrow conversion. +""" +from pyspark.serializers import ArrowSerializer +from pyspark.sql.types import from_arrow_schema, to_arrow_type, _cast_pandas_series_type +import pyarrow as pa + +# Slice the DataFrame into batches +step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up +pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step)) + +if schema is None or isinstance(schema, list): +batches = [pa.RecordBatch.from_pandas(pdf_slice, preserve_index=False) + for pdf_slice in pdf_slices] + +# There will be at least 1 batch after slicing the pandas.DataFrame +schema_from_arrow = from_arrow_schema(batches[0].schema) + +# If passed schema as a list of names then rename fields +if isinstance(schema, list): +fields = [] +for i, field in enumerate(schema_from_arrow): +field.name = schema[i] +fields.append(field) +schema = StructType(fields) +else: +schema = schema_from_arrow +else: +batches = [] +for i, pdf_slice in enumerate(pdf_slices): + +# convert to series to pyarrow.Arrays to use mask when creating Arrow batches +arrs = [] +names = [] +for c, (_, series) in enumerate(pdf_slice.iteritems()): +field = schema[c] +names.append(field.name) +t = to_arrow_type(field.dataType) +try: +# NOTE: casting is not necessary with Arrow >= 0.7 + arrs.append(pa.Array.from_pandas(_cast_pandas_series_type(series, t), + mask=series.isnull(), type=t)) +except ValueError as e: --- End diff -- I think this guard only works to prevent casting like: ```python >>> s = pd.Series(["abc", "2", "10001"]) >>> s.astype(np.object_) 0 abc 12 210001 dtype: object >>> s 0 abc 12 210001 dtype: object >>> s.astype(np.int8) ... ValueError: invalid literal for long() with base 10: 'abc' ``` For the casting that can cause overflow, this seems don't work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19527 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82917/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19527 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19527 **[Test build #82917 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82917/testReport)** for PR 19527 at commit [`b42d175`](https://github.com/apache/spark/commit/b42d175ddc4928ec36718177702059ccf0bfbfea). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r145862488 --- Diff: python/pyspark/sql/session.py --- @@ -414,6 +415,73 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _createFromPandasWithArrow(self, pdf, schema): +""" +Create a DataFrame from a given pandas.DataFrame by slicing it into partitions, converting +to Arrow data, then sending to the JVM to parallelize. If a schema is passed in, the +data types will be used to coerce the data in Pandas to Arrow conversion. +""" +from pyspark.serializers import ArrowSerializer +from pyspark.sql.types import from_arrow_schema, to_arrow_type, _cast_pandas_series_type +import pyarrow as pa + +# Slice the DataFrame into batches +step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up +pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step)) + +if schema is None or isinstance(schema, list): +batches = [pa.RecordBatch.from_pandas(pdf_slice, preserve_index=False) + for pdf_slice in pdf_slices] + +# There will be at least 1 batch after slicing the pandas.DataFrame +schema_from_arrow = from_arrow_schema(batches[0].schema) + +# If passed schema as a list of names then rename fields +if isinstance(schema, list): +fields = [] +for i, field in enumerate(schema_from_arrow): +field.name = schema[i] +fields.append(field) +schema = StructType(fields) +else: +schema = schema_from_arrow +else: +batches = [] +for i, pdf_slice in enumerate(pdf_slices): + +# convert to series to pyarrow.Arrays to use mask when creating Arrow batches +arrs = [] +names = [] +for c, (_, series) in enumerate(pdf_slice.iteritems()): +field = schema[c] +names.append(field.name) +t = to_arrow_type(field.dataType) +try: +# NOTE: casting is not necessary with Arrow >= 0.7 + arrs.append(pa.Array.from_pandas(_cast_pandas_series_type(series, t), --- End diff -- Any chance that the given data type for the field is not correct? By a wrong data type of a field, what the behavior of this casting? For example, for a Series of numpy.int16`, if the given data type is bytetype, this casting will turn it to numpy.int8, so we will get: ```python >>> s = pd.Series([1, 2, 10001], dtype=np.int16) >>> s 01 12 210001 dtype: int16 >>> s.astype(np.int8) 0 1 1 2 217 dtype: int8 ``` `createDataFrame` will check the data type of input data when converting to DataFrame. This implicit type casting seems inconsistent with original behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19539: [WIP] [SQL] Remove unnecessary methods
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19539 **[Test build #82920 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82920/testReport)** for PR 19539 at commit [`19bb867`](https://github.com/apache/spark/commit/19bb867234a3127727ef5500d49e93628c6c1ba3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19539: [WIP] [SQL] Remove unnecessary methods
GitHub user wzhfy opened a pull request: https://github.com/apache/spark/pull/19539 [WIP] [SQL] Remove unnecessary methods ## What changes were proposed in this pull request? Remove unnecessary methods. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wzhfy/spark remove_equals Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19539.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19539 commit 19bb867234a3127727ef5500d49e93628c6c1ba3 Author: Zhenhua Wang Date: 2017-10-20T01:31:27Z remove necessary methods --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19272 **[Test build #82919 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82919/testReport)** for PR 19272 at commit [`837157d`](https://github.com/apache/spark/commit/837157d1c76d1b9488025e8c294eb2839aeee5e3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19505 +1 for separate JIRA to clarify the proposal and +0 for 3. out of those three, too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19272: [Spark-21842][Mesos] Support Kerberos ticket rene...
Github user ArtRand commented on a diff in the pull request: https://github.com/apache/spark/pull/19272#discussion_r145861062 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala --- @@ -194,6 +198,27 @@ private[spark] class MesosCoarseGrainedSchedulerBackend( sc.conf.getOption("spark.mesos.driver.frameworkId").map(_ + suffix) ) +// check that the credentials are defined, even though it's likely that auth would have failed +// already if you've made it this far +if (principal != null && hadoopDelegationCreds.isDefined) { + logDebug(s"Principal found ($principal) starting token renewer") + val credentialRenewerThread = new Thread { +setName("MesosCredentialRenewer") +override def run(): Unit = { + val rt = MesosCredentialRenewer.getTokenRenewalTime(hadoopDelegationCreds.get, conf) + val credentialRenewer = +new MesosCredentialRenewer( + conf, + hadoopDelegationTokenManager.get, + MesosCredentialRenewer.getNextRenewalTime(rt), + driverEndpoint) + credentialRenewer.scheduleTokenRenewal() +} + } + + credentialRenewerThread.start() + credentialRenewerThread.join() --- End diff -- Yes, sorry, for some reason I understood you you to mean the credential renewer itself. I added a comment to the same effect as the YARN analogue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19469: [SPARK-22243][DStreams]spark.yarn.jars reload from confi...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19469 @felixcheung As you can see there's bunch of configurations needs to be added here in https://github.com/apache-spark-on-k8s/spark/pull/516, that's why I'm asking a general solutions for such related issue. I'm OK to merge this PR. But I would suspect similar PRs will still be created in future, since those issues are quite scenario specific, users may have different scenarios and can touch different issues regarding to this. So I'm just wondering if we could have a better solution for this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r145859471 --- Diff: python/pyspark/sql/session.py --- @@ -510,6 +578,12 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr except Exception: has_pandas = False if has_pandas and isinstance(data, pandas.DataFrame): +if self.conf.get("spark.sql.execution.arrow.enabled", "false").lower() == "true" \ +and len(data) > 0: +df = self._createFromPandasWithArrow(data, schema) +# Fallback to create DataFrame without arrow if return None +if df is not None: --- End diff -- Shall we show some log message to users in this case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19505 @icexelloss The summary and the proposal 3 looks great. To prevent confusing, can you also put the usage of each function type in proposal 3? E.g., group_map is for `groupby().apply()`, transform is for `withColumn`, etc? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19505 Btw, I think the scope of this change is more than just a follow-up. Should we create another JIRA for it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r145858362 --- Diff: python/pyspark/sql/session.py --- @@ -414,6 +415,73 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _createFromPandasWithArrow(self, pdf, schema): +""" +Create a DataFrame from a given pandas.DataFrame by slicing it into partitions, converting +to Arrow data, then sending to the JVM to parallelize. If a schema is passed in, the +data types will be used to coerce the data in Pandas to Arrow conversion. +""" +from pyspark.serializers import ArrowSerializer +from pyspark.sql.types import from_arrow_schema, to_arrow_type, _cast_pandas_series_type +import pyarrow as pa + +# Slice the DataFrame into batches +step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up +pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step)) + +if schema is None or isinstance(schema, list): +batches = [pa.RecordBatch.from_pandas(pdf_slice, preserve_index=False) + for pdf_slice in pdf_slices] + +# There will be at least 1 batch after slicing the pandas.DataFrame +schema_from_arrow = from_arrow_schema(batches[0].schema) + +# If passed schema as a list of names then rename fields +if isinstance(schema, list): +fields = [] +for i, field in enumerate(schema_from_arrow): +field.name = schema[i] +fields.append(field) +schema = StructType(fields) +else: +schema = schema_from_arrow +else: +batches = [] +for i, pdf_slice in enumerate(pdf_slices): + +# convert to series to pyarrow.Arrays to use mask when creating Arrow batches +arrs = [] +names = [] +for c, (_, series) in enumerate(pdf_slice.iteritems()): +field = schema[c] +names.append(field.name) +t = to_arrow_type(field.dataType) +try: +# NOTE: casting is not necessary with Arrow >= 0.7 + arrs.append(pa.Array.from_pandas(_cast_pandas_series_type(series, t), + mask=series.isnull(), type=t)) +except ValueError as e: +warnings.warn("Arrow will not be used in createDataFrame: %s" % str(e)) +return None +batches.append(pa.RecordBatch.from_arrays(arrs, names)) + +# Verify schema of first batch, return None if not equal and fallback without Arrow +if i == 0: +schema_from_arrow = from_arrow_schema(batches[i].schema) +if schema != schema_from_arrow: +warnings.warn("Arrow will not be used in createDataFrame.\n" + --- End diff -- OK by me too but let's keep on our eyes on mailing list and JIRAs that complains about it in the future, and improve it next time if this sounds more important than we think here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19524 **[Test build #82918 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82918/testReport)** for PR 19524 at commit [`de2a70b`](https://github.com/apache/spark/commit/de2a70b2279d214022b1d575270ff7a38764d26f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19524 Thanks for your review @shaneknapp. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19486: [SPARK-22268][BUILD] Fix lint-java
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19486 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19486: [SPARK-22268][BUILD] Fix lint-java
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19486 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19527 **[Test build #82917 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82917/testReport)** for PR 19527 at commit [`b42d175`](https://github.com/apache/spark/commit/b42d175ddc4928ec36718177702059ccf0bfbfea). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r145856074 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,439 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model, Transformer} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data. + * Options are 'skip' (filter out rows with invalid data) or 'error' (throw an error). + * Default: "error" + * @group param + */ + @Since("2.3.0") + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"How to handle invalid data " + +"Options are 'skip' (filter out rows with invalid data) or error (throw an error).", + ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids)) + + setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID) + + /** + * Whether to drop the last category in the encoded vector (default: true) + * @group param + */ + @Since("2.3.0") + final val dropLast: BooleanParam = +new BooleanParam(this, "dropLast", "whether to drop the last category") + setDefault(dropLast -> true) + + /** @group getParam */ + @Since("2.3.0") + def getDropLast: Boolean = $(dropLast) +} + +/** + * A one-hot encoder that maps a column of category indices to a column of binary vectors, with + * at most a single one-value per row that indicates the input category index. + * For example with 5 categories, an input value of 2.0 would map to an output vector of + * `[0.0, 0.0, 1.0, 0.0]`. + * The last category is not included by default (configurable via `dropLast`), + * because it makes the vector entries sum up to one, and hence linearly dependent. + * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`. + * + * @note This is different from scikit-learn's OneHotEncoder, which keeps all categories. + * The output vectors are sparse. + * + * @see `StringIndexer` for converting categorical values into category indices + */ +@Since("2.3.0") +class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val uid: String) +extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("oneHotEncoder")) + + /** @group setParam */ + @Since("2.3.0") + def setInputCols(values: Array[String]): this.type = set(inputCols, values) + + /** @group setParam */ + @Since("2.3.0") + def setOutputCols(values: Array[String]): this.type = set(outputCols, values) + + /** @group setParam */ + @Since("2.3.0") + def setDropLast(value: Boolean): this.type = set(dropLast, value) + + /** @group setParam */ + @Since("2.3.0") + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) + + @Since("2.3.0") + override def transformSchema(schema: StructType): S
[GitHub] spark issue #19486: [SPARK-22268][BUILD] Fix lint-java
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19486 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19486: [SPARK-22268][BUILD] Fix lint-java
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19486 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82916/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19486: [SPARK-22268][BUILD] Fix lint-java
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19486 **[Test build #82916 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82916/testReport)** for PR 19486 at commit [`95a2d9e`](https://github.com/apache/spark/commit/95a2d9ef6e07c53fca9d37e526abc2ac2f178c67). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user MrBago commented on the issue: https://github.com/apache/spark/pull/19439 @imatiach-msft just a few more comments. When I was looking over this I realized that the python and Scala name spaces are going to be a little different, eg `pyspark.ml.image.readImages` vs `spark.ml.image.ImageSchema.readImages`. Should we use an `ImageSchema` class in python as a namespace? Does spark do that in other pyspark modules? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user MrBago commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r145843289 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,122 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import pyspark +from pyspark import SparkContext +from pyspark.sql.types import * +from pyspark.sql.types import Row, _create_row +from pyspark.sql import DataFrame +from pyspark.ml.param.shared import * +import numpy as np + +undefinedImageType = "Undefined" + +imageFields = ["origin", "height", "width", "nChannels", "mode", "data"] + +ocvTypes = { +undefinedImageType: -1, +"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24, +"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25, +"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, "CV_16UC4": 26, +"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, "CV_16SC4": 27, +"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, "CV_32SC4": 28, +"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, "CV_32FC4": 29, +"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, "CV_64FC4": 30 +} + +# DataFrame with a single column of images named "image" (nullable) +imageSchema = StructType(StructField("image", StructType([ +StructField(imageFields[0], StringType(), True), +StructField(imageFields[1], IntegerType(), False), +StructField(imageFields[2], IntegerType(), False), +StructField(imageFields[3], IntegerType(), False), +# OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields[4], StringType(), False), +# bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(imageFields[5], BinaryType(), False)]), True)) + + +def toNDArray(image): +""" +Converts an image to a one-dimensional array. + +:param image (object): The image to be converted +:rtype array: The image as a one-dimensional array + +.. versionadded:: 2.3.0 +""" +height = image.height +width = image.width +nChannels = image.nChannels +return np.ndarray( +shape=(height, width, nChannels), +dtype=np.uint8, +buffer=image.data, +strides=(width * nChannels, nChannels, 1)) + + +def toImage(array, origin="", mode=ocvTypes["CV_8UC3"]): --- End diff -- The interaction between mode and dtype here is unclear here. I think there are two sensible things to do. 1) determine the dtype from the mode, and use that to cast the image, ie: `toImage(a, mode="CV_8UC3") => bytearray(array.astype(dtype=np.uint8).ravel())` `toImage(a, mode="CV_32FC3") => bytearray(array.astype(dtype=np.float32).ravel())` 2) Remove the mode argument to the function and use `array.dtype` to infer the mode. Also we don't currently check that nChannels is consistent with the mode proved. A (H, W, 1) image will currently get assigned mode "CV_8UC3". I think it's fine to restrict the set of modes/dtypes we support but again we should be careful not to return a malformed image row. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user MrBago commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r145842379 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,122 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import pyspark +from pyspark import SparkContext +from pyspark.sql.types import * +from pyspark.sql.types import Row, _create_row +from pyspark.sql import DataFrame +from pyspark.ml.param.shared import * +import numpy as np + +undefinedImageType = "Undefined" + +imageFields = ["origin", "height", "width", "nChannels", "mode", "data"] + +ocvTypes = { +undefinedImageType: -1, +"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24, +"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25, +"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, "CV_16UC4": 26, +"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, "CV_16SC4": 27, +"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, "CV_32SC4": 28, +"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, "CV_32FC4": 29, +"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, "CV_64FC4": 30 +} + +# DataFrame with a single column of images named "image" (nullable) +imageSchema = StructType(StructField("image", StructType([ +StructField(imageFields[0], StringType(), True), +StructField(imageFields[1], IntegerType(), False), +StructField(imageFields[2], IntegerType(), False), +StructField(imageFields[3], IntegerType(), False), +# OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields[4], StringType(), False), +# bytes in OpenCV-compatible order: row-wise BGR in most cases +StructField(imageFields[5], BinaryType(), False)]), True)) + + +def toNDArray(image): +""" +Converts an image to a one-dimensional array. + +:param image (object): The image to be converted +:rtype array: The image as a one-dimensional array + +.. versionadded:: 2.3.0 +""" +height = image.height +width = image.width +nChannels = image.nChannels +return np.ndarray( +shape=(height, width, nChannels), +dtype=np.uint8, --- End diff -- `dtype` and `strides` should depend on `image.mode`. I think it's fine to only support `uint8` modes for not, but can we add a check here to make sure we don't return a malformed array. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
Github user MrBago commented on a diff in the pull request: https://github.com/apache/spark/pull/19439#discussion_r145845879 --- Diff: python/pyspark/ml/image.py --- @@ -0,0 +1,122 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import pyspark +from pyspark import SparkContext +from pyspark.sql.types import * +from pyspark.sql.types import Row, _create_row +from pyspark.sql import DataFrame +from pyspark.ml.param.shared import * +import numpy as np + +undefinedImageType = "Undefined" + +imageFields = ["origin", "height", "width", "nChannels", "mode", "data"] + +ocvTypes = { +undefinedImageType: -1, +"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24, +"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25, +"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, "CV_16UC4": 26, +"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, "CV_16SC4": 27, +"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, "CV_32SC4": 28, +"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, "CV_32FC4": 29, +"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, "CV_64FC4": 30 +} + +# DataFrame with a single column of images named "image" (nullable) +imageSchema = StructType(StructField("image", StructType([ +StructField(imageFields[0], StringType(), True), +StructField(imageFields[1], IntegerType(), False), +StructField(imageFields[2], IntegerType(), False), +StructField(imageFields[3], IntegerType(), False), +# OpenCV-compatible type: CV_8UC3 in most cases +StructField(imageFields[4], StringType(), False), --- End diff -- I believe this was changed to IntegerType in scala. Is it possible to import this from scala so we don't need to define it in two places? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r145848280 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,439 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model, Transformer} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid --- End diff -- Should we? If we don't plan to add `HasHandleInvalid`, `HasInputCols` and `HasOutputCols` into existing `OneHotEncoder`, I think we can keep it as it is now. With this one hot encoder estimator added, we may want to deprecate the existing `OneHotEncoder`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r145847823 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,439 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model, Transformer} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} --- End diff -- Yes, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19527 Benchmark against existing one hot encoder. Because existing encoder only needs to run `transform`, there is no fitting time. Transforming: numColums | Existing one hot encoder -- | -- 1 | 0.2516055188 100 | 20.29175892115 1000 | 26242.039411932* * Because ten iterations take too long to finish, I just ran one iteration for 1000 columns. But it shows the scale already. Benchmark codes: ```scala import org.apache.spark.ml.feature._ import org.apache.spark.sql.Row import org.apache.spark.sql.types._ import spark.implicits._ import scala.util.Random val seed = 123l val random = new Random(seed) val n = 1 val m = 1000 val rows = sc.parallelize(1 to n).map(i=> Row(Array.fill(m)(random.nextInt(1000)): _*)) val struct = new StructType(Array.range(0,m,1).map(i => StructField(s"c$i",IntegerType,true))) val df = spark.createDataFrame(rows, struct) df.persist() df.count() val inputCols = Array.range(0,m,1).map(i => s"c$i") val outputCols = Array.range(0,m,1).map(i => s"c${i}_encoded") val encoders = Array.range(0,m,1).map(i => new OneHotEncoder().setInputCol(s"c$i").setOutputCol(s"c${i}_encoded")) var duration = 0.0 for (i <- 0 until 10) { var encoded = df val start = System.nanoTime() encoders.foreach { encoder => encoded = encoder.transform(encoded) } encoded.count val end = System.nanoTime() duration += (end - start) / 1e9 } println(s"duration: ${duration / 10}") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...
Github user yssharma commented on the issue: https://github.com/apache/spark/pull/18029 @brkyvz Please have a look once you have time. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19479#discussion_r145840719 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala --- @@ -216,65 +218,61 @@ object ColumnStat extends Logging { } } - /** - * Constructs an expression to compute column statistics for a given column. - * - * The expression should create a single struct column with the following schema: - * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long - * - * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and - * as a result should stay in sync with it. - */ - def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = { -def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr => - expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() } -}) -val one = Literal(1, LongType) + private def convertToHistogram(s: String): EquiHeightHistogram = { +val idx = s.indexOf(",") +if (idx <= 0) { + throw new AnalysisException("Failed to parse histogram.") +} +val height = s.substring(0, idx).toDouble +val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r +val buckets = pattern.findAllMatchIn(s).map { m => + EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, m.group(3).toLong) +}.toSeq +EquiHeightHistogram(height, buckets) + } -// the approximate ndv (num distinct value) should never be larger than the number of rows -val numNonNulls = if (col.nullable) Count(col) else Count(one) -val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls)) -val numNulls = Subtract(Count(one), numNonNulls) -val defaultSize = Literal(col.dataType.defaultSize, LongType) +} -def fixedLenTypeStruct(castType: DataType) = { - // For fixed width types, avg size should be the same as max size. - struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize, -defaultSize) -} +/** + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid + * histogram, etc. + * Currently in Spark, we support equi-height histogram since it is good at handling skew + * distribution, and also provides reasonable accuracy in other cases. + * We can add other histograms in the future, which will make estimation logic more complicated. + * Because we will have to deal with computation between different types of histograms in some + * cases, e.g. for join columns. + */ +trait Histogram -col.dataType match { - case dt: IntegralType => fixedLenTypeStruct(dt) - case _: DecimalType => fixedLenTypeStruct(col.dataType) - case dt @ (DoubleType | FloatType) => fixedLenTypeStruct(dt) - case BooleanType => fixedLenTypeStruct(col.dataType) - case DateType => fixedLenTypeStruct(col.dataType) - case TimestampType => fixedLenTypeStruct(col.dataType) - case BinaryType | StringType => -// For string and binary type, we don't store min/max. -val nullLit = Literal(null, col.dataType) -struct( - ndv, nullLit, nullLit, numNulls, - // Set avg/max size to default size if all the values are null or there is no value. - Coalesce(Seq(Ceil(Average(Length(col))), defaultSize)), - Coalesce(Seq(Cast(Max(Length(col)), LongType), defaultSize))) - case _ => -throw new AnalysisException("Analyzing column statistics is not supported for column " + -s"${col.name} of data type: ${col.dataType}.") -} - } +/** + * Equi-height histogram represents column value distribution by a sequence of buckets. Each bucket + * has a value range and contains approximately the same number of rows. + * @param height number of rows in each bucket + * @param ehBuckets equi-height histogram buckets + */ +case class EquiHeightHistogram(height: Double, ehBuckets: Seq[EquiHeightBucket]) extends Histogram { - /** Convert a struct for column stats (defined in statExprs) into [[ColumnStat]]. */ - def rowToColumnStat(row: InternalRow, attr: Attribute): ColumnStat = { -ColumnStat( - distinctCount = BigInt(row.getLong(0)), - // for string/binary min/max, get should return null - min = Option(row.get(1, attr.dataType)), - max = Option(row.get(2, attr.dataType)), - nullCount = BigInt(row.getLong(3)
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r145839008 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,439 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model, Transformer} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid --- End diff -- Can the old `OneHotEncoder` also inherit this trait? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r145834490 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,439 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model, Transformer} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} --- End diff -- `HasInputCol`, `HasOutputCol` not needed, also I think `Transformer` above --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18664 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18664 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82915/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18664 **[Test build #82915 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82915/testReport)** for PR 18664 at commit [`f512deb`](https://github.com/apache/spark/commit/f512deb97f458043a6825bac8a44c1392f6c910b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19534 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82914/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19534 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19534 **[Test build #82914 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82914/testReport)** for PR 19534 at commit [`f8fcc35`](https://github.com/apache/spark/commit/f8fcc3560e087440c7618b33cc892f3feafd4a3a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19479#discussion_r145828713 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala --- @@ -216,65 +218,61 @@ object ColumnStat extends Logging { } } - /** - * Constructs an expression to compute column statistics for a given column. - * - * The expression should create a single struct column with the following schema: - * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long - * - * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and - * as a result should stay in sync with it. - */ - def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = { -def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr => - expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() } -}) -val one = Literal(1, LongType) + private def convertToHistogram(s: String): EquiHeightHistogram = { +val idx = s.indexOf(",") +if (idx <= 0) { + throw new AnalysisException("Failed to parse histogram.") +} +val height = s.substring(0, idx).toDouble +val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r +val buckets = pattern.findAllMatchIn(s).map { m => + EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, m.group(3).toLong) +}.toSeq +EquiHeightHistogram(height, buckets) + } -// the approximate ndv (num distinct value) should never be larger than the number of rows -val numNonNulls = if (col.nullable) Count(col) else Count(one) -val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls)) -val numNulls = Subtract(Count(one), numNonNulls) -val defaultSize = Literal(col.dataType.defaultSize, LongType) +} -def fixedLenTypeStruct(castType: DataType) = { - // For fixed width types, avg size should be the same as max size. - struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize, -defaultSize) -} +/** + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid + * histogram, etc. + * Currently in Spark, we support equi-height histogram since it is good at handling skew + * distribution, and also provides reasonable accuracy in other cases. + * We can add other histograms in the future, which will make estimation logic more complicated. + * Because we will have to deal with computation between different types of histograms in some --- End diff -- This is because we will have to --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19530: [SPARK-22309][ML] Remove unused param in `LDAModel.getTo...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/19530 fyi it looks like this is cleanup from removing a broadcast in #18152 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19486: [SPARK-22268][BUILD] Fix lint-java
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19486 **[Test build #82916 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82916/testReport)** for PR 19486 at commit [`95a2d9e`](https://github.com/apache/spark/commit/95a2d9ef6e07c53fca9d37e526abc2ac2f178c67). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19530: [SPARK-22309][ML] Remove unused param in `LDAModel.getTo...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/19530 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19486: [SPARK-22268][BUILD] Fix lint-java
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/19486 Updated --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org