[GitHub] spark issue #20421: [SPARK-23112][DOC] Update ML migration guide with breaki...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20421 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20421: [SPARK-23112][DOC] Update ML migration guide with breaki...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86802/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20421: [SPARK-23112][DOC] Update ML migration guide with breaki...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20421 **[Test build #86802 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86802/testReport)** for PR 20421 at commit [`4433d9c`](https://github.com/apache/spark/commit/4433d9cb70bd7a3257aef4e23f8c85f57c7999a6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkSession ...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20404 LGTM. I'd like to leave this to @felixcheung to confirm setting the default session is okay or not (https://github.com/apache/spark/pull/20404#discussion_r164362178). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20430: [SPARK-23263][SQL] Create table stored as parquet...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20430#discussion_r164662154 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala --- @@ -34,16 +34,12 @@ object CommandUtils extends Logging { /** Change statistics after changing data by commands. */ def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit = { -if (table.stats.nonEmpty) { +if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) { val catalog = sparkSession.sessionState.catalog - if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) { -val newTable = catalog.getTableMetadata(table.identifier) -val newSize = CommandUtils.calculateTotalSize(sparkSession.sessionState, newTable) -val newStats = CatalogStatistics(sizeInBytes = newSize) -catalog.alterTableStats(table.identifier, Some(newStats)) - } else { -catalog.alterTableStats(table.identifier, None) --- End diff -- this seems to be a way to clear out the table stats previously. Don't we need that? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20404: [SPARK-23228][PYSPARK] Add Python Created jsparkSession ...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/20404 Hi all, can you please review again, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20428: [SPARK-23261] [PySpark] Rename Pandas UDFs
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20428 Let's also update PR description too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20431: [SPARK-23222][SQL] Make DataFrameRangeSuite not flaky
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20431 I didn't notice significant difference. It is about 450~500 milliseconds. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20421: [SPARK-23112][DOC] Update ML migration guide with breaki...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20421 **[Test build #86802 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86802/testReport)** for PR 20421 at commit [`4433d9c`](https://github.com/apache/spark/commit/4433d9cb70bd7a3257aef4e23f8c85f57c7999a6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20421: [SPARK-23112][DOC] Update ML migration guide with breaki...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/366/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20421: [SPARK-23112][DOC] Update ML migration guide with breaki...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20421 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20427: [SPARK-23260][SPARK-23262][SQL] several data sour...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/20427#discussion_r164660069 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -23,7 +23,7 @@ import org.apache.spark.sql.sources.v2.reader._ case class DataSourceV2Relation( --- End diff -- Consider remove V2 in `DataSourceV2Relation` and `StreamingDataSourceV2Relation` ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20432: [SPARK-23174][BUILD][PYTHON][FOLLOWUP] Add pycodestyle*....
Github user rekhajoshm commented on the issue: https://github.com/apache/spark/pull/20432 ð LGTM @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20386: [SPARK-23202][SQL] Break down DataSourceV2Writer.commit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20386 **[Test build #86801 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86801/testReport)** for PR 20386 at commit [`42dc690`](https://github.com/apache/spark/commit/42dc69004ad37a5c4a5d8c96478a875ff4baed4e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20386: [SPARK-23202][SQL] Break down DataSourceV2Writer.commit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20386 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20386: [SPARK-23202][SQL] Break down DataSourceV2Writer.commit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20386 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/365/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20422 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20422 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86792/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20422 **[Test build #86792 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86792/testReport)** for PR 20422 at commit [`98ea6a7`](https://github.com/apache/spark/commit/98ea6a742143da803eb728c352e7424f504fabba). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20430: [SPARK-23263][SQL] Create table stored as parquet should...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20430 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20430: [SPARK-23263][SQL] Create table stored as parquet should...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20430 **[Test build #86790 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86790/testReport)** for PR 20430 at commit [`08d31c0`](https://github.com/apache/spark/commit/08d31c0823e5f6c257b0917362c8e07b04702af2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20430: [SPARK-23263][SQL] Create table stored as parquet should...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20430 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86790/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20177: [SPARK-22954][SQL] Fix the exception thrown by Analyze c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20177 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86791/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20387: [SPARK-23203][SPARK-23204][SQL]: DataSourceV2: Use immut...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20387 don't we already have table in DataFrameReader? http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframereader#pyspark.sql.DataFrameReader.table http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader@table(tableName:String):org.apache.spark.sql.DataFrame --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20177: [SPARK-22954][SQL] Fix the exception thrown by Analyze c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20177 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20177: [SPARK-22954][SQL] Fix the exception thrown by Analyze c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20177 **[Test build #86791 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86791/testReport)** for PR 20177 at commit [`4c86456`](https://github.com/apache/spark/commit/4c8645623f3b89c9f7b1bc7809c6b9f5a95d2389). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20332: [SPARK-23138][ML][DOC] Multiclass logistic regres...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20332 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20332: [SPARK-23138][ML][DOC] Multiclass logistic regression su...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/20332 Merged to master / branch-2.3. Thanks @sethah, and @WeichenXu123 for review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20332: [SPARK-23138][ML][DOC] Multiclass logistic regres...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/20332#discussion_r164654897 --- Diff: docs/ml-classification-regression.md --- @@ -111,10 +110,9 @@ Continuing the earlier example: [`LogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html) provides a summary for a [`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html). -Currently, only binary classification is supported and the -summary must be explicitly cast to -[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html). -Support for multiclass model summaries will be added in the future. +In the case of binary classification, certain additional metrics are --- End diff -- I'm ambivalent - I think it is fairly clear through the phrasing "additional metrics are available...", and in the API doc link provided. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20378 I like this idea, too, but seems like there are too many violating files so we can't enable this for now. I'm wondering how we can encourage contributors to follow the style, hopefully automatically. Should we make a blacklist for the currently violating files and remove from it when the style is fixed, or something? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20343: [SPARK-23167][SQL] Add TPCDS queries v2.7 in TPCDSQueryS...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20343 I checked all the queries again and I found that some queries (q6, q11, q20, q22, q24, q34, q35, q47, q49, q57, q64, q72, q74, q75, q78, q98) only have minor changes (See the comments to point out the changes). So, how about directly applying these changes in `sql/core/src/test/resources/tpcds`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20343: [SPARK-23167][SQL] Add TPCDS queries v2.7 in TPCDSQueryS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20343 **[Test build #86800 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86800/testReport)** for PR 20343 at commit [`d04b087`](https://github.com/apache/spark/commit/d04b0872bcc02b5eadd309c560cda77ff1b8da0a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20343: [SPARK-23167][SQL] Add TPCDS queries v2.7 in TPCDSQueryS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20343 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/364/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20343: [SPARK-23167][SQL] Add TPCDS queries v2.7 in TPCDSQueryS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20343 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20433: [SPARK-23264][SQL] Support interval values without INTER...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20433 **[Test build #86799 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86799/testReport)** for PR 20433 at commit [`830cf8d`](https://github.com/apache/spark/commit/830cf8d014ae17ade5fd771ca98c8c846c93). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20433: [SPARK-23264][SQL] Support interval values without INTER...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20433 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/363/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20433: [SPARK-23264][SQL] Support interval values without INTER...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20433 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20343: [SPARK-23167][SQL] Add TPCDS queries v2.7 in TPCDSQueryS...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20343 I opened a new pr to support `[date] + 14 days`: https://github.com/apache/spark/pull/20433 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20361: [SPARK-23188][SQL] Make vectorized columar reader...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20361#discussion_r164650445 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -377,6 +377,12 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.parquet.batchSize") --- End diff -- Still a question. Is that possible to use the estimated memory size instead of the number of rows? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20433: [SPARK-23264][SQL] Support interval values withou...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/20433 [SPARK-23264][SQL] Support interval values without INTERVAL clauses ## What changes were proposed in this pull request? This pr updated parsing rules in `SqlBase.g4` to support a SQL query below; ``` SELECT CAST('2017-08-04' AS DATE) + 1 days; ``` The current master cannot parse it though, other dbms-like systems support the syntax (e.g., hive and mysql). Also, the syntax is frequently used in the official TPC-DS queries. ## How was this patch tested? Added tests in `SQLQuerySuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark SPARK-23264 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20433.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20433 commit 830cf8d014ae17ade5fd771ca98c8c846c93 Author: Takeshi YamamuroDate: 2018-01-30T06:15:35Z Fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20429: [SPARK-23157][SQL] Explain restriction on column ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20429 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20431: [SPARK-23222][SQL] Make DataFrameRangeSuite not flaky
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20431 does this significantly increase the test runtime? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20428: [SPARK-23261] [PySpark] Rename Pandas UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20428 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/362/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20428: [SPARK-23261] [PySpark] Rename Pandas UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20428 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20428: [SPARK-23261] [PySpark] Rename Pandas UDFs
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20428 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20429: [SPARK-23157][SQL] Explain restriction on column express...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20429 Thanks! Merged to master/2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20386: [SPARK-23202][SQL] Break down DataSourceV2Writer.commit ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20386 CC @rdblue @zsxwing @jose-torres @sameeragarwal --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20432: [SPARK-23174][BUILD][PYTHON][FOLLOWUP] Add pycodestyle*....
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20432 cc @rekhajoshm @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20424: [Spark-23240][python] Better error message when e...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20424#discussion_r164649608 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -191,7 +191,20 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String daemon = pb.start() val in = new DataInputStream(daemon.getInputStream) -daemonPort = in.readInt() +try { + daemonPort = in.readInt() +} catch { + case exc: EOFException => +throw new IOException(s"No port number in $daemonModule's stdout") +} + +// test that the returned port number is within a valid range. +// note: this does not cover the case where the port number +// is arbitrary data but is also coincidentally within range +if (daemonPort < 1 || daemonPort > 0x) { --- End diff -- I mean, I left my sign-off because what we do is basically move the _same_ check (`java.net.InetSocketAddress.checkPort`) ahead and another one is simply to wraps an exception, `EOFException`. I think we are here safe in theory. I got your point of reserved ports and now the condition became narrower. I should check other things like which error it produces before in this case and if the current error message is nicer. Also, this seems not completely addressing the concerns about it. I was wondering if this is worth doing these stuff. If you strongly prefer this, I won't stay against but may request few more investigations. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20428: [SPARK-23261] [PySpark] Rename Pandas UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20428 **[Test build #86798 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86798/testReport)** for PR 20428 at commit [`7a71c5a`](https://github.com/apache/spark/commit/7a71c5a294da230faf19965dc1d068adc3678411). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20432: [SPARK-23174][BUILD][PYTHON][FOLLOWUP] Add pycodestyle*....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20432 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20432: [SPARK-23174][BUILD][PYTHON][FOLLOWUP] Add pycodestyle*....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20432 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/361/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20386: [SPARK-23202][SQL] Break down DataSourceV2Writer....
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20386#discussion_r164649253 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala --- @@ -148,7 +148,8 @@ private[continuous] class EpochCoordinator( logDebug(s"Epoch $epoch has received commits from all partitions. Committing globally.") // Sequencing is important here. We must commit to the writer before recording the commit // in the query, or we will end up dropping the commit if we restart in the middle. - writer.commit(epoch, thisEpochCommits.toArray) + thisEpochCommits.foreach(writer.add(_)) --- End diff -- is it possible to call `add` once the commit message arrives? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20386: [SPARK-23202][SQL] Break down DataSourceV2Writer....
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20386#discussion_r164648934 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java --- @@ -63,32 +65,30 @@ DataWriterFactory createWriterFactory(); /** - * Commits this writing job with a list of commit messages. The commit messages are collected from - * successful data writers and are produced by {@link DataWriter#commit()}. + * Handles a commit message produced by {@link DataWriter#commit()}. --- End diff -- nit: `..., which is collected from a successful data writer in the executor side.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20432: [SPARK-23174][BUILD][PYTHON][FOLLOWUP] Add pycodestyle*....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20432 **[Test build #86797 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86797/testReport)** for PR 20432 at commit [`3fb3d78`](https://github.com/apache/spark/commit/3fb3d785a9b2497b6ec3b9ac9329db776568197c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20386: [SPARK-23202][SQL] Break down DataSourceV2Writer....
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20386#discussion_r164648815 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java --- @@ -63,32 +65,30 @@ DataWriterFactory createWriterFactory(); /** - * Commits this writing job with a list of commit messages. The commit messages are collected from - * successful data writers and are produced by {@link DataWriter#commit()}. + * Handles a commit message produced by {@link DataWriter#commit()}. * * If this method fails (by throwing an exception), this writing job is considered to to have been - * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of the destination - * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal with it. + * failed, and {@link #abort()} would be called. The state of the destination + * is undefined and @{@link #abort()} may not be able to deal with it. --- End diff -- add some more comments to say that, implementations should probably cache the commit messages and do the final step in #commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20432: [SPARK-23174][BUILD][PYTHON][FOLLOWUP] Add pycode...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/20432 [SPARK-23174][BUILD][PYTHON][FOLLOWUP] Add pycodestyle*.py to .gitignore file. ## What changes were proposed in this pull request? This is a follow-up pr of #20338 which changed the downloaded file name of the python code style checker but it's not contained in .gitignore file so the file remains as an untracked file for git after running the checker. This pr adds the file name to .gitignore file. ## How was this patch tested? Tested manually. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-23174/fup1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20432.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20432 commit 3fb3d785a9b2497b6ec3b9ac9329db776568197c Author: Takuya UESHINDate: 2018-01-30T06:03:19Z Add pycodestyle*.py to .gitignore file. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20386: [SPARK-23202][SQL] Break down DataSourceV2Writer....
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20386#discussion_r164648645 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java --- @@ -40,11 +40,13 @@ * 1. Create a writer factory by {@link #createWriterFactory()}, serialize and send it to all the * partitions of the input data(RDD). * 2. For each partition, create the data writer, and write the data of the partition with this - * writer. If all the data are written successfully, call {@link DataWriter#commit()}. If - * exception happens during the writing, call {@link DataWriter#abort()}. - * 3. If all writers are successfully committed, call {@link #commit(WriterCommitMessage[])}. If + * writer. If all the data are written successfully, call {@link DataWriter#commit()}. + * On a writer being successfully committed, call {@link #add(WriterCommitMessage)} to + * handle its commit message. + * If exception happens during the writing, call {@link DataWriter#abort()}. + * 3. If all writers are successfully committed, call {@link #commit()}. If --- End diff -- If all the data writers finish successfully, and #add is successfully called for all the commit messages, Spark will call #commit. If any of the data writers failed, or any of the #add call failed, or the job failed with an unknown reason, call #abort. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), unbounded...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20400 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86795/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), unbounded...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20400 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), unbounded...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20400 **[Test build #86795 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86795/testReport)** for PR 20400 at commit [`bbf8778`](https://github.com/apache/spark/commit/bbf8778a963a5e0b8de1b5ab1fddf4cafe13c180). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20386: [SPARK-23202][SQL] Break down DataSourceV2Writer....
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20386#discussion_r164648356 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java --- @@ -40,11 +40,13 @@ * 1. Create a writer factory by {@link #createWriterFactory()}, serialize and send it to all the * partitions of the input data(RDD). * 2. For each partition, create the data writer, and write the data of the partition with this - * writer. If all the data are written successfully, call {@link DataWriter#commit()}. If - * exception happens during the writing, call {@link DataWriter#abort()}. - * 3. If all writers are successfully committed, call {@link #commit(WriterCommitMessage[])}. If + * writer. If all the data are written successfully, call {@link DataWriter#commit()}. --- End diff -- If one data writer finishes successfully, the commit message will be sent back to the driver side and Spark will call #add. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20431: [SPARK-23222][SQL] Make DataFrameRangeSuite not flaky
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20431 **[Test build #86796 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86796/testReport)** for PR 20431 at commit [`9a4a484`](https://github.com/apache/spark/commit/9a4a4842b3f8281e73e564f4dfdad92017630760). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20431: [SPARK-23222][SQL] Make DataFrameRangeSuite not flaky
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20431 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/360/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20431: [SPARK-23222][SQL] Make DataFrameRangeSuite not flaky
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20431 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20431: [SPARK-23222][SQL] Make DataFrameRangeSuite not flaky
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20431 cc @vanzin @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20431: [SPARK-23222][SQL] Make DataFrameRangeSuite not f...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/20431 [SPARK-23222][SQL] Make DataFrameRangeSuite not flaky ## What changes were proposed in this pull request? It is reported that the test `Cancelling stage in a query with Range` in `DataFrameRangeSuite` fails a few times in unrelated PRs. I personally also saw it too in my PR. This test is not very flaky actually but only fails occasionally. Based on how the test works, I guess that is because `range` finishes before the listener calls `cancelStage`. I increase the range number from `10L` to `1000L` and count the range in one partition. I also reduce the `interval` of checking stage id. Hopefully it can make the test not flaky anymore. ## How was this patch tested? The modified tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 SPARK-23222 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20431.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20431 commit 9a4a4842b3f8281e73e564f4dfdad92017630760 Author: Liang-Chi HsiehDate: 2018-01-30T05:49:01Z Make DataFrameRangeSuite not flaky. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20424: [Spark-23240][python] Better error message when e...
Github user bersprockets commented on a diff in the pull request: https://github.com/apache/spark/pull/20424#discussion_r164646512 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -191,7 +191,20 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String daemon = pb.start() val in = new DataInputStream(daemon.getInputStream) -daemonPort = in.readInt() +try { + daemonPort = in.readInt() +} catch { + case exc: EOFException => +throw new IOException(s"No port number in $daemonModule's stdout") +} + +// test that the returned port number is within a valid range. +// note: this does not cover the case where the port number +// is arbitrary data but is also coincidentally within range +if (daemonPort < 1 || daemonPort > 0x) { --- End diff -- Sorry @HyukjinKwon , I didn't quite get the last point. Could you rephrase? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r164646157 --- Diff: core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala --- @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.python + +import java.io.File +import java.util.{Map => JMap} +import java.util.Arrays +import java.util.concurrent.atomic.AtomicInteger + +import scala.collection.JavaConverters._ + +import com.google.common.io.Files + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging + + +private[spark] class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean) + extends Logging { + + private var virtualEnvType = conf.get("spark.pyspark.virtualenv.type", "native") + private var virtualEnvPath = conf.get("spark.pyspark.virtualenv.bin.path", "") + private var virtualEnvName: String = _ + private var virtualPythonExec: String = _ + private val VIRTUALENV_ID = new AtomicInteger() + private var isLauncher: Boolean = false + + // used by launcher when user want to use virtualenv in pyspark shell. Launcher need this class + // to create virtualenv for driver. + def this(pythonExec: String, properties: JMap[String, String], isDriver: java.lang.Boolean) { --- End diff -- It is used by launcher module which doesn't depend on scala. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), unbounded...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20400 **[Test build #86795 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86795/testReport)** for PR 20400 at commit [`bbf8778`](https://github.com/apache/spark/commit/bbf8778a963a5e0b8de1b5ab1fddf4cafe13c180). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), unbounded...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20400 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), unbounded...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20400 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/359/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20424: [Spark-23240][python] Better error message when e...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20424#discussion_r164643325 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -191,7 +191,20 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String daemon = pb.start() val in = new DataInputStream(daemon.getInputStream) -daemonPort = in.readInt() +try { + daemonPort = in.readInt() +} catch { + case exc: EOFException => +throw new IOException(s"No port number in $daemonModule's stdout") +} + +// test that the returned port number is within a valid range. +// note: this does not cover the case where the port number +// is arbitrary data but is also coincidentally within range +if (daemonPort < 1 || daemonPort > 0x) { --- End diff -- I am saying we can just safely move the same check and fail fast, which is simple and theoretically safe. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20424: [Spark-23240][python] Better error message when e...
Github user bersprockets commented on a diff in the pull request: https://github.com/apache/spark/pull/20424#discussion_r164642503 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -191,7 +191,20 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String daemon = pb.start() val in = new DataInputStream(daemon.getInputStream) -daemonPort = in.readInt() +try { + daemonPort = in.readInt() +} catch { + case exc: EOFException => --- End diff -- I see, exc not used. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20386: [SPARK-23202][SQL] Break down DataSourceV2Writer.commit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20386 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86788/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20386: [SPARK-23202][SQL] Break down DataSourceV2Writer.commit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20386 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20386: [SPARK-23202][SQL] Break down DataSourceV2Writer.commit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20386 **[Test build #86788 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86788/testReport)** for PR 20386 at commit [`7a677fd`](https://github.com/apache/spark/commit/7a677fd63338cdfca4f1406ee9a5a7c45df42521). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20424: [Spark-23240][python] Better error message when extraneo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20424 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86787/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20424: [Spark-23240][python] Better error message when extraneo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20424 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20428: [SPARK-23261] [PySpark] Rename Pandas UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20428 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86794/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20428: [SPARK-23261] [PySpark] Rename Pandas UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20428 **[Test build #86794 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86794/testReport)** for PR 20428 at commit [`9a4aada`](https://github.com/apache/spark/commit/9a4aada3aafc0fcb06f06a39ce996ec9751ae0ac). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20424: [Spark-23240][python] Better error message when extraneo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20424 **[Test build #86787 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86787/testReport)** for PR 20424 at commit [`a1cb1a8`](https://github.com/apache/spark/commit/a1cb1a89d679840845142facf58f15e870f7c81d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20428: [SPARK-23261] [PySpark] Rename Pandas UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20428 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20428: [SPARK-23261] [PySpark] Rename Pandas UDFs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20428 **[Test build #86794 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86794/testReport)** for PR 20428 at commit [`9a4aada`](https://github.com/apache/spark/commit/9a4aada3aafc0fcb06f06a39ce996ec9751ae0ac). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20428: [SPARK-23261] [PySpark] Rename Pandas UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20428 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/358/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20424: [Spark-23240][python] Better error message when e...
Github user bersprockets commented on a diff in the pull request: https://github.com/apache/spark/pull/20424#discussion_r164641641 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -191,7 +191,20 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String daemon = pb.start() val in = new DataInputStream(daemon.getInputStream) -daemonPort = in.readInt() +try { + daemonPort = in.readInt() +} catch { + case exc: EOFException => +throw new IOException(s"No port number in $daemonModule's stdout") +} + +// test that the returned port number is within a valid range. +// note: this does not cover the case where the port number +// is arbitrary data but is also coincidentally within range +if (daemonPort < 1 || daemonPort > 0x) { --- End diff -- Port 0 has special meaning. A program passes port 0 when it wants the system to choose an unused port on the program's behalf. So, the daemon should not return 0. It's valid to pass port 0 to InetSocketAddress, since you might be asking for the system to assign a port for you. However, following my own logic, the code in my pull request really should be checking for the range 49152-65535 (ephemeral range) instead of 1-65535, but I didn't have the nerve to make it that restrictive. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20428: [SPARK-23261] [PySpark] Rename Pandas UDFs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20428 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...
Github user rekhajoshm commented on the issue: https://github.com/apache/spark/pull/20378 Thanks @HyukjinKwon for your update. @HyukjinKwon @holdenk @ueshin @viirya @icexelloss @felixcheung @BryanCutler and @MrBago - While you are thinking on it, below is my analysis. As I understand, there are two things that jira "seems" to be calling out.Please validate. 1. doctest strings must be correctly formatted. 2. Doctest must NOT be included in docs? Working on it, I found docstring style itself was not enforced at all, and that includes doctest style. Another aspect seems to be exclude doctest from documentation (_build/html once generated.) I am not certain on the reasoning behind this exclusion or whether it is indeed what is additionally intended in SPARK-11222.The jira subject and description say two different things, so maybe validate that understanding? Meanwhile I had a look into/tested different configurations on epytext/sphinx extensions to see if we can achieve surpassing doctests in docs via them. Played with RULES set in epytext.py. As per epytext manual http://epydoc.sourceforge.net/manual-epytext.html it ensures only correctly formatted doctests are rendered.Which means only if they were incorrectly formatted will they not to appear in docs.This did not seem right and contrary even. So I played around with sphinx extensions - 'sphinx.ext.doctest', 'sphinx.ext.napoleon' and in conf.py - # Napoleon settings napoleon_google_docstring = True # Doctest settings doctest_test_doctest_blocks='' trim_doctest_flags=True None of those options tried get me to surpass doctest in docs(_build/html) once the build is done. Thanks for thinking this over. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20427: [SPARK-23260][SPARK-23262][SQL] several data source v2 n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20427 **[Test build #86793 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86793/testReport)** for PR 20427 at commit [`b4fdbbe`](https://github.com/apache/spark/commit/b4fdbbe265943012093fbc0f54e8b22184fa2987). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20427: [SPARK-23260][SPARK-23262][SQL] several data source v2 n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20427 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20427: [SPARK-23260][SPARK-23262][SQL] several data source v2 n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20427 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/357/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20427: [SPARK-23260][SPARK-23262][SQL] several data source v2 n...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20427 Retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20424: [Spark-23240][python] Better error message when e...
Github user bersprockets commented on a diff in the pull request: https://github.com/apache/spark/pull/20424#discussion_r164640285 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -191,7 +191,20 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String daemon = pb.start() val in = new DataInputStream(daemon.getInputStream) -daemonPort = in.readInt() +try { + daemonPort = in.readInt() +} catch { + case exc: EOFException => +throw new IOException(s"No port number in $daemonModule's stdout") +} + +// test that the returned port number is within a valid range. +// note: this does not cover the case where the port number +// is arbitrary data but is also coincidentally within range +if (daemonPort < 1 || daemonPort > 0x) { + throw new IOException(s"Bad port number in $daemonModule's stdout: " + +f"0x$daemonPort%08x") --- End diff -- Yes, that makes sense. I might not be able to get it as clear as the exact path to the file, since the PythonWorkerFactory sets a PYTHONPATH environmental variable and then lets python itself figure out where on those paths the module actually lives. But I could tell the user how the PYTHONPATH was set up (in a generic sense, without using any shell's syntax) and then how the python command was subsequently run. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20427: [SPARK-23260][SPARK-23262][SQL] several data source v2 n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20427 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86789/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20427: [SPARK-23260][SPARK-23262][SQL] several data source v2 n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20427 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20427: [SPARK-23260][SPARK-23262][SQL] several data source v2 n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20427 **[Test build #86789 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86789/testReport)** for PR 20427 at commit [`b4fdbbe`](https://github.com/apache/spark/commit/b4fdbbe265943012093fbc0f54e8b22184fa2987). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20424: [Spark-23240][python] Better error message when e...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/20424#discussion_r164637553 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -191,7 +191,20 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String daemon = pb.start() val in = new DataInputStream(daemon.getInputStream) -daemonPort = in.readInt() +try { + daemonPort = in.readInt() +} catch { + case exc: EOFException => +throw new IOException(s"No port number in $daemonModule's stdout") +} + +// test that the returned port number is within a valid range. +// note: this does not cover the case where the port number +// is arbitrary data but is also coincidentally within range +if (daemonPort < 1 || daemonPort > 0x) { + throw new IOException(s"Bad port number in $daemonModule's stdout: " + +f"0x$daemonPort%08x") --- End diff -- just a thought: this error message won't mean much to the typical user. Would it be sensible to tell the user exactly what python command to run themselves to figure out the problem? Eg. "unexpected stdout from /foo/bar/some/path/to/python -m /path/to/daemon.py". That's what would help with that sitecustomization.py case. Or not useful in general? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20378 Hey @holdenk, @ueshin, @viirya, @icexelloss, @felixcheung, @BryanCutler and @MrBago. What do you guys think about checking docstring and the list above? I think this could prevent nitpicking and idea itself seems good. One vague concern is that it might make backporting super hard. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20422: [SPARK-23253][Core][Shuffle]Only write shuffle te...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/20422#discussion_r164635886 --- Diff: core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala --- @@ -166,8 +153,20 @@ private[spark] class IndexShuffleBlockResolver( if (dataTmp != null && dataTmp.exists()) { dataTmp.delete() } - indexTmp.delete() } else { + val out = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(indexTmp))) --- End diff -- move this below the comment "This is the first successul attempt". I'd also include a comment about why we write to a temporary file, even though we're always going to rename (because in case the task dies somehow, we'd prefer to not leave a half-written index file in the final location). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20378 So, seems we got: ``` First line should end with a period. 293 Multiline docstring should end with 1 blank line. 279 Blank line missing after one-line summary. 265 Return value type should be mentioned. 141 All modules should have docstrings. 109 One-liner docstrings should fit on one line with quotes. 91 First line should be in imperative mood ('Do', not 'Does'). 87 Exported definitions should have docstrings. 61 Class docstring should have 1 blank line around them. 35 Use r\"\"\" if any backslashes in your docstrings. 19 The entire docstring should be indented same as code. 6 Exported classes should have docstrings. 1 No blank line before docstring in definitions. 1 ``` I think we can take in: ``` First line should end with a period. 293 Multiline docstring should end with 1 blank line. 279 Blank line missing after one-line summary. 265 The entire docstring should be indented same as code. 6 Use \"\"\"triple double quotes\"\"\". 3 # this seems only in heapq3.py where we ignore pep8. No blank line before docstring in definitions. 1 ``` Not sure on: ``` Exported definitions should have docstrings. 61 Exported classes should have docstrings. 1 ``` and take out ``` Return value type should be mentioned. 141 All modules should have docstrings. 109 One-liner docstrings should fit on one line with quotes. 91 First line should be in imperative mood ('Do', not 'Does'). 87 Class docstring should have 1 blank line around them. 35 Use r\"\"\" if any backslashes in your docstrings. 19 ``` Also, I think we can take out cloudpickle.py, heapq3.py, shared.py, python/docs/conf.py, work/*/*.py, python/.eggs/*` as we do in pep8. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org