[GitHub] spark issue #21618: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21618 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2953/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21618: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21618 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21618#discussion_r216147915 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -656,6 +656,25 @@ object SQLConf { .intConf .createWithDefault(1) + val PARALLEL_GET_GLOBBED_PATH_THRESHOLD = +buildConf("spark.sql.sources.parallelGetGlobbedPath.threshold") + .doc("The maximum number of subfiles or directories allowed after a globbed path " + +"expansion.") + .intConf + .checkValue(threshold => threshold >= 0, "The maximum number of subfiles or directories " + --- End diff -- Maybe we should keep this public? Because the parallel only opened when the thread number > 0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21618#discussion_r216147921 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -724,4 +726,37 @@ object DataSource extends Logging { """.stripMargin) } } + + /** + * Return all paths represented by the wildcard string. + * This will be done in main thread by default while the value of config + * `spark.sql.sources.parallelGetGlobbedPath.numThreads` > 0, a local thread + * pool will expand the globbed paths. --- End diff -- Thanks, done in 1319cd3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21618#discussion_r216147919 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -724,4 +726,37 @@ object DataSource extends Logging { """.stripMargin) } } + + /** + * Return all paths represented by the wildcard string. + * This will be done in main thread by default while the value of config + * `spark.sql.sources.parallelGetGlobbedPath.numThreads` > 0, a local thread + * pool will expand the globbed paths. + */ + private def getGlobbedPaths( --- End diff -- Thanks, that's more clear, done in 1319cd3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21618: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21618 **[Test build #95843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95843/testReport)** for PR 21618 at commit [`1319cd3`](https://github.com/apache/spark/commit/1319cd38d24c24adccc1c318001814da58a3b22b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21618#discussion_r216147889 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -724,4 +726,37 @@ object DataSource extends Logging { """.stripMargin) } } + + /** + * Return all paths represented by the wildcard string. + * This will be done in main thread by default while the value of config + * `spark.sql.sources.parallelGetGlobbedPath.numThreads` > 0, a local thread + * pool will expand the globbed paths. + */ + private def getGlobbedPaths( + sparkSession: SparkSession, --- End diff -- Thanks for advise, done in 1319cd3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21618#discussion_r216147887 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1557,6 +1576,15 @@ class SQLConf extends Serializable with Logging { def parallelPartitionDiscoveryParallelism: Int = getConf(SQLConf.PARALLEL_PARTITION_DISCOVERY_PARALLELISM) + def parallelGetGlobbedPathThreshold: Int = +getConf(SQLConf.PARALLEL_GET_GLOBBED_PATH_THRESHOLD) + + def parallelGetGlobbedPathNumThreads: Int = +getConf(SQLConf.PARALLEL_GET_GLOBBED_PATH_NUM_THREADS) + + def parallelGetGlobbedPathEnabled: Boolean = --- End diff -- Thanks, done in 1319cd3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22369: [SPARK-25072][DOC] Update migration guide for beh...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/22369#discussion_r216147674 --- Diff: docs/sql-programming-guide.md --- @@ -1901,6 +1901,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3.0 to 2.3.1 and above - As of version 2.3.1 Arrow functionality, including `pandas_udf` and `toPandas()`/`createDataFrame()` with `spark.sql.execution.arrow.enabled` set to `True`, has been marked as experimental. These are still evolving and not currently recommended for use in production. + - In version 2.3.1 and earlier, it is possible for PySpark to create a Row object by providing more value than column number through the customized Row class. Since Spark 2.3.3, Spark will confirm value length is less or equal than column length in PySpark. See [SPARK-25072](https://issues.apache.org/jira/browse/SPARK-25072) for details. --- End diff -- Maybe say `..by providing more values than number of fields through a customized Row class. As of Spark 2.3.3, PySpark will raise a ValueError if the number of values are more than the number of fields. See...` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.mem...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22298 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.memory lim...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/22298 Merged to master (e.g. 3). It's not a bug fix but I _think_ we should consider this for backport to 2.4 since it's arguably the second half of a feature that's in 2.4, but it's doesn't backport cleanly as is so maybe another PR just for the 2.4 branch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/22140 @gatorsmile it seemed like a straightforward bug to me. Rows with extra values lead to incorrect output and exceptions when used in `DataFrames`, so it did not seem like there was any possible this would break existing code. For example ``` In [1]: MyRow = Row('a','b') In [2]: print(MyRow(1,2,3)) Row(a=1, b=2) In [3]: spark.createDataFrame([MyRow(1,2,3)]) Out[3]: DataFrame[a: bigint, b: bigint] In [4]: spark.createDataFrame([MyRow(1,2,3)]).show() 18/09/08 21:55:48 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 2 fields are required while 3 values are provided. In [5]: spark.createDataFrame([MyRow(1,2,3)], schema="x: int, y: int").show() ValueError: Length of object (3) does not match with length of fields (2) ``` Maybe I was too hasty with backporting and this needed some discussion. Do you know of a use case that this change would break? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22370: don't link to deprecated function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22370 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22370: don't link to deprecated function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22370 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22370: don't link to deprecated function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22370 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22370: don't link to deprecated function
GitHub user MichaelChirico opened a pull request: https://github.com/apache/spark/pull/22370 don't link to deprecated function Seems misleading to (without qualification) link to a deprecated function ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MichaelChirico/spark patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22370.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22370 commit e8b0d6333a1c09787e1c37a6f91eb895dee8fa72 Author: Michael Chirico Date: 2018-09-09T05:12:27Z don't link to deprecated function Seems misleading to (without qualification) link to a deprecated function --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22369: [SPARK-25072][DOC] Update migration guide for behavior c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22369 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95842/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22369: [SPARK-25072][DOC] Update migration guide for behavior c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22369 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22369: [SPARK-25072][DOC] Update migration guide for behavior c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22369 **[Test build #95842 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95842/testReport)** for PR 22369 at commit [`d257a38`](https://github.com/apache/spark/commit/d257a38c647b45a9e83a2bdbbd2814f1b3fc5d56). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22369: [SPARK-25072][DOC] Update migration guide for behavior c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22369 **[Test build #95842 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95842/testReport)** for PR 22369 at commit [`d257a38`](https://github.com/apache/spark/commit/d257a38c647b45a9e83a2bdbbd2814f1b3fc5d56). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22368: [SPARK-25368][SQL] Incorrect predicate pushdown returns ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22368 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2952/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22369: [SPARK-25072][DOC] Update migration guide for behavior c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22369 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22369: [SPARK-25072][DOC] Update migration guide for behavior c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22369 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2951/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22368: [SPARK-25368][SQL] Incorrect predicate pushdown returns ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22368 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/22140 ``` @xuanyuanking Could you please update the document? ``` #22369 Thanks for reminding, I'll pay attention in future work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22369: [SPARK-25072][DOC] Update migration guide for beh...
GitHub user xuanyuanking opened a pull request: https://github.com/apache/spark/pull/22369 [SPARK-25072][DOC] Update migration guide for behavior change ## What changes were proposed in this pull request? Update the document for the behavior change in PySpark Row creation. ## How was this patch tested? Existing UT. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuanyuanking/spark SPARK-25072-DOC Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22369.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22369 commit d257a38c647b45a9e83a2bdbbd2814f1b3fc5d56 Author: Yuanjian Li Date: 2018-09-09T04:26:23Z Update doc for SPARK-25072 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22368: [SPARK-25368][SQL] Incorrect predicate pushdown returns ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22368 **[Test build #95841 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95841/testReport)** for PR 22368 at commit [`865e0af`](https://github.com/apache/spark/commit/865e0af572edad7fd775c25e317055ffa0df2a08). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22368: [SPARK-25368][SQL] Incorrect predicate pushdown r...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22368 [SPARK-25368][SQL] Incorrect predicate pushdown returns wrong result ## What changes were proposed in this pull request? How to reproduce: ```scala val df1 = spark.createDataFrame(Seq( (1, 1) )).toDF("a", "b").withColumn("c", lit(null).cast("int")) val df2 = df1.union(df1).withColumn("d", spark_partition_id).filter($"c".isNotNull) df2.show +---+---++---+ | a| b| c| d| +---+---++---+ | 1| 1|null| 0| | 1| 1|null| 1| +---+---++---+ ``` `filter($"c".isNotNull)`changed to `(null <=> c#10)` before https://github.com/apache/spark/pull/19201, but it changed to `(c#10 = null)` since https://github.com/apache/spark/pull/20155. This pr revert it to `(null <=> c#10)` to fix this issue. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25368 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22368.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22368 commit 86b9b7892c94be68145453f9519e35a3574fe568 Author: Yuming Wang Date: 2018-09-09T03:46:18Z Fix SPARK-25368 commit 865e0af572edad7fd775c25e317055ffa0df2a08 Author: Yuming Wang Date: 2018-09-09T04:22:29Z Fix InferFiltersFromConstraintsSuite test error --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22010: [SPARK-21436][CORE] Take advantage of known partitioner ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22010 Actually @holdenk is this change even correct? RDD.distinct is not key based. It is based on the value of the elements in RDD. Even if `numPartitions == partitions.length`, it doesn't mean the RDD is hash partitioned this way. Consider this RDD: Partition 1: 1, 2, 3 Partition 2: 1, 2, 3 rdd.distinct() should return 1, 2, 3 with your change it'd still return 1, 2, 3, 1, 2, 3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22010: [SPARK-21436][CORE] Take advantage of known parti...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/22010#discussion_r216145892 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -396,7 +396,26 @@ abstract class RDD[T: ClassTag]( * Return a new RDD containing the distinct elements in this RDD. */ def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { -map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1) +partitioner match { --- End diff -- you can just create a new MapPartitionsRDD with preservesPartitioning set to true, can't you? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNu...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22366 Is it better to add a description to `docs/sql-programming-guide.md`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22360: [MINOR][ML] Remove `BisectingKMeansModel.setDistanceMeas...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/22360 Do we need to set `distanceMeasure` again for the parent model ? When parent model created, it will use the same `distanceMeasure` with the one used in training. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22367: [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty stri...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22367 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95840/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22367: [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty stri...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22367 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22367: [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty stri...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22367 **[Test build #95840 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95840/testReport)** for PR 22367 at commit [`7eac385`](https://github.com/apache/spark/commit/7eac385568c78735bb7743cfcfa234c4bea97fb0). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/22357 Thanks! @mallman For the first query, I think the query plan produced by your WIP patch is not correct. We don't need to read the `company:struct` from `employer:struct`. For the second, your WIP patch doesn't push down `IsNotNull(employer)` predicate into `FileScan` node. That is the important difference I noticed for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22366 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95839/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22366 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22366 **[Test build #95839 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95839/testReport)** for PR 22366 at commit [`f950845`](https://github.com/apache/spark/commit/f9508458d1963e83c7fc23106dc4cb2f1f491524). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22367: [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty stri...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22367 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22367: [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty stri...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22367 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/22234 @gatorsmile @HyukjinKwon Please, take a look at #22367 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22367: [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty stri...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22367 **[Test build #95840 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95840/testReport)** for PR 22367 at commit [`7eac385`](https://github.com/apache/spark/commit/7eac385568c78735bb7743cfcfa234c4bea97fb0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22367: [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix emp...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/22367 [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty string being parsed as null when nullValue is set. ## What changes were proposed in this pull request? In the PR, I propose new CSV option `emptyValue` and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as `""` to distinguish them from saved `null`s. ## How was this patch tested? It was tested by `CSVSuite` and new tests added in the PR #22234 You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 csv-empty-value-2.4 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22367.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22367 commit 465ed7a6011bd0437c7f88cb4c18ecea68cb60ac Author: Mario Molina Date: 2018-08-25T17:42:03Z Configurable empty values when reading/writing CSV files commit 48e143d43a876afc4f0099bf7079130d74ebe855 Author: Mario Molina Date: 2018-08-26T23:29:32Z Adding tests commit 70e217146962186a391227f1417cf79c5e81c380 Author: Mario Molina Date: 2018-08-26T23:33:55Z Changing emptyValue order arg in streaming.py commit 8665f93c442915dc23a40ffb3c958a097dec34c5 Author: Mario Molina Date: 2018-08-27T02:03:41Z Changing emptyValue order arg in set_opts commit 867c6de34673bbc877e0e26e8c0d662e038e2946 Author: Maxim Gekk Date: 2018-09-08T20:40:41Z Added comments for parameters commit e0cb879f3bc28f66e19d049ed0ee6dc33fc5922c Author: Maxim Gekk Date: 2018-09-08T21:02:21Z Updating the migration guide commit e23098c5a6322ab3cff851b37889163c9bd09491 Author: Mario Molina Date: 2018-08-26T23:28:34Z Changing order in args for emptyValue commit 732ec78c8d376bad0cc8897b1da48a56448590fb Author: Maxim Gekk Date: 2018-09-08T21:11:56Z Revert some checking commit 7eac385568c78735bb7743cfcfa234c4bea97fb0 Author: Maxim Gekk Date: 2018-09-08T21:14:13Z Revert unneeded changes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22353: [SPARK-25357][SQL] Abbreviated simpleString in DataSourc...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22353 @LantaoJin . Please check the following example in Spark UI; the hover text on `Scan parquet`. ```scala scala> spark.range(2).repartition(1).write.mode("overwrite").parquet("/tmp/1") scala> spark.read.parquet("/tmp/1/*").count ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22363: [SPARK-25375][SQL][TEST] Reenable qualified perm. functi...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22363 Thank you, @gatorsmile ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22365 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22365 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95836/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22365 **[Test build #95836 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95836/testReport)** for PR 22365 at commit [`2845bca`](https://github.com/apache/spark/commit/2845bca09797a34e930e6aca42f198ec5cbd95e3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22337 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95833/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22337 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22337 **[Test build #95833 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95833/testReport)** for PR 22337 at commit [`309e265`](https://github.com/apache/spark/commit/309e265f64a856f46c10d5310a07417e0abd0dab). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22366 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22366 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22366 **[Test build #95839 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95839/testReport)** for PR 22366 at commit [`f950845`](https://github.com/apache/spark/commit/f9508458d1963e83c7fc23106dc4cb2f1f491524). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJson...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/22366 [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNullableSchema ## What changes were proposed in this pull request? In the PR, I propose to remove the `spark.sql.fromJsonForceNullableSchema` flag since disabling it can cause corrupted output. The flag was introduced only for backward compatibility in minor versions. The PR targets Spark 3.0 in which the flag can be removed. ## How was this patch tested? It was tested by `JsonExpressionsSuite`, `JsonFunctionsSuite` and `JsonSuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 json-remove-non-nullable-schema Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22366.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22366 commit 0de349bc59143eeb26cd422fdfe945037f8353ac Author: Maxim Gekk Date: 2018-09-08T19:33:20Z Removing the spark.sql.fromJsonForceNullableSchema flag commit f9508458d1963e83c7fc23106dc4cb2f1f491524 Author: Maxim Gekk Date: 2018-09-08T19:46:05Z Bug fix - missing field must not nullable --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22360: [MINOR][ML] Remove `BisectingKMeansModel.setDistanceMeas...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22360 Yes, I think the point here is that the parameter is part of `BisectingKMeansParams` which defines as final the getter method. I think `KMeans` has the same issue. We can probably remove this and set the distanceMeasure from the parent model at creation time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22364 cc @gatorsmile @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21273: [SPARK-17916][SQL] Fix empty string being parsed ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21273#discussion_r216138533 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -164,7 +164,7 @@ class CSVOptions( writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite) writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite) writerSettings.setNullValue(nullValue) -writerSettings.setEmptyValue(nullValue) +writerSettings.setEmptyValue("\"\"") --- End diff -- This needs an update in migration guide. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22234 @MaxGekk Could you take this PR over? I think we need to merge this to Spark 2.4. Users can set the behaviors to the previous one by this new conf `emptyValue`, if needed. Also update the migration guide about the behavior change and explain how to set `emptyValue`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22337 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22337 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95832/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22337 **[Test build #95832 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95832/testReport)** for PR 22337 at commit [`a314776`](https://github.com/apache/spark/commit/a3147760b025e6592dd80d858ae4757bd907a72c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22140 @BryanCutler What is the reason to backport this PR? This sounds a behavior change. @xuanyuanking Could you please update the document? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17899: [SPARK-20636] Add new optimization rule to transp...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17899 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19045 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19045 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95838/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19045 **[Test build #95838 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95838/testReport)** for PR 19045 at commit [`5877c16`](https://github.com/apache/spark/commit/5877c16e20559122847ed5ea21c74214fc024c9d). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19045 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95837/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19045 **[Test build #95837 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95837/testReport)** for PR 19045 at commit [`0ba0ca5`](https://github.com/apache/spark/commit/0ba0ca5551d106cd621097b510fa8fb373f171f9). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19045 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19045 **[Test build #95838 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95838/testReport)** for PR 19045 at commit [`5877c16`](https://github.com/apache/spark/commit/5877c16e20559122847ed5ea21c74214fc024c9d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19045 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2950/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19045 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19045 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19045 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2949/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/19045 cc @ifilonenko it's super WIP but since you joined me on the stream where I was working on reviving this I thought it would be good to get your early comments (especially if you have any suggestions around making effective integration tests for this). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19045 **[Test build #95837 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95837/testReport)** for PR 19045 at commit [`0ba0ca5`](https://github.com/apache/spark/commit/0ba0ca5551d106cd621097b510fa8fb373f171f9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22010: [SPARK-21436][CORE] Take advantage of known partitioner ...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/22010 Hey @rxin & @cloud-fan I'd really appreciate your input on the tricks I did to keep the partioniner information present -- is this the right approach? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21654: [SPARK-24671][PySpark] DataFrame length using a dunder/m...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/21654 cc @rgbkrk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22363: [SPARK-25375][SQL][TEST] Reenable qualified perm....
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22363 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22365 **[Test build #95836 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95836/testReport)** for PR 22365 at commit [`2845bca`](https://github.com/apache/spark/commit/2845bca09797a34e930e6aca42f198ec5cbd95e3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22365 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22365 **[Test build #95835 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95835/testReport)** for PR 22365 at commit [`7e77941`](https://github.com/apache/spark/commit/7e7794153924b824dc5fe5f05375c8b9950ef539). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22365 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95835/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22365 **[Test build #95835 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95835/testReport)** for PR 22365 at commit [`7e77941`](https://github.com/apache/spark/commit/7e7794153924b824dc5fe5f05375c8b9950ef539). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22365 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22365 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95834/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22365 **[Test build #95834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95834/testReport)** for PR 22365 at commit [`e2e6149`](https://github.com/apache/spark/commit/e2e61498c47da9d7b36d2e0727ce8642d5d71472). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22365 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22365 **[Test build #95834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95834/testReport)** for PR 22365 at commit [`e2e6149`](https://github.com/apache/spark/commit/e2e61498c47da9d7b36d2e0727ce8642d5d71472). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22365 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
GitHub user MaxGekk opened a pull request: https://github.com/apache/spark/pull/22365 [SPARK-25381][SQL] Stratified sampling by Column argument ## What changes were proposed in this pull request? In the PR, I propose to add an overloaded method for `sampleBy` which accepts the first argument of the `Column` type. This will allow to sample by any complex columns as well as sampling by multiple columns. For example: ```Scala spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17), ("Alice", 10))).toDF("name", "age") .stat .sampleBy(struct($"name", $"age"), Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0), 36L) .show() +-+---+ | name|age| +-+---+ | Nico| 8| |Alice| 10| +-+---+ ``` ## How was this patch tested? Added new test for sampling by multiple columns for Scala and test for Java, Python to check that `sampleBy` is able to sample by `Column` type argument. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MaxGekk/spark-1 sample-by-column Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22365.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22365 commit 3832f2137676a76d6d06a0bb6dbcedcba801910b Author: Maxim Gekk Date: 2018-09-08T13:30:49Z Adding overloaded sampleBy with Column type commit 5cd3229ce8bfe894dac8ebc097109da237d95401 Author: Maxim Gekk Date: 2018-09-08T13:39:30Z Adding overloaded sampleBy with Column type for Java commit e2e61498c47da9d7b36d2e0727ce8642d5d71472 Author: Maxim Gekk Date: 2018-09-08T14:56:36Z Adding overloaded sampleBy with Column type for Python --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22349: [SPARK-25345][ML] Deprecate public APIs from Imag...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22349 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22349: [SPARK-25345][ML] Deprecate public APIs from ImageSchema
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/22349 LGTM. Merged into master and branch-2.4. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22353: [SPARK-25357][SQL] Abbreviated simpleString in Da...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22353#discussion_r216134032 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -54,7 +54,7 @@ trait DataSourceScanExec extends LeafExecNode with CodegenSupport { override def simpleString: String = { val metadataEntries = metadata.toSeq.sorted.map { case (key, value) => -key + ": " + StringUtils.abbreviate(redact(value), 100) --- End diff -- This seems to cause a regression on Spark Web UI. Could you check that, @LantaoJin ? In fact, the abbreviation is introduced over two years ago at Spark 2.0 intentionally for UI via [[SPARK-14476][SQL] Improve the physical plan visualization by adding meta info like table name and file path for data source](https://github.com/apache/spark/pull/12947). At least, we had better update the information of PR and JIRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21618: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/21618 @kiszk @maropu Great thanks for your review and advise! I'll address them and resolve the conflicts ASAP. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21618#discussion_r216133261 --- Diff: core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala --- @@ -77,6 +80,51 @@ class SparkHadoopUtilSuite extends SparkFunSuite with Matchers { }) } + test("test expanding glob path") { --- End diff -- ``` IIUC, the new feature is disabled as default since spark.sql.sources.parallelGetGlobbedPath.numThreads is 0. ``` Yes that's right. ``` I am afraid these test causes are executed only with disabling the new feature. ``` These mainly test the correctness of `sparkHadoopUtil.expandGlobPath`, maybe it's necessary to keep. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22359 Since this is related to Parquet behavior only, can we have `in Parquet` at the end of title specifically? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22363: [SPARK-25375][SQL][TEST] Reenable qualified perm. functi...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22363 cc @cloud-fan and @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org