[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22194 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22194 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95133/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22194 **[Test build #95133 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95133/testReport)** for PR 22194 at commit [`967360a`](https://github.com/apache/spark/commit/967360a1a417739cdada3b7c7334c8ca87ede6a6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22165: [SPARK-25017][Core] Add test suite for BarrierCoordinato...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/22165 @jiangxb1987 Great thanks for your comment! ``` One general idea is that we don't need to rely on the RPC framework to test ContextBarrierState, just mock RpcCallContexts should be enough. ``` Actually I also want to implement like this at first also as you asked in jira, but `ContextBarrierState` is the private inner class in `BarrierCoordinator`. Could I do the refactor of moving `ContextBarrierState` out of `BarrierCoordinator`? If that is permitted I think we can just mock RpcCallContext to reach this. ``` We shall cover the following scenarios: ``` Pretty cool for the list, the 5 in front scenarios are including in currently implement, I'll add the last checking work of `Make sure we clear all the internal data under each case.` after we reach an agreement. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20146 seems like this was a thumbs-up from @WeichenXu123 @jkbradley? @dbtsai ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22121 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22121 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 To confirm, is everyone OK with merging this PR, or we are just OK with the direction and need more time to review this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22121 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95140/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22121 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22121 **[Test build #95140 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95140/testReport)** for PR 22121 at commit [`1f253bf`](https://github.com/apache/spark/commit/1f253bf536c3a7bd1c07ba5ea5600f661c8e106e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22121 **[Test build #95139 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95139/testReport)** for PR 22121 at commit [`8245806`](https://github.com/apache/spark/commit/824580684c05c2a3c1654517b77864ca5d504ee0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22121 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22121 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95139/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] Directly ship the StructType objects ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22187 **[Test build #95141 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95141/testReport)** for PR 22187 at commit [`81ef75a`](https://github.com/apache/spark/commit/81ef75a36a1c9dcd6922d2ec77393bc35389efd0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] Directly ship the StructType objects ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22187 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2475/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] Directly ship the StructType objects ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22187 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22163 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95130/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22163 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22163 **[Test build #95130 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95130/testReport)** for PR 22163 at commit [`f91e18c`](https://github.com/apache/spark/commit/f91e18c7d4b8eab53c4983320a0eab0403c37a48). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22121 The preview doc (zip file in PR description) is updated to latest version. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22121 **[Test build #95140 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95140/testReport)** for PR 22121 at commit [`1f253bf`](https://github.com/apache/spark/commit/1f253bf536c3a7bd1c07ba5ea5600f661c8e106e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs: > The shuffle simply transfers the bytes its supposed to. Sparks shuffle of those bytes is not consistent in that the order it fetches from can change and without the sort happening on that data the order can be different on rerun. I guess maybe you mean the ShuffledRDD as a whole or do you mean something else here? By shuffle, I am referring to the output of shuffle which is be consumed by RDD with `ShuffleDependency` as input. More specifically, the output of `SparkEnv.get.shuffleManager.getReader(...).read()` which RDD (user and spark impl's) uses to fetch output of shuffle machinery. This output will not just be shuffle bytes/deserialize, but with aggregation applied (if specified) and ordering imposed (if specified). ShuffledRDD is one such usage within spark core, but others exist within spark core and in user code. > All I'm saying is zip is just another variant of this, you could document it as such and do nothing internal to spark to "fix it". I agree; repartition + shuffle, zip, sample, mllib usages are all variants of the same problem - of shuffle output order being inconsistent. > I guess we can separate out these 2 discussions. I think the point of this pr is to temporarily workaround the data loss/corruption issue with repartition by failing. So if everyone agrees on that lets move the discussion to a jira about what to do with the rest of the operators and fix repartition here. thoughts? Sounds good to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22121 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22121 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2474/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22121 **[Test build #95139 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95139/testReport)** for PR 22121 at commit [`8245806`](https://github.com/apache/spark/commit/824580684c05c2a3c1654517b77864ca5d504ee0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22192 **[Test build #95138 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95138/testReport)** for PR 22192 at commit [`7c86fc5`](https://github.com/apache/spark/commit/7c86fc54c36954f1345eccc066873f7f90832657). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22121 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2473/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22121 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95129/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22152: [SPARK-25159][SQL] json schema inference should o...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22152#discussion_r212183703 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala --- @@ -69,10 +70,17 @@ private[sql] object JsonInferSchema { }.reduceOption(typeMerger).toIterator } -// Here we get RDD local iterator then fold, instead of calling `RDD.fold` directly, because -// `RDD.fold` will run the fold function in DAGScheduler event loop thread, which may not have -// active SparkSession and `SQLConf.get` may point to the wrong configs. -val rootType = mergedTypesFromPartitions.toLocalIterator.fold(StructType(Nil))(typeMerger) +// Here we manually submit a fold-like Spark job, so that we can set the SQLConf when running +// the fold functions in the scheduler event loop thread. +val existingConf = SQLConf.get +var rootType: DataType = StructType(Nil) +val foldPartition = (iter: Iterator[DataType]) => iter.fold(StructType(Nil))(typeMerger) +val mergeResult = (index: Int, taskResult: DataType) => { + rootType = SQLConf.withExistingConf(existingConf) { --- End diff -- Same question was in my mind. thanks for clarification. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95129 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95129/testReport)** for PR 22112 at commit [`93f37fa`](https://github.com/apache/spark/commit/93f37fa585462b9ee2fb9e179eab736fbc416d3e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22171: [SPARK-25177][SQL] When dataframe decimal type co...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22171#discussion_r212180992 --- Diff: sql/core/src/test/resources/sql-tests/results/literals.sql.out --- @@ -197,7 +197,7 @@ select .e3 -- !query 20 select 1E309, -1E309 -- !query 20 schema -struct<1E+309:decimal(1,-309),-1E+309:decimal(1,-309)> +struct<10:decimal(1,-309),-10:decimal(1,-309)> --- End diff -- @vinodkc how does it show in Postgres? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20345 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95131/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20345 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20345 **[Test build #95131 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95131/testReport)** for PR 20345 at commit [`39462fb`](https://github.com/apache/spark/commit/39462fbee952ec574b4c04d7718fd73bb5f56d9d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21546 **[Test build #95137 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95137/testReport)** for PR 21546 at commit [`5549644`](https://github.com/apache/spark/commit/554964465dbcb99cc313620fafb0fc41acfd4304). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/Hive t...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22153 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21546 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2472/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21546 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22157: [SPARK-25126][SQL] Avoid creating Reader for all ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22157#discussion_r212178321 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala --- @@ -562,20 +562,57 @@ abstract class OrcQueryTest extends OrcTest { } } +def testAllCorruptFiles(): Unit = { + withTempDir { dir => +val basePath = dir.getCanonicalPath +spark.range(1).toDF("a").write.json(new Path(basePath, "first").toString) +spark.range(1, 2).toDF("a").write.json(new Path(basePath, "second").toString) +val df = spark.read.orc( + new Path(basePath, "first").toString, + new Path(basePath, "second").toString) +assert(df.count() == 0) + } +} + +def testAllCorruptFilesWithoutSchemaInfer(): Unit = { + withTempDir { dir => +val basePath = dir.getCanonicalPath +spark.range(1).toDF("a").write.json(new Path(basePath, "first").toString) +spark.range(1, 2).toDF("a").write.json(new Path(basePath, "second").toString) +val df = spark.read.schema("a long").orc( + new Path(basePath, "first").toString, + new Path(basePath, "second").toString) +assert(df.count() == 0) + } +} + withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "true") { testIgnoreCorruptFiles() testIgnoreCorruptFilesWithoutSchemaInfer() + val m1 = intercept[AnalysisException] { +testAllCorruptFiles() + }.getMessage + assert(m1.contains("Unable to infer schema for ORC")) + testAllCorruptFilesWithoutSchemaInfer() } withSQLConf(SQLConf.IGNORE_CORRUPT_FILES.key -> "false") { val m1 = intercept[SparkException] { testIgnoreCorruptFiles() }.getMessage - assert(m1.contains("Could not read footer for file")) + assert(m1.contains("Malformed ORC file")) --- End diff -- why the error message changed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21546 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r212178291 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -183,34 +178,106 @@ private[sql] object ArrowConverters { } /** - * Convert a byte array to an ArrowRecordBatch. + * Load a serialized ArrowRecordBatch. */ - private[arrow] def byteArrayToBatch( + private[arrow] def loadBatch( batchBytes: Array[Byte], allocator: BufferAllocator): ArrowRecordBatch = { -val in = new ByteArrayReadableSeekableByteChannel(batchBytes) -val reader = new ArrowFileReader(in, allocator) - -// Read a batch from a byte stream, ensure the reader is closed -Utils.tryWithSafeFinally { - val root = reader.getVectorSchemaRoot // throws IOException - val unloader = new VectorUnloader(root) - reader.loadNextBatch() // throws IOException - unloader.getRecordBatch -} { - reader.close() -} +val in = new ByteArrayInputStream(batchBytes) +MessageSerializer.deserializeRecordBatch( + new ReadChannel(Channels.newChannel(in)), allocator) // throws IOException } + /** + * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches. + */ private[sql] def toDataFrame( - payloadRDD: JavaRDD[Array[Byte]], + arrowBatchRDD: JavaRDD[Array[Byte]], schemaString: String, sqlContext: SQLContext): DataFrame = { -val rdd = payloadRDD.rdd.mapPartitions { iter => +val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] +val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone +val rdd = arrowBatchRDD.rdd.mapPartitions { iter => val context = TaskContext.get() - ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), context) + ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context) } -val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] sqlContext.internalCreateDataFrame(rdd, schema) } + + /** + * Read a file as an Arrow stream and parallelize as an RDD of serialized ArrowRecordBatches. + */ + private[sql] def readArrowStreamFromFile( + sqlContext: SQLContext, + filename: String): JavaRDD[Array[Byte]] = { +val fileStream = new FileInputStream(filename) +try { + // Create array so that we can safely close the file + val batches = getBatchesFromStream(fileStream.getChannel).toArray + // Parallelize the record batches to create an RDD + JavaRDD.fromRDD(sqlContext.sparkContext.parallelize(batches, batches.length)) --- End diff -- Ah, sorry. You are right. I misread. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22192 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22192 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95136/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22192 **[Test build #95136 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95136/testReport)** for PR 22192 at commit [`44454dd`](https://github.com/apache/spark/commit/44454dd586e35bdf16492c4a8969494bd3b7f8f5). * This patch **fails Java style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` .doc(\"Comma-separated list of class names for \"plugins\" implementing \" +` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95128/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #95128 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95128/testReport)** for PR 22112 at commit [`097092b`](https://github.com/apache/spark/commit/097092be4b2967689082af62715ecc4f78086c30). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918] Executor Plugin API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22192 **[Test build #95136 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95136/testReport)** for PR 22192 at commit [`44454dd`](https://github.com/apache/spark/commit/44454dd586e35bdf16492c4a8969494bd3b7f8f5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22192: [SPARK-24918] Executor Plugin API
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22192 Jenkins, ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21923: [SPARK-24918][Core] Executor Plugin api
Github user squito commented on the issue: https://github.com/apache/spark/pull/21923 this is being continued in https://github.com/apache/spark/pull/22192 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21923: [SPARK-24918][Core] Executor Plugin api
Github user squito closed the pull request at: https://github.com/apache/spark/pull/21923 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22195: [SPARK-25205][CORE] Fix typo in spark.network.crypto.key...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22195 **[Test build #95135 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95135/testReport)** for PR 22195 at commit [`b927b94`](https://github.com/apache/spark/commit/b927b94ea1312ba74d73c203adb7683b2fb42fed). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r212171980 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -183,34 +178,106 @@ private[sql] object ArrowConverters { } /** - * Convert a byte array to an ArrowRecordBatch. + * Load a serialized ArrowRecordBatch. */ - private[arrow] def byteArrayToBatch( + private[arrow] def loadBatch( batchBytes: Array[Byte], allocator: BufferAllocator): ArrowRecordBatch = { -val in = new ByteArrayReadableSeekableByteChannel(batchBytes) -val reader = new ArrowFileReader(in, allocator) - -// Read a batch from a byte stream, ensure the reader is closed -Utils.tryWithSafeFinally { - val root = reader.getVectorSchemaRoot // throws IOException - val unloader = new VectorUnloader(root) - reader.loadNextBatch() // throws IOException - unloader.getRecordBatch -} { - reader.close() -} +val in = new ByteArrayInputStream(batchBytes) +MessageSerializer.deserializeRecordBatch( + new ReadChannel(Channels.newChannel(in)), allocator) // throws IOException } + /** + * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches. + */ private[sql] def toDataFrame( - payloadRDD: JavaRDD[Array[Byte]], + arrowBatchRDD: JavaRDD[Array[Byte]], schemaString: String, sqlContext: SQLContext): DataFrame = { -val rdd = payloadRDD.rdd.mapPartitions { iter => +val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] +val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone +val rdd = arrowBatchRDD.rdd.mapPartitions { iter => val context = TaskContext.get() - ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), context) + ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context) } -val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] sqlContext.internalCreateDataFrame(rdd, schema) } + + /** + * Read a file as an Arrow stream and parallelize as an RDD of serialized ArrowRecordBatches. + */ + private[sql] def readArrowStreamFromFile( + sqlContext: SQLContext, + filename: String): JavaRDD[Array[Byte]] = { +val fileStream = new FileInputStream(filename) +try { + // Create array so that we can safely close the file + val batches = getBatchesFromStream(fileStream.getChannel).toArray + // Parallelize the record batches to create an RDD + JavaRDD.fromRDD(sqlContext.sparkContext.parallelize(batches, batches.length)) --- End diff -- so this the length of the array of batches, not the number of records in the batch. The input is split according to the default parallelism config. So if that is 32, we will have an array of 32 batches and then parallelize those to 32 partitions. `parallelize` might usually have one big array of primitives as the first arg, that you then partition by the number in the second arg, but this is a little different since we are using batches. Does that answer your question? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22195: [SPARK-25205][CORE] Fix typo in spark.network.crypto.key...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22195 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2471/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22195: [SPARK-25205][CORE] Fix typo in spark.network.crypto.key...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22195 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22161: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes for R sql...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22161 Ah, it's okie. Yes, please. Not a big deal. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22195: [CORE] Fix typo in spark.network.crypto.keyFactor...
GitHub user squito opened a pull request: https://github.com/apache/spark/pull/22195 [CORE] Fix typo in spark.network.crypto.keyFactoryIterations You can merge this pull request into a Git repository by running: $ git pull https://github.com/squito/spark SPARK-25205 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22195.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22195 commit b927b94ea1312ba74d73c203adb7683b2fb42fed Author: Imran Rashid Date: 2018-08-23T03:11:40Z [CORE] Fix typo in spark.network.crypto.keyFactoryIterations --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22194 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r212170997 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -3268,13 +3268,49 @@ class Dataset[T] private[sql]( } /** - * Collect a Dataset as ArrowPayload byte arrays and serve to PySpark. + * Collect a Dataset as Arrow batches and serve stream to PySpark. */ private[sql] def collectAsArrowToPython(): Array[Any] = { +val timeZoneId = sparkSession.sessionState.conf.sessionLocalTimeZone + withAction("collectAsArrowToPython", queryExecution) { plan => - val iter: Iterator[Array[Byte]] = -toArrowPayload(plan).collect().iterator.map(_.asPythonSerializable) - PythonRDD.serveIterator(iter, "serve-Arrow") + PythonRDD.serveToStream("serve-Arrow") { out => +val batchWriter = new ArrowBatchStreamWriter(schema, out, timeZoneId) +val arrowBatchRdd = toArrowBatchRdd(plan) +val numPartitions = arrowBatchRdd.partitions.length + +// Store collection results for worst case of 1 to N-1 partitions --- End diff -- It's not necessary to buffer the first partition because it can be sent to Python right away, so only need an array of size N-1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r212171051 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -183,34 +178,106 @@ private[sql] object ArrowConverters { } /** - * Convert a byte array to an ArrowRecordBatch. + * Load a serialized ArrowRecordBatch. */ - private[arrow] def byteArrayToBatch( + private[arrow] def loadBatch( batchBytes: Array[Byte], allocator: BufferAllocator): ArrowRecordBatch = { -val in = new ByteArrayReadableSeekableByteChannel(batchBytes) -val reader = new ArrowFileReader(in, allocator) - -// Read a batch from a byte stream, ensure the reader is closed -Utils.tryWithSafeFinally { - val root = reader.getVectorSchemaRoot // throws IOException - val unloader = new VectorUnloader(root) - reader.loadNextBatch() // throws IOException - unloader.getRecordBatch -} { - reader.close() -} +val in = new ByteArrayInputStream(batchBytes) +MessageSerializer.deserializeRecordBatch( + new ReadChannel(Channels.newChannel(in)), allocator) // throws IOException } + /** + * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches. + */ private[sql] def toDataFrame( - payloadRDD: JavaRDD[Array[Byte]], + arrowBatchRDD: JavaRDD[Array[Byte]], schemaString: String, sqlContext: SQLContext): DataFrame = { -val rdd = payloadRDD.rdd.mapPartitions { iter => +val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] +val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone +val rdd = arrowBatchRDD.rdd.mapPartitions { iter => val context = TaskContext.get() - ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), context) + ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context) } -val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] sqlContext.internalCreateDataFrame(rdd, schema) } + + /** + * Read a file as an Arrow stream and parallelize as an RDD of serialized ArrowRecordBatches. + */ + private[sql] def readArrowStreamFromFile( + sqlContext: SQLContext, + filename: String): JavaRDD[Array[Byte]] = { +val fileStream = new FileInputStream(filename) --- End diff -- yup, thanks for catching that --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21546#discussion_r212170606 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -183,34 +178,106 @@ private[sql] object ArrowConverters { } /** - * Convert a byte array to an ArrowRecordBatch. + * Load a serialized ArrowRecordBatch. */ - private[arrow] def byteArrayToBatch( + private[arrow] def loadBatch( batchBytes: Array[Byte], allocator: BufferAllocator): ArrowRecordBatch = { -val in = new ByteArrayReadableSeekableByteChannel(batchBytes) -val reader = new ArrowFileReader(in, allocator) - -// Read a batch from a byte stream, ensure the reader is closed -Utils.tryWithSafeFinally { - val root = reader.getVectorSchemaRoot // throws IOException - val unloader = new VectorUnloader(root) - reader.loadNextBatch() // throws IOException - unloader.getRecordBatch -} { - reader.close() -} +val in = new ByteArrayInputStream(batchBytes) +MessageSerializer.deserializeRecordBatch( + new ReadChannel(Channels.newChannel(in)), allocator) // throws IOException } + /** + * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches. + */ private[sql] def toDataFrame( - payloadRDD: JavaRDD[Array[Byte]], + arrowBatchRDD: JavaRDD[Array[Byte]], schemaString: String, sqlContext: SQLContext): DataFrame = { -val rdd = payloadRDD.rdd.mapPartitions { iter => +val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] +val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone +val rdd = arrowBatchRDD.rdd.mapPartitions { iter => val context = TaskContext.get() - ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), context) + ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context) } -val schema = DataType.fromJson(schemaString).asInstanceOf[StructType] sqlContext.internalCreateDataFrame(rdd, schema) } + + /** + * Read a file as an Arrow stream and parallelize as an RDD of serialized ArrowRecordBatches. + */ + private[sql] def readArrowStreamFromFile( + sqlContext: SQLContext, + filename: String): JavaRDD[Array[Byte]] = { +val fileStream = new FileInputStream(filename) +try { + // Create array so that we can safely close the file + val batches = getBatchesFromStream(fileStream.getChannel).toArray + // Parallelize the record batches to create an RDD + JavaRDD.fromRDD(sqlContext.sparkContext.parallelize(batches, batches.length)) +} finally { + fileStream.close() +} + } + + /** + * Read an Arrow stream input and return an iterator of serialized ArrowRecordBatches. + */ + private[sql] def getBatchesFromStream(in: SeekableByteChannel): Iterator[Array[Byte]] = { + +// Create an iterator to get each serialized ArrowRecordBatch from a stream +new Iterator[Array[Byte]] { + var batch: Array[Byte] = readNextBatch() + + override def hasNext: Boolean = batch != null + + override def next(): Array[Byte] = { +val prevBatch = batch +batch = readNextBatch() +prevBatch + } + + def readNextBatch(): Array[Byte] = { +val msgMetadata = MessageSerializer.readMessage(new ReadChannel(in)) +if (msgMetadata == null) { + return null +} + +// Get the length of the body, which has not be read at this point +val bodyLength = msgMetadata.getMessageBodyLength.toInt + +// Only care about RecordBatch data, skip Schema and unsupported Dictionary messages +if (msgMetadata.getMessage.headerType() == MessageHeader.RecordBatch) { + + // Create output backed by buffer to hold msg length (int32), msg metadata, msg body + val bbout = new ByteBufferOutputStream(4 + msgMetadata.getMessageLength + bodyLength) --- End diff -- I'll add some more details about what this is doing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...
Github user techaddict commented on the issue: https://github.com/apache/spark/pull/22194 @ueshin LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22161: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes fo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22161 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22161: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes for R sql...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22161 @HyukjinKwon Oh.. thank you. I was going to fix the style ? I will include it when i fix something next ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22161: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes for R sql...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22161 LGTM. Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22161: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes fo...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22161#discussion_r212168906 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -3613,11 +3613,11 @@ test_that("Collect on DataFrame when NAs exists at the top of a timestamp column test_that("catalog APIs, currentDatabase, setCurrentDatabase, listDatabases", { expect_equal(currentDatabase(), "default") expect_error(setCurrentDatabase("default"), NA) - expect_error(setCurrentDatabase("foo"), - "Error in setCurrentDatabase : analysis error - Database 'foo' does not exist") + expect_error(setCurrentDatabase("zxwtyswklpf"), +"Error in setCurrentDatabase : analysis error - Database 'zxwtyswklpf' does not exist") --- End diff -- Not a big deal but can we fix this style too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22085: [SPARK-25095][PySpark] Python support for BarrierTaskCon...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22085 Thanks, @jiangxb1987 and @mengxr again. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22085: [SPARK-25095][PySpark] Python support for Barrier...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22085#discussion_r212168597 --- Diff: python/pyspark/taskcontext.py --- @@ -95,3 +99,143 @@ def getLocalProperty(self, key): Get a local property set upstream in the driver, or None if it is missing. """ return self._localProperties.get(key, None) + + +BARRIER_FUNCTION = 1 + + +def _load_from_socket(port, auth_secret): +""" +Load data from a given socket, this is a blocking method thus only return when the socket +connection has been closed. + +This is copied from context.py, while modified the message protocol. --- End diff -- It would be nicer if we can deduplciate it later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22163#discussion_r212168161 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java --- @@ -206,14 +211,21 @@ private void writeSortedFile(boolean isLastFile) { long recordReadPosition = recordOffsetInPage + uaoSize; // skip over record length while (dataRemaining > 0) { final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining); -Platform.copyMemory( - recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer); -writer.write(writeBuffer, 0, toTransfer); +if (bufferOffset > 0 && bufferOffset + toTransfer > DISK_WRITE_BUFFER_SIZE) { --- End diff -- Yes,you are right. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22189: Correct missing punctuation in the documentation
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22189 It's okay but mind if I ask to take another look, see if there are more typos and fix other typos while we are here? I am pretty sure there are more. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...
Github user Ngone51 commented on a diff in the pull request: https://github.com/apache/spark/pull/22163#discussion_r212167438 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java --- @@ -206,14 +211,21 @@ private void writeSortedFile(boolean isLastFile) { long recordReadPosition = recordOffsetInPage + uaoSize; // skip over record length while (dataRemaining > 0) { final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining); -Platform.copyMemory( - recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer); -writer.write(writeBuffer, 0, toTransfer); +if (bufferOffset > 0 && bufferOffset + toTransfer > DISK_WRITE_BUFFER_SIZE) { --- End diff -- > The numRecordsWritten in DiskBlockObjectWriter is still correct during the process after this PR The number is correct, but it is not consistent with what real happen compare to current behaviour. But as you said, we will get correct result at the end. So, it may not be a big deal. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22191: [SPARK-25204][SS] Fix race in rate source test.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22191 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95126/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22191: [SPARK-25204][SS] Fix race in rate source test.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22191 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22191: [SPARK-25204][SS] Fix race in rate source test.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22191 **[Test build #95126 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95126/testReport)** for PR 22191 at commit [`ade656a`](https://github.com/apache/spark/commit/ade656a3cf3edb32d80e66d03276584c0c86c4d0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22163#discussion_r212166322 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java --- @@ -206,14 +211,21 @@ private void writeSortedFile(boolean isLastFile) { long recordReadPosition = recordOffsetInPage + uaoSize; // skip over record length while (dataRemaining > 0) { final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining); -Platform.copyMemory( - recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer); -writer.write(writeBuffer, 0, toTransfer); +if (bufferOffset > 0 && bufferOffset + toTransfer > DISK_WRITE_BUFFER_SIZE) { --- End diff -- The `numRecordsWritten` in `DiskBlockObjectWriter` is still correct during the process after this PR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22187: [SPARK-25178][SQL] Directly ship the StructType objects ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/22187 +1 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22171: [SPARK-25177][SQL] When dataframe decimal type co...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22171#discussion_r212165658 --- Diff: sql/core/src/test/resources/sql-tests/results/literals.sql.out --- @@ -197,7 +197,7 @@ select .e3 -- !query 20 select 1E309, -1E309 -- !query 20 schema -struct<1E+309:decimal(1,-309),-1E+309:decimal(1,-309)> +struct<10:decimal(1,-309),-10:decimal(1,-309)> --- End diff -- hmm, this seems a bad representation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22171: [SPARK-25177][SQL] When dataframe decimal type co...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22171#discussion_r212165521 --- Diff: sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out --- @@ -201,6 +201,7 @@ struct<> -- !query 20 output + --- End diff -- I think this is wrongly submitted. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/22163#discussion_r212165385 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java --- @@ -206,14 +211,21 @@ private void writeSortedFile(boolean isLastFile) { long recordReadPosition = recordOffsetInPage + uaoSize; // skip over record length while (dataRemaining > 0) { final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining); -Platform.copyMemory( - recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer); -writer.write(writeBuffer, 0, toTransfer); +if (bufferOffset > 0 && bufferOffset + toTransfer > DISK_WRITE_BUFFER_SIZE) { --- End diff -- Yeah, this result of `_bytesWritten` would not have been updated synchronously before, you can see this condition:`if (numRecordsWritten % 16384 == 0)`. But we do not need worry. the final result is correct, because it will be updated in `commitAndGet` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/Hive t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22153 **[Test build #95134 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95134/testReport)** for PR 22153 at commit [`82190ac`](https://github.com/apache/spark/commit/82190accc6182988dfae00a07a656b069aa7b708). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/Hive t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22153 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2470/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/Hive t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22153 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22180: [SPARK-25174][YARN]Limit the size of diagnostic message ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22180 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95132/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22180: [SPARK-25174][YARN]Limit the size of diagnostic message ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22180 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22180: [SPARK-25174][YARN]Limit the size of diagnostic message ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22180 **[Test build #95132 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95132/testReport)** for PR 22180 at commit [`3271c3f`](https://github.com/apache/spark/commit/3271c3f4100e0d69fe30400a42ab35aaab1c7c48). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22187: [SPARK-25178][SQL] Directly ship the StructType o...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22187#discussion_r212164362 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala --- @@ -44,31 +44,8 @@ class RowBasedHashMapGenerator( groupingKeySchema, bufferSchema) { override protected def initializeAggregateHashMap(): String = { -val generatedKeySchema: String = - s"new org.apache.spark.sql.types.StructType()" + -groupingKeySchema.map { key => - val keyName = ctx.addReferenceObj("keyName", key.name) - key.dataType match { -case d: DecimalType => - s""".add($keyName, org.apache.spark.sql.types.DataTypes.createDecimalType( - |${d.precision}, ${d.scale}))""".stripMargin -case _ => - s""".add($keyName, org.apache.spark.sql.types.DataTypes.${key.dataType})""" - } -}.mkString("\n").concat(";") - -val generatedValueSchema: String = - s"new org.apache.spark.sql.types.StructType()" + -bufferSchema.map { key => - val keyName = ctx.addReferenceObj("keyName", key.name) - key.dataType match { -case d: DecimalType => - s""".add($keyName, org.apache.spark.sql.types.DataTypes.createDecimalType( - |${d.precision}, ${d.scale}))""".stripMargin -case _ => - s""".add($keyName, org.apache.spark.sql.types.DataTypes.${key.dataType})""" - } -}.mkString("\n").concat(";") +val generatedKeySchema = ctx.addReferenceObj("generatedKeySchemaTerm", groupingKeySchema) --- End diff -- nit: the variable name sounds strange because the schema is not generated any more. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22171: [SPARK-25177][SQL] When dataframe decimal type column ha...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22171 @rxin, I recall https://github.com/apache/spark/pull/14560 where we used Postgres as reference. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22194 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22194 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2469/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22153 my bad, this pr doesn't affect cache tables in webui. I'll drop these. Actually, this affects hive tables and rdds only; ``` >> Hive table case sql("CREATE TABLE t(c1 int) USING hive") sql("INSERT INTO t VALUES(1)") spark.table("t").show() >> RDD case import org.apache.spark.sql._ import org.apache.spark.sql.types._ val data = spark.sparkContext.parallelize(Row(1, "abc") :: Nil).setName("existing RDD1") val df = spark.createDataFrame(data, StructType.fromDDL("c0 int, c1 string")) df.show() ``` > spark-v2.3.1 for hive tables https://user-images.githubusercontent.com/692303/44500677-cb55d180-a6c4-11e8-97e9-25b88b351b0a.png;> > master w/this pr for hive tables https://user-images.githubusercontent.com/692303/44500676-cb55d180-a6c4-11e8-9602-1cfbea6d8267.png;> > spark-v2.3.1 for rdds https://user-images.githubusercontent.com/692303/44500731-05bf6e80-a6c5-11e8-83dd-ed7f1ab2d658.png;> > master w/this pr for rdds https://user-images.githubusercontent.com/692303/44500741-11ab3080-a6c5-11e8-8c18-e1cc66be0f09.png;> --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22194 cc @techaddict --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22194 **[Test build #95133 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95133/testReport)** for PR 22194 at commit [`967360a`](https://github.com/apache/spark/commit/967360a1a417739cdada3b7c7334c8ca87ede6a6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22185: [SPARK-25127] DataSourceV2: Remove SupportsPushDownCatal...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22185 +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22194: [SPARK-23932][SQL][FOLLOW-UP] Fix an example of z...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/22194 [SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with function. ## What changes were proposed in this pull request? This is a follow-up pr of #22031 which added `zip_with` function to fix an example. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-23932/fix_examples Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22194.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22194 commit 967360a1a417739cdada3b7c7334c8ca87ede6a6 Author: Takuya UESHIN Date: 2018-08-23T01:59:55Z Fix an example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...
Github user Ngone51 commented on a diff in the pull request: https://github.com/apache/spark/pull/22163#discussion_r212163785 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java --- @@ -206,14 +211,21 @@ private void writeSortedFile(boolean isLastFile) { long recordReadPosition = recordOffsetInPage + uaoSize; // skip over record length while (dataRemaining > 0) { final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining); -Platform.copyMemory( - recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer); -writer.write(writeBuffer, 0, toTransfer); +if (bufferOffset > 0 && bufferOffset + toTransfer > DISK_WRITE_BUFFER_SIZE) { --- End diff -- Yeah, I agree there' s no difference as for final result. But `writeMetrics` in `DiskBlockObjectWriter` would be incorrect during the process. Not only `numRecordsWritten`, but also `_bytesWritten`(this could only be correctly counted when `writer.write()` is called. You can see `recordWritten#updateBytesWritten` for detail). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22164: [SPARK-23679][YARN] Fix AmIpFilter cannot work in RM HA ...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/22164 Gently ping again @vanzin @tgravescs . Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21546 LGTM otherwise. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org