[GitHub] spark issue #21942: [SPARK-24283][ML] Make ml.StandardScaler skip conversion...
Github user sujithjay commented on the issue: https://github.com/apache/spark/pull/21942 OK to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21439 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94666/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21439 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21439 **[Test build #94666 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94666/testReport)** for PR 21439 at commit [`74a7799`](https://github.com/apache/spark/commit/74a779964b666b36b36a65b2cdd4b47d9df1e04c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...
Github user sddyljsx commented on the issue: https://github.com/apache/spark/pull/21859 I think I need another retest . Please help. @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22078 Number files | Delete directory elapsed time(milliseconds) | Truncate directory elapsed time(milliseconds) -- | -- | -- 200 | 7 | 319 5000 | 48 | 7810 1 | 90 | 16427 10 | 736 | 194470 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22079 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94665/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22079 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22079 **[Test build #94665 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94665/testReport)** for PR 22079 at commit [`8d2d558`](https://github.com/apache/spark/commit/8d2d5585b2c2832cd4d88b3851607ce15180cca5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22001: [SPARK-24819][CORE] Fail fast when no enough slots to la...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22001 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22001: [SPARK-24819][CORE] Fail fast when no enough slots to la...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22001 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94658/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22001: [SPARK-24819][CORE] Fail fast when no enough slots to la...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22001 **[Test build #94658 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94658/testReport)** for PR 22001 at commit [`9d4e232`](https://github.com/apache/spark/commit/9d4e232a13d5e9098c9cbc1c1d9004eff32dd6e5). * This patch **fails from timeout after a configured wait of \`340m\`**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22087: [SPARK-25097][ML] Support prediction on single instance ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22087 **[Test build #94671 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94671/testReport)** for PR 22087 at commit [`f6cbb46`](https://github.com/apache/spark/commit/f6cbb46f46f039d6168de382a18018a37e0e3ee7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22001: [SPARK-24819][CORE] Fail fast when no enough slots to la...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22001 **[Test build #94672 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94672/testReport)** for PR 22001 at commit [`9d4e232`](https://github.com/apache/spark/commit/9d4e232a13d5e9098c9cbc1c1d9004eff32dd6e5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22087: [SPARK-25097][ML] Support prediction on single instance ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22087 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2117/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22001: [SPARK-24819][CORE] Fail fast when no enough slots to la...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22001 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22087: [SPARK-25097][ML] Support prediction on single instance ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22087 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22001: [SPARK-24819][CORE] Fail fast when no enough slots to la...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22001 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2118/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22001: [SPARK-24819][CORE] Fail fast when no enough slots to la...
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/22001 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22087: [SPARK-25097][Support prediction on single instan...
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/22087 [SPARK-25097][Support prediction on single instance in KMeans/BiKMeans/GMM] Support prediction on single instance in KMeans/BiKMeans/GMM ## What changes were proposed in this pull request? expose method `predict` in KMeans/BiKMeans/GMM ## How was this patch tested? NA You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhengruifeng/spark clu_pre_instance Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22087.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22087 commit f6cbb46f46f039d6168de382a18018a37e0e3ee7 Author: zhengruifeng Date: 2018-08-13T06:10:35Z init --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks w...
Github user eyalfa commented on a diff in the pull request: https://github.com/apache/spark/pull/21369#discussion_r209499005 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala --- @@ -585,17 +592,15 @@ class ExternalAppendOnlyMap[K, V, C]( } else { logInfo(s"Task ${context.taskAttemptId} force spilling in-memory map to disk and " + s"it will release ${org.apache.spark.util.Utils.bytesToString(getUsed())} memory") -nextUpstream = spillMemoryIteratorToDisk(upstream) +val nextUpstream = spillMemoryIteratorToDisk(upstream) +assert(!upstream.hasNext) hasSpilled = true +upstream = nextUpstream --- End diff -- @cloud-fan , do you think this is worth doing, I'm referring to the CompletionIterator delaying GC of the sub iterator and cleanup function (usually a closure referring to a larger collection). if so, I'd open a separate JIRA+PR for this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21859 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21859 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94662/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21859 **[Test build #94662 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94662/testReport)** for PR 21859 at commit [`46bab16`](https://github.com/apache/spark/commit/46bab165af68c1ef2dd1fc57e7f27f5d27c72015). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22082: [SPARK-24420][Build][FOLLOW-UP] Upgrade ASM6 APIs
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22082 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks when spi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21369 **[Test build #94670 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94670/testReport)** for PR 21369 at commit [`855854a`](https://github.com/apache/spark/commit/855854af4762057438b996c58639a38c773ac0bb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks when spi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21369 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks when spi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21369 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2116/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/21561#discussion_r209498032 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -151,13 +152,9 @@ class BisectingKMeans private ( this } - /** - * Runs the bisecting k-means algorithm. - * @param input RDD of vectors - * @return model for the bisecting kmeans - */ - @Since("1.6.0") - def run(input: RDD[Vector]): BisectingKMeansModel = { + + private[spark] def run(input: RDD[Vector], + instr: Option[Instrumentation]): BisectingKMeansModel = { --- End diff -- `instrumented` will create a new `Instrumentation`, and `instrumented` is only used in ml When mllib's impls is called, the `Instrumentation` will be passed as a parameters, like what KMeans does (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L362). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22082: [SPARK-24420][Build][FOLLOW-UP] Upgrade ASM6 APIs
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22082 Merged into master. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22077: [SPARK-25084][SQL][BACKPORT-2.3] "distribute by" ...
Github user LantaoJin closed the pull request at: https://github.com/apache/spark/pull/22077 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21561 **[Test build #94669 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94669/testReport)** for PR 21561 at commit [`fb3ff2b`](https://github.com/apache/spark/commit/fb3ff2b1f6cfa3936f2aa3901be844555d33887e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21561 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2115/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/GMM/AFT/...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21561 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/21561#discussion_r209496789 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala --- @@ -157,11 +157,15 @@ class NaiveBayes @Since("1.5.0") ( instr.logNumFeatures(numFeatures) val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol)) +val countAccum = dataset.sparkSession.sparkContext.longAccumulator + // Aggregates term frequencies per label. // TODO: Calling aggregateByKey and collect creates two stages, we can implement something // TODO: similar to reduceByKeyLocally to save one stage. val aggregated = dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd - .map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2))) + .map { row => +countAccum.add(1L) --- End diff -- This should work correctly, however, to guarantee the correctness, I update the pr to compute the number without Accumulator --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21933: [SPARK-24917][CORE] make chunk size configurable
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21933#discussion_r209496412 --- Diff: core/src/main/scala/org/apache/spark/serializer/SerializerManager.scala --- @@ -70,6 +70,8 @@ private[spark] class SerializerManager( private[this] val compressRdds = conf.getBoolean("spark.rdd.compress", false) // Whether to compress shuffle output temporarily spilled to disk private[this] val compressShuffleSpill = conf.getBoolean("spark.shuffle.spill.compress", true) + // Size of the chunks to be used in the ChunkedByteBuffer + private[this] val chunkSizeMb = conf.getSizeAsMb("spark.memory.serializer.chunkSize", "4m").toInt --- End diff -- @vincent-grosbois WDTY about this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22025: SPARK-25043: print master and appId from spark-sql on st...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22025 @abellina, how is the output like? mind if I ask put it in the PR description just to make the review easier if that's not hard? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22082: [SPARK-24420][Build][FOLLOW-UP] Upgrade ASM6 APIs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22082 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94660/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22082: [SPARK-24420][Build][FOLLOW-UP] Upgrade ASM6 APIs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22082 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22082: [SPARK-24420][Build][FOLLOW-UP] Upgrade ASM6 APIs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22082 **[Test build #94660 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94660/testReport)** for PR 22082 at commit [`2666500`](https://github.com/apache/spark/commit/266650006ed1f5d19d6eaf24d7058ca341457039). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22075 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22075 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94663/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22075 **[Test build #94663 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94663/testReport)** for PR 22075 at commit [`4b48a39`](https://github.com/apache/spark/commit/4b48a39238ab80f2bd1ebb36fd653ecc6495e492). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22085: [SPARK-25095][PySpark] Python support for Barrier...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22085#discussion_r209491244 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -180,7 +183,42 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( dataOut.writeInt(partitionIndex) // Python version of driver PythonRDD.writeUTF(pythonVer, dataOut) +// Init a GatewayServer to port current BarrierTaskContext to Python side. +val isBarrier = context.isInstanceOf[BarrierTaskContext] +val secret = if (isBarrier) { + Utils.createSecret(env.conf) +} else { + "" +} +val gatewayServer: Option[GatewayServer] = if (isBarrier) { + Some(new GatewayServer.GatewayServerBuilder() +.entryPoint(context.asInstanceOf[BarrierTaskContext]) +.authToken(secret) +.javaPort(0) +.callbackClient(GatewayServer.DEFAULT_PYTHON_PORT, GatewayServer.defaultAddress(), + secret) +.build()) --- End diff -- If this should necessarily target 2.4.0, don't block by me since it's a new feature and probably we could consider another approach later but if we can avoid, I would suggest to avoid for now. Let me try to track the design doc and changes about this. I think I need more time to check why it happened like this and if there's another way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22085: [SPARK-25095][PySpark] Python support for Barrier...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22085#discussion_r209491060 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -180,7 +183,42 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( dataOut.writeInt(partitionIndex) // Python version of driver PythonRDD.writeUTF(pythonVer, dataOut) +// Init a GatewayServer to port current BarrierTaskContext to Python side. +val isBarrier = context.isInstanceOf[BarrierTaskContext] +val secret = if (isBarrier) { + Utils.createSecret(env.conf) +} else { + "" +} +val gatewayServer: Option[GatewayServer] = if (isBarrier) { + Some(new GatewayServer.GatewayServerBuilder() +.entryPoint(context.asInstanceOf[BarrierTaskContext]) +.authToken(secret) +.javaPort(0) +.callbackClient(GatewayServer.DEFAULT_PYTHON_PORT, GatewayServer.defaultAddress(), + secret) +.build()) --- End diff -- Yea, I read and understood if this is only initialised when the context is a `BarrierTaskContext` but this is super weird we start another Java gateway here. If it's a hard requirement, then I suspect the design issue. Should this be targeted to 2.4.0, @mengxr? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22085: [SPARK-25095][PySpark] Python support for Barrier...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/22085#discussion_r209490553 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -180,7 +183,42 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( dataOut.writeInt(partitionIndex) // Python version of driver PythonRDD.writeUTF(pythonVer, dataOut) +// Init a GatewayServer to port current BarrierTaskContext to Python side. +val isBarrier = context.isInstanceOf[BarrierTaskContext] +val secret = if (isBarrier) { + Utils.createSecret(env.conf) +} else { + "" +} +val gatewayServer: Option[GatewayServer] = if (isBarrier) { + Some(new GatewayServer.GatewayServerBuilder() +.entryPoint(context.asInstanceOf[BarrierTaskContext]) +.authToken(secret) +.javaPort(0) +.callbackClient(GatewayServer.DEFAULT_PYTHON_PORT, GatewayServer.defaultAddress(), + secret) +.build()) --- End diff -- We have to port `BarrierTaskContext` from java to python side, otherwise there is no way to call `BarrierTaskContext.barrier()` from python side. Thus, of course, the JavaGateway is only initiated when the context is a `BarrierTaskContext`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to ar...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22075#discussion_r209489692 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -90,6 +90,20 @@ trait CheckAnalysis extends PredicateHelper { u.failAnalysis(s"Table or view not found: ${u.tableIdentifier}") case operator: LogicalPlan => +// Check argument data types of higher-order functions downwards first. +// If the arguments of the higher-order functions are resolved but the type check fails, +// the argument functions will not get resolved, but we should report the argument type --- End diff -- ah i see, makes sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22075 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22075 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94657/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22075 **[Test build #94657 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94657/testReport)** for PR 22075 at commit [`388c2d3`](https://github.com/apache/spark/commit/388c2d3d812bf749ddf9de029432eab729bcc932). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait HigherOrderFunction extends Expression with ExpectsInputTypes ` * `trait SimpleHigherOrderFunction extends HigherOrderFunction ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to ar...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22075#discussion_r209488846 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -90,6 +90,20 @@ trait CheckAnalysis extends PredicateHelper { u.failAnalysis(s"Table or view not found: ${u.tableIdentifier}") case operator: LogicalPlan => +// Check argument data types of higher-order functions downwards first. +// If the arguments of the higher-order functions are resolved but the type check fails, +// the argument functions will not get resolved, but we should report the argument type --- End diff -- I'm worried that if we say only `functions`, we might be confused whether the "function" means the higher-order function or the function as an argument. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to ar...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22075#discussion_r209488516 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -90,6 +90,20 @@ trait CheckAnalysis extends PredicateHelper { u.failAnalysis(s"Table or view not found: ${u.tableIdentifier}") case operator: LogicalPlan => +// Check argument data types of higher-order functions downwards first. +// If the arguments of the higher-order functions are resolved but the type check fails, +// the argument functions will not get resolved, but we should report the argument type --- End diff -- actually, I think it's clearer to say `functions`, instead of `argument functions`. Sorry for the back and forth. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17520 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17520 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94661/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17520 **[Test build #94661 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94661/testReport)** for PR 17520 at commit [`b923bd5`](https://github.com/apache/spark/commit/b923bd5b2c79a84ada834a32b756ad0da80f12c6). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds the following public classes _(experimental)_: * `case class SparkListenerBlockManagerAdded(` * `class StorageStatus(` * `public final class JavaStructuredSessionization ` * ` public static class LineWithTimestamp implements Serializable ` * ` public static class Event implements Serializable ` * ` public static class SessionInfo implements Serializable ` * ` public static class SessionUpdate implements Serializable ` * `case class Event(sessionId: String, timestamp: Timestamp)` * `case class SessionInfo(` * `case class SessionUpdate(` * `class Correlation(object):` * `case class UnresolvedMapObjects(` * `case class AssertNotNull(child: Expression, walkedTypePath: Seq[String] = Nil)` * `case class StarSchemaDetection(conf: SQLConf) extends PredicateHelper ` * ` * Helper case class to hold (plan, rowCount) pairs.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22009 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2114/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22009 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22075 **[Test build #94668 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94668/testReport)** for PR 22075 at commit [`deee1dc`](https://github.com/apache/spark/commit/deee1dcef6d7cbde516fa082e4210261ff89b8ff). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22086: [SPARK-25096][SQL] Loosen nullability if the cast is for...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22086 good catch! LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22075 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2113/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22075 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22009 **[Test build #94667 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94667/testReport)** for PR 22009 at commit [`f4f85a8`](https://github.com/apache/spark/commit/f4f85a833ef319a6860134e12655574aca081ed6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22009 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22075 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to ar...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22075#discussion_r209487687 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -90,9 +90,10 @@ trait CheckAnalysis extends PredicateHelper { u.failAnalysis(s"Table or view not found: ${u.tableIdentifier}") case operator: LogicalPlan => -// Check argument data types of higher-order functions downwards first because function -// arguments of the higher-order functions might be unresolved due to the unresolved -// argument data types, otherwise always claims the function arguments are unresolved. +// Check argument data types of higher-order functions downwards first. +// If the arguments of the higher-order functions are resolved but the type check fails, +// the argument functions will not get resolved, but we should report the argument type +// check failure instead of claiming the function arguments are unresolved. --- End diff -- "function arguments" -> "argument functions"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21859: [SPARK-24900][SQL]Speed up sort when the dataset ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21859#discussion_r209486478 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala --- @@ -294,7 +296,12 @@ object ShuffleExchangeExec { sorter.sort(iter.asInstanceOf[Iterator[UnsafeRow]]) } } else { -rdd +part match { + case partitioner: RangePartitioner[InternalRow @unchecked, _] --- End diff -- Ok. I see. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21439 LGTM too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21859: [SPARK-24900][SQL]Speed up sort when the dataset ...
Github user sddyljsx commented on a diff in the pull request: https://github.com/apache/spark/pull/21859#discussion_r209486199 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala --- @@ -294,7 +296,12 @@ object ShuffleExchangeExec { sorter.sort(iter.asInstanceOf[Iterator[UnsafeRow]]) } } else { -rdd +part match { + case partitioner: RangePartitioner[InternalRow @unchecked, _] --- End diff -- yes, but ``` def parallelize[T: ClassTag]( seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] = withScope { assertNotStopped() new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]()) } ``` the parallelize function needs this ClassTag, so we must match it here. I tried to match RangePartitioner[_, _], but there is an error: ``` Error:(302, 37) No ClassTag available for _ Error occurred in an application involving default arguments. sparkContext.parallelize(partitioner.getSampledArray.toSeq, rdd.getNumPartitions) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21439 **[Test build #94666 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94666/testReport)** for PR 21439 at commit [`74a7799`](https://github.com/apache/spark/commit/74a779964b666b36b36a65b2cdd4b47d9df1e04c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21439 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22079 **[Test build #94665 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94665/testReport)** for PR 22079 at commit [`8d2d558`](https://github.com/apache/spark/commit/8d2d5585b2c2832cd4d88b3851607ce15180cca5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...
Github user bersprockets commented on the issue: https://github.com/apache/spark/pull/22079 Hmmm... I somehow managed to break SparkR tests but fixing a comment. It seems to have auto-retried and broke the second time too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22017: [SPARK-23938][SQL] Add map_zip_with function
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22017#discussion_r209485487 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -231,6 +231,15 @@ object TypeCoercion { }) } + /** + * Similar to [[findTightestCommonType]] but with string promotion. + */ + def findWiderTypeForTwoExceptDecimals(t1: DataType, t2: DataType): Option[DataType] = { --- End diff -- I submitted a pr #22086. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...
Github user bersprockets commented on the issue: https://github.com/apache/spark/pull/22079 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22086: [SPARK-25096][SQL] Loosen nullability if the cast is for...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22086 cc @cloud-fan @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22086: [SPARK-25096][SQL] Loosen nullability if the cast is for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22086 **[Test build #94664 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94664/testReport)** for PR 22086 at commit [`f480354`](https://github.com/apache/spark/commit/f48035479c248b85519f1bbc7f5b7661c66c0e67). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22086: [SPARK-25096][SQL] Loosen nullability if the cast is for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22086 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22086: [SPARK-25096][SQL] Loosen nullability if the cast is for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22086 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2112/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22086: [SPARK-25096][SQL] Loosen nullability if the cast...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/22086 [SPARK-25096][SQL] Loosen nullability if the cast is force-nullable. ## What changes were proposed in this pull request? In type coercion for complex types, if the found type is force-nullable to cast, we should loosen the nullability to be able to cast. Also for map key type, we can't use the type. ## How was this patch tested? Added some test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-25096/fix_type_coercion Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22086.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22086 commit f48035479c248b85519f1bbc7f5b7661c66c0e67 Author: Takuya UESHIN Date: 2018-08-13T03:28:21Z Loosen nullability if the cast is force-nullable. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22084: [SPARK-25026][BUILD] Binary releases should conta...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22084#discussion_r209483839 --- Diff: dev/make-distribution.sh --- @@ -188,6 +190,23 @@ if [ -f "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar cp "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar "$DISTDIR/yarn" fi +# Only copy external jars if built +if [ -f "$SPARK_HOME"/external/avro/target/spark-avro_${SCALA_VERSION}-${VERSION}.jar ]; then + cp "$SPARK_HOME"/external/avro/target/spark-avro_${SCALA_VERSION}-${VERSION}.jar "$DISTDIR/external/jars/" +fi +if [ -f "$SPARK_HOME"/external/kafka-0-10/target/spark-streaming-kafka-0-10_${SCALA_VERSION}-${VERSION}.jar ]; then + cp "$SPARK_HOME"/external/kafka-0-10/target/spark-streaming-kafka-0-10_${SCALA_VERSION}-${VERSION}.jar "$DISTDIR/external/jars/" --- End diff -- I didn't want to include kinesis or ganglia because those would entail including OSS with licenses we can't redistribute. The existence of the modules is already a gray area. Let me look into what is built into these JARs. Some things like kafka we don't want to include, but do want to include kafka-client, yeah. We don't want to include Spark either for example. Yeah it's a reasonable argument, that nobody would use these directly anyway. The same could be said of some other JARs in the distro. Really the purpose of jars/ here is to support running in standalone mode. That is, most vendor distros would have spark-streaming-kafka on the classpath for you anyway, but, standalone doesn't. Standalone still won't pick up these new JARs because they're in external/jars/, but at least they're there at all, to be moved into jars/ if you cared to, for local deployment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22075 **[Test build #94663 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94663/testReport)** for PR 22075 at commit [`4b48a39`](https://github.com/apache/spark/commit/4b48a39238ab80f2bd1ebb36fd653ecc6495e492). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22075 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22075: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22075 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2111/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21439 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22079 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94656/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22079 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21439 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94659/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21439 **[Test build #94659 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94659/testReport)** for PR 21439 at commit [`74a7799`](https://github.com/apache/spark/commit/74a779964b666b36b36a65b2cdd4b47d9df1e04c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22079: [SPARK-23207][SQL][BACKPORT-2.2] Shuffle+Repartition on ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22079 **[Test build #94656 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94656/testReport)** for PR 22079 at commit [`8d2d558`](https://github.com/apache/spark/commit/8d2d5585b2c2832cd4d88b3851607ce15180cca5). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22005: [SPARK-16817][CORE][WIP] Use Alluxio to improve stabilit...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/22005 I believe such kind of PR requires SPIP and community discussion first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22036#discussion_r209481430 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -204,6 +204,24 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared } } + test("SPARK-25028: column stats collection for null partitioning columns") { +val table = "analyze_partition_with_null" +withTempDir { dir => + withTable(table) { +sql(s""" + |CREATE TABLE $table (name string, value string) + |USING PARQUET + |PARTITIONED BY (name) + |LOCATION '${dir.toURI}'""".stripMargin) +val df = Seq(("a", null), ("b", null)).toDF("value", "name") --- End diff -- when creating the table, we can put partition column at the end, to avoid this confusion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22077: [SPARK-25084][SQL][BACKPORT-2.3] "distribute by" on mult...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/22077 This is already merged, @LantaoJin please close this PR, thanksï¼ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22084: [SPARK-25026][BUILD] Binary releases should conta...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/22084#discussion_r209481255 --- Diff: dev/make-distribution.sh --- @@ -188,6 +190,23 @@ if [ -f "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar cp "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar "$DISTDIR/yarn" fi +# Only copy external jars if built +if [ -f "$SPARK_HOME"/external/avro/target/spark-avro_${SCALA_VERSION}-${VERSION}.jar ]; then + cp "$SPARK_HOME"/external/avro/target/spark-avro_${SCALA_VERSION}-${VERSION}.jar "$DISTDIR/external/jars/" +fi +if [ -f "$SPARK_HOME"/external/kafka-0-10/target/spark-streaming-kafka-0-10_${SCALA_VERSION}-${VERSION}.jar ]; then + cp "$SPARK_HOME"/external/kafka-0-10/target/spark-streaming-kafka-0-10_${SCALA_VERSION}-${VERSION}.jar "$DISTDIR/external/jars/" --- End diff -- When building such external jar, assembly jar will also be built accordingly. And the assembly jar can be used directly. Jars provided here still not so useful because it lacks third-party dependencies like Kafka, so I'm not sure if it is more convenient compared to pull from maven repo directly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22036: [SPARK-25028][SQL] Avoid NPE when analyzing parti...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22036#discussion_r209481344 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala --- @@ -140,7 +140,13 @@ case class AnalyzePartitionCommand( val df = tableDf.filter(Column(filter)).groupBy(partitionColumns: _*).count() df.collect().map { r => - val partitionColumnValues = partitionColumns.indices.map(r.get(_).toString) + val partitionColumnValues = partitionColumns.indices.map { i => +if (r.isNullAt(i)) { + ExternalCatalogUtils.DEFAULT_PARTITION_NAME --- End diff -- do we need to chang the read path? i.e. where we use these statistics. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22084: [SPARK-25026][BUILD] Binary releases should conta...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/22084#discussion_r209480817 --- Diff: dev/make-distribution.sh --- @@ -188,6 +190,23 @@ if [ -f "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar cp "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar "$DISTDIR/yarn" fi +# Only copy external jars if built +if [ -f "$SPARK_HOME"/external/avro/target/spark-avro_${SCALA_VERSION}-${VERSION}.jar ]; then + cp "$SPARK_HOME"/external/avro/target/spark-avro_${SCALA_VERSION}-${VERSION}.jar "$DISTDIR/external/jars/" +fi +if [ -f "$SPARK_HOME"/external/kafka-0-10/target/spark-streaming-kafka-0-10_${SCALA_VERSION}-${VERSION}.jar ]; then + cp "$SPARK_HOME"/external/kafka-0-10/target/spark-streaming-kafka-0-10_${SCALA_VERSION}-${VERSION}.jar "$DISTDIR/external/jars/" --- End diff -- Also what about kinesis? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22084: [SPARK-25026][BUILD] Binary releases should conta...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22084#discussion_r209480769 --- Diff: dev/make-distribution.sh --- @@ -188,6 +190,23 @@ if [ -f "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar cp "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar "$DISTDIR/yarn" fi +# Only copy external jars if built +if [ -f "$SPARK_HOME"/external/avro/target/spark-avro_${SCALA_VERSION}-${VERSION}.jar ]; then + cp "$SPARK_HOME"/external/avro/target/spark-avro_${SCALA_VERSION}-${VERSION}.jar "$DISTDIR/external/jars/" +fi +if [ -f "$SPARK_HOME"/external/kafka-0-10/target/spark-streaming-kafka-0-10_${SCALA_VERSION}-${VERSION}.jar ]; then + cp "$SPARK_HOME"/external/kafka-0-10/target/spark-streaming-kafka-0-10_${SCALA_VERSION}-${VERSION}.jar "$DISTDIR/external/jars/" --- End diff -- I didn't because the jars/ dir doesn't contain assemblies... well, it's kind of the contents of an assembly. I assume the end-user here is someone who wants to just consume the integration, to add to an app. The assembly would pull back in all of Spark (?) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22084: [SPARK-25026][BUILD] Binary releases should conta...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/22084#discussion_r209480628 --- Diff: dev/make-distribution.sh --- @@ -188,6 +190,23 @@ if [ -f "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar cp "$SPARK_HOME"/common/network-yarn/target/scala*/spark-*-yarn-shuffle.jar "$DISTDIR/yarn" fi +# Only copy external jars if built +if [ -f "$SPARK_HOME"/external/avro/target/spark-avro_${SCALA_VERSION}-${VERSION}.jar ]; then + cp "$SPARK_HOME"/external/avro/target/spark-avro_${SCALA_VERSION}-${VERSION}.jar "$DISTDIR/external/jars/" +fi +if [ -f "$SPARK_HOME"/external/kafka-0-10/target/spark-streaming-kafka-0-10_${SCALA_VERSION}-${VERSION}.jar ]; then + cp "$SPARK_HOME"/external/kafka-0-10/target/spark-streaming-kafka-0-10_${SCALA_VERSION}-${VERSION}.jar "$DISTDIR/external/jars/" --- End diff -- Shall we also copy assembly jar for Kafka and flume? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks when spi...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21369 LGTM except some code style issues. Thanks for improving the test! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks w...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21369#discussion_r209480323 --- Diff: core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala --- @@ -414,7 +424,112 @@ class ExternalAppendOnlyMapSuite extends SparkFunSuite with LocalSparkContext { sc.stop() } - test("external aggregation updates peak execution memory") { + test("SPARK-22713 spill during iteration leaks internal map") { +val size = 1000 +val conf = createSparkConf(loadDefaults = true) +sc = new SparkContext("local-cluster[1,1,1024]", "test", conf) +val map = createExternalMap[Int] + +map.insertAll((0 until size).iterator.map(i => (i / 10, i))) +assert(map.numSpills == 0, "map was not supposed to spill") + +val it = map.iterator +assert(it.isInstanceOf[CompletionIterator[_, _]]) +// org.apache.spark.util.collection.AppendOnlyMap.destructiveSortedIterator returns +// an instance of an annonymous Iterator class. + +val underlyingMapRef = WeakReference(map.currentMap) + +{ + // direct asserts introduced some macro generated code that held a reference to the map + val tmpIsNull = null == underlyingMapRef.get.orNull + assert(!tmpIsNull) +} + +val first50Keys = for ( _ <- 0 until 50) yield { + val (k, vs) = it.next + val sortedVs = vs.sorted + assert(sortedVs.seq == (0 until 10).map(10 * k + _)) + k +} +assert(map.numSpills == 0) +map.spill(Long.MaxValue, null) +// these asserts try to show that we're no longer holding references to the underlying map. +// it'd be nice to use something like +// https://github.com/scala/scala/blob/2.13.x/test/junit/scala/tools/testing/AssertUtil.scala +// (lines 69-89) +// assert(map.currentMap == null) +eventually{ + System.gc() + // direct asserts introduced some macro generated code that held a reference to the map + val tmpIsNull = null == underlyingMapRef.get.orNull + assert(tmpIsNull) +} + + +val next50Keys = for ( _ <- 0 until 50) yield { + val (k, vs) = it.next + val sortedVs = vs.sorted + assert(sortedVs.seq == (0 until 10).map(10 * k + _)) + k +} +assert(!it.hasNext) +val keys = (first50Keys ++ next50Keys).sorted +assert(keys == (0 until 100)) + } + + test("drop all references to the underlying map once the iterator is exhausted") { +val size = 1000 +val conf = createSparkConf(loadDefaults = true) +sc = new SparkContext("local-cluster[1,1,1024]", "test", conf) +val map = createExternalMap[Int] + +map.insertAll((0 until size).iterator.map(i => (i / 10, i))) +assert(map.numSpills == 0, "map was not supposed to spill") + +val underlyingMapRef = WeakReference(map.currentMap) + +{ + // direct asserts introduced some macro generated code that held a reference to the map + val tmpIsNull = null == underlyingMapRef.get.orNull + assert(!tmpIsNull) +} + +val it = map.iterator +assert( it.isInstanceOf[CompletionIterator[_, _]]) + + +val keys = it.map{ + case (k, vs) => +val sortedVs = vs.sorted +assert(sortedVs.seq == (0 until 10).map(10 * k + _)) +k +} +.toList +.sorted + +assert(it.isEmpty) +assert(keys == (0 until 100)) + +assert(map.numSpills == 0) +// these asserts try to show that we're no longer holding references to the underlying map. +// it'd be nice to use something like +// https://github.com/scala/scala/blob/2.13.x/test/junit/scala/tools/testing/AssertUtil.scala +// (lines 69-89) +assert(map.currentMap == null) + +eventually{ --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks w...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21369#discussion_r209480313 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala --- @@ -585,17 +591,24 @@ class ExternalAppendOnlyMap[K, V, C]( } else { logInfo(s"Task ${context.taskAttemptId} force spilling in-memory map to disk and " + s"it will release ${org.apache.spark.util.Utils.bytesToString(getUsed())} memory") -nextUpstream = spillMemoryIteratorToDisk(upstream) +val nextUpstream = spillMemoryIteratorToDisk(upstream) +assert(!upstream.hasNext) hasSpilled = true +upstream = nextUpstream true } } +private def destroy() : Unit = { + freeCurrentMap() + upstream = Iterator.empty +} + +def toCompletionIterator: CompletionIterator[(K, C), SpillableIterator] = { + CompletionIterator[(K, C), SpillableIterator](this, this.destroy ) --- End diff -- nit: no space before `)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org