[GitHub] spark issue #21341: Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21341 **[Test build #90672 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90672/testReport)** for PR 21341 at commit [`0674301`](https://github.com/apache/spark/commit/06743015fbfca7060c800daedfd65bc9c52bf7b4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21341: Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21341 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20894 ping @gengliangwang --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21341: Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21341 cc @gatorsmile @viirya @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21341: Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that ...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/21341 Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf.get is acces⦠â¦sed only on the driver" This reverts commit a4206d58e05ab9ed6f01fee57e18dee65cbc4efc. This is from https://github.com/apache/spark/pull/21299 and to ease the review of it. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark revert Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21341.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21341 commit 06743015fbfca7060c800daedfd65bc9c52bf7b4 Author: Wenchen Fan Date: 2018-05-16T06:54:08Z Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf.get is accessed only on the driver" This reverts commit a4206d58e05ab9ed6f01fee57e18dee65cbc4efc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21291 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90666/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21291 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21291 **[Test build #90666 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90666/testReport)** for PR 21291 at commit [`f93738b`](https://github.com/apache/spark/commit/f93738be3a7509d70568b3060a0cc4dd3ff23da0). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21106: [SPARK-23711][SQL] Add fallback generator for UnsafeProj...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21106 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21106: [SPARK-23711][SQL] Add fallback generator for UnsafeProj...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21106 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3251/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21252: [SPARK-24193] Sort by disk when number of limit is big i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21252 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3250/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21252: [SPARK-24193] Sort by disk when number of limit is big i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21252 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21106: [SPARK-23711][SQL] Add fallback generator for UnsafeProj...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21106 **[Test build #90671 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90671/testReport)** for PR 21106 at commit [`f883c2b`](https://github.com/apache/spark/commit/f883c2b8f2b80b2d73e28d78fcaa6530143e0b66). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21252: [SPARK-24193] Sort by disk when number of limit is big i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21252 **[Test build #90670 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90670/testReport)** for PR 21252 at commit [`6fa3e58`](https://github.com/apache/spark/commit/6fa3e582582fafffdc469943177e47272ba4c8a0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21258: [SPARK-23933][SQL] Add map_from_arrays function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21258 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90665/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21258: [SPARK-23933][SQL] Add map_from_arrays function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21258 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21258: [SPARK-23933][SQL] Add map_from_arrays function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21258 **[Test build #90665 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90665/testReport)** for PR 21258 at commit [`afd2ebb`](https://github.com/apache/spark/commit/afd2ebbb48f45f9763e0e602262f5b558f90077a). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21086: [SPARK-24002] [SQL] Task not serializable caused by org....
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21086 since people hit this issue, let's backport. cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21086: [SPARK-24002] [SQL] Task not serializable caused ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21086#discussion_r188504187 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -351,12 +338,26 @@ class ParquetFileFormat val timestampConversion: Boolean = sparkSession.sessionState.conf.isParquetINT96TimestampConversion val capacity = sqlConf.parquetVectorizedReaderBatchSize +val enableParquetFilterPushDown: Boolean = + sparkSession.sessionState.conf.parquetFilterPushDown // Whole stage codegen (PhysicalRDD) is able to deal with batches directly val returningBatch = supportBatch(sparkSession, resultSchema) (file: PartitionedFile) => { assert(file.partitionValues.numFields == partitionSchema.size) + // Try to push down filters when filter push-down is enabled. --- End diff -- Now the code is inside the read function, which will be executed at executor side. Thus we don't need to serialize `ParquetFilters`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20929 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90664/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20929 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20929 **[Test build #90664 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90664/testReport)** for PR 20929 at commit [`53b686d`](https://github.com/apache/spark/commit/53b686dede4e5fbcb2b3e39932602ae0c9974209). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]Support wildcard in HDFS path for load...
Github user kevinyu98 commented on the issue: https://github.com/apache/spark/pull/20611 @sujith71955 Sorry for the delay. I just ran your test case with my fix only, and it run successfully. Can you verify it? If it is true, then my fix is much simple, what do you think? Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21329: [SPARK-24277][SQL] Code clean up in SQL module: HadoopMa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21329 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21329: [SPARK-24277][SQL] Code clean up in SQL module: HadoopMa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21329 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3249/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21329: [SPARK-24277][SQL] Code clean up in SQL module: HadoopMa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21329 **[Test build #90669 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90669/testReport)** for PR 21329 at commit [`353606c`](https://github.com/apache/spark/commit/353606c919d1b61db22e9e9f47ab6ed06d78702e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21329: [SPARK-24277][SQL] Code clean up in SQL module: HadoopMa...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/21329 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20872: [SPARK-23264][SQL] Fix scala.MatchError in literals.sql....
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20872 @cloud-fan ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21069: [SPARK-23920][SQL]add array_remove to remove all element...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21069 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21208: [SPARK-23925][SQL] Add array_repeat collection function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21069: [SPARK-23920][SQL]add array_remove to remove all element...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21069 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3248/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21208: [SPARK-23925][SQL] Add array_repeat collection function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90662/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21208: [SPARK-23925][SQL] Add array_repeat collection function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21208 **[Test build #90662 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90662/testReport)** for PR 21208 at commit [`3bd11e2`](https://github.com/apache/spark/commit/3bd11e2e25cbc172791b9934279589d0cd459ba5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21069: [SPARK-23920][SQL]add array_remove to remove all ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21069#discussion_r188494901 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala --- @@ -280,4 +280,35 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper checkEvaluation(Concat(Seq(aa0, aa1)), Seq(Seq("a", "b"), Seq("c"), Seq("d"), Seq("e", "f"))) } + + test("Array remove") { +val a0 = Literal.create(Seq(1, 2, 3, 2, 2, 5), ArrayType(IntegerType)) +val a1 = Literal.create(Seq("b", "a", "a", "c", "b"), ArrayType(StringType)) +val a2 = Literal.create(Seq[String](null, "", null, ""), ArrayType(StringType)) +val a3 = Literal.create(Seq.empty[Integer], ArrayType(IntegerType)) +val a4 = Literal.create(null, ArrayType(StringType)) +val a5 = Literal.create(Seq(1, null, 8, 9, null), ArrayType(IntegerType)) +val a6 = Literal.create(Seq(true, false, false, true), ArrayType(BooleanType)) + +checkEvaluation(ArrayRemove(a0, Literal(0)), Seq(1, 2, 3, 2, 2, 5)) +checkEvaluation(ArrayRemove(a0, Literal(1)), Seq(2, 3, 2, 2, 5)) +checkEvaluation(ArrayRemove(a0, Literal(2)), Seq(1, 3, 5)) +checkEvaluation(ArrayRemove(a0, Literal(3)), Seq(1, 2, 2, 2, 5)) +checkEvaluation(ArrayRemove(a0, Literal(5)), Seq(1, 2, 3, 2, 2)) --- End diff -- @ueshin Thank you very much for your comments. I am very sorry for the late reply. I corrected everything except this one. I have ```checkEvaluation(ArrayRemove(a0, Literal(0)), Seq(1, 2, 3, 2, 2, 5))``` to check no value is removed with not contained value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21069: [SPARK-23920][SQL]add array_remove to remove all element...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21069 **[Test build #90668 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90668/testReport)** for PR 21069 at commit [`7fd77d0`](https://github.com/apache/spark/commit/7fd77d01777e7b8bd8b34503cf4d4e7c77df9ecd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21069: [SPARK-23920][SQL]add array_remove to remove all element...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21069 **[Test build #90667 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90667/testReport)** for PR 21069 at commit [`8011aa9`](https://github.com/apache/spark/commit/8011aa91e0ef6bb13ee7b83532dc6fd236cdf792). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21291 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3247/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21291 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r188491670 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType} + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830";>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column + *are ignored + * @param minSupport the minimal support level of the sequential pattern, any pattern that + * appears more than (minSupport * size-of-the-dataset) times will be output + * (recommended value: `0.1`). + * @param maxPatternLength the maximal length of the sequential pattern + * (recommended value: `10`). + * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the + * internal storage format) allowed in a projected database before + * local processing. If a projected database exceeds this size, another + * iteration of distributed prefix growth is run + * (recommended value: `3200`). + * @return A `DataFrame` that contains columns of sequence and corresponding frequency. + * The schema of it will be: + * - `sequence: Seq[Seq[T]]` (T is the item type) + * - `freq: Long` + */ + @Since("2.4.0") + def findFrequentSequentialPatterns( + dataset: Dataset[_], + sequenceCol: String, --- End diff -- this way `final class PrefixSpan(override val uid: String) extends Params` seemingly breaks binary compatibility if later we change it into an estimator ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21291 **[Test build #90666 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90666/testReport)** for PR 21291 at commit [`f93738b`](https://github.com/apache/spark/commit/f93738be3a7509d70568b3060a0cc4dd3ff23da0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21291 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21092 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21092 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90660/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21092 **[Test build #90660 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90660/testReport)** for PR 21092 at commit [`72953a3`](https://github.com/apache/spark/commit/72953a3ef42ce0aa0d4b55c0f213198b4b468907). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21258: [SPARK-23933][SQL] Add map_from_arrays function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21258 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3246/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21258: [SPARK-23933][SQL] Add map_from_arrays function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21258 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21325: [R][backport-2.2] backport lint fix
Github user felixcheung closed the pull request at: https://github.com/apache/spark/pull/21325 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21258: [SPARK-23933][SQL] Add map_from_arrays function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21258 **[Test build #90665 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90665/testReport)** for PR 21258 at commit [`afd2ebb`](https://github.com/apache/spark/commit/afd2ebbb48f45f9763e0e602262f5b558f90077a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20929 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3245/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20929 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20929 **[Test build #90664 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90664/testReport)** for PR 20929 at commit [`53b686d`](https://github.com/apache/spark/commit/53b686dede4e5fbcb2b3e39932602ae0c9974209). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20929 ok, in this pr, I'll focus on adding a new flag to do so. just a sec for the update. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20929 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21340: [SPARK-24115] Have logging pass through instrumentation ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21340 **[Test build #90663 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90663/testReport)** for PR 21340 at commit [`1c83b32`](https://github.com/apache/spark/commit/1c83b329fb59bb357bcbf4ac14179fa55a8b4aad). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21340: [SPARK-24115] Have logging pass through instrumentation ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21340 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90663/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21340: [SPARK-24115] Have logging pass through instrumentation ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21340 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21340: [SPARK-24115] Have logging pass through instrumentation ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21340 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21340: [SPARK-24115] Have logging pass through instrumentation ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21340 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3244/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21291 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21291 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90661/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21340: [SPARK-24115] Have logging pass through instrumentation ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21340 **[Test build #90663 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90663/testReport)** for PR 21340 at commit [`1c83b32`](https://github.com/apache/spark/commit/1c83b329fb59bb357bcbf4ac14179fa55a8b4aad). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21291 **[Test build #90661 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90661/testReport)** for PR 21291 at commit [`f93738b`](https://github.com/apache/spark/commit/f93738be3a7509d70568b3060a0cc4dd3ff23da0). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21340: [SPARK-24115] Have logging pass through instrumen...
GitHub user MrBago opened a pull request: https://github.com/apache/spark/pull/21340 [SPARK-24115] Have logging pass through instrumentation class. ## What changes were proposed in this pull request? Fixes to tuning instrumentation. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MrBago/spark tunning-instrumentation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21340.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21340 commit 1c83b329fb59bb357bcbf4ac14179fa55a8b4aad Author: Bago Amirbekian Date: 2018-05-16T01:39:31Z Have logging pass through instrumentation class. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21336: [SPARK-24286][Documentation] DataFrameReader.csv ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21336#discussion_r188476857 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -521,7 +521,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * * You can set the following CSV-specific options to deal with CSV files: * - * `sep` (default `,`): sets a single character as a separator for each + * `sep` or `delimiter` (default `,`): sets a single character as a separator for each --- End diff -- `sep` is preferred and `delimiter` is not documented on purpose. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21338: [SPARK-23601][build][follow-up] Keep md5 checksums for n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21338 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21338: [SPARK-23601][build][follow-up] Keep md5 checksums for n...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21338 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90659/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21338: [SPARK-23601][build][follow-up] Keep md5 checksums for n...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21338 **[Test build #90659 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90659/testReport)** for PR 21338 at commit [`60d058e`](https://github.com/apache/spark/commit/60d058e02be7d2daf4d7c5f0abff3530c2349c00). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21086: [SPARK-24002] [SQL] Task not serializable caused ...
Github user ghoto commented on a diff in the pull request: https://github.com/apache/spark/pull/21086#discussion_r188473831 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -351,12 +338,26 @@ class ParquetFileFormat val timestampConversion: Boolean = sparkSession.sessionState.conf.isParquetINT96TimestampConversion val capacity = sqlConf.parquetVectorizedReaderBatchSize +val enableParquetFilterPushDown: Boolean = + sparkSession.sessionState.conf.parquetFilterPushDown // Whole stage codegen (PhysicalRDD) is able to deal with batches directly val returningBatch = supportBatch(sparkSession, resultSchema) (file: PartitionedFile) => { assert(file.partitionValues.numFields == partitionSchema.size) + // Try to push down filters when filter push-down is enabled. --- End diff -- So this code is the same as before. How can this solve the bug described in the head of the Conversation? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21153: [SPARK-24058][ML][PySpark] Default Params in ML s...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21153 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21153: [SPARK-24058][ML][PySpark] Default Params in ML should b...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21153 Thanks @jkbradley @WeichenXu123 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21153: [SPARK-24058][ML][PySpark] Default Params in ML should b...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21153 OK thanks @viirya ! Merging with master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21208: [SPARK-23925][SQL] Add array_repeat collection function
Github user pepinoflo commented on the issue: https://github.com/apache/spark/pull/21208 Just changed my email address in those 9 last commits. Unfortunately I wasn't able to rewrite the first commit as the first merge could not be preserved even with `git rebase -i -p`. Is that ok to be merged anyway or this needs to be fixed somehow (maybe removing the 2 merges totally and doing a new merge)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21208: [SPARK-23925][SQL] Add array_repeat collection function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21208 **[Test build #90662 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90662/testReport)** for PR 21208 at commit [`3bd11e2`](https://github.com/apache/spark/commit/3bd11e2e25cbc172791b9934279589d0cd459ba5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r188464083 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType} + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830";>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column + *are ignored + * @param minSupport the minimal support level of the sequential pattern, any pattern that + * appears more than (minSupport * size-of-the-dataset) times will be output + * (recommended value: `0.1`). + * @param maxPatternLength the maximal length of the sequential pattern + * (recommended value: `10`). + * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the + * internal storage format) allowed in a projected database before + * local processing. If a projected database exceeds this size, another + * iteration of distributed prefix growth is run + * (recommended value: `3200`). + * @return A `DataFrame` that contains columns of sequence and corresponding frequency. + * The schema of it will be: + * - `sequence: Seq[Seq[T]]` (T is the item type) + * - `freq: Long` + */ + @Since("2.4.0") + def findFrequentSequentialPatterns( + dataset: Dataset[_], + sequenceCol: String, --- End diff -- It should be easier to keep the `PrefixSpan` name and make it an `Estimator` later. For example: ~~~scala final class PrefixSpan(override val uid: String) extends Params { // param, setters, getters def findFrequentSequentialPatterns(dataset: Dataset[_]): DataFrame } ~~~ Later we can add `Estimator.fit` and `PrefixSpanModel.transform`. Any issue with this approach? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21086: [SPARK-24002] [SQL] Task not serializable caused by org....
Github user ghoto commented on the issue: https://github.com/apache/spark/pull/21086 I'm hitting this issue after upgrading from 2.0.2 to 2.3.0. Please backport this PR to Spark 2.3.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21291 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21291 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3243/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21092 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3146/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21337: [SPARK-24234][SS] Reader for continuous processin...
Github user jose-torres commented on a diff in the pull request: https://github.com/apache/spark/pull/21337#discussion_r188456856 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleReadRDD.scala --- @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming.continuous.shuffle + +import java.util.UUID + +import org.apache.spark.{Partition, SparkContext, SparkEnv, TaskContext} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.expressions.UnsafeRow +import org.apache.spark.util.NextIterator + +case class ContinuousShuffleReadPartition(index: Int) extends Partition { + // Initialized only on the executor, and only once even as we call compute() multiple times. + lazy val (receiver, endpoint) = { +val env = SparkEnv.get.rpcEnv +val receiver = new UnsafeRowReceiver(env) +val endpoint = env.setupEndpoint(UUID.randomUUID().toString, receiver) +TaskContext.get().addTaskCompletionListener { ctx => + env.stop(endpoint) +} +(receiver, endpoint) + } +} + +/** + * RDD at the bottom of each continuous processing shuffle task, reading from the + */ +class ContinuousShuffleReadRDD(sc: SparkContext, numPartitions: Int) +extends RDD[UnsafeRow](sc, Nil) { + + override protected def getPartitions: Array[Partition] = { +(0 until numPartitions).map(ContinuousShuffleReadPartition).toArray + } + + override def compute(split: Partition, context: TaskContext): Iterator[UnsafeRow] = { +val receiver = split.asInstanceOf[ContinuousShuffleReadPartition].receiver + +new NextIterator[UnsafeRow] { + override def getNext(): UnsafeRow = receiver.poll() match { +case ReceiverRow(r) => r +case ReceiverEpochMarker() => --- End diff -- It should, but I think that's significant enough to justify its own PR. Added an explicit TODO to be safe. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21337: [SPARK-24234][SS] Reader for continuous processin...
Github user jose-torres commented on a diff in the pull request: https://github.com/apache/spark/pull/21337#discussion_r188456692 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleReadRDD.scala --- @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming.continuous.shuffle + +import java.util.UUID + +import org.apache.spark.{Partition, SparkContext, SparkEnv, TaskContext} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.expressions.UnsafeRow +import org.apache.spark.util.NextIterator + +case class ContinuousShuffleReadPartition(index: Int) extends Partition { + // Initialized only on the executor, and only once even as we call compute() multiple times. + lazy val (receiver, endpoint) = { +val env = SparkEnv.get.rpcEnv +val receiver = new UnsafeRowReceiver(env) +val endpoint = env.setupEndpoint(UUID.randomUUID().toString, receiver) +TaskContext.get().addTaskCompletionListener { ctx => + env.stop(endpoint) +} +(receiver, endpoint) + } +} + +/** + * RDD at the bottom of each continuous processing shuffle task, reading from the --- End diff -- Well, ContinuousShuffleReadRDD is a bit self-documenting as a reader. Added that it's receiving shuffle data from upstream tasks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21338: [SPARK-23601][build][follow-up] Keep md5 checksums for n...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21338 I'll reply to the original e-mail on the PMC list. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21291 **[Test build #90661 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90661/testReport)** for PR 21291 at commit [`f93738b`](https://github.com/apache/spark/commit/f93738be3a7509d70568b3060a0cc4dd3ff23da0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21092 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3146/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21291 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21338: [SPARK-23601][build][follow-up] Keep md5 checksums for n...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/21338 Can we check this the appropriate Apache group (is it infra ?) ? It seems odd that the policy would require removing them when nexus requires them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21092 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21092 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3242/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21322: [SPARK-24225][CORE] Support closing AutoClosable objects...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21322 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21322: [SPARK-24225][CORE] Support closing AutoClosable objects...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21322 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90652/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21092 **[Test build #90660 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90660/testReport)** for PR 21092 at commit [`72953a3`](https://github.com/apache/spark/commit/72953a3ef42ce0aa0d4b55c0f213198b4b468907). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user ifilonenko commented on the issue: https://github.com/apache/spark/pull/21092 jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21322: [SPARK-24225][CORE] Support closing AutoClosable objects...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21322 **[Test build #90652 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90652/testReport)** for PR 21322 at commit [`6a08c43`](https://github.com/apache/spark/commit/6a08c434cf967b939b8065bb23d64d0715e38a2c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21322: [SPARK-24225][CORE] Support closing AutoClosable objects...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21322 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90651/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21322: [SPARK-24225][CORE] Support closing AutoClosable objects...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21322 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21322: [SPARK-24225][CORE] Support closing AutoClosable objects...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21322 **[Test build #90651 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90651/testReport)** for PR 21322 at commit [`62d46d3`](https://github.com/apache/spark/commit/62d46d3bf49ef0393a916d3cafaae4947f374f36). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21338: [SPARK-23601][build][follow-up] Keep md5 checksums for n...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21338 Right. The new policy says we shouldn't use md5 files, but the nexus server requires them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21338: [SPARK-23601][build][follow-up] Keep md5 checksums for n...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/21338 If I follow this correctly, this is a partial revert only for the Nexus artifacts ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21326: [SPARK-24275][SQL] Revise doc comments in InputPartition
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21326 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21326: [SPARK-24275][SQL] Revise doc comments in InputPartition
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21326 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90650/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21326: [SPARK-24275][SQL] Revise doc comments in InputPartition
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21326 **[Test build #90650 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90650/testReport)** for PR 21326 at commit [`f571750`](https://github.com/apache/spark/commit/f571750b26a7da936e48ba5e40528e6a16c43744). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org