[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20147 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20147 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85718/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20147 **[Test build #85718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85718/testReport)** for PR 20147 at commit [`7b58d99`](https://github.com/apache/spark/commit/7b58d994f485255ffad59ed9ffb480a64a00cea2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEs...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20132 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85719/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEs...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20132 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEs...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20132 **[Test build #85719 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85719/testReport)** for PR 20132 at commit [`c547d0f`](https://github.com/apache/spark/commit/c547d0fa7c4e1bb61312219df4d273af4a5a9db9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20160: [SPARK-22757][K8S] Enable spark.jars and spark.fi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20160 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20160: [SPARK-22757][K8S] Enable spark.jars and spark.files in ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20160 test passed, merged to master/2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20151: [SPARK-22959][PYTHON] Configuration to select the...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20151#discussion_r159819670 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -34,17 +34,25 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String import PythonWorkerFactory._ - // Because forking processes from Java is expensive, we prefer to launch a single Python daemon - // (pyspark/daemon.py) and tell it to fork new workers for our tasks. This daemon currently - // only works on UNIX-based systems now because it uses signals for child management, so we can - // also fall back to launching workers (pyspark/worker.py) directly. + // Because forking processes from Java is expensive, we prefer to launch a single Python daemon, + // pyspark/daemon.py (by default) and tell it to fork new workers for our tasks. This daemon + // currently only works on UNIX-based systems now because it uses signals for child management, + // so we can also fall back to launching workers, pyspark/worker.py (by default) directly. val useDaemon = { val useDaemonEnabled = SparkEnv.get.conf.getBoolean("spark.python.use.daemon", true) // This flag is ignored on Windows as it's unable to fork. !System.getProperty("os.name").startsWith("Windows") && useDaemonEnabled } + // This configuration indicates the module to run the daemon to execute its Python workers. + val daemonModule = SparkEnv.get.conf.get("spark.python.daemon.module", "pyspark.daemon") --- End diff -- generally, I thought we use the name "command" as what we call the thing to execute --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20143: [SPARK-22949][ML] Apply CrossValidator approach t...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20143 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20143: [SPARK-22949][ML] Apply CrossValidator approach to Drive...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20143 LGTM I'm going to merge this with master and branch-2.3, backporting as a fairly important bug fix Thanks @MrBago ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20024 Thanks! I'll do. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEs...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20132 **[Test build #85719 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85719/testReport)** for PR 20132 at commit [`c547d0f`](https://github.com/apache/spark/commit/c547d0fa7c4e1bb61312219df4d273af4a5a9db9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20150: [SPARK-22956][SS] Bug fix for 2 streams union failover s...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/20150 cc @zsxwing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEs...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20132 Updated! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEn...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20132#discussion_r159815032 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -205,60 +210,58 @@ class OneHotEncoderModel private[ml] ( import OneHotEncoderModel._ - // Returns the category size for a given index with `dropLast` and `handleInvalid` + // Returns the category size for each index with `dropLast` and `handleInvalid` // taken into account. - private def configedCategorySize(orgCategorySize: Int, idx: Int): Int = { + private def getConfigedCategorySizes: Array[Int] = { val dropLast = getDropLast val keepInvalid = getHandleInvalid == OneHotEncoderEstimator.KEEP_INVALID if (!dropLast && keepInvalid) { // When `handleInvalid` is "keep", an extra category is added as last category // for invalid data. - orgCategorySize + 1 + categorySizes.map(_ + 1) } else if (dropLast && !keepInvalid) { // When `dropLast` is true, the last category is removed. - orgCategorySize - 1 + categorySizes.map(_ - 1) } else { // When `dropLast` is true and `handleInvalid` is "keep", the extra category for invalid // data is removed. Thus, it is the same as the plain number of categories. - orgCategorySize + categorySizes } } private def encoder: UserDefinedFunction = { -val oneValue = Array(1.0) -val emptyValues = Array.empty[Double] -val emptyIndices = Array.empty[Int] -val dropLast = getDropLast -val handleInvalid = getHandleInvalid -val keepInvalid = handleInvalid == OneHotEncoderEstimator.KEEP_INVALID +val keepInvalid = getHandleInvalid == OneHotEncoderEstimator.KEEP_INVALID +val configedSizes = getConfigedCategorySizes +val localCategorySizes = categorySizes // The udf performed on input data. The first parameter is the input value. The second -// parameter is the index of input. -udf { (label: Double, idx: Int) => - val plainNumCategories = categorySizes(idx) - val size = configedCategorySize(plainNumCategories, idx) - - if (label < 0) { -throw new SparkException(s"Negative value: $label. Input can't be negative.") - } else if (label == size && dropLast && !keepInvalid) { -// When `dropLast` is true and `handleInvalid` is not "keep", -// the last category is removed. -Vectors.sparse(size, emptyIndices, emptyValues) - } else if (label >= plainNumCategories && keepInvalid) { -// When `handleInvalid` is "keep", encodes invalid data to last category (and removed -// if `dropLast` is true) -if (dropLast) { - Vectors.sparse(size, emptyIndices, emptyValues) +// parameter is the index in inputCols of the column being encoded. +udf { (label: Double, colIdx: Int) => + val origCategorySize = localCategorySizes(colIdx) + // idx: index in vector of the single 1-valued element + val idx = if (label >= 0 && label < origCategorySize) { +label + } else { +if (keepInvalid) { + origCategorySize } else { - Vectors.sparse(size, Array(size - 1), oneValue) + if (label < 0) { +throw new SparkException(s"Negative value: $label. Input can't be negative. " + --- End diff -- Great point, I'll make the change so that negative values are treated just like any other invalid value. We could add null/NaN in the future too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20024: [SPARK-22825][SQL] Fix incorrect results of Casti...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20024 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20024 as a follow-up, we should do the same thing for struct and map type too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20024 thanks, merging to master/2.3! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20076 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20076 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85716/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20076 **[Test build #85716 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85716/testReport)** for PR 20076 at commit [`b5cd809`](https://github.com/apache/spark/commit/b5cd809c680d089d4fa8da9bd43cd64ed1a3b138). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19080 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85715/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20147 **[Test build #85718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85718/testReport)** for PR 20147 at commit [`7b58d99`](https://github.com/apache/spark/commit/7b58d994f485255ffad59ed9ffb480a64a00cea2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19080 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19080 **[Test build #85715 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85715/testReport)** for PR 19080 at commit [`639e9b6`](https://github.com/apache/spark/commit/639e9b67ed55ea109d6a89a0319d83124bbd7d30). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed trait Distribution ` * `case class HashPartitionedDistribution(expressions: Seq[Expression]) extends Distribution ` * `case class BroadcastDistribution(mode: BroadcastMode) extends Distribution ` * `case class UnknownPartitioning(numPartitions: Int) extends Partitioning` * `case class RoundRobinPartitioning(numPartitions: Int) extends Partitioning` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20147 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...
Github user squito commented on the issue: https://github.com/apache/spark/pull/20082 sorry this is a little late, but lgtm too. agree with the points above about leaving the old name deprecated and moving to the new name --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20147 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20147 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85717/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20147 **[Test build #85717 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85717/testReport)** for PR 20147 at commit [`7b58d99`](https://github.com/apache/spark/commit/7b58d994f485255ffad59ed9ffb480a64a00cea2). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19992: [SPARK-22805][CORE] Use StorageLevel aliases in event lo...
Github user squito commented on the issue: https://github.com/apache/spark/pull/19992 change is fine, but from discussion on the jira I'm unclear if this is really worth it -- gain seems pretty small after the other fix in 2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/19992#discussion_r159812291 --- Diff: core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala --- @@ -2022,12 +1947,7 @@ private[spark] object JsonProtocolSuite extends Assertions { | "Port": 300 |}, |"Block ID": "rdd_0_0", - |"Storage Level": { --- End diff -- yup, I completely agree that off heap is not respected in the json format. can you file a bug? I think its still relevant even after this goes in, for custom levels --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20024 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20024 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85714/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20024 **[Test build #85714 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85714/testReport)** for PR 20024 at commit [`dc15b93`](https://github.com/apache/spark/commit/dc15b93fe76a675136dd1bf08ce25ad3c55959b3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20153 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20153 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85712/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20153 **[Test build #85712 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85712/testReport)** for PR 20153 at commit [`a019886`](https://github.com/apache/spark/commit/a01988624d0cde682aa820e59c89019812c3ef73). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class DataSourceRDDPartition[T : ClassTag](val index: Int, val readTask: ReadTask[T])` * `class DataSourceRDD[T: ClassTag](` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20096 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85713/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20096 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20096 **[Test build #85713 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85713/testReport)** for PR 20096 at commit [`f9ad94e`](https://github.com/apache/spark/commit/f9ad94e8aa753892f40ee8fc069969563347764c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20024 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85710/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20024 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20024 **[Test build #85710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85710/testReport)** for PR 20024 at commit [`449e2c9`](https://github.com/apache/spark/commit/449e2c9c8c5c48a14a9b2efec728b350463188bf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20163 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85709/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20163 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20163 **[Test build #85709 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85709/testReport)** for PR 20163 at commit [`ca026d3`](https://github.com/apache/spark/commit/ca026d31a489f1e0eb451fe85df97083659d0f67). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20163 Wait .. Isn't this because we failed to call `toInternal` by the return type? Please give me few days .. will double check tonight. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20147 **[Test build #85717 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85717/testReport)** for PR 20147 at commit [`7b58d99`](https://github.com/apache/spark/commit/7b58d994f485255ffad59ed9ffb480a64a00cea2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user squito commented on the issue: https://github.com/apache/spark/pull/19848 @steveloughran can you bring this up on dev@? we should move this discussion off of this PR. (sorry haven't had a chance to look yet, but I appreciate you doing this) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20147 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85711/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20147 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20147 **[Test build #85711 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85711/testReport)** for PR 20147 at commit [`3dbfffd`](https://github.com/apache/spark/commit/3dbfffd9a764cb35e05b85fc5c691a7708f31a0e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PySpark] Spark SQL should handle Py...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r159805825 --- Diff: python/pyspark/sql/udf.py --- @@ -26,6 +26,28 @@ def _wrap_function(sc, func, returnType): +def coerce_to_str(v): +import datetime +if type(v) == datetime.date or type(v) == datetime.datetime: +return str(v) --- End diff -- I think it's weird that we have a cast here alone ... Can't we register a custom Pyrolite unpickler? Does it make the things more complicated? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20098: [SPARK-22914][DEPLOY] Register history.ui.port
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20098 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85705/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20098: [SPARK-22914][DEPLOY] Register history.ui.port
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20098 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20098: [SPARK-22914][DEPLOY] Register history.ui.port
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20098 **[Test build #85705 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85705/testReport)** for PR 20098 at commit [`03e0e27`](https://github.com/apache/spark/commit/03e0e271c58e1a4fe7381b60b06e87e5fe2c3a77). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PySpark] Spark SQL should handle Py...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r159804507 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala --- @@ -120,10 +121,18 @@ object EvaluatePython { case (c: java.math.BigDecimal, dt: DecimalType) => Decimal(c, dt.precision, dt.scale) case (c: Int, DateType) => c --- End diff -- Of course, separate change obviously. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20162: [SPARK-22965] [PySpark] [SQL] Add deterministic paramete...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20162 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85708/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20162: [SPARK-22965] [PySpark] [SQL] Add deterministic paramete...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20162 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20162: [SPARK-22965] [PySpark] [SQL] Add deterministic paramete...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20162 **[Test build #85708 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85708/testReport)** for PR 20162 at commit [`7e4f3c0`](https://github.com/apache/spark/commit/7e4f3c0f0c4082bf166cec0e72f9f86f5d23aac8). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PySpark] Spark SQL should handle Py...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r159804356 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala --- @@ -120,10 +121,18 @@ object EvaluatePython { case (c: java.math.BigDecimal, dt: DecimalType) => Decimal(c, dt.precision, dt.scale) case (c: Int, DateType) => c --- End diff -- BTW, as a side note, I think we can make the converter for the type and then reuse it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20163 I think Scalar and Group map UDF expect pandas Series of datetime64[ns] (native pandas timestamp type) instead of a pandas Series of datetime.date and datetime.datetime object. I don't think it's necessary to have pandas UDF to work with a pandas Series of datetime.date or datetime.datetime object, as the standard type of timestamp is datetime64[ns] in pandas. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20076 **[Test build #85716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85716/testReport)** for PR 20076 at commit [`b5cd809`](https://github.com/apache/spark/commit/b5cd809c680d089d4fa8da9bd43cd64ed1a3b138). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20160: [SPARK-22757][K8S] Enable spark.jars and spark.files in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20160 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...
Github user fjh100456 commented on a diff in the pull request: https://github.com/apache/spark/pull/20076#discussion_r159802320 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala --- @@ -27,7 +28,7 @@ import org.apache.spark.sql.internal.SQLConf /** * Options for the Parquet data source. */ -private[parquet] class ParquetOptions( --- End diff -- Yes, It should be revived. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20160: [SPARK-22757][K8S] Enable spark.jars and spark.files in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20160 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85703/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20160: [SPARK-22757][K8S] Enable spark.jars and spark.files in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20160 **[Test build #85703 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85703/testReport)** for PR 20160 at commit [`1075cfc`](https://github.com/apache/spark/commit/1075cfc9ce18f3717bbbeb9950eafdda219f7233). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSui...
Github user bersprockets commented on a diff in the pull request: https://github.com/apache/spark/pull/20147#discussion_r159802199 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala --- @@ -85,6 +93,34 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils { new File(tmpDataDir, name).getCanonicalPath } + private def getFileFromUrl(urlString: String, targetDir: String, filename: String): Unit = { +val conf = new SparkConf +// if the caller passes the name of an existing file, we want doFetchFile to write over it with +// the contents from the specified url. +conf.set("spark.files.overwrite", "true") +val securityManager = new SecurityManager(conf) +val hadoopConf = new Configuration + +val outDir = new File(targetDir) +if (!outDir.exists()) { + outDir.mkdirs() +} + +// propagate exceptions up to the caller of getFileFromUrl +Utils.doFetchFile(urlString, outDir, filename, conf, securityManager, hadoopConf) + } + + private def getStringFromUrl(urlString: String, encoding: String = "UTF-8"): String = { --- End diff -- Oops. I should have caught that. Will fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19080 **[Test build #85715 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85715/testReport)** for PR 19080 at commit [`639e9b6`](https://github.com/apache/spark/commit/639e9b67ed55ea109d6a89a0319d83124bbd7d30). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19080 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20096 **[Test build #85713 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85713/testReport)** for PR 20096 at commit [`f9ad94e`](https://github.com/apache/spark/commit/f9ad94e8aa753892f40ee8fc069969563347764c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20024 **[Test build #85714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85714/testReport)** for PR 20024 at commit [`dc15b93`](https://github.com/apache/spark/commit/dc15b93fe76a675136dd1bf08ce25ad3c55959b3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20024: [SPARK-22825][SQL] Fix incorrect results of Casti...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/20024#discussion_r159800597 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -608,6 +665,17 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String val tz = ctx.addReferenceObj("timeZone", timeZone) (c, evPrim, evNull) => s"""$evPrim = UTF8String.fromString( org.apache.spark.sql.catalyst.util.DateTimeUtils.timestampToString($c, $tz));""" + case ArrayType(et, _) => +(c, evPrim, evNull) => { + val bufferTerm = ctx.freshName("bufferTerm") --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20152: [SPARK-22957] ApproxQuantile breaks if the number...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20152 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20024: [SPARK-22825][SQL] Fix incorrect results of Casti...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20024#discussion_r159799552 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -608,6 +665,17 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String val tz = ctx.addReferenceObj("timeZone", timeZone) (c, evPrim, evNull) => s"""$evPrim = UTF8String.fromString( org.apache.spark.sql.catalyst.util.DateTimeUtils.timestampToString($c, $tz));""" + case ArrayType(et, _) => +(c, evPrim, evNull) => { + val bufferTerm = ctx.freshName("bufferTerm") --- End diff -- super nit: In codegen we usually don't add a `term` postfix, just call it `buffer`, `array`, etc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20152: [SPARK-22957] ApproxQuantile breaks if the number of row...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20152 LGTM, merging to master/2.3! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSui...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20147#discussion_r159798210 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala --- @@ -85,6 +93,34 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils { new File(tmpDataDir, name).getCanonicalPath } + private def getFileFromUrl(urlString: String, targetDir: String, filename: String): Unit = { +val conf = new SparkConf +// if the caller passes the name of an existing file, we want doFetchFile to write over it with +// the contents from the specified url. +conf.set("spark.files.overwrite", "true") +val securityManager = new SecurityManager(conf) +val hadoopConf = new Configuration + +val outDir = new File(targetDir) +if (!outDir.exists()) { + outDir.mkdirs() +} + +// propagate exceptions up to the caller of getFileFromUrl +Utils.doFetchFile(urlString, outDir, filename, conf, securityManager, hadoopConf) + } + + private def getStringFromUrl(urlString: String, encoding: String = "UTF-8"): String = { --- End diff -- Seems `encoding: String = "UTF-8"` is not used. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20153 **[Test build #85712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85712/testReport)** for PR 20153 at commit [`a019886`](https://github.com/apache/spark/commit/a01988624d0cde682aa820e59c89019812c3ef73). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20147 **[Test build #85711 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85711/testReport)** for PR 20147 at commit [`3dbfffd`](https://github.com/apache/spark/commit/3dbfffd9a764cb35e05b85fc5c691a7708f31a0e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user jose-torres commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r159797164 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -418,11 +418,16 @@ abstract class StreamExecution( * Blocks the current thread until processing for data from the given `source` has reached at * least the given `Offset`. This method is intended for use primarily when writing tests. */ - private[sql] def awaitOffset(source: Source, newOffset: Offset): Unit = { + private[sql] def awaitOffset(sourceIndex: Int, newOffset: Offset): Unit = { assertAwaitThread() def notDone = { val localCommittedOffsets = committedOffsets - !localCommittedOffsets.contains(source) || localCommittedOffsets(source) != newOffset + if (sources.length <= sourceIndex) { +false --- End diff -- Sources is a var which might not be populated yet. (This race condition showed up in AddKafkaData in my tests.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20024 **[Test build #85710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85710/testReport)** for PR 20024 at commit [`449e2c9`](https://github.com/apache/spark/commit/449e2c9c8c5c48a14a9b2efec728b350463188bf). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user jose-torres commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r159796525 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSinkV2.scala --- @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import org.apache.kafka.clients.producer.{Callback, ProducerRecord, RecordMetadata} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, Literal, UnsafeProjection} +import org.apache.spark.sql.kafka010.KafkaSourceProvider.{kafkaParamsForProducer, TOPIC_OPTION_KEY} +import org.apache.spark.sql.sources.v2.streaming.writer.ContinuousWriter +import org.apache.spark.sql.sources.v2.writer._ +import org.apache.spark.sql.streaming.OutputMode +import org.apache.spark.sql.types.{BinaryType, StringType, StructType} + +class ContinuousKafkaWriter( +topic: Option[String], producerParams: Map[String, String], schema: StructType) + extends ContinuousWriter with SupportsWriteInternalRow { + + override def createInternalRowWriterFactory(): KafkaWriterFactory = +KafkaWriterFactory(topic, producerParams, schema) + + override def commit(epochId: Long, messages: Array[WriterCommitMessage]): Unit = {} + override def abort(messages: Array[WriterCommitMessage]): Unit = {} +} + +case class KafkaWriterFactory( +topic: Option[String], producerParams: Map[String, String], schema: StructType) + extends DataWriterFactory[InternalRow] { + + override def createDataWriter(partitionId: Int, attemptNumber: Int): DataWriter[InternalRow] = { +new KafkaDataWriter(topic, producerParams, schema.toAttributes) + } +} + +case class KafkaWriterCommitMessage() extends WriterCommitMessage {} + +class KafkaDataWriter( +topic: Option[String], producerParams: Map[String, String], inputSchema: Seq[Attribute]) + extends DataWriter[InternalRow] { + import scala.collection.JavaConverters._ + + @volatile private var failedWrite: Exception = _ + private val projection = createProjection + private lazy val producer = CachedKafkaProducer.getOrCreate( +new java.util.HashMap[String, Object](producerParams.asJava)) + + private val callback = new Callback() { +override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = { + if (failedWrite == null && e != null) { +failedWrite = e + } +} + } + + def write(row: InternalRow): Unit = { +if (failedWrite != null) return + +val projectedRow = projection(row) +val topic = projectedRow.getUTF8String(0) +val key = projectedRow.getBinary(1) +val value = projectedRow.getBinary(2) + +if (topic == null) { + throw new NullPointerException(s"null topic present in the data. Use the " + +s"${KafkaSourceProvider.TOPIC_OPTION_KEY} option for setting a default topic.") +} +val record = new ProducerRecord[Array[Byte], Array[Byte]](topic.toString, key, value) +producer.send(record, callback) + } + + def commit(): WriterCommitMessage = KafkaWriterCommitMessage() + def abort(): Unit = {} + + def close(): Unit = { +checkForErrors() +if (producer != null) { + producer.flush() + checkForErrors() +} --- End diff -- I think CachedKafkaProducer handles closing automatically, but since these are long lived I can do it explicitly too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20024 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20163 LGTM, cc @ueshin @icexelloss does this behavior consistent with pandas UDF? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user jose-torres commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r159796099 --- Diff: external/kafka-0-10-sql/src/main/scala/ContinuousKafkaReader.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.io._ +import java.nio.charset.StandardCharsets +import java.util.concurrent.atomic.AtomicBoolean + +import org.apache.commons.io.IOUtils +import org.apache.kafka.clients.consumer.ConsumerRecord +import org.apache.kafka.common.TopicPartition +import org.apache.kafka.common.errors.WakeupException + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.{DataFrame, Row, SparkSession, SQLContext} +import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, Literal, UnsafeProjection, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeRowWriter} +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.execution.streaming.{HDFSMetadataLog, SerializedOffset} +import org.apache.spark.sql.kafka010.KafkaSource.{INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_FALSE, INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_TRUE, VERSION} +import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.sources.v2.streaming.reader.{ContinuousDataReader, ContinuousReader, Offset, PartitionOffset} +import org.apache.spark.sql.types.StructType +import org.apache.spark.unsafe.types.UTF8String + +class ContinuousKafkaReader( +kafkaReader: KafkaOffsetReader, +executorKafkaParams: java.util.Map[String, Object], +sourceOptions: Map[String, String], +metadataPath: String, +initialOffsets: KafkaOffsetRangeLimit, +failOnDataLoss: Boolean) + extends ContinuousReader with SupportsScanUnsafeRow with Logging { + + override def mergeOffsets(offsets: Array[PartitionOffset]): Offset = { +val mergedMap = offsets.map { + case KafkaSourcePartitionOffset(p, o) => Map(p -> o) +}.reduce(_ ++ _) +KafkaSourceOffset(mergedMap) + } + + private lazy val session = SparkSession.getActiveSession.get + private lazy val sc = session.sparkContext + + private lazy val pollTimeoutMs = sourceOptions.getOrElse( +"kafkaConsumer.pollTimeoutMs", +sc.conf.getTimeAsMs("spark.network.timeout", "120s").toString + ).toLong + + private val maxOffsetsPerTrigger = +sourceOptions.get("maxOffsetsPerTrigger").map(_.toLong) + + /** + * Lazily initialize `initialPartitionOffsets` to make sure that `KafkaConsumer.poll` is only + * called in StreamExecutionThread. Otherwise, interrupting a thread while running + * `KafkaConsumer.poll` may hang forever (KAFKA-1894). + */ + private lazy val initialPartitionOffsets = { + val offsets = initialOffsets match { +case EarliestOffsetRangeLimit => KafkaSourceOffset(kafkaReader.fetchEarliestOffsets()) +case LatestOffsetRangeLimit => KafkaSourceOffset(kafkaReader.fetchLatestOffsets()) +case SpecificOffsetRangeLimit(p) => fetchAndVerify(p) + } + logInfo(s"Initial offsets: $offsets") + offsets.partitionToOffsets + } + + private def fetchAndVerify(specificOffsets: Map[TopicPartition, Long]) = { +val result = kafkaReader.fetchSpecificOffsets(specificOffsets) +specificOffsets.foreach { + case (tp, off) if off != KafkaOffsetRangeLimit.LATEST && +off != KafkaOffsetRangeLimit.EARLIEST => +if (result(tp) != off) { + reportDataLoss( +s"startingOffsets for $tp was $off but consumer reset to ${result(tp)}") +} + case _ => + // no real way to check that beginning or end is reasonable +} +KafkaSourceOffset(result) + } + + // Initialized when creating read tasks. If this diverges from the partitions a
[GitHub] spark pull request #20163: [SPARK-22966][PySpark] Spark SQL should handle Py...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r159796028 --- Diff: python/pyspark/sql/udf.py --- @@ -26,6 +26,28 @@ def _wrap_function(sc, func, returnType): +def coerce_to_str(v): +import datetime +if type(v) == datetime.date or type(v) == datetime.datetime: +return str(v) +else: +return v + +# Pyrolite will unpickle both Python datetime.date and datetime.datetime objects +# into java.util.Calendar objects, so the type information on the Python side is lost. +# This is problematic when Spark SQL needs to cast such objects into Spark SQL string type, +# because the format of the string should be different, depending on the type of the input +# object. So for those two specific types we eagerly convert them to string here, where the +# Python type information is still intact. +if returnType == StringType(): --- End diff -- This is to handle when a python udf returns `date` or `datetime` but mark the return type as string? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20024 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85701/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20024 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20024 **[Test build #85701 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85701/testReport)** for PR 20024 at commit [`449e2c9`](https://github.com/apache/spark/commit/449e2c9c8c5c48a14a9b2efec728b350463188bf). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user jose-torres commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r159795887 --- Diff: external/kafka-0-10-sql/src/main/scala/ContinuousKafkaReader.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.io._ +import java.nio.charset.StandardCharsets +import java.util.concurrent.atomic.AtomicBoolean + +import org.apache.commons.io.IOUtils +import org.apache.kafka.clients.consumer.ConsumerRecord +import org.apache.kafka.common.TopicPartition +import org.apache.kafka.common.errors.WakeupException + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.{DataFrame, Row, SparkSession, SQLContext} +import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, Literal, UnsafeProjection, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeRowWriter} +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.execution.streaming.{HDFSMetadataLog, SerializedOffset} +import org.apache.spark.sql.kafka010.KafkaSource.{INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_FALSE, INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_TRUE, VERSION} +import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.sources.v2.streaming.reader.{ContinuousDataReader, ContinuousReader, Offset, PartitionOffset} +import org.apache.spark.sql.types.StructType +import org.apache.spark.unsafe.types.UTF8String + +class ContinuousKafkaReader( +kafkaReader: KafkaOffsetReader, +executorKafkaParams: java.util.Map[String, Object], +sourceOptions: Map[String, String], +metadataPath: String, +initialOffsets: KafkaOffsetRangeLimit, +failOnDataLoss: Boolean) + extends ContinuousReader with SupportsScanUnsafeRow with Logging { + + override def mergeOffsets(offsets: Array[PartitionOffset]): Offset = { +val mergedMap = offsets.map { + case KafkaSourcePartitionOffset(p, o) => Map(p -> o) +}.reduce(_ ++ _) +KafkaSourceOffset(mergedMap) + } + + private lazy val session = SparkSession.getActiveSession.get + private lazy val sc = session.sparkContext + + private lazy val pollTimeoutMs = sourceOptions.getOrElse( +"kafkaConsumer.pollTimeoutMs", +sc.conf.getTimeAsMs("spark.network.timeout", "120s").toString + ).toLong + + private val maxOffsetsPerTrigger = +sourceOptions.get("maxOffsetsPerTrigger").map(_.toLong) + + /** + * Lazily initialize `initialPartitionOffsets` to make sure that `KafkaConsumer.poll` is only + * called in StreamExecutionThread. Otherwise, interrupting a thread while running + * `KafkaConsumer.poll` may hang forever (KAFKA-1894). + */ + private lazy val initialPartitionOffsets = { + val offsets = initialOffsets match { +case EarliestOffsetRangeLimit => KafkaSourceOffset(kafkaReader.fetchEarliestOffsets()) +case LatestOffsetRangeLimit => KafkaSourceOffset(kafkaReader.fetchLatestOffsets()) +case SpecificOffsetRangeLimit(p) => fetchAndVerify(p) + } + logInfo(s"Initial offsets: $offsets") + offsets.partitionToOffsets + } + + private def fetchAndVerify(specificOffsets: Map[TopicPartition, Long]) = { +val result = kafkaReader.fetchSpecificOffsets(specificOffsets) +specificOffsets.foreach { + case (tp, off) if off != KafkaOffsetRangeLimit.LATEST && +off != KafkaOffsetRangeLimit.EARLIEST => +if (result(tp) != off) { + reportDataLoss( +s"startingOffsets for $tp was $off but consumer reset to ${result(tp)}") +} + case _ => + // no real way to check that beginning or end is reasonable +} +KafkaSourceOffset(result) + } + + // Initialized when creating read tasks. If this diverges from the partitions a
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user jose-torres commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r159795747 --- Diff: external/kafka-0-10-sql/src/main/scala/ContinuousKafkaReader.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.io._ +import java.nio.charset.StandardCharsets +import java.util.concurrent.atomic.AtomicBoolean + +import org.apache.commons.io.IOUtils +import org.apache.kafka.clients.consumer.ConsumerRecord +import org.apache.kafka.common.TopicPartition +import org.apache.kafka.common.errors.WakeupException + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.{DataFrame, Row, SparkSession, SQLContext} +import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, Literal, UnsafeProjection, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeRowWriter} +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.execution.streaming.{HDFSMetadataLog, SerializedOffset} +import org.apache.spark.sql.kafka010.KafkaSource.{INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_FALSE, INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_TRUE, VERSION} +import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.sources.v2.streaming.reader.{ContinuousDataReader, ContinuousReader, Offset, PartitionOffset} +import org.apache.spark.sql.types.StructType +import org.apache.spark.unsafe.types.UTF8String + +class ContinuousKafkaReader( +kafkaReader: KafkaOffsetReader, +executorKafkaParams: java.util.Map[String, Object], +sourceOptions: Map[String, String], +metadataPath: String, +initialOffsets: KafkaOffsetRangeLimit, +failOnDataLoss: Boolean) + extends ContinuousReader with SupportsScanUnsafeRow with Logging { + + override def mergeOffsets(offsets: Array[PartitionOffset]): Offset = { +val mergedMap = offsets.map { + case KafkaSourcePartitionOffset(p, o) => Map(p -> o) +}.reduce(_ ++ _) +KafkaSourceOffset(mergedMap) + } + + private lazy val session = SparkSession.getActiveSession.get + private lazy val sc = session.sparkContext + + private lazy val pollTimeoutMs = sourceOptions.getOrElse( +"kafkaConsumer.pollTimeoutMs", +sc.conf.getTimeAsMs("spark.network.timeout", "120s").toString + ).toLong + + private val maxOffsetsPerTrigger = +sourceOptions.get("maxOffsetsPerTrigger").map(_.toLong) + + /** + * Lazily initialize `initialPartitionOffsets` to make sure that `KafkaConsumer.poll` is only + * called in StreamExecutionThread. Otherwise, interrupting a thread while running + * `KafkaConsumer.poll` may hang forever (KAFKA-1894). + */ + private lazy val initialPartitionOffsets = { + val offsets = initialOffsets match { +case EarliestOffsetRangeLimit => KafkaSourceOffset(kafkaReader.fetchEarliestOffsets()) +case LatestOffsetRangeLimit => KafkaSourceOffset(kafkaReader.fetchLatestOffsets()) +case SpecificOffsetRangeLimit(p) => fetchAndVerify(p) + } + logInfo(s"Initial offsets: $offsets") + offsets.partitionToOffsets + } + + private def fetchAndVerify(specificOffsets: Map[TopicPartition, Long]) = { +val result = kafkaReader.fetchSpecificOffsets(specificOffsets) +specificOffsets.foreach { + case (tp, off) if off != KafkaOffsetRangeLimit.LATEST && +off != KafkaOffsetRangeLimit.EARLIEST => +if (result(tp) != off) { + reportDataLoss( +s"startingOffsets for $tp was $off but consumer reset to ${result(tp)}") +} + case _ => + // no real way to check that beginning or end is reasonable +} +KafkaSourceOffset(result) + } + + // Initialized when creating read tasks. If this diverges from the partitions a
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user jose-torres commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r159795735 --- Diff: external/kafka-0-10-sql/src/main/scala/ContinuousKafkaReader.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.io._ +import java.nio.charset.StandardCharsets +import java.util.concurrent.atomic.AtomicBoolean + +import org.apache.commons.io.IOUtils +import org.apache.kafka.clients.consumer.ConsumerRecord +import org.apache.kafka.common.TopicPartition +import org.apache.kafka.common.errors.WakeupException + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.{DataFrame, Row, SparkSession, SQLContext} +import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, Literal, UnsafeProjection, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, UnsafeRowWriter} +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.execution.streaming.{HDFSMetadataLog, SerializedOffset} +import org.apache.spark.sql.kafka010.KafkaSource.{INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_FALSE, INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_TRUE, VERSION} +import org.apache.spark.sql.sources.v2.reader._ +import org.apache.spark.sql.sources.v2.streaming.reader.{ContinuousDataReader, ContinuousReader, Offset, PartitionOffset} +import org.apache.spark.sql.types.StructType +import org.apache.spark.unsafe.types.UTF8String + +class ContinuousKafkaReader( +kafkaReader: KafkaOffsetReader, +executorKafkaParams: java.util.Map[String, Object], +sourceOptions: Map[String, String], +metadataPath: String, +initialOffsets: KafkaOffsetRangeLimit, +failOnDataLoss: Boolean) + extends ContinuousReader with SupportsScanUnsafeRow with Logging { + + override def mergeOffsets(offsets: Array[PartitionOffset]): Offset = { +val mergedMap = offsets.map { + case KafkaSourcePartitionOffset(p, o) => Map(p -> o) +}.reduce(_ ++ _) +KafkaSourceOffset(mergedMap) + } + + private lazy val session = SparkSession.getActiveSession.get + private lazy val sc = session.sparkContext + + private lazy val pollTimeoutMs = sourceOptions.getOrElse( +"kafkaConsumer.pollTimeoutMs", +sc.conf.getTimeAsMs("spark.network.timeout", "120s").toString + ).toLong + + private val maxOffsetsPerTrigger = +sourceOptions.get("maxOffsetsPerTrigger").map(_.toLong) + + /** + * Lazily initialize `initialPartitionOffsets` to make sure that `KafkaConsumer.poll` is only + * called in StreamExecutionThread. Otherwise, interrupting a thread while running + * `KafkaConsumer.poll` may hang forever (KAFKA-1894). + */ + private lazy val initialPartitionOffsets = { --- End diff -- oops, no --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20163 **[Test build #85709 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85709/testReport)** for PR 20163 at commit [`ca026d3`](https://github.com/apache/spark/commit/ca026d31a489f1e0eb451fe85df97083659d0f67). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PySpark] Spark SQL should handle Py...
GitHub user rednaxelafx opened a pull request: https://github.com/apache/spark/pull/20163 [SPARK-22966][PySpark] Spark SQL should handle Python UDFs that return a datetime.date or datetime.datetime ## What changes were proposed in this pull request? Perform appropriate conversions for results coming from Python UDFs that return `datetime.date` or `datetime.datetime`. Before this PR, Pyrolite would unpickle both `datetime.date` and `datetime.datetime` into a `java.util.Calendar`, which Spark SQL doesn't understand, which then leads to incorrect results. An example of such incorrect result is: ``` >>> py_date = udf(datetime.date) >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30))).show(truncate=False) +--+ |date(2017, 10, 30) | +--+ |java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2017,MONTH=9,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=30,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=?,ZONE_OFFSET=?,DST_OFFSET=?]| +--+ ``` After this PR, the same query above would give correct results: ``` >>> spark.range(1).select(py_date(lit(2017), lit(10), lit(30))).show(truncate=False) +--+ |date(2017, 10, 30)| +--+ |2017-10-30| +--+ ``` An explicit non-goal of this PR is to change the behavior of timezone awareness or time
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20096 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20096 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85699/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20096 **[Test build #85699 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85699/testReport)** for PR 20096 at commit [`dae3a09`](https://github.com/apache/spark/commit/dae3a09e48565439ed8c22dda857a7747e518b3b). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20161: [SPARK-21525][streaming] Check error code from superviso...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20161 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org