[GitHub] [spark] dilipbiswal commented on issue #24759: [SPARK-27395][SQL] Improve EXPLAIN command
dilipbiswal commented on issue #24759: [SPARK-27395][SQL] Improve EXPLAIN command URL: https://github.com/apache/spark/pull/24759#issuecomment-524910740 Thanks a LOT @cloud-fan @gatorsmile @maryannxue @maropu @ekoifman This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
AmplabJenkins removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524910226 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
SparkQA removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524899246 **[Test build #109741 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109741/testReport)** for PR 25583 at commit [`98c9d69`](https://github.com/apache/spark/commit/98c9d697b18d04cfa922636a05c751e4456b33aa). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
AmplabJenkins removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524910231 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109741/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
AmplabJenkins commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524910231 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109741/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
AmplabJenkins commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524910226 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
SparkQA commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524910083 **[Test build #109741 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109741/testReport)** for PR 25583 at commit [`98c9d69`](https://github.com/apache/spark/commit/98c9d697b18d04cfa922636a05c751e4456b33aa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #25423: [SPARK-28701][test-hadoop3.2][test-java11][k8s] adding java11 support for pull request builds
srowen commented on a change in pull request #25423: [SPARK-28701][test-hadoop3.2][test-java11][k8s] adding java11 support for pull request builds URL: https://github.com/apache/spark/pull/25423#discussion_r317655226 ## File path: dev/run-tests.py ## @@ -404,6 +404,12 @@ def run_scala_tests(build_tool, hadoop_version, test_modules, excluded_tags): if excluded_tags: test_profiles += ['-Dtest.exclude.tags=' + ",".join(excluded_tags)] +# set up java11 env if this is a pull request build with 'test-java11' in the title +if "test-java11" in os.environ["ghprbPullTitle"].lower(): +os.environ["JAVA_HOME"] = "/usr/java/jdk-11.0.1" +os.environ["PATH"] = "%s/bin:%s" % (os.environ["JAVA_HOME"], os.environ["PATH"]) +test_profiles += ['-Djava.version=11'] Review comment: Hm, and why does https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/ pass then? it is doing the same thing in the Jenkins config. (OK I think I answered my own question below) EDIT: Oh, because it doesn't run Pyspark tests? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #25423: [SPARK-28701][test-hadoop3.2][test-java11][k8s] adding java11 support for pull request builds
srowen commented on a change in pull request #25423: [SPARK-28701][test-hadoop3.2][test-java11][k8s] adding java11 support for pull request builds URL: https://github.com/apache/spark/pull/25423#discussion_r317655226 ## File path: dev/run-tests.py ## @@ -404,6 +404,12 @@ def run_scala_tests(build_tool, hadoop_version, test_modules, excluded_tags): if excluded_tags: test_profiles += ['-Dtest.exclude.tags=' + ",".join(excluded_tags)] +# set up java11 env if this is a pull request build with 'test-java11' in the title +if "test-java11" in os.environ["ghprbPullTitle"].lower(): +os.environ["JAVA_HOME"] = "/usr/java/jdk-11.0.1" +os.environ["PATH"] = "%s/bin:%s" % (os.environ["JAVA_HOME"], os.environ["PATH"]) +test_profiles += ['-Djava.version=11'] Review comment: Hm, and why does https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/ pass then? it is doing the same thing in the Jenkins config. (OK I think I answered my own question below) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #25423: [SPARK-28701][test-hadoop3.2][test-java11][k8s] adding java11 support for pull request builds
srowen commented on a change in pull request #25423: [SPARK-28701][test-hadoop3.2][test-java11][k8s] adding java11 support for pull request builds URL: https://github.com/apache/spark/pull/25423#discussion_r317650935 ## File path: dev/run-tests.py ## @@ -404,6 +404,12 @@ def run_scala_tests(build_tool, hadoop_version, test_modules, excluded_tags): if excluded_tags: test_profiles += ['-Dtest.exclude.tags=' + ",".join(excluded_tags)] +# set up java11 env if this is a pull request build with 'test-java11' in the title +if "test-java11" in os.environ["ghprbPullTitle"].lower(): +os.environ["JAVA_HOME"] = "/usr/java/jdk-11.0.1" +os.environ["PATH"] = "%s/bin:%s" % (os.environ["JAVA_HOME"], os.environ["PATH"]) +test_profiles += ['-Djava.version=11'] Review comment: It should use Java 11 if the path provides Java 11 and the test harness that runs Python tests does too. At least I don't know how else one would tell pyspark what to use! In fact I'm pretty sure the test failure here shows that it is using JDK 11. From JPMML: `java.lang.ClassNotFoundException: com.sun.xml.internal.bind.v2.ContextFactory` This would be caused by JDK 11 changes. However, I don't get why all the other non-Python tests don't fail. Given the weird problem in https://github.com/apache/spark/pull/24651 I am wondering if we have some subtle classpath issues with how the Pyspark tests are run. This one however might be more directly solvable by figuring out what is suggesting to use this old Sun JAXB implementation. I'll start digging around META-INF This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on issue #25423: [SPARK-28701][test-hadoop3.2][test-java11][k8s] adding java11 support for pull request builds
srowen commented on issue #25423: [SPARK-28701][test-hadoop3.2][test-java11][k8s] adding java11 support for pull request builds URL: https://github.com/apache/spark/pull/25423#issuecomment-524903212 I personally think this is OK to merge simply because we need a way to test JDK 11, and this seems to do that. The rest of the error is orthogonal. So, in order to use this in a JDK 11 Jenkins build, how would one configure the Jenkins job? it is only triggering off the PR title (which is also useful). OK if that's a future step. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #25570: [SPARK-23519][SQL] create view should work from query with duplicate output columns
cloud-fan commented on a change in pull request #25570: [SPARK-23519][SQL] create view should work from query with duplicate output columns URL: https://github.com/apache/spark/pull/25570#discussion_r317653404 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala ## @@ -360,13 +362,15 @@ object ViewHelper { def generateViewProperties( properties: Map[String, String], session: SparkSession, - analyzedPlan: LogicalPlan): Map[String, String] = { + analyzedPlan: LogicalPlan, + fieldNames: Array[String]): Map[String, String] = { +// for createViewCommand queryOutput may be different from fieldNames val queryOutput = analyzedPlan.schema.fieldNames // Generate the query column names, throw an AnalysisException if there exists duplicate column // names. SchemaUtils.checkColumnNameDuplication( Review comment: we can make the diff much smaller by changing this line to ``` val colNamesToCheck = if (userSpecifiedColumns.nonEmpty) userSpecifiedColumns.map(_._1) else queryOutput SchemaUtils.checkColumnNameDuplication(colNamesToCheck, ...) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog
AmplabJenkins removed a comment on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog URL: https://github.com/apache/spark/pull/25104#issuecomment-524901494 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14788/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog
AmplabJenkins removed a comment on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog URL: https://github.com/apache/spark/pull/25104#issuecomment-524901483 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog
AmplabJenkins commented on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog URL: https://github.com/apache/spark/pull/25104#issuecomment-524901483 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog
AmplabJenkins commented on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog URL: https://github.com/apache/spark/pull/25104#issuecomment-524901494 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14788/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone
tgravescs commented on a change in pull request #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone URL: https://github.com/apache/spark/pull/25409#discussion_r317651126 ## File path: core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala ## @@ -345,4 +376,47 @@ private[spark] object StandaloneResourceUtils extends Logging { def needCoordinate(conf: SparkConf): Boolean = { conf.get(SPARK_RESOURCES_COORDINATE) } + + def toMutable(immutableResources: Map[String, ResourceInformation]) +: Map[String, MutableResourceInfo] = { +immutableResources.map { case (rName, rInfo) => + val mutableAddress = new mutable.HashSet[String]() + rInfo.addresses.foreach(mutableAddress.add) Review comment: this can just be: mutableAddress ++= rInfo.addresses This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
SparkQA commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524899246 **[Test build #109741 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109741/testReport)** for PR 25583 at commit [`98c9d69`](https://github.com/apache/spark/commit/98c9d697b18d04cfa922636a05c751e4456b33aa). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog
SparkQA commented on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog URL: https://github.com/apache/spark/pull/25104#issuecomment-524899251 **[Test build #109742 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109742/testReport)** for PR 25104 at commit [`6847688`](https://github.com/apache/spark/commit/6847688f85bac472b6777de77bc7f8a9654fd6f6). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions
AmplabJenkins removed a comment on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions URL: https://github.com/apache/spark/pull/20965#issuecomment-524898399 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109733/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions
AmplabJenkins removed a comment on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions URL: https://github.com/apache/spark/pull/20965#issuecomment-524898391 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions
AmplabJenkins commented on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions URL: https://github.com/apache/spark/pull/20965#issuecomment-524898399 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109733/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions
AmplabJenkins commented on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions URL: https://github.com/apache/spark/pull/20965#issuecomment-524898391 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions
SparkQA removed a comment on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions URL: https://github.com/apache/spark/pull/20965#issuecomment-524840399 **[Test build #109733 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109733/testReport)** for PR 20965 at commit [`fee5edb`](https://github.com/apache/spark/commit/fee5edb6ba82149329e3452c1f4d8580ff251996). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions
SparkQA commented on issue #20965: [SPARK-21870][SQL] Split aggregation code into small functions URL: https://github.com/apache/spark/pull/20965#issuecomment-524898067 **[Test build #109733 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109733/testReport)** for PR 20965 at commit [`fee5edb`](https://github.com/apache/spark/commit/fee5edb6ba82149329e3452c1f4d8580ff251996). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone
Ngone51 commented on a change in pull request #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone URL: https://github.com/apache/spark/pull/25409#discussion_r317646956 ## File path: core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala ## @@ -64,12 +61,54 @@ private[spark] class WorkerInfo( def coresFree: Int = cores - coresUsed def memoryFree: Int = memory - memoryUsed - def resourcesFree: Map[String, Int] = { + def resourcesAmountFree: Map[String, Int] = { resources.map { case (rName, rInfo) => rName -> rInfo.availableAddrs.length } } + def resourcesInfo[T: ClassTag]: Map[String, T] = { +resources.map { case (rName, rInfo) => + rName -> createResourceInfo(rName, rInfo.addresses, implicitly[ClassTag[T]]) +} + } + + def resourcesInfoFree: Map[String, ResourceInformation] = { +resources.map { case (rName, rInfo) => + rName -> createResourceInfo(rName, rInfo.availableAddrs, +implicitly[ClassTag[ResourceInformation]]) +} + } + + def resourcesInfoUsed[T: ClassTag]: Map[String, T] = { +resources.map { case (rName, rInfo) => + rName -> createResourceInfo(rName, rInfo.assignedAddrs, implicitly[ClassTag[T]]) +} + } + + private def createResourceInfo[T]( + name: String, + addresses: Seq[String], + ct: ClassTag[T]): T = { +val clazz = ct.runtimeClass +val rf = { + clazz match { +case _ if clazz.equals(classOf[MutableResourceInfo]) => Review comment: Thanks, and I chosen the alternative one, which seems more simply in WorkerInfo. While, the other one actually needs 4 functions(combinations of [immutable, mutable] * [total, used]), which seems noisy in WorkerInfo from my view. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone
SparkQA commented on issue #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone URL: https://github.com/apache/spark/pull/25409#issuecomment-524896215 **[Test build #109740 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109740/testReport)** for PR 25409 at commit [`1e64a33`](https://github.com/apache/spark/commit/1e64a33dd0a30a8b816025e528622c5d9c9f99eb). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone
AmplabJenkins removed a comment on issue #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone URL: https://github.com/apache/spark/pull/25409#issuecomment-524895177 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone
AmplabJenkins removed a comment on issue #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone URL: https://github.com/apache/spark/pull/25409#issuecomment-524895186 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14787/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
HeartSaVioR commented on a change in pull request #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#discussion_r317645581 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceInitialOffsetWriter.scala ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.io._ +import java.nio.charset.StandardCharsets + +import org.apache.commons.io.IOUtils + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.streaming.{HDFSMetadataLog, SerializedOffset} + +/** A version of [[HDFSMetadataLog]] specialized for saving the initial offsets. */ +private[kafka010] class KafkaSourceInitialOffsetWriter( +sparkSession: SparkSession, +metadataPath: String) + extends HDFSMetadataLog[KafkaSourceOffset](sparkSession, metadataPath) { + + val VERSION = 1 + + override def serialize(metadata: KafkaSourceOffset, out: OutputStream): Unit = { +out.write(0) // A zero byte is written to support Spark 2.1.0 (SPARK-19517) +val writer = new BufferedWriter(new OutputStreamWriter(out, StandardCharsets.UTF_8)) +writer.write("v" + VERSION + "\n") +writer.write(metadata.json) +writer.flush + } + + override def deserialize(in: InputStream): KafkaSourceOffset = { +in.read() // A zero byte is read to support Spark 2.1.0 (SPARK-19517) +val content = IOUtils.toString(new InputStreamReader(in, StandardCharsets.UTF_8)) +// HDFSMetadataLog guarantees that it never creates a partial file. +assert(content.length != 0) Review comment: We want to assert but `require` would be better as `assert` may be just ignored under JVM option. I've just changed to `require(content.nonEmpty)`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone
AmplabJenkins commented on issue #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone URL: https://github.com/apache/spark/pull/25409#issuecomment-524895177 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone
AmplabJenkins commented on issue #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone URL: https://github.com/apache/spark/pull/25409#issuecomment-524895186 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14787/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone
Ngone51 commented on a change in pull request #25409: [SPARK-28414][WEBUI] UI updates to show resource info in Standalone URL: https://github.com/apache/spark/pull/25409#discussion_r317644596 ## File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala ## @@ -242,6 +243,22 @@ private[deploy] class Worker( System.exit(1) } } +resources.keys.foreach { rName => + resourcesUsed(rName) = MutableResourceInfo(rName, new HashSet[String]) +} + } + + private def updateResourcesUsed(deltaInfo: Map[String, ResourceInformation], add: Boolean) Review comment: good point! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #25582: [SPARK-28875][DSTREAMS][SS][TESTS] Add Task retry tests to make sure new consumer used
HeartSaVioR commented on a change in pull request #25582: [SPARK-28875][DSTREAMS][SS][TESTS] Add Task retry tests to make sure new consumer used URL: https://github.com/apache/spark/pull/25582#discussion_r317642200 ## File path: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDataConsumerSuite.scala ## @@ -60,23 +70,40 @@ class KafkaDataConsumerSuite extends SharedSparkSession with PrivateMethodTester assert(e.getCause === cause) } + test("new KafkaDataConsumer instance in case of Task retry") { +try { + KafkaDataConsumer.cache.clear() + + val kafkaParams = getKafkaParams() + val key = new CacheKey(groupId, topicPartition) + + val context1 = new TaskContextImpl(0, 0, 0, 0, 0, null, null, null) + TaskContext.setTaskContext(context1) + val consumer1 = KafkaDataConsumer.acquire(topicPartition, kafkaParams, true) + consumer1.release() + + assert(KafkaDataConsumer.cache.size() == 1) + assert(KafkaDataConsumer.cache.get(key).eq(consumer1.internalConsumer)) + + val context2 = new TaskContextImpl(0, 0, 0, 0, 1, null, null, null) + TaskContext.setTaskContext(context2) + val consumer2 = KafkaDataConsumer.acquire(topicPartition, kafkaParams, true) + consumer2.release() + + // The first consumer should be removed from cache and new non-cached should be returned Review comment: I'd say consumer2 should be cached as it's created after invalidation, but here you only address test so that's OK. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #25582: [SPARK-28875][DSTREAMS][SS][TESTS] Add Task retry tests to make sure new consumer used
HeartSaVioR commented on a change in pull request #25582: [SPARK-28875][DSTREAMS][SS][TESTS] Add Task retry tests to make sure new consumer used URL: https://github.com/apache/spark/pull/25582#discussion_r317643184 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala ## @@ -78,7 +78,7 @@ private[kafka010] sealed trait KafkaDataConsumer { def release(): Unit /** Reference to the internal implementation that this wrapper delegates to */ - protected def internalConsumer: InternalKafkaConsumer + def internalConsumer: InternalKafkaConsumer Review comment: That's technically `private[kafka010]` as class scope so seems OK. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
SparkQA commented on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion URL: https://github.com/apache/spark/pull/25581#issuecomment-524886240 **[Test build #109739 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109739/testReport)** for PR 25581 at commit [`68da9cc`](https://github.com/apache/spark/commit/68da9cca6030ed2bc248ccbb6d7f8731f0bae59d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
AmplabJenkins removed a comment on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion URL: https://github.com/apache/spark/pull/25581#issuecomment-524885423 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
AmplabJenkins removed a comment on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion URL: https://github.com/apache/spark/pull/25581#issuecomment-524885431 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14786/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
AmplabJenkins commented on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion URL: https://github.com/apache/spark/pull/25581#issuecomment-524885431 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14786/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
AmplabJenkins commented on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion URL: https://github.com/apache/spark/pull/25581#issuecomment-524885423 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
AmplabJenkins removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524883602 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109736/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
AmplabJenkins removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524883588 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
SparkQA removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524873878 **[Test build #109736 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109736/testReport)** for PR 25583 at commit [`1daf77b`](https://github.com/apache/spark/commit/1daf77b5822260a7b5d1add98ba6b09d4151ffb3). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
gengliangwang commented on a change in pull request #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion URL: https://github.com/apache/spark/pull/25581#discussion_r317631925 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala ## @@ -371,12 +373,14 @@ object DataType { byName: Boolean, resolver: Resolver, context: String, + storeAssignmentPolicy: StoreAssignmentPolicy.Value, Review comment: I think keeping the original policy is also fine. Otherwise, it is hard to tell that we are using ANSI mode if `isStrict` is `false`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #25439: [SPARK-28709][DSTREAMS] Fix StreamingContext leak through Streaming
srowen closed pull request #25439: [SPARK-28709][DSTREAMS] Fix StreamingContext leak through Streaming URL: https://github.com/apache/spark/pull/25439 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on issue #25439: [SPARK-28709][DSTREAMS] Fix StreamingContext leak through Streaming
srowen commented on issue #25439: [SPARK-28709][DSTREAMS] Fix StreamingContext leak through Streaming URL: https://github.com/apache/spark/pull/25439#issuecomment-524883593 Merged to master This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
AmplabJenkins commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524883588 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
SparkQA commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524883440 **[Test build #109736 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109736/testReport)** for PR 25583 at commit [`1daf77b`](https://github.com/apache/spark/commit/1daf77b5822260a7b5d1add98ba6b09d4151ffb3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
AmplabJenkins commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524883602 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109736/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25584: [SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table
AmplabJenkins removed a comment on issue #25584: [SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table URL: https://github.com/apache/spark/pull/25584#issuecomment-524882062 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14784/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25584: [SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table
AmplabJenkins removed a comment on issue #25584: [SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table URL: https://github.com/apache/spark/pull/25584#issuecomment-524882056 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25584: [SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table
SparkQA commented on issue #25584: [SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table URL: https://github.com/apache/spark/pull/25584#issuecomment-524883020 **[Test build #109737 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109737/testReport)** for PR 25584 at commit [`a9eea2a`](https://github.com/apache/spark/commit/a9eea2adbe69bc105b239a1289d6c0d92189e7a2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
SparkQA commented on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion URL: https://github.com/apache/spark/pull/25581#issuecomment-524883025 **[Test build #109738 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109738/testReport)** for PR 25581 at commit [`e2b3754`](https://github.com/apache/spark/commit/e2b37544b4888008ab92c554bb2dfe51ecae4b35). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
AmplabJenkins removed a comment on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion URL: https://github.com/apache/spark/pull/25581#issuecomment-524882189 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14785/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
AmplabJenkins removed a comment on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion URL: https://github.com/apache/spark/pull/25581#issuecomment-524882174 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
AmplabJenkins commented on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion URL: https://github.com/apache/spark/pull/25581#issuecomment-524882189 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14785/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
AmplabJenkins commented on issue #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion URL: https://github.com/apache/spark/pull/25581#issuecomment-524882174 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25584: [SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table
AmplabJenkins commented on issue #25584: [SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table URL: https://github.com/apache/spark/pull/25584#issuecomment-524882062 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/14784/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25584: [SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table
AmplabJenkins commented on issue #25584: [SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table URL: https://github.com/apache/spark/pull/25584#issuecomment-524882056 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum opened a new pull request #25584: [SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table
wangyum opened a new pull request #25584: [SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table URL: https://github.com/apache/spark/pull/25584 ### What changes were proposed in this pull request? This PR makes `spark.sql.statistics.fallBackToHdfs` not support Hive partitioned tables. ### Why are the changes needed? The current implementation is incorrect for external partitions and it is expensive to support partitioned table with external partitions. ### Does this PR introduce any user-facing change? Yes. But I think it will not change the join strategy because partitioned table usually very large. ### How was this patch tested? unit test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #25545: [SPARK-28843][PYTHON] Set OMP_NUM_THREADS to executor cores for python
srowen commented on a change in pull request #25545: [SPARK-28843][PYTHON] Set OMP_NUM_THREADS to executor cores for python URL: https://github.com/apache/spark/pull/25545#discussion_r317627588 ## File path: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ## @@ -106,6 +106,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( val startTime = System.currentTimeMillis val env = SparkEnv.get val localdir = env.blockManager.diskBlockManager.localDirs.map(f => f.getPath()).mkString(",") +// if OMP_NUM_THREADS is not explicitly set, override it with the number of cores +if (conf.getOption("spark.executorEnv.OMP_NUM_THREADS").isEmpty) { + // SPARK-28843: limit the OpenMP thread pool to the number of cores assigned to this executor + // this avoids high memory consumption with pandas/numpy because of a large OpenMP thread pool + // see https://github.com/numpy/numpy/issues/10455 + conf.getOption("spark.executor.cores").foreach(envVars.put("OMP_NUM_THREADS", _)) Review comment: 1 is just the default in YARN; the meaning is the same everywhere. Let me break it down further, to illustrate. All this is doing is ensuring that a process isn't using more cores than it should, which ought to always be a good thing. (And it saves memory along the way.) Suppose there's a 16-core machine. Case 1: JVM Spark a) `spark.executor.cores` = 16. There is one JVM using all cores. b) `spark.executor.cores` = 4. There are (up to) 4 JVMs using 4 cores each. Case 2: Pyspark a) `spark.executor.cores` = 16. There are 16 Python processes b) `spark.executor.cores` = 4. There are still 16 Python processes. In case 1a, imagine using MLlib that uses OpenBLAS or MKL. By default, OpenMP will use all 16 cores now. This is fine, and does not change with this change. In case 1b, each JVM will use 16 cores, so OpenMP will attempt to use 64 total threads (to my understanding here), right now. This change would make this use 16 cores (4 x 4). That's better. In case 2a and 2b, imagine using numpy. 256 threads will be used in total on the machine! That's bad; it's a little slower because of all the context switching, but also uses more memory. This change does not however help case 2a. It does help 2b, where 'at least' only 64 threads are started. The more aggressive change would be to set the default to 1, always, for Pyspark as well as this matches the execution better. However, this is at least a more conservative step to merely cap it at the number of allocated executor cores. Yes, the situation isn't as bad if the executor isn't actually fully utilized, but, I don't think we should optimize for that case? at least, this more conservative change still errs on the side of over-committing the cores at the cost of memory, just not nearly as extremely as the default. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog
AmplabJenkins removed a comment on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog URL: https://github.com/apache/spark/pull/25104#issuecomment-524879679 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109730/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog
AmplabJenkins removed a comment on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog URL: https://github.com/apache/spark/pull/25104#issuecomment-524879672 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog
AmplabJenkins commented on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog URL: https://github.com/apache/spark/pull/25104#issuecomment-524879672 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog
AmplabJenkins commented on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog URL: https://github.com/apache/spark/pull/25104#issuecomment-524879679 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109730/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog
SparkQA removed a comment on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog URL: https://github.com/apache/spark/pull/25104#issuecomment-524824809 **[Test build #109730 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109730/testReport)** for PR 25104 at commit [`e6a18ac`](https://github.com/apache/spark/commit/e6a18accc7cbc84ebf738ad23d4dd3864e8d4de5). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog
SparkQA commented on issue #25104: [SPARK-28341][SQL] create a public API for V2SessionCatalog URL: https://github.com/apache/spark/pull/25104#issuecomment-524879351 **[Test build #109730 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109730/testReport)** for PR 25104 at commit [`e6a18ac`](https://github.com/apache/spark/commit/e6a18accc7cbc84ebf738ad23d4dd3864e8d4de5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public abstract class CatalogExtension implements TableCatalog ` * `class CatalogManager(conf: SQLConf, sessionCatalog: TableCatalog) extends Logging ` * `class V2SessionCatalog(catalog: SessionCatalog, conf: SQLConf) extends TableCatalog ` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #25347: [SPARK-28610][SQL] Allow having a decimal buffer for long sum
cloud-fan commented on issue #25347: [SPARK-28610][SQL] Allow having a decimal buffer for long sum URL: https://github.com/apache/spark/pull/25347#issuecomment-524876439 @mgaido91 can you do a simple microbenchmark? If the performance overhead is not significant, we can use decimal as sum buffer and provide a legacy config to use long as buffer. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
srowen commented on a change in pull request #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#discussion_r317621710 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceInitialOffsetWriter.scala ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.io._ +import java.nio.charset.StandardCharsets + +import org.apache.commons.io.IOUtils + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.streaming.{HDFSMetadataLog, SerializedOffset} + +/** A version of [[HDFSMetadataLog]] specialized for saving the initial offsets. */ +private[kafka010] class KafkaSourceInitialOffsetWriter( +sparkSession: SparkSession, +metadataPath: String) + extends HDFSMetadataLog[KafkaSourceOffset](sparkSession, metadataPath) { + + val VERSION = 1 + + override def serialize(metadata: KafkaSourceOffset, out: OutputStream): Unit = { +out.write(0) // A zero byte is written to support Spark 2.1.0 (SPARK-19517) +val writer = new BufferedWriter(new OutputStreamWriter(out, StandardCharsets.UTF_8)) +writer.write("v" + VERSION + "\n") Review comment: And while we're here you could use interpolation here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
srowen commented on a change in pull request #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#discussion_r317621165 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceInitialOffsetWriter.scala ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.io._ +import java.nio.charset.StandardCharsets + +import org.apache.commons.io.IOUtils + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.streaming.{HDFSMetadataLog, SerializedOffset} + +/** A version of [[HDFSMetadataLog]] specialized for saving the initial offsets. */ +private[kafka010] class KafkaSourceInitialOffsetWriter( +sparkSession: SparkSession, +metadataPath: String) + extends HDFSMetadataLog[KafkaSourceOffset](sparkSession, metadataPath) { + + val VERSION = 1 + + override def serialize(metadata: KafkaSourceOffset, out: OutputStream): Unit = { +out.write(0) // A zero byte is written to support Spark 2.1.0 (SPARK-19517) +val writer = new BufferedWriter(new OutputStreamWriter(out, StandardCharsets.UTF_8)) +writer.write("v" + VERSION + "\n") +writer.write(metadata.json) +writer.flush + } + + override def deserialize(in: InputStream): KafkaSourceOffset = { +in.read() // A zero byte is read to support Spark 2.1.0 (SPARK-19517) +val content = IOUtils.toString(new InputStreamReader(in, StandardCharsets.UTF_8)) +// HDFSMetadataLog guarantees that it never creates a partial file. +assert(content.length != 0) +if (content(0) == 'v') { + val indexOfNewLine = content.indexOf("\n") + if (indexOfNewLine > 0) { +validateVersion(content.substring(0, indexOfNewLine), VERSION) +KafkaSourceOffset(SerializedOffset(content.substring(indexOfNewLine + 1))) + } else { +throw new IllegalStateException( + s"Log file was malformed: failed to detect the log file version line.") Review comment: Nit: no need for interpolation This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
srowen commented on a change in pull request #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#discussion_r317621622 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceInitialOffsetWriter.scala ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.io._ +import java.nio.charset.StandardCharsets + +import org.apache.commons.io.IOUtils + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.streaming.{HDFSMetadataLog, SerializedOffset} + +/** A version of [[HDFSMetadataLog]] specialized for saving the initial offsets. */ +private[kafka010] class KafkaSourceInitialOffsetWriter( +sparkSession: SparkSession, +metadataPath: String) + extends HDFSMetadataLog[KafkaSourceOffset](sparkSession, metadataPath) { + + val VERSION = 1 + + override def serialize(metadata: KafkaSourceOffset, out: OutputStream): Unit = { +out.write(0) // A zero byte is written to support Spark 2.1.0 (SPARK-19517) +val writer = new BufferedWriter(new OutputStreamWriter(out, StandardCharsets.UTF_8)) +writer.write("v" + VERSION + "\n") +writer.write(metadata.json) +writer.flush + } + + override def deserialize(in: InputStream): KafkaSourceOffset = { +in.read() // A zero byte is read to support Spark 2.1.0 (SPARK-19517) +val content = IOUtils.toString(new InputStreamReader(in, StandardCharsets.UTF_8)) +// HDFSMetadataLog guarantees that it never creates a partial file. +assert(content.length != 0) Review comment: Could be .nonEmpty, but I'm not even sure we want asserts here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
cloud-fan commented on a change in pull request #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion URL: https://github.com/apache/spark/pull/25581#discussion_r317619714 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala ## @@ -371,12 +373,14 @@ object DataType { byName: Boolean, resolver: Resolver, context: String, + storeAssignmentPolicy: StoreAssignmentPolicy.Value, Review comment: maybe pass in a boolean flag `isStrict`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion
cloud-fan commented on a change in pull request #25581: [SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion URL: https://github.com/apache/spark/pull/25581#discussion_r317619345 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala ## @@ -158,6 +158,35 @@ object Cast { case _ => false } + def canANSIStoreAssign(from: DataType, to: DataType): Boolean = (from, to) match { +case _ if from == to => true +case (_: NumericType, _: NumericType) => true +case (_, StringType) => true Review comment: shall we only allow AtomicType to string? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
AmplabJenkins removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524872836 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
SparkQA commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524873878 **[Test build #109736 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109736/testReport)** for PR 25583 at commit [`1daf77b`](https://github.com/apache/spark/commit/1daf77b5822260a7b5d1add98ba6b09d4151ffb3). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
AmplabJenkins removed a comment on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524872616 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
AmplabJenkins commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524872836 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
AmplabJenkins commented on issue #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583#issuecomment-524872616 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate
HeartSaVioR opened a new pull request #25583: [MINOR][SS] Reuse KafkaSourceInitialOffsetWriter to deduplicate URL: https://github.com/apache/spark/pull/25583 ### What changes were proposed in this pull request? This patch proposes to reuse KafkaSourceInitialOffsetWriter to remove identical code in KafkaSource. Credit to @jaceklaskowski for finding this. https://lists.apache.org/thread.html/7faa6ac29d871444eaeccefc520e3543a77f4362af4bb0f12a3f7cb2@%3Cdev.spark.apache.org%3E ### Why are the changes needed? The code is duplicated with identical code, which opens the chance to maintain the code separately and might end up with bugs not addressed one side. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs, as it's simple refactor. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] attilapiros commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host
attilapiros commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host URL: https://github.com/apache/spark/pull/25299#discussion_r317614716 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala ## @@ -995,6 +1003,19 @@ private[spark] class BlockManager( None } + private[spark] def getHostLocalDirs(executorIds: Array[String]) + : scala.collection.Map[String, Array[String]] = { +val cachedItems = executorIdToLocalDirsCache.filterKeys(executorIds.contains(_)) +if (cachedItems.size < executorIds.length) { + val notCachedItems = master + .getHostLocalDirs(executorIds.filterNot(executorIdToLocalDirsCache.contains)) + executorIdToLocalDirsCache ++= notCachedItems Review comment: Yes, it is unbounded but this case is the very same what we have at the external shuffle service: here only those executors are stored which is/was running on the same hosts. I assume this number is quite small but we can use the same limit here if you think it still worth to do that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] attilapiros commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host
attilapiros commented on a change in pull request #25299: [SPARK-27651][Core] Avoid the network when shuffle blocks are fetched from the same host URL: https://github.com/apache/spark/pull/25299#discussion_r317614594 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala ## @@ -995,6 +1003,19 @@ private[spark] class BlockManager( None } + private[spark] def getHostLocalDirs(executorIds: Array[String]) + : scala.collection.Map[String, Array[String]] = { Review comment: The predef `Map` is the immutable one. With just using `Map` as return type I would got: ``` Error:(1012, 19) type mismatch; found : Map[String,Array[String]] (in scala.collection) required: Map[String,Array[String]] (in scala.collection.immutable) cachedItems ++ notCachedItems So I decided to avoid unnecessary conversion (copy of items) I would use the the `Map` trait. See `toMap` implementation: ``` def toMap[T, U](implicit ev: A <:< (T, U)): immutable.Map[T, U] = { val b = immutable.Map.newBuilder[T, U] for (x <- self) b += x b.result() } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25403: [SPARK-28679][YARN] changes to setResourceInformation to handle empty resources and reflection error handling
AmplabJenkins removed a comment on issue #25403: [SPARK-28679][YARN] changes to setResourceInformation to handle empty resources and reflection error handling URL: https://github.com/apache/spark/pull/25403#issuecomment-524868589 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109735/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25403: [SPARK-28679][YARN] changes to setResourceInformation to handle empty resources and reflection error handling
SparkQA removed a comment on issue #25403: [SPARK-28679][YARN] changes to setResourceInformation to handle empty resources and reflection error handling URL: https://github.com/apache/spark/pull/25403#issuecomment-524861704 **[Test build #109735 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109735/testReport)** for PR 25403 at commit [`298f89b`](https://github.com/apache/spark/commit/298f89b24768b2776a04309e641cb2758686234e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25403: [SPARK-28679][YARN] changes to setResourceInformation to handle empty resources and reflection error handling
AmplabJenkins removed a comment on issue #25403: [SPARK-28679][YARN] changes to setResourceInformation to handle empty resources and reflection error handling URL: https://github.com/apache/spark/pull/25403#issuecomment-524868574 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25403: [SPARK-28679][YARN] changes to setResourceInformation to handle empty resources and reflection error handling
AmplabJenkins commented on issue #25403: [SPARK-28679][YARN] changes to setResourceInformation to handle empty resources and reflection error handling URL: https://github.com/apache/spark/pull/25403#issuecomment-524868574 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25403: [SPARK-28679][YARN] changes to setResourceInformation to handle empty resources and reflection error handling
AmplabJenkins commented on issue #25403: [SPARK-28679][YARN] changes to setResourceInformation to handle empty resources and reflection error handling URL: https://github.com/apache/spark/pull/25403#issuecomment-524868589 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109735/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25403: [SPARK-28679][YARN] changes to setResourceInformation to handle empty resources and reflection error handling
SparkQA commented on issue #25403: [SPARK-28679][YARN] changes to setResourceInformation to handle empty resources and reflection error handling URL: https://github.com/apache/spark/pull/25403#issuecomment-524868447 **[Test build #109735 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109735/testReport)** for PR 25403 at commit [`298f89b`](https://github.com/apache/spark/commit/298f89b24768b2776a04309e641cb2758686234e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer
HeartSaVioR commented on a change in pull request #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer URL: https://github.com/apache/spark/pull/22138#discussion_r317612424 ## File path: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDataConsumerSuite.scala ## @@ -121,4 +125,158 @@ class KafkaDataConsumerSuite extends SharedSparkSession with PrivateMethodTester threadpool.shutdown() } } + + test("SPARK-25151 Handles multiple tasks in executor fetching same (topic, partition) pair") { Review comment: Hmm... that's actually dealing with the limitation of current pool, and I guess other reviewer was pointing out this issue so I feel we've indicated this as soft kind of bug. Don't mind whether removing JIRA number or not. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya edited a comment on issue #25570: [SPARK-23519][SQL] create view should work from query with duplicate output columns
viirya edited a comment on issue #25570: [SPARK-23519][SQL] create view should work from query with duplicate output columns URL: https://github.com/apache/spark/pull/25570#issuecomment-524844659 hmm I am not sure if ALTER VIEW AS should take the schema of the new query, or it should keep original schema, and just replace old query? That's right, you described is what ALTER VIEW AS works now. Hive's document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterViewAsSelect, doesn't talk the details, just says it works as CREATE OR REPLACE VIEW. But I can't find Hive doc of CREATE OR REPLACE VIEW. The comment `Nothing we need to retain from the old view...`, is specific for CREATE OR REPLACE VIEW. I think it does not necessarily indicate how ALTER VIEW AS works. Postgresql has CREATE OR REPLACE VIEW, but interestingly, the new query in CREATE OR REPLACE VIEW must generate same column names in the same order and data types. No ALTER VIEW AS in Postgresql. I thought ALTER VIEW AS only replaces old query, and keeps original schema. But seems it isn't how ALTER VIEW AS works now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer
HeartSaVioR commented on a change in pull request #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer URL: https://github.com/apache/spark/pull/22138#discussion_r317610953 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala ## @@ -445,197 +529,68 @@ private[kafka010] case class InternalKafkaConsumer( * Throw an exception or log a warning as per `failOnDataLoss`. */ private def reportDataLoss( + topicPartition: TopicPartition, + groupId: String, failOnDataLoss: Boolean, message: String, cause: Throwable = null): Unit = { -val finalMessage = s"$message ${additionalMessage(failOnDataLoss)}" +val finalMessage = s"$message ${additionalMessage(topicPartition, groupId, failOnDataLoss)}" reportDataLoss0(failOnDataLoss, finalMessage, cause) } - def close(): Unit = consumer.close() - - private def seek(offset: Long): Unit = { -logDebug(s"Seeking to $groupId $topicPartition $offset") -consumer.seek(topicPartition, offset) - } - - /** - * Poll messages from Kafka starting from `offset` and update `fetchedData`. `fetchedData` may be - * empty if the Kafka consumer fetches some messages but all of them are not visible messages - * (either transaction messages, or aborted messages when `isolation.level` is `read_committed`). - * - * @throws OffsetOutOfRangeException if `offset` is out of range. - * @throws TimeoutException if the consumer position is not changed after polling. It means the - * consumer polls nothing before timeout. - */ - private def fetchData(offset: Long, pollTimeoutMs: Long): Unit = { -// Seek to the offset because we may call seekToBeginning or seekToEnd before this. -seek(offset) -val p = consumer.poll(pollTimeoutMs) -val r = p.records(topicPartition) -logDebug(s"Polled $groupId ${p.partitions()} ${r.size}") -val offsetAfterPoll = consumer.position(topicPartition) -logDebug(s"Offset changed from $offset to $offsetAfterPoll after polling") -fetchedData.withNewPoll(r.listIterator, offsetAfterPoll) -if (!fetchedData.hasNext) { - // We cannot fetch anything after `poll`. Two possible cases: - // - `offset` is out of range so that Kafka returns nothing. `OffsetOutOfRangeException` will - // be thrown. - // - Cannot fetch any data before timeout. `TimeoutException` will be thrown. - // - Fetched something but all of them are not invisible. This is a valid case and let the - // caller handles this. - val range = getAvailableOffsetRange() - if (offset < range.earliest || offset >= range.latest) { -throw new OffsetOutOfRangeException( - Map(topicPartition -> java.lang.Long.valueOf(offset)).asJava) - } else if (offset == offsetAfterPoll) { -throw new TimeoutException( - s"Cannot fetch record for offset $offset in $pollTimeoutMs milliseconds") - } -} + private def runUninterruptiblyIfPossible[T](body: => T): T = Thread.currentThread match { +case ut: UninterruptibleThread => + ut.runUninterruptibly(body) +case _ => + logWarning("KafkaDataConsumer is not running in UninterruptibleThread. " + +"It may hang when KafkaDataConsumer's methods are interrupted because of KAFKA-1894") + body } } - private[kafka010] object KafkaDataConsumer extends Logging { + val UNKNOWN_OFFSET = -2L case class AvailableOffsetRange(earliest: Long, latest: Long) - private case class CachedKafkaDataConsumer(internalConsumer: InternalKafkaConsumer) -extends KafkaDataConsumer { -assert(internalConsumer.inUse) // make sure this has been set to true -override def release(): Unit = { KafkaDataConsumer.release(internalConsumer) } - } - - private case class NonCachedKafkaDataConsumer(internalConsumer: InternalKafkaConsumer) -extends KafkaDataConsumer { -override def release(): Unit = { internalConsumer.close() } - } - - private case class CacheKey(groupId: String, topicPartition: TopicPartition) { + case class CacheKey(groupId: String, topicPartition: TopicPartition) { def this(topicPartition: TopicPartition, kafkaParams: ju.Map[String, Object]) = this(kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String], topicPartition) } - // This cache has the following important properties. - // - We make a best-effort attempt to maintain the max size of the cache as configured capacity. - // The capacity is not guaranteed to be maintained, especially when there are more active - // tasks simultaneously using consumers than the capacity. - private lazy val cache = { -val conf = SparkEnv.get.conf -val capacity = conf.get(CONSUMER_CACHE_CAPACITY) -new ju.LinkedHashMap[CacheKey, InternalKafkaConsumer](capacity, 0.75f, true) { - override def removeEldestEntry( -
[GitHub] [spark] viirya edited a comment on issue #25570: [SPARK-23519][SQL] create view should work from query with duplicate output columns
viirya edited a comment on issue #25570: [SPARK-23519][SQL] create view should work from query with duplicate output columns URL: https://github.com/apache/spark/pull/25570#issuecomment-524844659 hmm I am not sure if ALTER VIEW AS should take the schema of the new query, or it should keep original schema, and just replace old query? That's right, you described is what ALTER VIEW AS works now. Hive's document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterViewAsSelect, doesn't talk the details, just says it works as CREATE OR REPLACE VIEW. But I can't find Hive doc of CREATE OR REPLACE VIEW. I assume it works like Spark SQL. Postgresql has CREATE OR REPLACE VIEW, but interestingly, the new query in CREATE OR REPLACE VIEW must generate same column names in the same order and data types. No ALTER VIEW AS in Postgresql. I thought ALTER VIEW AS only replaces old query, and keeps original schema. But seems it isn't how ALTER VIEW AS works now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25582: [SPARK-28875][DSTREAMS][SS][TESTS] Add Task retry tests to make sure new consumer used
SparkQA removed a comment on issue #25582: [SPARK-28875][DSTREAMS][SS][TESTS] Add Task retry tests to make sure new consumer used URL: https://github.com/apache/spark/pull/25582#issuecomment-524855919 **[Test build #109734 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109734/testReport)** for PR 25582 at commit [`d6602ec`](https://github.com/apache/spark/commit/d6602ece506469bc706f74d8b0e4eda4dc4d229b). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25582: [SPARK-28875][DSTREAMS][SS][TESTS] Add Task retry tests to make sure new consumer used
AmplabJenkins removed a comment on issue #25582: [SPARK-28875][DSTREAMS][SS][TESTS] Add Task retry tests to make sure new consumer used URL: https://github.com/apache/spark/pull/25582#issuecomment-524865107 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25582: [SPARK-28875][DSTREAMS][SS][TESTS] Add Task retry tests to make sure new consumer used
AmplabJenkins removed a comment on issue #25582: [SPARK-28875][DSTREAMS][SS][TESTS] Add Task retry tests to make sure new consumer used URL: https://github.com/apache/spark/pull/25582#issuecomment-524865113 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109734/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25582: [WIP][SPARK-28875][DSTREAMS][SS][TESTS] Add Task rety tests to make sure new consumer used
AmplabJenkins commented on issue #25582: [WIP][SPARK-28875][DSTREAMS][SS][TESTS] Add Task rety tests to make sure new consumer used URL: https://github.com/apache/spark/pull/25582#issuecomment-524865107 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25582: [WIP][SPARK-28875][DSTREAMS][SS][TESTS] Add Task rety tests to make sure new consumer used
AmplabJenkins commented on issue #25582: [WIP][SPARK-28875][DSTREAMS][SS][TESTS] Add Task rety tests to make sure new consumer used URL: https://github.com/apache/spark/pull/25582#issuecomment-524865113 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/109734/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25582: [WIP][SPARK-28875][DSTREAMS][SS][TESTS] Add Task rety tests to make sure new consumer used
SparkQA commented on issue #25582: [WIP][SPARK-28875][DSTREAMS][SS][TESTS] Add Task rety tests to make sure new consumer used URL: https://github.com/apache/spark/pull/25582#issuecomment-524865022 **[Test build #109734 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109734/testReport)** for PR 25582 at commit [`d6602ec`](https://github.com/apache/spark/commit/d6602ece506469bc706f74d8b0e4eda4dc4d229b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog
cloud-fan commented on a change in pull request #25507: [SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog URL: https://github.com/apache/spark/pull/25507#discussion_r317608095 ## File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/V1WriteFallbackSuite.scala ## @@ -68,13 +69,25 @@ class V1WriteFallbackSuite extends QueryTest with SharedSparkSession with Before } class V1WriteFallbackSessionCatalogSuite - extends SessionCatalogTest[InMemoryTableWithV1Fallback, V1FallbackTableCatalog] { + extends InsertIntoTests(supportsDynamicOverwrite = false, includeSQLTests = true) + with SessionCatalogTest[InMemoryTableWithV1Fallback, V1FallbackTableCatalog] { + override protected val v2Format = classOf[InMemoryV1Provider].getName override protected val catalogClassName: String = classOf[V1FallbackTableCatalog].getName + override protected val catalogAndNamespace: String = "" override protected def verifyTable(tableName: String, expected: DataFrame): Unit = { checkAnswer(InMemoryV1Provider.getTableData(spark, s"default.$tableName"), expected) } + + protected def doInsert(tableName: String, insert: DataFrame, mode: SaveMode): Unit = { +val tmpView = "tmp_view" +withTempView(tmpView) { + insert.createOrReplaceTempView(tmpView) + val overwrite = if (mode == SaveMode.Overwrite) "OVERWRITE" else "INTO" + sql(s"INSERT $overwrite TABLE $tableName SELECT * FROM $tmpView") Review comment: ok seems we do want to run SQL test cases only here. Shouldn't we extend `InsertIntoSQLTests` instead of `InsertIntoTests`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer
HeartSaVioR commented on a change in pull request #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer URL: https://github.com/apache/spark/pull/22138#discussion_r317608101 ## File path: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala ## @@ -18,228 +18,253 @@ package org.apache.spark.sql.kafka010 import java.{util => ju} +import java.io.Closeable import java.util.concurrent.TimeoutException import scala.collection.JavaConverters._ import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord, KafkaConsumer, OffsetOutOfRangeException} import org.apache.kafka.common.TopicPartition -import org.apache.spark.{SparkEnv, SparkException, TaskContext} +import org.apache.spark.TaskContext import org.apache.spark.internal.Logging import org.apache.spark.kafka010.KafkaConfigUpdater -import org.apache.spark.sql.kafka010.KafkaDataConsumer.AvailableOffsetRange +import org.apache.spark.sql.kafka010.KafkaDataConsumer.{AvailableOffsetRange, UNKNOWN_OFFSET} import org.apache.spark.sql.kafka010.KafkaSourceProvider._ -import org.apache.spark.util.UninterruptibleThread +import org.apache.spark.util.{ShutdownHookManager, UninterruptibleThread} + +/** + * This class simplifies the usages of Kafka consumer in Spark SQL Kafka connector. + * + * NOTE: Like KafkaConsumer, this class is not thread-safe. + * NOTE for contributors: It is possible for the instance to be used from multiple callers, + * so all the methods should not rely on current cursor and use seek manually. + */ +private[kafka010] class InternalKafkaConsumer( +val topicPartition: TopicPartition, +val kafkaParams: ju.Map[String, Object]) extends Closeable with Logging { + + val groupId = kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG).asInstanceOf[String] + + private val consumer = createConsumer -private[kafka010] sealed trait KafkaDataConsumer { /** - * Get the record for the given offset if available. - * - * If the record is invisible (either a - * transaction message, or an aborted message when the consumer's `isolation.level` is - * `read_committed`), it will be skipped and this method will try to fetch next available record - * within [offset, untilOffset). + * Poll messages from Kafka starting from `offset` and returns a pair of "list of consumer record" + * and "offset after poll". The list of consumer record may be empty if the Kafka consumer fetches + * some messages but all of them are not visible messages (either transaction messages, + * or aborted messages when `isolation.level` is `read_committed`). * - * This method also will try its best to detect data loss. If `failOnDataLoss` is `true`, it will - * throw an exception when we detect an unavailable offset. If `failOnDataLoss` is `false`, this - * method will try to fetch next available record within [offset, untilOffset). - * - * When this method tries to skip offsets due to either invisible messages or data loss and - * reaches `untilOffset`, it will return `null`. - * - * @param offset the offset to fetch. - * @param untilOffsetthe max offset to fetch. Exclusive. - * @param pollTimeoutMs timeout in milliseconds to poll data from Kafka. - * @param failOnDataLoss When `failOnDataLoss` is `true`, this method will either return record at - * offset if available, or throw exception.when `failOnDataLoss` is `false`, - * this method will either return record at offset if available, or return - * the next earliest available record less than untilOffset, or null. It - * will not throw any exception. + * @throws OffsetOutOfRangeException if `offset` is out of range. + * @throws TimeoutException if the consumer position is not changed after polling. It means the + * consumer polls nothing before timeout. */ - def get( - offset: Long, - untilOffset: Long, - pollTimeoutMs: Long, - failOnDataLoss: Boolean): ConsumerRecord[Array[Byte], Array[Byte]] = { -internalConsumer.get(offset, untilOffset, pollTimeoutMs, failOnDataLoss) + def fetch(offset: Long, pollTimeoutMs: Long) + : (ju.List[ConsumerRecord[Array[Byte], Array[Byte]]], Long) = { +// Seek to the offset because we may call seekToBeginning or seekToEnd before this. +seek(offset) +val p = consumer.poll(pollTimeoutMs) +val r = p.records(topicPartition) +logDebug(s"Polled $groupId ${p.partitions()} ${r.size}") +val offsetAfterPoll = consumer.position(topicPartition) +logDebug(s"Offset changed from $offset to $offsetAfterPoll after polling") +val fetchedData = (r, offsetAfterPoll) +if (r.isEmpty) { + // We cannot fetch anything after `poll`. Two possible cases: + // - `offset` is out of range so that Kafka returns nothing. `OffsetOutOfRangeException` wi
[GitHub] [spark] HyukjinKwon closed pull request #25567: [SPARK-28527][SQL][TEST] Re-run all the tests in SQLQueryTestSuite via Thrift Server
HyukjinKwon closed pull request #25567: [SPARK-28527][SQL][TEST] Re-run all the tests in SQLQueryTestSuite via Thrift Server URL: https://github.com/apache/spark/pull/25567 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org