[jira] [Comment Edited] (SPARK-32063) Spark native temporary table
[ https://issues.apache.org/jira/browse/SPARK-32063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143575#comment-17143575 ] Lantao Jin edited comment on SPARK-32063 at 6/24/20, 6:53 AM: -- [~viirya] For 1, even RDD cache or table cache can improve performance, but I still think they have totally different scopes. Besides, we also can cache a temporary table to memory to get more performance improvement. In production usage, I found our data engineers and data scientists do not always remember to uncached cached tables or views. This situation became worse in the Spark thrift-server (sharing Spark driver). For 2, we found when Adaptive Query Execution enabled, complex views are easily stuck in the optimization step. Cache this view couldn't help. For 3, the scenario is in our migration case, move SQL from Teradata to Spark. Without the temporary table, TD users have to create permanent tables and drop them at the end of a script as an alternate of TD volatile table, if JDBC session closed or script failed before cleaning up, no mechanism guarantee to drop the intermediate data. If they use Spark temporary view, many logic couldn't work well. For example, they want to execute UPDATE/DELETE op on intermediate tables but we cannot convert a temporary view to Delta table or Hudi table ... was (Author: cltlfcjin): For 1, even RDD cache or table cache can improve performance, but I still think they have totally different scopes. Besides, we also can cache a temporary table to memory to get more performance improvement. In production usage, I found our data engineers and data scientists do not always remember to uncached cached tables or views. This situation became worse in the Spark thrift-server (sharing Spark driver). For 2, we found when Adaptive Query Execution enabled, complex views are easily stuck in the optimization step. Cache this view couldn't help. For 3, the scenario is in our migration case, move SQL from Teradata to Spark. Without the temporary table, TD users have to create permanent tables and drop them at the end of a script as an alternate of TD volatile table, if JDBC session closed or script failed before cleaning up, no mechanism guarantee to drop the intermediate data. If they use Spark temporary view, many logic couldn't work well. For example, they want to execute UPDATE/DELETE op on intermediate tables but we cannot convert a temporary view to Delta table or Hudi table ... > Spark native temporary table > > > Key: SPARK-32063 > URL: https://issues.apache.org/jira/browse/SPARK-32063 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lantao Jin >Priority: Major > > Many databases and data warehouse SQL engines support temporary tables. A > temporary table, as its named implied, is a short-lived table that its life > will be only for current session. > In Spark, there is no temporary table. the DDL “CREATE TEMPORARY TABLE AS > SELECT” will create a temporary view. A temporary view is totally different > with a temporary table. > A temporary view is just a VIEW. It doesn’t materialize data in storage. So > it has below shortage: > # View will not give improved performance. Materialize intermediate data in > temporary tables for a complex query will accurate queries, especially in an > ETL pipeline. > # View which calls other views can cause severe performance issues. Even, > executing a very complex view may fail in Spark. > # Temporary view has no database namespace. In some complex ETL pipelines or > data warehouse applications, without database prefix is not convenient. It > needs some tables which only used in current session. > > More details are described in [Design > Docs|https://docs.google.com/document/d/1RS4Q3VbxlZ_Yy0fdWgTJ-k0QxFd1dToCqpLAYvIJ34U/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32063) Spark native temporary table
[ https://issues.apache.org/jira/browse/SPARK-32063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143575#comment-17143575 ] Lantao Jin commented on SPARK-32063: For 1, even RDD cache or table cache can improve performance, but I still think they have totally different scopes. Besides, we also can cache a temporary table to memory to get more performance improvement. In production usage, I found our data engineers and data scientists do not always remember to uncached cached tables or views. This situation became worse in the Spark thrift-server (sharing Spark driver). For 2, we found when Adaptive Query Execution enabled, complex views are easily stuck in the optimization step. Cache this view couldn't help. For 3, the scenario is in our migration case, move SQL from Teradata to Spark. Without the temporary table, TD users have to create permanent tables and drop them at the end of a script as an alternate of TD volatile table, if JDBC session closed or script failed before cleaning up, no mechanism guarantee to drop the intermediate data. If they use Spark temporary view, many logic couldn't work well. For example, they want to execute UPDATE/DELETE op on intermediate tables but we cannot convert a temporary view to Delta table or Hudi table ... > Spark native temporary table > > > Key: SPARK-32063 > URL: https://issues.apache.org/jira/browse/SPARK-32063 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lantao Jin >Priority: Major > > Many databases and data warehouse SQL engines support temporary tables. A > temporary table, as its named implied, is a short-lived table that its life > will be only for current session. > In Spark, there is no temporary table. the DDL “CREATE TEMPORARY TABLE AS > SELECT” will create a temporary view. A temporary view is totally different > with a temporary table. > A temporary view is just a VIEW. It doesn’t materialize data in storage. So > it has below shortage: > # View will not give improved performance. Materialize intermediate data in > temporary tables for a complex query will accurate queries, especially in an > ETL pipeline. > # View which calls other views can cause severe performance issues. Even, > executing a very complex view may fail in Spark. > # Temporary view has no database namespace. In some complex ETL pipelines or > data warehouse applications, without database prefix is not convenient. It > needs some tables which only used in current session. > > More details are described in [Design > Docs|https://docs.google.com/document/d/1RS4Q3VbxlZ_Yy0fdWgTJ-k0QxFd1dToCqpLAYvIJ34U/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30466) remove dependency on jackson-mapper-asl-1.9.13 and jackson-core-asl-1.9.13
[ https://issues.apache.org/jira/browse/SPARK-30466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143569#comment-17143569 ] Prashant Sharma commented on SPARK-30466: - I just saw, Hadoop 3.2.1 still uses these jars(jackson-mapper-asl-1.9.13 and jackson-core-asl-1.9.13), they are a transitive dependency on jersey-json. See below. {code:java} [INFO] org.apache.hadoop:hadoop-common:jar:3.2.1 [INFO] +- org.apache.hadoop:hadoop-annotations:jar:3.2.1:compile [INFO] | \- jdk.tools:jdk.tools:jar:1.8:system [INFO] +- com.google.guava:guava:jar:27.0-jre:compile [INFO] | +- com.google.guava:failureaccess:jar:1.0:compile [INFO] | +- com.google.guava:listenablefuture:jar:.0-empty-to-avoid-conflict-with-guava:compile [INFO] | +- org.checkerframework:checker-qual:jar:2.5.2:compile [INFO] | +- com.google.errorprone:error_prone_annotations:jar:2.2.0:compile [INFO] | +- com.google.j2objc:j2objc-annotations:jar:1.1:compile [INFO] | \- org.codehaus.mojo:animal-sniffer-annotations:jar:1.17:compile [INFO] +- commons-cli:commons-cli:jar:1.2:compile [INFO] +- org.apache.commons:commons-math3:jar:3.1.1:compile [INFO] +- org.apache.httpcomponents:httpclient:jar:4.5.6:compile [INFO] | \- org.apache.httpcomponents:httpcore:jar:4.4.10:compile [INFO] +- commons-codec:commons-codec:jar:1.11:compile [INFO] +- commons-io:commons-io:jar:2.5:compile [INFO] +- commons-net:commons-net:jar:3.6:compile [INFO] +- commons-collections:commons-collections:jar:3.2.2:compile [INFO] +- javax.servlet:javax.servlet-api:jar:3.1.0:compile [INFO] +- org.eclipse.jetty:jetty-server:jar:9.3.24.v20180605:compile [INFO] | +- org.eclipse.jetty:jetty-http:jar:9.3.24.v20180605:compile [INFO] | \- org.eclipse.jetty:jetty-io:jar:9.3.24.v20180605:compile [INFO] +- org.eclipse.jetty:jetty-util:jar:9.3.24.v20180605:compile [INFO] +- org.eclipse.jetty:jetty-servlet:jar:9.3.24.v20180605:compile [INFO] | \- org.eclipse.jetty:jetty-security:jar:9.3.24.v20180605:compile [INFO] +- org.eclipse.jetty:jetty-webapp:jar:9.3.24.v20180605:compile [INFO] | \- org.eclipse.jetty:jetty-xml:jar:9.3.24.v20180605:compile [INFO] +- org.eclipse.jetty:jetty-util-ajax:jar:9.3.24.v20180605:test [INFO] +- javax.servlet.jsp:jsp-api:jar:2.1:runtime [INFO] +- com.sun.jersey:jersey-core:jar:1.19:compile [INFO] | \- javax.ws.rs:jsr311-api:jar:1.1.1:compile [INFO] +- com.sun.jersey:jersey-servlet:jar:1.19:compile [INFO] +- com.sun.jersey:jersey-json:jar:1.19:compile [INFO] | +- org.codehaus.jettison:jettison:jar:1.1:compile [INFO] | +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile [INFO] | | \- javax.xml.bind:jaxb-api:jar:2.2.11:compile [INFO] | +- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile [INFO] | +- org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile [INFO] | +- org.codehaus.jackson:jackson-jaxrs:jar:1.9.13:compile [INFO] | \- org.codehaus.jackson:jackson-xc:jar:1.9.13:compile [INFO] +- com.sun.jersey:jersey-server:jar:1.19:compile {code} > remove dependency on jackson-mapper-asl-1.9.13 and jackson-core-asl-1.9.13 > -- > > Key: SPARK-30466 > URL: https://issues.apache.org/jira/browse/SPARK-30466 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4, 3.0.0 >Reporter: Michael Burgener >Priority: Major > Labels: security > > These 2 libraries are deprecated and replaced by the jackson-databind > libraries which are already included. These two libraries are flagged by our > vulnerability scanners as having the following security vulnerabilities. > I've set the priority to Major due to the Critical nature and hopefully they > can be addressed quickly. Please note, I'm not a developer but work in > InfoSec and this was flagged when we incorporated spark into our product. If > you feel the priority is not set correctly please change accordingly. I'll > watch the issue and flag our dev team to update once resolved. > jackson-mapper-asl-1.9.13 > CVE-2018-7489 (CVSS 3.0 Score 9.8 CRITICAL) > [https://nvd.nist.gov/vuln/detail/CVE-2018-7489] > > CVE-2017-7525 (CVSS 3.0 Score 9.8 CRITICAL) > [https://nvd.nist.gov/vuln/detail/CVE-2017-7525] > > CVE-2017-17485 (CVSS 3.0 Score 9.8 CRITICAL) > [https://nvd.nist.gov/vuln/detail/CVE-2017-17485] > > CVE-2017-15095 (CVSS 3.0 Score 9.8 CRITICAL) > [https://nvd.nist.gov/vuln/detail/CVE-2017-15095] > > CVE-2018-5968 (CVSS 3.0 Score 8.1 High) > [https://nvd.nist.gov/vuln/detail/CVE-2018-5968] > > jackson-core-asl-1.9.13 > CVE-2016-7051 (CVSS 3.0 Score 8.6 High) > https://nvd.nist.gov/vuln/detail/CVE-2016-7051 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For addi
[jira] [Resolved] (SPARK-32074) Update AppVeyor R to 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32074. -- Fix Version/s: 3.1.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/28909 > Update AppVeyor R to 4.0.2 > -- > > Key: SPARK-32074 > URL: https://issues.apache.org/jira/browse/SPARK-32074 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > We should test R 4.0.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143566#comment-17143566 ] Hyukjin Kwon commented on SPARK-25244: -- This issue was closed because it marked the affected version as 2.3 which is EOL. Feel free to create new JIRA with a reproducer and analysis if the issue persists. > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-25244 > URL: https://issues.apache.org/jira/browse/SPARK-25244 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Anton Daitche >Priority: Major > Labels: bulk-closed > > The setting `spark.sql.session.timeZone` is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons `datetime` > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method `toPandas` respected the timezone setting (UTC), but the > method `collect` ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods `toInternal` and > `fromInternal` of PySparks `TimestampType` class don't take into account the > setting `spark.sql.session.timeZone` and use the system timezone. > If the maintainers agree that this should be fixed, I would try to come up > with a patch. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31887) Date casting to string is giving wrong value
[ https://issues.apache.org/jira/browse/SPARK-31887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143564#comment-17143564 ] Hyukjin Kwon commented on SPARK-31887: -- The changes that fixed this issue is likely about calendar switching at SPARK-26651, which is a very big and invasive change. It will not likely be ported to back. > Date casting to string is giving wrong value > > > Key: SPARK-31887 > URL: https://issues.apache.org/jira/browse/SPARK-31887 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5 > Environment: The spark is running on cluster mode with Mesos. > > Mesos agents are dockerised running on Ubuntu 18. > > Timezone setting of docker instance: UTC > Timezone of server hosting docker: America/New_York > Timezone of driver machine: America/New_York >Reporter: Amit Gupta >Priority: Major > > The code converts the string to date and then write it in csv. > {code:java} > val x = Seq(("2020-02-19", "2020-02-19 05:11:00")).toDF("a", > "b").select('a.cast("date"), 'b.cast("timestamp")) > x.show() > +--+---+ > | a| b| > +--+---+ > |2020-02-19|2020-02-19 05:11:00| > +--+---+ > x.write.mode("overwrite").option("header", true).csv("/tmp/test1.csv") > {code} > > The date written in CSV file is different: > {code:java} > > snakebite cat "/tmp/test1.csv/*.csv" > a,b > 2020-02-18,2020-02-19T05:11:00.000Z{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27281) Wrong latest offsets returned by DirectKafkaInputDStream#latestOffsets
[ https://issues.apache.org/jira/browse/SPARK-27281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143562#comment-17143562 ] Yuanyuan Xia commented on SPARK-27281: -- In our environment, we encounter the same issue and the cause seems also related to [KAFKA-7703|https://issues.apache.org/jira/browse/KAFKA-7703] > Wrong latest offsets returned by DirectKafkaInputDStream#latestOffsets > -- > > Key: SPARK-27281 > URL: https://issues.apache.org/jira/browse/SPARK-27281 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.0 >Reporter: Viacheslav Krot >Priority: Major > > I have a very strange and hard to reproduce issue when using kafka direct > streaming, version 2.4.0 > From time to time, maybe once a day - once a week I get following error > {noformat} > java.lang.IllegalArgumentException: requirement failed: numRecords must not > be negative > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334) > at scala.Option.orElse(Option.scala:289) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247) > at > org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > 19/01/29 13:10:00 ERROR apps.BusinessRuleEngine: Job failed. Stopping JVM > java.lang.IllegalArgumentException: requirement failed: numRecords must not > be negative > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.sc
[jira] [Issue Comment Deleted] (SPARK-27281) Wrong latest offsets returned by DirectKafkaInputDStream#latestOffsets
[ https://issues.apache.org/jira/browse/SPARK-27281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanyuan Xia updated SPARK-27281: - Comment: was deleted (was: In our environment, we encounter the same issue and the cause seems also related to [KAFKA-7703|https://issues.apache.org/jira/browse/KAFKA-7703]) > Wrong latest offsets returned by DirectKafkaInputDStream#latestOffsets > -- > > Key: SPARK-27281 > URL: https://issues.apache.org/jira/browse/SPARK-27281 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.0 >Reporter: Viacheslav Krot >Priority: Major > > I have a very strange and hard to reproduce issue when using kafka direct > streaming, version 2.4.0 > From time to time, maybe once a day - once a week I get following error > {noformat} > java.lang.IllegalArgumentException: requirement failed: numRecords must not > be negative > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334) > at scala.Option.orElse(Option.scala:289) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247) > at > org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > 19/01/29 13:10:00 ERROR apps.BusinessRuleEngine: Job failed. Stopping JVM > java.lang.IllegalArgumentException: requirement failed: numRecords must not > be negative > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416) > at > org.apac
[jira] [Commented] (SPARK-27281) Wrong latest offsets returned by DirectKafkaInputDStream#latestOffsets
[ https://issues.apache.org/jira/browse/SPARK-27281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143561#comment-17143561 ] Yuanyuan Xia commented on SPARK-27281: -- In our environment, we encounter the same issue and the cause seems also related to [KAFKA-7703|https://issues.apache.org/jira/browse/KAFKA-7703] > Wrong latest offsets returned by DirectKafkaInputDStream#latestOffsets > -- > > Key: SPARK-27281 > URL: https://issues.apache.org/jira/browse/SPARK-27281 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.0 >Reporter: Viacheslav Krot >Priority: Major > > I have a very strange and hard to reproduce issue when using kafka direct > streaming, version 2.4.0 > From time to time, maybe once a day - once a week I get following error > {noformat} > java.lang.IllegalArgumentException: requirement failed: numRecords must not > be negative > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334) > at scala.Option.orElse(Option.scala:289) > at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247) > at > org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > 19/01/29 13:10:00 ERROR apps.BusinessRuleEngine: Job failed. Stopping JVM > java.lang.IllegalArgumentException: requirement failed: numRecords must not > be negative > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.sc
[jira] [Resolved] (SPARK-32050) GBTClassifier not working with OnevsRest
[ https://issues.apache.org/jira/browse/SPARK-32050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32050. -- Resolution: Duplicate > GBTClassifier not working with OnevsRest > > > Key: SPARK-32050 > URL: https://issues.apache.org/jira/browse/SPARK-32050 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 > Environment: spark 2.4.0 >Reporter: Raghuvarran V H >Priority: Minor > > I am trying to use GBT classifier for multi class classification using > OnevsRest > > {code:java} > from pyspark.ml.classification import > MultilayerPerceptronClassifier,OneVsRest,GBTClassifier > from pyspark.ml import Pipeline,PipelineModel > lr = GBTClassifier(featuresCol='features', labelCol='label', > predictionCol='prediction', maxDepth=5, > > maxBins=32,minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, > cacheNodeIds=False,checkpointInterval=10, lossType='logistic', > maxIter=20,stepSize=0.1, seed=None,subsamplingRate=1.0, > featureSubsetStrategy='auto') > classifier = OneVsRest(featuresCol='features', labelCol='label', > predictionCol='prediction', classifier=lr, weightCol=None,parallelism=1) > pipeline = Pipeline(stages=[str_indxr,ohe,vecAssembler,normalizer,classifier]) > model = pipeline.fit(train_data) > {code} > > > When I try this I get this error: > /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/ml/classification.py > in _fit(self, dataset) > 1800 classifier = self.getClassifier() > 1801 assert isinstance(classifier, HasRawPredictionCol),\ > -> 1802 "Classifier %s doesn't extend from HasRawPredictionCol." % > type(classifier) > 1803 > 1804 numClasses = int(dataset.agg(\{labelCol: > "max"}).head()["max("+labelCol+")"]) + 1 > AssertionError: Classifier > doesn't extend from HasRawPredictionCol. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32051) Dataset.foreachPartition returns object
[ https://issues.apache.org/jira/browse/SPARK-32051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32051: - Priority: Major (was: Critical) > Dataset.foreachPartition returns object > --- > > Key: SPARK-32051 > URL: https://issues.apache.org/jira/browse/SPARK-32051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Frank Oosterhuis >Priority: Major > > I'm trying to map values from the Dataset[Row], but since 3.0.0 this fails. > In 3.0.0 I'm dealing with an error: "Error:(28, 38) value map is not a member > of Object" > > This is the simplest code that works in 2.4.x, but fails in 3.0.0: > {code:scala} > spark.range(100) > .repartition(10) > .foreachPartition(part => println(part.toList)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32053. -- Resolution: Incomplete > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Kayal >Priority: Major > Attachments: image-2020-06-22-18-19-32-236.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143557#comment-17143557 ] Hyukjin Kwon commented on SPARK-32053: -- [~kaganesa] Spark 2.3.0 is EOL so we won't be able to land any fix. Can you see if this issue still persists in higher versions? Also, it would be great if you share the full reproducer and full error messages. > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Kayal >Priority: Major > Attachments: image-2020-06-22-18-19-32-236.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32068) Spark 3 UI task launch time show in error time zone
[ https://issues.apache.org/jira/browse/SPARK-32068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143556#comment-17143556 ] Hyukjin Kwon commented on SPARK-32068: -- [~d87904488] can you attach the snapshots? > Spark 3 UI task launch time show in error time zone > --- > > Key: SPARK-32068 > URL: https://issues.apache.org/jira/browse/SPARK-32068 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Smith Cruise >Priority: Major > Labels: easyfix > > For example, > In this link: history/app-20200623133209-0015/stages/ , stage submit time is > correct (UTS) > > But in this link: > history/app-20200623133209-0015/stages/stage/?id=0&attempt=0 , task launch > time is incorrect(UTC) > > The same problem exists in port 4040 Web UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32068) Spark 3 UI task launch time show in error time zone
[ https://issues.apache.org/jira/browse/SPARK-32068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143556#comment-17143556 ] Hyukjin Kwon edited comment on SPARK-32068 at 6/24/20, 6:24 AM: [~d87904488] can you attach the screenshots? was (Author: hyukjin.kwon): [~d87904488] can you attach the snapshots? > Spark 3 UI task launch time show in error time zone > --- > > Key: SPARK-32068 > URL: https://issues.apache.org/jira/browse/SPARK-32068 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Smith Cruise >Priority: Major > Labels: easyfix > > For example, > In this link: history/app-20200623133209-0015/stages/ , stage submit time is > correct (UTS) > > But in this link: > history/app-20200623133209-0015/stages/stage/?id=0&attempt=0 , task launch > time is incorrect(UTC) > > The same problem exists in port 4040 Web UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32081) facing Invalid UTF-32 character v2.4.5 running pyspark
[ https://issues.apache.org/jira/browse/SPARK-32081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143554#comment-17143554 ] Hyukjin Kwon commented on SPARK-32081: -- Please just don't copy and paste the errors. The error message say the encoding of your file is wrong: {code} java.io.CharConversionException: Invalid UTF-32 character 0x100(above 10) at char #206, byte {code} > facing Invalid UTF-32 character v2.4.5 running pyspark > -- > > Key: SPARK-32081 > URL: https://issues.apache.org/jira/browse/SPARK-32081 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 2.4.5 >Reporter: Yaniv Kempler >Priority: Major > > facing Invalid UTF-32 character while reading json files > > Py4JJavaError Traceback (most recent call last) in > ~/.local/lib/python3.6/site-packages/pyspark/sql/readwriter.py in json(self, > path, schema, primitivesAsString, prefersDecimal, allowComments, > allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, > allowBackslashEscapingAnyCharacter, mode, columnNameOfCorruptRecord, > dateFormat, timestampFormat, multiLine, allowUnquotedControlChars, lineSep, > samplingRatio, dropFieldIfAllNull, encoding) 284 keyed._bypass_serializer = > True 285 jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) --> 286 > return self._df(self._jreader.json(jrdd)) 287 else: 288 raise > TypeError("path can be only string, list or RDD") > ~/.local/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, > *args) 1255 answer = self.gateway_client.send_command(command) 1256 > return_value = get_return_value( -> 1257 answer, self.gateway_client, > self.target_id, self.name) 1258 1259 for temp_arg in temp_args: > ~/.local/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except > py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() > ~/.local/lib/python3.6/site-packages/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) 326 raise > Py4JJavaError( 327 "An error occurred while calling \{0}{1}\{2}.\n". --> 328 > format(target_id, ".", name), value) 329 else: 330 raise Py4JError( > Py4JJavaError: An error occurred while calling o67.json. : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 546 > in stage 0.0 failed 4 times, most recent failure: Lost task 546.3 in stage > 0.0 (TID 642, 172.31.30.196, executor 1): java.io.CharConversionException: > Invalid UTF-32 character 0x100(above 10) at char #206, byte #827) at > com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) > at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2017) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:577) > at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:56) > at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:55) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543) at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:55) > at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:53) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at > scala.collection.Iterator$class.foreach(Iterator.scala:891) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at > scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185) > at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1334) at > scala.collection.TraversableOnce$class.reduceLeftOption(TraversableOnce.scala:203) > at scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1334) > at > scala.collection.TraversableOnce$class.reduceOption(TraversableOnce.scala:210) > at scala.collection.AbstractIterator.reduceOption(Iterator.scala:1334) at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:70) > at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:50) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1
[jira] [Resolved] (SPARK-32081) facing Invalid UTF-32 character v2.4.5 running pyspark
[ https://issues.apache.org/jira/browse/SPARK-32081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32081. -- Resolution: Cannot Reproduce > facing Invalid UTF-32 character v2.4.5 running pyspark > -- > > Key: SPARK-32081 > URL: https://issues.apache.org/jira/browse/SPARK-32081 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 2.4.5 >Reporter: Yaniv Kempler >Priority: Major > > facing Invalid UTF-32 character while reading json files > > Py4JJavaError Traceback (most recent call last) in > ~/.local/lib/python3.6/site-packages/pyspark/sql/readwriter.py in json(self, > path, schema, primitivesAsString, prefersDecimal, allowComments, > allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, > allowBackslashEscapingAnyCharacter, mode, columnNameOfCorruptRecord, > dateFormat, timestampFormat, multiLine, allowUnquotedControlChars, lineSep, > samplingRatio, dropFieldIfAllNull, encoding) 284 keyed._bypass_serializer = > True 285 jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) --> 286 > return self._df(self._jreader.json(jrdd)) 287 else: 288 raise > TypeError("path can be only string, list or RDD") > ~/.local/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, > *args) 1255 answer = self.gateway_client.send_command(command) 1256 > return_value = get_return_value( -> 1257 answer, self.gateway_client, > self.target_id, self.name) 1258 1259 for temp_arg in temp_args: > ~/.local/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except > py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() > ~/.local/lib/python3.6/site-packages/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) 326 raise > Py4JJavaError( 327 "An error occurred while calling \{0}{1}\{2}.\n". --> 328 > format(target_id, ".", name), value) 329 else: 330 raise Py4JError( > Py4JJavaError: An error occurred while calling o67.json. : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 546 > in stage 0.0 failed 4 times, most recent failure: Lost task 546.3 in stage > 0.0 (TID 642, 172.31.30.196, executor 1): java.io.CharConversionException: > Invalid UTF-32 character 0x100(above 10) at char #206, byte #827) at > com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) > at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2017) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:577) > at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:56) > at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:55) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543) at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:55) > at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:53) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at > scala.collection.Iterator$class.foreach(Iterator.scala:891) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at > scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185) > at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1334) at > scala.collection.TraversableOnce$class.reduceLeftOption(TraversableOnce.scala:203) > at scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1334) > at > scala.collection.TraversableOnce$class.reduceOption(TraversableOnce.scala:210) > at scala.collection.AbstractIterator.reduceOption(Iterator.scala:1334) at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:70) > at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:50) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
[jira] [Updated] (SPARK-32081) facing Invalid UTF-32 character v2.4.5 running pyspark
[ https://issues.apache.org/jira/browse/SPARK-32081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32081: - Priority: Major (was: Blocker) > facing Invalid UTF-32 character v2.4.5 running pyspark > -- > > Key: SPARK-32081 > URL: https://issues.apache.org/jira/browse/SPARK-32081 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 2.4.5 >Reporter: Yaniv Kempler >Priority: Major > > facing Invalid UTF-32 character while reading json files > > Py4JJavaError Traceback (most recent call last) in > ~/.local/lib/python3.6/site-packages/pyspark/sql/readwriter.py in json(self, > path, schema, primitivesAsString, prefersDecimal, allowComments, > allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, > allowBackslashEscapingAnyCharacter, mode, columnNameOfCorruptRecord, > dateFormat, timestampFormat, multiLine, allowUnquotedControlChars, lineSep, > samplingRatio, dropFieldIfAllNull, encoding) 284 keyed._bypass_serializer = > True 285 jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) --> 286 > return self._df(self._jreader.json(jrdd)) 287 else: 288 raise > TypeError("path can be only string, list or RDD") > ~/.local/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, > *args) 1255 answer = self.gateway_client.send_command(command) 1256 > return_value = get_return_value( -> 1257 answer, self.gateway_client, > self.target_id, self.name) 1258 1259 for temp_arg in temp_args: > ~/.local/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except > py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() > ~/.local/lib/python3.6/site-packages/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) 326 raise > Py4JJavaError( 327 "An error occurred while calling \{0}{1}\{2}.\n". --> 328 > format(target_id, ".", name), value) 329 else: 330 raise Py4JError( > Py4JJavaError: An error occurred while calling o67.json. : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 546 > in stage 0.0 failed 4 times, most recent failure: Lost task 546.3 in stage > 0.0 (TID 642, 172.31.30.196, executor 1): java.io.CharConversionException: > Invalid UTF-32 character 0x100(above 10) at char #206, byte #827) at > com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) > at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2017) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:577) > at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:56) > at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:55) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543) at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:55) > at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:53) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at > scala.collection.Iterator$class.foreach(Iterator.scala:891) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at > scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185) > at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1334) at > scala.collection.TraversableOnce$class.reduceLeftOption(TraversableOnce.scala:203) > at scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1334) > at > scala.collection.TraversableOnce$class.reduceOption(TraversableOnce.scala:210) > at scala.collection.AbstractIterator.reduceOption(Iterator.scala:1334) at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:70) > at > org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:50) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
[jira] [Commented] (SPARK-31998) Change package references for ArrowBuf
[ https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143553#comment-17143553 ] Apache Spark commented on SPARK-31998: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/28915 > Change package references for ArrowBuf > -- > > Key: SPARK-31998 > URL: https://issues.apache.org/jira/browse/SPARK-31998 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liya Fan >Priority: Major > > Recently, we have moved class ArrowBuf from package io.netty.buffer to > org.apache.arrow.memory. So after upgrading Arrow library, we need to update > the references to ArrowBuf with the correct package name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31998) Change package references for ArrowBuf
[ https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31998: Assignee: (was: Apache Spark) > Change package references for ArrowBuf > -- > > Key: SPARK-31998 > URL: https://issues.apache.org/jira/browse/SPARK-31998 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liya Fan >Priority: Major > > Recently, we have moved class ArrowBuf from package io.netty.buffer to > org.apache.arrow.memory. So after upgrading Arrow library, we need to update > the references to ArrowBuf with the correct package name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31998) Change package references for ArrowBuf
[ https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143551#comment-17143551 ] Apache Spark commented on SPARK-31998: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/28915 > Change package references for ArrowBuf > -- > > Key: SPARK-31998 > URL: https://issues.apache.org/jira/browse/SPARK-31998 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liya Fan >Priority: Major > > Recently, we have moved class ArrowBuf from package io.netty.buffer to > org.apache.arrow.memory. So after upgrading Arrow library, we need to update > the references to ArrowBuf with the correct package name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31998) Change package references for ArrowBuf
[ https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31998: Assignee: Apache Spark > Change package references for ArrowBuf > -- > > Key: SPARK-31998 > URL: https://issues.apache.org/jira/browse/SPARK-31998 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liya Fan >Assignee: Apache Spark >Priority: Major > > Recently, we have moved class ArrowBuf from package io.netty.buffer to > org.apache.arrow.memory. So after upgrading Arrow library, we need to update > the references to ArrowBuf with the correct package name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32081) facing Invalid UTF-32 character v2.4.5 running pyspark
Yaniv Kempler created SPARK-32081: - Summary: facing Invalid UTF-32 character v2.4.5 running pyspark Key: SPARK-32081 URL: https://issues.apache.org/jira/browse/SPARK-32081 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 2.4.5 Reporter: Yaniv Kempler facing Invalid UTF-32 character while reading json files Py4JJavaError Traceback (most recent call last) in ~/.local/lib/python3.6/site-packages/pyspark/sql/readwriter.py in json(self, path, schema, primitivesAsString, prefersDecimal, allowComments, allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, allowBackslashEscapingAnyCharacter, mode, columnNameOfCorruptRecord, dateFormat, timestampFormat, multiLine, allowUnquotedControlChars, lineSep, samplingRatio, dropFieldIfAllNull, encoding) 284 keyed._bypass_serializer = True 285 jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) --> 286 return self._df(self._jreader.json(jrdd)) 287 else: 288 raise TypeError("path can be only string, list or RDD") ~/.local/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, self.target_id, self.name) 1258 1259 for temp_arg in temp_args: ~/.local/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() ~/.local/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 326 raise Py4JJavaError( 327 "An error occurred while calling \{0}{1}\{2}.\n". --> 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError( Py4JJavaError: An error occurred while calling o67.json. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 546 in stage 0.0 failed 4 times, most recent failure: Lost task 546.3 in stage 0.0 (TID 642, 172.31.30.196, executor 1): java.io.CharConversionException: Invalid UTF-32 character 0x100(above 10) at char #206, byte #827) at com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2017) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:577) at org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:56) at org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:55) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543) at org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:55) at org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:53) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185) at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1334) at scala.collection.TraversableOnce$class.reduceLeftOption(TraversableOnce.scala:203) at scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1334) at scala.collection.TraversableOnce$class.reduceOption(TraversableOnce.scala:210) at scala.collection.AbstractIterator.reduceOption(Iterator.scala:1334) at org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:70) at org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:50) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at
[jira] [Resolved] (SPARK-32062) Reset listenerRegistered in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-32062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32062. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28899 [https://github.com/apache/spark/pull/28899] > Reset listenerRegistered in SparkSession > > > Key: SPARK-32062 > URL: https://issues.apache.org/jira/browse/SPARK-32062 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32062) Reset listenerRegistered in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-32062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32062: --- Assignee: ulysses you > Reset listenerRegistered in SparkSession > > > Key: SPARK-32062 > URL: https://issues.apache.org/jira/browse/SPARK-32062 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32072) Unaligned benchmark results
[ https://issues.apache.org/jira/browse/SPARK-32072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32072: --- Assignee: Maxim Gekk > Unaligned benchmark results > > > Key: SPARK-32072 > URL: https://issues.apache.org/jira/browse/SPARK-32072 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > If the length of benchmark names is greater than 40, benchmark results are > not aligned to column names. For example: > {code} > OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux > 4.15.0-1044-aws > Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz > make_timestamp(): Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > prepare make_timestamp() 3636 3673 > 38 0.33635.7 1.0X > make_timestamp(2019, 1, 2, 3, 4, 50.123456) 94 99 > 4 10.7 93.8 38.8X > make_timestamp(2019, 1, 2, 3, 4, 60.00) 68 80 > 13 14.6 68.3 53.2X > make_timestamp(2019, 12, 31, 23, 59, 60.00) 65 79 > 19 15.3 65.3 55.7X > make_timestamp(*, *, *, 3, 4, 50.123456)271280 > 14 3.7 270.7 13.4X > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32072) Unaligned benchmark results
[ https://issues.apache.org/jira/browse/SPARK-32072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32072. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28906 [https://github.com/apache/spark/pull/28906] > Unaligned benchmark results > > > Key: SPARK-32072 > URL: https://issues.apache.org/jira/browse/SPARK-32072 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > If the length of benchmark names is greater than 40, benchmark results are > not aligned to column names. For example: > {code} > OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux > 4.15.0-1044-aws > Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz > make_timestamp(): Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > prepare make_timestamp() 3636 3673 > 38 0.33635.7 1.0X > make_timestamp(2019, 1, 2, 3, 4, 50.123456) 94 99 > 4 10.7 93.8 38.8X > make_timestamp(2019, 1, 2, 3, 4, 60.00) 68 80 > 13 14.6 68.3 53.2X > make_timestamp(2019, 12, 31, 23, 59, 60.00) 65 79 > 19 15.3 65.3 55.7X > make_timestamp(*, *, *, 3, 4, 50.123456)271280 > 14 3.7 270.7 13.4X > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32075) Fix a few issues in parameters table
[ https://issues.apache.org/jira/browse/SPARK-32075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32075. -- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/28910 > Fix a few issues in parameters table > > > Key: SPARK-32075 > URL: https://issues.apache.org/jira/browse/SPARK-32075 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Zuo Dao >Priority: Trivial > Fix For: 3.0.1, 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32075) Fix a few issues in parameters table
[ https://issues.apache.org/jira/browse/SPARK-32075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32075: Assignee: Zuo Dao > Fix a few issues in parameters table > > > Key: SPARK-32075 > URL: https://issues.apache.org/jira/browse/SPARK-32075 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Zuo Dao >Assignee: Zuo Dao >Priority: Trivial > Fix For: 3.0.1, 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32080) Simplify ArrowColumnVector ListArray accessor
[ https://issues.apache.org/jira/browse/SPARK-32080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-32080: - Priority: Trivial (was: Major) > Simplify ArrowColumnVector ListArray accessor > - > > Key: SPARK-32080 > URL: https://issues.apache.org/jira/browse/SPARK-32080 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Trivial > > The ArrowColumnVector ListArray accessor calculates start and end offset > indices manually. There were APIs added in Arrow 0.15.0 that do this and > using them will simplify this code and make use of more stable APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32080) Simplify ArrowColumnVector ListArray accessor
[ https://issues.apache.org/jira/browse/SPARK-32080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143479#comment-17143479 ] Apache Spark commented on SPARK-32080: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/28915 > Simplify ArrowColumnVector ListArray accessor > - > > Key: SPARK-32080 > URL: https://issues.apache.org/jira/browse/SPARK-32080 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > The ArrowColumnVector ListArray accessor calculates start and end offset > indices manually. There were APIs added in Arrow 0.15.0 that do this and > using them will simplify this code and make use of more stable APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32080) Simplify ArrowColumnVector ListArray accessor
[ https://issues.apache.org/jira/browse/SPARK-32080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32080: Assignee: Apache Spark > Simplify ArrowColumnVector ListArray accessor > - > > Key: SPARK-32080 > URL: https://issues.apache.org/jira/browse/SPARK-32080 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Major > > The ArrowColumnVector ListArray accessor calculates start and end offset > indices manually. There were APIs added in Arrow 0.15.0 that do this and > using them will simplify this code and make use of more stable APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32080) Simplify ArrowColumnVector ListArray accessor
[ https://issues.apache.org/jira/browse/SPARK-32080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32080: Assignee: (was: Apache Spark) > Simplify ArrowColumnVector ListArray accessor > - > > Key: SPARK-32080 > URL: https://issues.apache.org/jira/browse/SPARK-32080 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > The ArrowColumnVector ListArray accessor calculates start and end offset > indices manually. There were APIs added in Arrow 0.15.0 that do this and > using them will simplify this code and make use of more stable APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32080) Simplify ArrowColumnVector ListArray accessor
Bryan Cutler created SPARK-32080: Summary: Simplify ArrowColumnVector ListArray accessor Key: SPARK-32080 URL: https://issues.apache.org/jira/browse/SPARK-32080 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Bryan Cutler The ArrowColumnVector ListArray accessor calculates start and end offset indices manually. There were APIs added in Arrow 0.15.0 that do this and using them will simplify this code and make use of more stable APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32028) App id link in history summary page point to wrong application attempt
[ https://issues.apache.org/jira/browse/SPARK-32028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-32028. -- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 28867 [https://github.com/apache/spark/pull/28867] > App id link in history summary page point to wrong application attempt > -- > > Key: SPARK-32028 > URL: https://issues.apache.org/jira/browse/SPARK-32028 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.4, 3.0.0, 3.1.0 >Reporter: Zhen Li >Assignee: Zhen Li >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > Attachments: multi_same.JPG, wrong_attemptJPG.JPG > > > App id link in history summary page url is wrong, for multi attempts case. > for details, please see attached screen. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32028) App id link in history summary page point to wrong application attempt
[ https://issues.apache.org/jira/browse/SPARK-32028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-32028: Assignee: Zhen Li > App id link in history summary page point to wrong application attempt > -- > > Key: SPARK-32028 > URL: https://issues.apache.org/jira/browse/SPARK-32028 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.4, 3.0.0, 3.1.0 >Reporter: Zhen Li >Assignee: Zhen Li >Priority: Minor > Attachments: multi_same.JPG, wrong_attemptJPG.JPG > > > App id link in history summary page url is wrong, for multi attempts case. > for details, please see attached screen. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32073) Drop R < 3.5 support
[ https://issues.apache.org/jira/browse/SPARK-32073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32073. -- Fix Version/s: 3.1.0 2.4.7 3.0.1 Resolution: Fixed Issue resolved by pull request 28908 [https://github.com/apache/spark/pull/28908] > Drop R < 3.5 support > > > Key: SPARK-32073 > URL: https://issues.apache.org/jira/browse/SPARK-32073 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: releasenotes > Fix For: 3.0.1, 2.4.7, 3.1.0 > > > Spark 3.0.0 is built by R 3.6.3 which does not support R < 3.5: > {code} > Error in readRDS(pfile) : cannot read workspace version 3 written by R 3.6.3; > need R 3.5.0 or newer version. > {code} > In fact, with SPARK-31918, we will have to drop R < 3.5 entirely to support R > 4.0.0. > This JIRA targets to drop R < 3.5 in SparkR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32073) Drop R < 3.5 support
[ https://issues.apache.org/jira/browse/SPARK-32073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32073: Assignee: Hyukjin Kwon > Drop R < 3.5 support > > > Key: SPARK-32073 > URL: https://issues.apache.org/jira/browse/SPARK-32073 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: releasenotes > > Spark 3.0.0 is built by R 3.6.3 which does not support R < 3.5: > {code} > Error in readRDS(pfile) : cannot read workspace version 3 written by R 3.6.3; > need R 3.5.0 or newer version. > {code} > In fact, with SPARK-31918, we will have to drop R < 3.5 entirely to support R > 4.0.0. > This JIRA targets to drop R < 3.5 in SparkR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31918: Assignee: Hyukjin Kwon > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Assignee: Hyukjin Kwon >Priority: Blocker > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31918. -- Fix Version/s: 3.1.0 2.4.7 3.0.1 Resolution: Fixed Issue resolved by pull request 28907 [https://github.com/apache/spark/pull/28907] > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6, 3.0.0 >Reporter: Shivaram Venkataraman >Assignee: Hyukjin Kwon >Priority: Blocker > Fix For: 3.0.1, 2.4.7, 3.1.0 > > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32073) Drop R < 3.5 support
[ https://issues.apache.org/jira/browse/SPARK-32073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32073: - Labels: releasenotes (was: ) > Drop R < 3.5 support > > > Key: SPARK-32073 > URL: https://issues.apache.org/jira/browse/SPARK-32073 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: releasenotes > > Spark 3.0.0 is built by R 3.6.3 which does not support R < 3.5: > {code} > Error in readRDS(pfile) : cannot read workspace version 3 written by R 3.6.3; > need R 3.5.0 or newer version. > {code} > In fact, with SPARK-31918, we will have to drop R < 3.5 entirely to support R > 4.0.0. > This JIRA targets to drop R < 3.5 in SparkR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32079) PySpark <> Beam pickling issues for collections.namedtuple
Gerard Casas Saez created SPARK-32079: - Summary: PySpark <> Beam pickling issues for collections.namedtuple Key: SPARK-32079 URL: https://issues.apache.org/jira/browse/SPARK-32079 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.0 Reporter: Gerard Casas Saez PySpark monkeypatching namedtuple makes it difficult/impossible to depickle collections.namedtuple instances from outside of a pyspark environment. When PySpark has been loaded into the environment, any time that you try to pickle a namedtuple, you are only able to unpickle it from an environment where the [hijack|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L385] has been applied. This conflicts directly when trying to use Beam from a non-Spark environment (namingly Flink or Dataflow) making it impossible to use the pipeline if it has a namedtuple loaded somewhere. {code:python} import collections import dill ColumnInfo = collections.namedtuple( "ColumnInfo", [ "name", # type: ColumnName # pytype: disable=ignored-type-comment "type", # type: Optional[ColumnType] # pytype: disable=ignored-type-comment ]) dill.dumps(ColumnInfo('test', int)) {code} {{b'\x80\x03cdill._dill\n_create_namedtuple\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x08\x00\x00\x00__main__q\x05\x87q\x06Rq\x07X\x04\x00\x00\x00testq\x08cdill._dill\n_load_type\nq\tX\x03\x00\x00\x00intq\n\x85q\x0bRq\x0c\x86q\r\x81q\x0e.'}} {code:python} import pyspark import collections import dill ColumnInfo = collections.namedtuple( "ColumnInfo", [ "name", # type: ColumnName # pytype: disable=ignored-type-comment "type", # type: Optional[ColumnType] # pytype: disable=ignored-type-comment ]) dill.dumps(ColumnInfo('test', int)) {code} {{b'\x80\x03cpyspark.serializers\n_restore\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x04\x00\x00\x00testq\x05cdill._dill\n_load_type\nq\x06X\x03\x00\x00\x00intq\x07\x85q\x08Rq\t\x86q\n\x87q\x0bRq\x0c.'}} Second pickled object can only be used from an environment with PySpark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32078) Add a redirect to sql-ref from sql-reference
[ https://issues.apache.org/jira/browse/SPARK-32078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143429#comment-17143429 ] Apache Spark commented on SPARK-32078: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/28914 > Add a redirect to sql-ref from sql-reference > > > Key: SPARK-32078 > URL: https://issues.apache.org/jira/browse/SPARK-32078 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > > A number of Google searches I’ve done today have turned up > [https://spark.apache.org/docs/latest/sql-reference.html], which does not > exist any more. Thus, we should add a redirect to sql-ref.html. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32078) Add a redirect to sql-ref from sql-reference
[ https://issues.apache.org/jira/browse/SPARK-32078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32078: Assignee: Xiao Li (was: Apache Spark) > Add a redirect to sql-ref from sql-reference > > > Key: SPARK-32078 > URL: https://issues.apache.org/jira/browse/SPARK-32078 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > > A number of Google searches I’ve done today have turned up > [https://spark.apache.org/docs/latest/sql-reference.html], which does not > exist any more. Thus, we should add a redirect to sql-ref.html. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32078) Add a redirect to sql-ref from sql-reference
[ https://issues.apache.org/jira/browse/SPARK-32078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32078: Assignee: Apache Spark (was: Xiao Li) > Add a redirect to sql-ref from sql-reference > > > Key: SPARK-32078 > URL: https://issues.apache.org/jira/browse/SPARK-32078 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > A number of Google searches I’ve done today have turned up > [https://spark.apache.org/docs/latest/sql-reference.html], which does not > exist any more. Thus, we should add a redirect to sql-ref.html. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32078) Add a redirect to sql-ref from sql-reference
Xiao Li created SPARK-32078: --- Summary: Add a redirect to sql-ref from sql-reference Key: SPARK-32078 URL: https://issues.apache.org/jira/browse/SPARK-32078 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 3.0.0 Reporter: Xiao Li Assignee: Xiao Li A number of Google searches I’ve done today have turned up [https://spark.apache.org/docs/latest/sql-reference.html], which does not exist any more. Thus, we should add a redirect to sql-ref.html. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23631) Add summary to RandomForestClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-23631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143392#comment-17143392 ] Apache Spark commented on SPARK-23631: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28913 > Add summary to RandomForestClassificationModel > -- > > Key: SPARK-23631 > URL: https://issues.apache.org/jira/browse/SPARK-23631 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.0 >Reporter: Evan Zamir >Priority: Major > Labels: bulk-closed > > I'm using the RandomForestClassificationModel and noticed that there is no > summary attribute like there is for LogisticRegressionModel. Specifically, > I'd like to have the roc and pr curves. Is that on the Spark roadmap > anywhere? Is there a reason it hasn't been implemented? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32067) [K8S] Executor pod template of subsequent submission inadvertently applies to ongoing submission
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Yu updated SPARK-32067: - Description: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and with app2 launching while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 end up having app2's executor pod template applied to them. The root cause appears to be that app1's podspec-configmap got overwritten by app2 during the overlapping launching periods because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. The issue can be seen as follows: First, after submitting app1, you get these configmaps: {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors. The podspec-confimap is modified by app2. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app1--podspec-configmap1 13m57s default app2--driver-conf-map 1 10s default app2--podspec-configmap1 3m{code} was: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and with app2 launching while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 end up having app2's executor pod template applied to them. The root cause appears to be that app1's podspec-configmap got overwritten by app2 during the overlapping launching periods because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. The issue can be seen as follows: First, after submitting app1, you get these configmaps: {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors. The podspec-confimap is modified by app2. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 3m{code} > [K8S] Executor pod template of subsequent submission inadvertently applies to > ongoing submission > > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.6, 3.0.0 >Reporter: James Yu >Priority: Minor > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different executor pod templates (e.g., different labels) to K8s > sequentially, and with app2 launching while app1 is still ramping up all its > executor pods. The unwanted result is that some launched executor pods of > app1 end up having app2's executor pod template applied to them. > The root cause appears to be that app1's podspec-configmap got overwritten by > app2 during the overlapping launching periods because the configmap names of > the t
[jira] [Updated] (SPARK-32067) [K8S] Executor pod template of subsequent submission inadvertently applies to ongoing submission
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Yu updated SPARK-32067: - Description: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and with app2 launching while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 end up having app2's executor pod template applied to them. The root cause appears to be that app1's podspec-configmap got overwritten by app2 during the overlapping launching periods because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. The issue can be seen as follows: First, after submitting app1, you get these configmaps: {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors. The podspec-confimap is modified by app2. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 3m{code} was: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and with app2 launching while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 end up having app2's executor pod template applied to them. The root cause appears to be that app1's podspec-configmap got overwritten by app2 during the overlapping launching periods because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. The issue can be seen as follows: First, after submitting app1, you get these configmaps: {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors. The podspec-confimap is modified by app2. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} > [K8S] Executor pod template of subsequent submission inadvertently applies to > ongoing submission > > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.6, 3.0.0 >Reporter: James Yu >Priority: Minor > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different executor pod templates (e.g., different labels) to K8s > sequentially, and with app2 launching while app1 is still ramping up all its > executor pods. The unwanted result is that some launched executor pods of > app1 end up having app2's executor pod template applied to them. > The root cause appears to be that app1's podspec-configmap got overwritten by > app2 during the overlapping launching periods because the configmap names of > th
[jira] [Updated] (SPARK-32067) [K8S] Executor pod template of subsequent submission inadvertently applies to ongoing submission
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Yu updated SPARK-32067: - Summary: [K8S] Executor pod template of subsequent submission inadvertently applies to ongoing submission (was: [K8s] Executor pod template of subsequent submission inadvertently applies to ongoing submission) > [K8S] Executor pod template of subsequent submission inadvertently applies to > ongoing submission > > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.6, 3.0.0 >Reporter: James Yu >Priority: Minor > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different executor pod templates (e.g., different labels) to K8s > sequentially, and with app2 launching while app1 is still ramping up all its > executor pods. The unwanted result is that some launched executor pods of > app1 end up having app2's executor pod template applied to them. > The root cause appears to be that app1's podspec-configmap got overwritten by > app2 during the overlapping launching periods because the configmap names of > the two apps are the same. This causes some app1's executor pods being ramped > up after app2 is launched to be inadvertently launched with the app2's pod > template. The issue can be seen as follows: > First, after submitting app1, you get these configmaps: > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 9m46s > default podspec-configmap 1 12m{code} > Then submit app2 while app1 is still ramping up its executors. The > podspec-confimap is modified by app2. > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 11m43s > default app2--driver-conf-map 1 10s > default podspec-configmap 1 13m57s{code} > > PROPOSED SOLUTION: > Properly prefix the podspec-configmap for each submitted app. > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 11m43s > default app2--driver-conf-map 1 10s > default app1--podspec-configmap1 13m57s > default app2--podspec-configmap1 13m57s{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32067) [K8s] Executor pod template of subsequent submission inadvertently applies to ongoing submission
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Yu updated SPARK-32067: - Summary: [K8s] Executor pod template of subsequent submission inadvertently applies to ongoing submission (was: [K8s] Pod template from subsequent submission inadvertently applies to ongoing submission) > [K8s] Executor pod template of subsequent submission inadvertently applies to > ongoing submission > > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.6, 3.0.0 >Reporter: James Yu >Priority: Minor > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different executor pod templates (e.g., different labels) to K8s > sequentially, and with app2 launching while app1 is still ramping up all its > executor pods. The unwanted result is that some launched executor pods of > app1 end up having app2's executor pod template applied to them. > The root cause appears to be that app1's podspec-configmap got overwritten by > app2 during the overlapping launching periods because the configmap names of > the two apps are the same. This causes some app1's executor pods being ramped > up after app2 is launched to be inadvertently launched with the app2's pod > template. The issue can be seen as follows: > First, after submitting app1, you get these configmaps: > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 9m46s > default podspec-configmap 1 12m{code} > Then submit app2 while app1 is still ramping up its executors. The > podspec-confimap is modified by app2. > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 11m43s > default app2--driver-conf-map 1 10s > default podspec-configmap 1 13m57s{code} > > PROPOSED SOLUTION: > Properly prefix the podspec-configmap for each submitted app. > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 11m43s > default app2--driver-conf-map 1 10s > default app1--podspec-configmap1 13m57s > default app2--podspec-configmap1 13m57s{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequent submission inadvertently applies to ongoing submission
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Yu updated SPARK-32067: - Summary: [K8s] Pod template from subsequent submission inadvertently applies to ongoing submission (was: [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission) > [K8s] Pod template from subsequent submission inadvertently applies to > ongoing submission > - > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.6, 3.0.0 >Reporter: James Yu >Priority: Minor > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different executor pod templates (e.g., different labels) to K8s > sequentially, and with app2 launching while app1 is still ramping up all its > executor pods. The unwanted result is that some launched executor pods of > app1 end up having app2's executor pod template applied to them. > The root cause appears to be that app1's podspec-configmap got overwritten by > app2 during the overlapping launching periods because the configmap names of > the two apps are the same. This causes some app1's executor pods being ramped > up after app2 is launched to be inadvertently launched with the app2's pod > template. The issue can be seen as follows: > First, after submitting app1, you get these configmaps: > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 9m46s > default podspec-configmap 1 12m{code} > Then submit app2 while app1 is still ramping up its executors. The > podspec-confimap is modified by app2. > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 11m43s > default app2--driver-conf-map 1 10s > default podspec-configmap 1 13m57s{code} > > PROPOSED SOLUTION: > Properly prefix the podspec-configmap for each submitted app. > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 11m43s > default app2--driver-conf-map 1 10s > default app1--podspec-configmap1 13m57s > default app2--podspec-configmap1 13m57s{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31998) Change package references for ArrowBuf
[ https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143306#comment-17143306 ] Kouhei Sutou commented on SPARK-31998: -- Yes. This change will be included in Apache Arrow 1.0.0. Apache Arrow 1.0.0 will be released at the end of 2020-07. We'll start our release process at the beginning of 2020-07. It'll take a few weeks for verification and vote. FYI: https://lists.apache.org/thread.html/re6fe67fd4cf10113f7969bc00ca6c7b4ccc8067d8512be9c7a904005%40%3Cdev.arrow.apache.org%3E > Change package references for ArrowBuf > -- > > Key: SPARK-31998 > URL: https://issues.apache.org/jira/browse/SPARK-31998 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liya Fan >Priority: Major > > Recently, we have moved class ArrowBuf from package io.netty.buffer to > org.apache.arrow.memory. So after upgrading Arrow library, we need to update > the references to ArrowBuf with the correct package name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32051) Dataset.foreachPartition returns object
[ https://issues.apache.org/jira/browse/SPARK-32051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143271#comment-17143271 ] Jungtaek Lim commented on SPARK-32051: -- [~frankivo] Could you put full source code for Dataset.foreach here? It looks to be returning DataFrame. > Dataset.foreachPartition returns object > --- > > Key: SPARK-32051 > URL: https://issues.apache.org/jira/browse/SPARK-32051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Frank Oosterhuis >Priority: Critical > > I'm trying to map values from the Dataset[Row], but since 3.0.0 this fails. > In 3.0.0 I'm dealing with an error: "Error:(28, 38) value map is not a member > of Object" > > This is the simplest code that works in 2.4.x, but fails in 3.0.0: > {code:scala} > spark.range(100) > .repartition(10) > .foreachPartition(part => println(part.toList)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED state correctly
[ https://issues.apache.org/jira/browse/SPARK-32057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143231#comment-17143231 ] Apache Spark commented on SPARK-32057: -- User 'alismess-db' has created a pull request for this issue: https://github.com/apache/spark/pull/28912 > SparkExecuteStatementOperation does not set CANCELED state correctly > - > > Key: SPARK-32057 > URL: https://issues.apache.org/jira/browse/SPARK-32057 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Smesseim >Priority: Major > > https://github.com/apache/spark/pull/28671 introduced changes that changed > the way cleanup is done in SparkExecuteStatementOperation. In cancel(), > cleanup (killing jobs) used to be done after setting state to CANCELED. Now, > the order is reversed. Jobs are killed first, causing exception to be thrown > inside execute(), so the status of the operation becomes ERROR before being > set to CANCELED. > cc [~juliuszsompolski] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32063) Spark native temporary table
[ https://issues.apache.org/jira/browse/SPARK-32063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143233#comment-17143233 ] L. C. Hsieh commented on SPARK-32063: - For 1 and 2, it seems all related to performance. In Spark, we have caching mechanism that materializes complex query. I think it can complement the shortage of temporary view. For 3, I'm not sure about this point. Can you elaborate it more? > Spark native temporary table > > > Key: SPARK-32063 > URL: https://issues.apache.org/jira/browse/SPARK-32063 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Lantao Jin >Priority: Major > > Many databases and data warehouse SQL engines support temporary tables. A > temporary table, as its named implied, is a short-lived table that its life > will be only for current session. > In Spark, there is no temporary table. the DDL “CREATE TEMPORARY TABLE AS > SELECT” will create a temporary view. A temporary view is totally different > with a temporary table. > A temporary view is just a VIEW. It doesn’t materialize data in storage. So > it has below shortage: > # View will not give improved performance. Materialize intermediate data in > temporary tables for a complex query will accurate queries, especially in an > ETL pipeline. > # View which calls other views can cause severe performance issues. Even, > executing a very complex view may fail in Spark. > # Temporary view has no database namespace. In some complex ETL pipelines or > data warehouse applications, without database prefix is not convenient. It > needs some tables which only used in current session. > > More details are described in [Design > Docs|https://docs.google.com/document/d/1RS4Q3VbxlZ_Yy0fdWgTJ-k0QxFd1dToCqpLAYvIJ34U/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED state correctly
[ https://issues.apache.org/jira/browse/SPARK-32057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143230#comment-17143230 ] Apache Spark commented on SPARK-32057: -- User 'alismess-db' has created a pull request for this issue: https://github.com/apache/spark/pull/28912 > SparkExecuteStatementOperation does not set CANCELED state correctly > - > > Key: SPARK-32057 > URL: https://issues.apache.org/jira/browse/SPARK-32057 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Smesseim >Priority: Major > > https://github.com/apache/spark/pull/28671 introduced changes that changed > the way cleanup is done in SparkExecuteStatementOperation. In cancel(), > cleanup (killing jobs) used to be done after setting state to CANCELED. Now, > the order is reversed. Jobs are killed first, causing exception to be thrown > inside execute(), so the status of the operation becomes ERROR before being > set to CANCELED. > cc [~juliuszsompolski] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED state correctly
[ https://issues.apache.org/jira/browse/SPARK-32057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32057: Assignee: Apache Spark > SparkExecuteStatementOperation does not set CANCELED state correctly > - > > Key: SPARK-32057 > URL: https://issues.apache.org/jira/browse/SPARK-32057 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Smesseim >Assignee: Apache Spark >Priority: Major > > https://github.com/apache/spark/pull/28671 introduced changes that changed > the way cleanup is done in SparkExecuteStatementOperation. In cancel(), > cleanup (killing jobs) used to be done after setting state to CANCELED. Now, > the order is reversed. Jobs are killed first, causing exception to be thrown > inside execute(), so the status of the operation becomes ERROR before being > set to CANCELED. > cc [~juliuszsompolski] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED state correctly
[ https://issues.apache.org/jira/browse/SPARK-32057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32057: Assignee: (was: Apache Spark) > SparkExecuteStatementOperation does not set CANCELED state correctly > - > > Key: SPARK-32057 > URL: https://issues.apache.org/jira/browse/SPARK-32057 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Smesseim >Priority: Major > > https://github.com/apache/spark/pull/28671 introduced changes that changed > the way cleanup is done in SparkExecuteStatementOperation. In cancel(), > cleanup (killing jobs) used to be done after setting state to CANCELED. Now, > the order is reversed. Jobs are killed first, causing exception to be thrown > inside execute(), so the status of the operation becomes ERROR before being > set to CANCELED. > cc [~juliuszsompolski] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED state correctly
[ https://issues.apache.org/jira/browse/SPARK-32057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ali Smesseim updated SPARK-32057: - Summary: SparkExecuteStatementOperation does not set CANCELED state correctly (was: SparkExecuteStatementOperation does not set CANCELED/CLOSED state correctly ) > SparkExecuteStatementOperation does not set CANCELED state correctly > - > > Key: SPARK-32057 > URL: https://issues.apache.org/jira/browse/SPARK-32057 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Smesseim >Priority: Major > > https://github.com/apache/spark/pull/28671 introduced changes that changed > the way cleanup is done in SparkExecuteStatementOperation. In cancel(), > cleanup (killing jobs) used to be done after setting state to CANCELED. Now, > the order is reversed. Jobs are killed first, causing exception to be thrown > inside execute(), so the status of the operation becomes ERROR before being > set to CANCELED. > cc [~juliuszsompolski] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Yu updated SPARK-32067: - Description: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and with app2 launching while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 end up having app2's executor pod template applied to them. The root cause appears to be that app1's podspec-configmap got overwritten by app2 during the overlapping launching periods because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. The issue can be seen as follows: First, after submitting app1, you get these configmaps: {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors. The podspec-confimap is modified by app2. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} was: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and with app2 launching while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 end up having app2's executor pod template applied to them. The root cause is that app1's podspec-configmap got overwritten by app2 during the overlapping launching periods because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. The issue can be seen as follows: First, after submitting app1, you get these configmaps: {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors. The podspec-confimap is modified by app2. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} > [K8s] Pod template from subsequently submission inadvertently applies to > ongoing submission > --- > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.6, 3.0.0 >Reporter: James Yu >Priority: Minor > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different executor pod templates (e.g., different labels) to K8s > sequentially, and with app2 launching while app1 is still ramping up all its > executor pods. The unwanted result is that some launched executor pods of > app1 end up having app2's executor pod template applied to them. > The root cause appears to be that app1's podspec-configmap got overwritten by > app2 during the overlapping launching periods because the configmap names of > the two apps are th
[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Yu updated SPARK-32067: - Description: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and with app2 launching while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 end up having app2's executor pod template applied to them. The root cause is that app1's podspec-configmap got overwritten by app2 during the overlapping launching periods because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. The issue can be seen as follows: First, after submitting app1, you get these configmaps: {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors. The podspec-confimap is modified by app2. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} was: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and with app2 launching while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 appear to have app2's pod template applied. The root cause is that app1's podspec-configmap got overwritten by app2 during the overlapping launching periods because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. The issue can be seen as follows: First, after submitting app1, you get these configmaps: {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors. The podspec-confimap is modified by app2. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} > [K8s] Pod template from subsequently submission inadvertently applies to > ongoing submission > --- > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.6, 3.0.0 >Reporter: James Yu >Priority: Minor > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different executor pod templates (e.g., different labels) to K8s > sequentially, and with app2 launching while app1 is still ramping up all its > executor pods. The unwanted result is that some launched executor pods of > app1 end up having app2's executor pod template applied to them. > The root cause is that app1's podspec-configmap got overwritten by app2 > during the overlapping launching periods because the configmap names of the > two apps are the same. This causes some app1's execut
[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Yu updated SPARK-32067: - Description: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and with app2 launching while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 appear to have app2's pod template applied. The root cause is that app1's podspec-configmap got overwritten by app2 during the overlapping launching periods because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. The issue can be seen as follows: First, after submitting app1, you get these configmaps: {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors. The podspec-confimap is modified by app2. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} was: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and app2 launches while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 appear to have app2's pod template applied. The root cause is that app1's podspec-configmap got overwritten by app2 during the overlapping launching periods because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. The issue can be seen as follows: First, after submitting app1, you get these configmaps: {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors. The podspec-confimap is modified by app2. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} > [K8s] Pod template from subsequently submission inadvertently applies to > ongoing submission > --- > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.6, 3.0.0 >Reporter: James Yu >Priority: Minor > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different executor pod templates (e.g., different labels) to K8s > sequentially, and with app2 launching while app1 is still ramping up all its > executor pods. The unwanted result is that some launched executor pods of > app1 appear to have app2's pod template applied. > The root cause is that app1's podspec-configmap got overwritten by app2 > during the overlapping launching periods because the configmap names of the > two apps are the same. This causes some app1's executor pods being ramped up > after app2
[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Yu updated SPARK-32067: - Description: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and app2 launches while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 appear to have app2's pod template applied. The root cause is that app1's podspec-configmap got overwritten by app2 during the overlapping launching periods because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. The issue can be seen as follows: First, after submitting app1, you get these configmaps: {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors. The podspec-confimap is modified by app2. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} was: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and app2 launches while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 appear to have app2's pod template applied. The root cause is that app1's podspec-configmap got overwritten by app2 during the overlapping launching periods because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. First, submit app1 {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} > [K8s] Pod template from subsequently submission inadvertently applies to > ongoing submission > --- > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.6, 3.0.0 >Reporter: James Yu >Priority: Minor > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different executor pod templates (e.g., different labels) to K8s > sequentially, and app2 launches while app1 is still ramping up all its > executor pods. The unwanted result is that some launched executor pods of > app1 appear to have app2's pod template applied. > The root cause is that app1's podspec-configmap got overwritten by app2 > during the overlapping launching periods because the configmap names of the > two apps are the same. This causes some app1's executor pods being ramped up > after app2 is launched to be inadvertently launched with the app2's pod > template. The issue can be seen as follows: > First, after submi
[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Yu updated SPARK-32067: - Description: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and app2 launches while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 appear to have app2's pod template applied. The root cause is that app1's podspec-configmap got overwritten by app2 during the overlapping launching periods because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. First, submit app1 {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} was: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and app2 launches while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 appear to have app2's pod template applied. The root cause is that app1's podspec-configmap got overwritten by app2 during the launching period because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. First, submit app1 {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} > [K8s] Pod template from subsequently submission inadvertently applies to > ongoing submission > --- > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.6, 3.0.0 >Reporter: James Yu >Priority: Minor > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different executor pod templates (e.g., different labels) to K8s > sequentially, and app2 launches while app1 is still ramping up all its > executor pods. The unwanted result is that some launched executor pods of > app1 appear to have app2's pod template applied. > The root cause is that app1's podspec-configmap got overwritten by app2 > during the overlapping launching periods because the configmap names of the > two apps are the same. This causes some app1's executor pods being ramped up > after app2 is launched to be inadvertently launched with the app2's pod > template. > First, submit app1 > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 9m46s > defa
[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission
[ https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Yu updated SPARK-32067: - Description: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different executor pod templates (e.g., different labels) to K8s sequentially, and app2 launches while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 appear to have app2's pod template applied. The root cause is that app1's podspec-configmap got overwritten by app2 during the launching period because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. First, submit app1 {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} was: THE BUG: The bug is reproducible by spark-submit two different apps (app1 and app2) with different pod templates to K8s sequentially, and app2 launches while app1 is still ramping up all its executor pods. The unwanted result is that some launched executor pods of app1 appear to have app2's pod template applied. The root cause is that app1's podspec-configmap got overwritten by app2 during the launching period because the configmap names of the two apps are the same. This causes some app1's executor pods being ramped up after app2 is launched to be inadvertently launched with the app2's pod template. First, submit app1 {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 9m46s default podspec-configmap 1 12m{code} Then submit app2 while app1 is still ramping up its executors {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default podspec-configmap 1 13m57s{code} PROPOSED SOLUTION: Properly prefix the podspec-configmap for each submitted app. {code:java} NAMESPACENAME DATAAGE default app1--driver-conf-map 1 11m43s default app2--driver-conf-map 1 10s default app1--podspec-configmap1 13m57s default app2--podspec-configmap1 13m57s{code} > [K8s] Pod template from subsequently submission inadvertently applies to > ongoing submission > --- > > Key: SPARK-32067 > URL: https://issues.apache.org/jira/browse/SPARK-32067 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.6, 3.0.0 >Reporter: James Yu >Priority: Minor > > THE BUG: > The bug is reproducible by spark-submit two different apps (app1 and app2) > with different executor pod templates (e.g., different labels) to K8s > sequentially, and app2 launches while app1 is still ramping up all its > executor pods. The unwanted result is that some launched executor pods of > app1 appear to have app2's pod template applied. > The root cause is that app1's podspec-configmap got overwritten by app2 > during the launching period because the configmap names of the two apps are > the same. This causes some app1's executor pods being ramped up after app2 is > launched to be inadvertently launched with the app2's pod template. > First, submit app1 > {code:java} > NAMESPACENAME DATAAGE > default app1--driver-conf-map 1 9m46s > default podspec-configmap 1 12m{
[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation
[ https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143151#comment-17143151 ] Thomas Graves commented on SPARK-32037: --- I agree healthy/unhealty could mean other things then the current blacklist meaning. Another option is excludes but again has the same problem that it could be excluded if user specified it. A few other options I found searching around: *grant*list/*block*list *let*list/*ban*list - I like ban but not sure on the letlist side. SafeList/BlockList Allowlist/DenyList [https://tools.ietf.org/id/draft-knodel-terminology-00.html#rfc.section.1.2.1] has: * Blocklist-allowlist * Block-permit Personally I like the blocklist/allowlist > Rename blacklisting feature to avoid language with racist connotation > - > > Key: SPARK-32037 > URL: https://issues.apache.org/jira/browse/SPARK-32037 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Priority: Minor > > As per [discussion on the Spark dev > list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E], > it will be beneficial to remove references to problematic language that can > alienate potential community members. One such reference is "blacklist". > While it seems to me that there is some valid debate as to whether this term > has racist origins, the cultural connotations are inescapable in today's > world. > I've created a separate task, SPARK-32036, to remove references outside of > this feature. Given the large surface area of this feature and the > public-facing UI / configs / etc., more care will need to be taken here. > I'd like to start by opening up debate on what the best replacement name > would be. Reject-/deny-/ignore-/block-list are common replacements for > "blacklist", but I'm not sure that any of them work well for this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32077) Support host-local shuffle data reading with external shuffle service disabled
[ https://issues.apache.org/jira/browse/SPARK-32077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32077: Assignee: (was: Apache Spark) > Support host-local shuffle data reading with external shuffle service disabled > -- > > Key: SPARK-32077 > URL: https://issues.apache.org/jira/browse/SPARK-32077 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > After SPARK-27651, Spark can read host-local shuffle data directly from disk > with external shuffle service enabled. To extend the future, we can also > support it with external shuffle service disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32077) Support host-local shuffle data reading with external shuffle service disabled
[ https://issues.apache.org/jira/browse/SPARK-32077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143021#comment-17143021 ] Apache Spark commented on SPARK-32077: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28911 > Support host-local shuffle data reading with external shuffle service disabled > -- > > Key: SPARK-32077 > URL: https://issues.apache.org/jira/browse/SPARK-32077 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > After SPARK-27651, Spark can read host-local shuffle data directly from disk > with external shuffle service enabled. To extend the future, we can also > support it with external shuffle service disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32077) Support host-local shuffle data reading with external shuffle service disabled
[ https://issues.apache.org/jira/browse/SPARK-32077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143020#comment-17143020 ] Apache Spark commented on SPARK-32077: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28911 > Support host-local shuffle data reading with external shuffle service disabled > -- > > Key: SPARK-32077 > URL: https://issues.apache.org/jira/browse/SPARK-32077 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > After SPARK-27651, Spark can read host-local shuffle data directly from disk > with external shuffle service enabled. To extend the future, we can also > support it with external shuffle service disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32077) Support host-local shuffle data reading with external shuffle service disabled
[ https://issues.apache.org/jira/browse/SPARK-32077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32077: Assignee: Apache Spark > Support host-local shuffle data reading with external shuffle service disabled > -- > > Key: SPARK-32077 > URL: https://issues.apache.org/jira/browse/SPARK-32077 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > After SPARK-27651, Spark can read host-local shuffle data directly from disk > with external shuffle service enabled. To extend the future, we can also > support it with external shuffle service disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not
[ https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142990#comment-17142990 ] Jim Huang edited comment on SPARK-31995 at 6/23/20, 3:00 PM: - Thanks Gabor for triaging this issue. SPARK-32076 has been opened to explore the improvement perspective. [~gsomogyi] I am curious as to what part of the code base within the Spark 3.0.0 branch that "should make this issue disappear"? was (Author: jimhuang): Thanks Gabor for triaging this issue. SPARK-32076 has been opened to explore the improvement perspective. > Spark Structure Streaming checkpiontFileManager ERROR when > HDFS.DFSOutputStream.completeFile with IOException unable to close file > because the last block does not have enough number of replicas > - > > Key: SPARK-31995 > URL: https://issues.apache.org/jira/browse/SPARK-31995 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.5 > Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop > Hadoop 2.7.3 - YARN cluster > delta-core_ 2.11:0.6.1 > >Reporter: Jim Huang >Priority: Major > > I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) > as the sink running in YARN cluster running on Hadoop 2.7.3. I have been > using Spark Structured Streaming for several months now in this runtime > environment until this new corner case that handicapped my Spark structured > streaming job in partial working state. > > I have included the ERROR message and stack trace. I did a quick search > using the string "MicroBatchExecution: Query terminated with error" but did > not find any existing Jira that looks like my stack trace. > > Based on the naive look at this error message and stack trace, is it possible > the Spark's CheckpointFileManager could attempt to handle this HDFS exception > better to simply wait a little longer for HDFS's pipeline to complete the > replicas? > > Being new to this code, where can I find the configuration parameter that > sets the replica counts for the `streaming.HDFSMetadataLog`? I am just > trying to understand if there are already some holistic configuration tuning > variable(s) the current code provide to be able to handle this IOException > more gracefully? Hopefully experts can provide some pointers or directions. > > {code:java} > 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = > yarn-job-id-redacted, runId = run-id-redacted] terminated with error > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511) > at > org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472) > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) > at > org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:547) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:557) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$ap
[jira] [Comment Edited] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143007#comment-17143007 ] Gabor Somogyi edited comment on SPARK-32001 at 6/23/20, 2:54 PM: - With the service loader no registration is needed and everything works like charm. Additionally harder to implement enable flag. Let me give an example: With JdbcDiaclect: one creates a dialect class and then in the app "JdbcDialects.registerDialect(new CustomDialect())" must be called. With ServiceLoader: one creates a provider class + META-INF.services file (no registration or whatever needed). Service loader scans classpath which implements an API. Not super experienced in the dialect area so it may be needed per app but kerberos authentication provider is not something what the user needs to care about in the app code. The company must write one provider, put it on the classpath of Spark and must be forgotten. was (Author: gsomogyi): With the service loader no registration is needed and everything works like charm. Additionally harder to implement enable flag. Let me give an example: With JdbcDiaclect: one creates a dialect class and then in the app "JdbcDialects.registerDialect(new CustomDialect())" must be called. With ServiceLoader: one creates a provider class + META-INF.services file (no registration or whatever needed) Not super experienced in the dialect area so it may be needed per app but kerberos authentication provider is not something what the user needs to care about in the app code. The company must write one provider, put it on the classpath of Spark and must be forgotten. > Create Kerberos authentication provider API in JDBC connector > - > > Key: SPARK-32001 > URL: https://issues.apache.org/jira/browse/SPARK-32001 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > Adding embedded provider to all the possible databases would generate high > maintenance cost on Spark side. > Instead an API can be introduced which would allow to implement further > providers independently. > One important requirement what I suggest is: JDBC connection providers must > be loaded independently just like delegation token providers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32077) Support host-local shuffle data reading with external shuffle service disabled
wuyi created SPARK-32077: Summary: Support host-local shuffle data reading with external shuffle service disabled Key: SPARK-32077 URL: https://issues.apache.org/jira/browse/SPARK-32077 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: wuyi After SPARK-27651, Spark can read host-local shuffle data directly from disk with external shuffle service enabled. To extend the future, we can also support it with external shuffle service disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143007#comment-17143007 ] Gabor Somogyi commented on SPARK-32001: --- With the service loader no registration is needed and everything works like charm. Additionally harder to implement enable flag. Let me give an example: With JdbcDiaclect: one creates a dialect class and then in the app "JdbcDialects.registerDialect(new CustomDialect())" must be called. With ServiceLoader: one creates a provider class + META-INF.services file (no registration or whatever needed) Not super experienced in the dialect area so it may be needed per app but kerberos authentication provider is not something what the user needs to care about in the app code. The company must write one provider, put it on the classpath of Spark and must be forgotten. > Create Kerberos authentication provider API in JDBC connector > - > > Key: SPARK-32001 > URL: https://issues.apache.org/jira/browse/SPARK-32001 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > Adding embedded provider to all the possible databases would generate high > maintenance cost on Spark side. > Instead an API can be introduced which would allow to implement further > providers independently. > One important requirement what I suggest is: JDBC connection providers must > be loaded independently just like delegation token providers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32051) Dataset.foreachPartition returns object
[ https://issues.apache.org/jira/browse/SPARK-32051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Oosterhuis updated SPARK-32051: - Fix Version/s: (was: 3.0.1) (was: 3.1.0) > Dataset.foreachPartition returns object > --- > > Key: SPARK-32051 > URL: https://issues.apache.org/jira/browse/SPARK-32051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Frank Oosterhuis >Priority: Critical > > I'm trying to map values from the Dataset[Row], but since 3.0.0 this fails. > In 3.0.0 I'm dealing with an error: "Error:(28, 38) value map is not a member > of Object" > > This is the simplest code that works in 2.4.x, but fails in 3.0.0: > {code:scala} > spark.range(100) > .repartition(10) > .foreachPartition(part => println(part.toList)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32051) Dataset.foreachPartition returns object
[ https://issues.apache.org/jira/browse/SPARK-32051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143001#comment-17143001 ] Frank Oosterhuis commented on SPARK-32051: -- Looks like a similar conflict happens with Dataset.foreach. {code:java} Error:(22, 8) overloaded method value foreach with alternatives: (func: org.apache.spark.api.java.function.ForeachFunction[org.apache.spark.sql.Row])Unit (f: org.apache.spark.sql.Row => Unit)Unit cannot be applied to (org.apache.spark.sql.Row => org.apache.spark.sql.DataFrame) .foreach((r : Row) => { {code} Workaround *.foreach((r: Row) =>* does not work here. > Dataset.foreachPartition returns object > --- > > Key: SPARK-32051 > URL: https://issues.apache.org/jira/browse/SPARK-32051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Frank Oosterhuis >Priority: Critical > Fix For: 3.0.1, 3.1.0 > > > I'm trying to map values from the Dataset[Row], but since 3.0.0 this fails. > In 3.0.0 I'm dealing with an error: "Error:(28, 38) value map is not a member > of Object" > > This is the simplest code that works in 2.4.x, but fails in 3.0.0: > {code:scala} > spark.range(100) > .repartition(10) > .foreachPartition(part => println(part.toList)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32075) Fix a few issues in parameters table
[ https://issues.apache.org/jira/browse/SPARK-32075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32075: Assignee: Apache Spark > Fix a few issues in parameters table > > > Key: SPARK-32075 > URL: https://issues.apache.org/jira/browse/SPARK-32075 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Zuo Dao >Assignee: Apache Spark >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32075) Fix a few issues in parameters table
[ https://issues.apache.org/jira/browse/SPARK-32075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142994#comment-17142994 ] Apache Spark commented on SPARK-32075: -- User 'sidedoorleftroad' has created a pull request for this issue: https://github.com/apache/spark/pull/28910 > Fix a few issues in parameters table > > > Key: SPARK-32075 > URL: https://issues.apache.org/jira/browse/SPARK-32075 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Zuo Dao >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32075) Fix a few issues in parameters table
[ https://issues.apache.org/jira/browse/SPARK-32075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32075: Assignee: (was: Apache Spark) > Fix a few issues in parameters table > > > Key: SPARK-32075 > URL: https://issues.apache.org/jira/browse/SPARK-32075 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Zuo Dao >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not have
[ https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142990#comment-17142990 ] Jim Huang commented on SPARK-31995: --- Thanks Gabor for triaging this issue. SPARK-32076 has been opened to explore the improvement perspective. > Spark Structure Streaming checkpiontFileManager ERROR when > HDFS.DFSOutputStream.completeFile with IOException unable to close file > because the last block does not have enough number of replicas > - > > Key: SPARK-31995 > URL: https://issues.apache.org/jira/browse/SPARK-31995 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.5 > Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop > Hadoop 2.7.3 - YARN cluster > delta-core_ 2.11:0.6.1 > >Reporter: Jim Huang >Priority: Major > > I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) > as the sink running in YARN cluster running on Hadoop 2.7.3. I have been > using Spark Structured Streaming for several months now in this runtime > environment until this new corner case that handicapped my Spark structured > streaming job in partial working state. > > I have included the ERROR message and stack trace. I did a quick search > using the string "MicroBatchExecution: Query terminated with error" but did > not find any existing Jira that looks like my stack trace. > > Based on the naive look at this error message and stack trace, is it possible > the Spark's CheckpointFileManager could attempt to handle this HDFS exception > better to simply wait a little longer for HDFS's pipeline to complete the > replicas? > > Being new to this code, where can I find the configuration parameter that > sets the replica counts for the `streaming.HDFSMetadataLog`? I am just > trying to understand if there are already some holistic configuration tuning > variable(s) the current code provide to be able to handle this IOException > more gracefully? Hopefully experts can provide some pointers or directions. > > {code:java} > 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = > yarn-job-id-redacted, runId = run-id-redacted] terminated with error > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511) > at > org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472) > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) > at > org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:547) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:557) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198) > at > org.apache.spark.sql.execution.streaming.MicroBa
[jira] [Created] (SPARK-32076) Structured Streaming application continuity when encountering streaming query task level error
Jim Huang created SPARK-32076: - Summary: Structured Streaming application continuity when encountering streaming query task level error Key: SPARK-32076 URL: https://issues.apache.org/jira/browse/SPARK-32076 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.5 Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop Hadoop 2.7.3 - YARN cluster delta-core_ 2.11:0.6.1 Reporter: Jim Huang >From the Spark Structured Streaming application continuity perspective, the >thread that ran this task was terminated with ERROR SPARK-31995 but to YARN it >is still an active running job even though this instance of the Spark >Structured Streaming job is no longer making any further processing. If the >monitoring of the Spark Structured Streaming job is done only from the YARN >job perspective, it may provide a false status. In this situation, should the >Spark Structure Streaming application fail hard and completely (fail by Spark >framework or Application exception handling)? Or should the developer >investigate and develop some ideal monitoring implementation that has the >right level of specificity to detect Spark Structured Streaming *task* level >failures? Any references on these topics are much appreciated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142969#comment-17142969 ] Takeshi Yamamuro commented on SPARK-32001: -- I just want to know the approach (META-INF.services) is the best for that. We cannot follow the similar interfaces with JdbcDiaclect (registerDialect and unregisterDialect)? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala#L205-L216 > Create Kerberos authentication provider API in JDBC connector > - > > Key: SPARK-32001 > URL: https://issues.apache.org/jira/browse/SPARK-32001 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > Adding embedded provider to all the possible databases would generate high > maintenance cost on Spark side. > Instead an API can be introduced which would allow to implement further > providers independently. > One important requirement what I suggest is: JDBC connection providers must > be loaded independently just like delegation token providers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32075) Fix a few issues in parameters table
Zuo Dao created SPARK-32075: --- Summary: Fix a few issues in parameters table Key: SPARK-32075 URL: https://issues.apache.org/jira/browse/SPARK-32075 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.0.0 Reporter: Zuo Dao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not have
[ https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi resolved SPARK-31995. --- Resolution: Information Provided The issue should disappear w/ Spark 3.0. Please re-open it if it's not the case. > Spark Structure Streaming checkpiontFileManager ERROR when > HDFS.DFSOutputStream.completeFile with IOException unable to close file > because the last block does not have enough number of replicas > - > > Key: SPARK-31995 > URL: https://issues.apache.org/jira/browse/SPARK-31995 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.5 > Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop > Hadoop 2.7.3 - YARN cluster > delta-core_ 2.11:0.6.1 > >Reporter: Jim Huang >Priority: Major > > I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) > as the sink running in YARN cluster running on Hadoop 2.7.3. I have been > using Spark Structured Streaming for several months now in this runtime > environment until this new corner case that handicapped my Spark structured > streaming job in partial working state. > > I have included the ERROR message and stack trace. I did a quick search > using the string "MicroBatchExecution: Query terminated with error" but did > not find any existing Jira that looks like my stack trace. > > Based on the naive look at this error message and stack trace, is it possible > the Spark's CheckpointFileManager could attempt to handle this HDFS exception > better to simply wait a little longer for HDFS's pipeline to complete the > replicas? > > Being new to this code, where can I find the configuration parameter that > sets the replica counts for the `streaming.HDFSMetadataLog`? I am just > trying to understand if there are already some holistic configuration tuning > variable(s) the current code provide to be able to handle this IOException > more gracefully? Hopefully experts can provide some pointers or directions. > > {code:java} > 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = > yarn-job-id-redacted, runId = run-id-redacted] terminated with error > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511) > at > org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472) > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) > at > org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:547) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:557) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStrea
[jira] [Closed] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not have en
[ https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi closed SPARK-31995. - > Spark Structure Streaming checkpiontFileManager ERROR when > HDFS.DFSOutputStream.completeFile with IOException unable to close file > because the last block does not have enough number of replicas > - > > Key: SPARK-31995 > URL: https://issues.apache.org/jira/browse/SPARK-31995 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.5 > Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop > Hadoop 2.7.3 - YARN cluster > delta-core_ 2.11:0.6.1 > >Reporter: Jim Huang >Priority: Major > > I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) > as the sink running in YARN cluster running on Hadoop 2.7.3. I have been > using Spark Structured Streaming for several months now in this runtime > environment until this new corner case that handicapped my Spark structured > streaming job in partial working state. > > I have included the ERROR message and stack trace. I did a quick search > using the string "MicroBatchExecution: Query terminated with error" but did > not find any existing Jira that looks like my stack trace. > > Based on the naive look at this error message and stack trace, is it possible > the Spark's CheckpointFileManager could attempt to handle this HDFS exception > better to simply wait a little longer for HDFS's pipeline to complete the > replicas? > > Being new to this code, where can I find the configuration parameter that > sets the replica counts for the `streaming.HDFSMetadataLog`? I am just > trying to understand if there are already some holistic configuration tuning > variable(s) the current code provide to be able to handle this IOException > more gracefully? Hopefully experts can provide some pointers or directions. > > {code:java} > 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = > yarn-job-id-redacted, runId = run-id-redacted] terminated with error > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511) > at > org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472) > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) > at > org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:547) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:557) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.MicroBat
[jira] [Commented] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not have
[ https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142957#comment-17142957 ] Gabor Somogyi commented on SPARK-31995: --- W/o super deep consideration I would say streaming query must stop if exception happens during execution but I would like to handle that as a separate jira. My proposal is to close this jira w/ later version solves the issue and open another w/ the improvement. > Spark Structure Streaming checkpiontFileManager ERROR when > HDFS.DFSOutputStream.completeFile with IOException unable to close file > because the last block does not have enough number of replicas > - > > Key: SPARK-31995 > URL: https://issues.apache.org/jira/browse/SPARK-31995 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.5 > Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop > Hadoop 2.7.3 - YARN cluster > delta-core_ 2.11:0.6.1 > >Reporter: Jim Huang >Priority: Major > > I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) > as the sink running in YARN cluster running on Hadoop 2.7.3. I have been > using Spark Structured Streaming for several months now in this runtime > environment until this new corner case that handicapped my Spark structured > streaming job in partial working state. > > I have included the ERROR message and stack trace. I did a quick search > using the string "MicroBatchExecution: Query terminated with error" but did > not find any existing Jira that looks like my stack trace. > > Based on the naive look at this error message and stack trace, is it possible > the Spark's CheckpointFileManager could attempt to handle this HDFS exception > better to simply wait a little longer for HDFS's pipeline to complete the > replicas? > > Being new to this code, where can I find the configuration parameter that > sets the replica counts for the `streaming.HDFSMetadataLog`? I am just > trying to understand if there are already some holistic configuration tuning > variable(s) the current code provide to be able to handle this IOException > more gracefully? Hopefully experts can provide some pointers or directions. > > {code:java} > 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = > yarn-job-id-redacted, runId = run-id-redacted] terminated with error > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511) > at > org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472) > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) > at > org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:547) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:557) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:545) > at > org.apache.spark.sql.execution.streaming.MicroBatchEx
[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142937#comment-17142937 ] Gabor Somogyi commented on SPARK-32001: --- Here is the DT provider API: https://github.com/apache/spark/blob/e00f43cb86a6c76720b45176e9f9a7fba1dc3a35/core/src/main/scala/org/apache/spark/security/HadoopDelegationTokenProvider.scala#L31 META-INF.services: https://github.com/apache/spark/blob/master/core/src/main/resources/META-INF/services/org.apache.spark.security.HadoopDelegationTokenProvider > Create Kerberos authentication provider API in JDBC connector > - > > Key: SPARK-32001 > URL: https://issues.apache.org/jira/browse/SPARK-32001 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > Adding embedded provider to all the possible databases would generate high > maintenance cost on Spark side. > Instead an API can be introduced which would allow to implement further > providers independently. > One important requirement what I suggest is: JDBC connection providers must > be loaded independently just like delegation token providers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142934#comment-17142934 ] Gabor Somogyi commented on SPARK-32001: --- Which system you mean? A provider can be just added as an external jar containing the API implementation + the META-INF.services file and that's all. > Create Kerberos authentication provider API in JDBC connector > - > > Key: SPARK-32001 > URL: https://issues.apache.org/jira/browse/SPARK-32001 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > Adding embedded provider to all the possible databases would generate high > maintenance cost on Spark side. > Instead an API can be introduced which would allow to implement further > providers independently. > One important requirement what I suggest is: JDBC connection providers must > be loaded independently just like delegation token providers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not have
[ https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142932#comment-17142932 ] Jim Huang commented on SPARK-31995: --- Thank you for providing a helpful perspective. I was able to locate HDFS 11486 using the search string you have provided and it was "resolved in (Hadoop) 2.7.4+." I agree with you the HDFS 11486 fix will definitely improve the HDFS replica exception handling. This issue is a pretty unique, I am not sure I am equipped to create and induce such a rare corner case. Spark 3.0.0 just got released this past week. I will need additional application development time to migrate to Spark 3.x architecture (Delta 0.7.0+) ecosystem. I will be able to upgrade to Spark 2.4.6 sooner. >From the Spark Structured Streaming application continuity perspective, the >thread that ran this task was terminated with ERROR but to YARN it is still an >active running job even though my Spark Structured Streaming job is no longer >making any further processing. If the monitoring of the Spark Structured >Streaming job is done only from the YARN job perspective, it may provide a >false status. In this situation, should the Spark Structure Streaming >application fail hard and completely (fail by Spark framework or Application >exception handling)? Or should I investigate and develop some ideal >monitoring implementation that has the right level of specificity to detect >Spark Structured Streaming task level failures? Any references on these >topics are much appreciated. > Spark Structure Streaming checkpiontFileManager ERROR when > HDFS.DFSOutputStream.completeFile with IOException unable to close file > because the last block does not have enough number of replicas > - > > Key: SPARK-31995 > URL: https://issues.apache.org/jira/browse/SPARK-31995 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.5 > Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop > Hadoop 2.7.3 - YARN cluster > delta-core_ 2.11:0.6.1 > >Reporter: Jim Huang >Priority: Major > > I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) > as the sink running in YARN cluster running on Hadoop 2.7.3. I have been > using Spark Structured Streaming for several months now in this runtime > environment until this new corner case that handicapped my Spark structured > streaming job in partial working state. > > I have included the ERROR message and stack trace. I did a quick search > using the string "MicroBatchExecution: Query terminated with error" but did > not find any existing Jira that looks like my stack trace. > > Based on the naive look at this error message and stack trace, is it possible > the Spark's CheckpointFileManager could attempt to handle this HDFS exception > better to simply wait a little longer for HDFS's pipeline to complete the > replicas? > > Being new to this code, where can I find the configuration parameter that > sets the replica counts for the `streaming.HDFSMetadataLog`? I am just > trying to understand if there are already some holistic configuration tuning > variable(s) the current code provide to be able to handle this IOException > more gracefully? Hopefully experts can provide some pointers or directions. > > {code:java} > 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = > yarn-job-id-redacted, runId = run-id-redacted] terminated with error > java.io.IOException: Unable to close file because the last block does not > have enough number of replicas. > at > org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511) > at > org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472) > at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) > at > org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110) > at > org.apache.spark.sql.execution.streaming.HDFSMetadata
[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142916#comment-17142916 ] Takeshi Yamamuro commented on SPARK-32001: -- Do you know any other systems supporting that kind of interfaces? > Create Kerberos authentication provider API in JDBC connector > - > > Key: SPARK-32001 > URL: https://issues.apache.org/jira/browse/SPARK-32001 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > Adding embedded provider to all the possible databases would generate high > maintenance cost on Spark side. > Instead an API can be introduced which would allow to implement further > providers independently. > One important requirement what I suggest is: JDBC connection providers must > be loaded independently just like delegation token providers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32074) Update AppVeyor R to 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32074: Assignee: Apache Spark > Update AppVeyor R to 4.0.2 > -- > > Key: SPARK-32074 > URL: https://issues.apache.org/jira/browse/SPARK-32074 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > We should test R 4.0.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32074) Update AppVeyor R to 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142915#comment-17142915 ] Apache Spark commented on SPARK-32074: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28909 > Update AppVeyor R to 4.0.2 > -- > > Key: SPARK-32074 > URL: https://issues.apache.org/jira/browse/SPARK-32074 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > We should test R 4.0.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32074) Update AppVeyor R to 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142914#comment-17142914 ] Apache Spark commented on SPARK-32074: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28909 > Update AppVeyor R to 4.0.2 > -- > > Key: SPARK-32074 > URL: https://issues.apache.org/jira/browse/SPARK-32074 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > We should test R 4.0.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32074) Update AppVeyor R to 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32074: Assignee: (was: Apache Spark) > Update AppVeyor R to 4.0.2 > -- > > Key: SPARK-32074 > URL: https://issues.apache.org/jira/browse/SPARK-32074 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > We should test R 4.0.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32074) Update AppVeyor R to 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32074: - Summary: Update AppVeyor R to 4.0.2 (was: Update AppVeyor R to 4.0.1) > Update AppVeyor R to 4.0.2 > -- > > Key: SPARK-32074 > URL: https://issues.apache.org/jira/browse/SPARK-32074 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > We should test R 4.0.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32051) Dataset.foreachPartition returns object
[ https://issues.apache.org/jira/browse/SPARK-32051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Oosterhuis updated SPARK-32051: - Fix Version/s: 3.1.0 > Dataset.foreachPartition returns object > --- > > Key: SPARK-32051 > URL: https://issues.apache.org/jira/browse/SPARK-32051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Frank Oosterhuis >Priority: Critical > Fix For: 3.0.1, 3.1.0 > > > I'm trying to map values from the Dataset[Row], but since 3.0.0 this fails. > In 3.0.0 I'm dealing with an error: "Error:(28, 38) value map is not a member > of Object" > > This is the simplest code that works in 2.4.x, but fails in 3.0.0: > {code:scala} > spark.range(100) > .repartition(10) > .foreachPartition(part => println(part.toList)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142911#comment-17142911 ] Gabor Somogyi commented on SPARK-32001: --- Yeah, develper's API for sure. My plan is to load providers w/ service loader and new provider will be loaded automatically if it's registered w/ an appropriate META-INF.services entry. > Create Kerberos authentication provider API in JDBC connector > - > > Key: SPARK-32001 > URL: https://issues.apache.org/jira/browse/SPARK-32001 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > Adding embedded provider to all the possible databases would generate high > maintenance cost on Spark side. > Instead an API can be introduced which would allow to implement further > providers independently. > One important requirement what I suggest is: JDBC connection providers must > be loaded independently just like delegation token providers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32074) Update AppVeyor R to 4.0.1
[ https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32074: - Summary: Update AppVeyor R to 4.0.1 (was: Update AppVeyor R and Rtools to 4.0.1) > Update AppVeyor R to 4.0.1 > -- > > Key: SPARK-32074 > URL: https://issues.apache.org/jira/browse/SPARK-32074 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > We should test R 4.0.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32051) Dataset.foreachPartition returns object
[ https://issues.apache.org/jira/browse/SPARK-32051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Oosterhuis updated SPARK-32051: - Fix Version/s: 3.0.1 > Dataset.foreachPartition returns object > --- > > Key: SPARK-32051 > URL: https://issues.apache.org/jira/browse/SPARK-32051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Frank Oosterhuis >Priority: Critical > Fix For: 3.0.1 > > > I'm trying to map values from the Dataset[Row], but since 3.0.0 this fails. > In 3.0.0 I'm dealing with an error: "Error:(28, 38) value map is not a member > of Object" > > This is the simplest code that works in 2.4.x, but fails in 3.0.0: > {code:scala} > spark.range(100) > .repartition(10) > .foreachPartition(part => println(part.toList)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org