[jira] [Comment Edited] (SPARK-32063) Spark native temporary table

2020-06-23 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143575#comment-17143575
 ] 

Lantao Jin edited comment on SPARK-32063 at 6/24/20, 6:53 AM:
--

[~viirya] For 1, even RDD cache or table cache can improve performance, but I 
still think they have totally different scopes. Besides, we also can cache a 
temporary table to memory to get more performance improvement. In production 
usage, I found our data engineers and data scientists do not always remember to 
uncached cached tables or views. This situation became worse in the Spark 
thrift-server (sharing Spark driver). 

For 2, we found when Adaptive Query Execution enabled, complex views are easily 
stuck in the optimization step. Cache this view couldn't help.

For 3, the scenario is in our migration case, move SQL from Teradata to Spark. 
Without the temporary table, TD users have to create permanent tables and drop 
them at the end of a script as an alternate of TD volatile table, if JDBC 
session closed or script failed before cleaning up, no mechanism guarantee to 
drop the intermediate data. If they use Spark temporary view, many logic 
couldn't work well. For example, they want to execute UPDATE/DELETE op on 
intermediate tables but we cannot convert a temporary view to Delta table or 
Hudi table ...


was (Author: cltlfcjin):
For 1, even RDD cache or table cache can improve performance, but I still think 
they have totally different scopes. Besides, we also can cache a temporary 
table to memory to get more performance improvement. In production usage, I 
found our data engineers and data scientists do not always remember to uncached 
cached tables or views. This situation became worse in the Spark thrift-server 
(sharing Spark driver). 

For 2, we found when Adaptive Query Execution enabled, complex views are easily 
stuck in the optimization step. Cache this view couldn't help.

For 3, the scenario is in our migration case, move SQL from Teradata to Spark. 
Without the temporary table, TD users have to create permanent tables and drop 
them at the end of a script as an alternate of TD volatile table, if JDBC 
session closed or script failed before cleaning up, no mechanism guarantee to 
drop the intermediate data. If they use Spark temporary view, many logic 
couldn't work well. For example, they want to execute UPDATE/DELETE op on 
intermediate tables but we cannot convert a temporary view to Delta table or 
Hudi table ...

> Spark native temporary table
> 
>
> Key: SPARK-32063
> URL: https://issues.apache.org/jira/browse/SPARK-32063
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Many databases and data warehouse SQL engines support temporary tables. A 
> temporary table, as its named implied, is a short-lived table that its life 
> will be only for current session.
> In Spark, there is no temporary table. the DDL “CREATE TEMPORARY TABLE AS 
> SELECT” will create a temporary view. A temporary view is totally different 
> with a temporary table. 
> A temporary view is just a VIEW. It doesn’t materialize data in storage. So 
> it has below shortage:
>  # View will not give improved performance. Materialize intermediate data in 
> temporary tables for a complex query will accurate queries, especially in an 
> ETL pipeline.
>  # View which calls other views can cause severe performance issues. Even, 
> executing a very complex view may fail in Spark. 
>  # Temporary view has no database namespace. In some complex ETL pipelines or 
> data warehouse applications, without database prefix is not convenient. It 
> needs some tables which only used in current session.
>  
> More details are described in [Design 
> Docs|https://docs.google.com/document/d/1RS4Q3VbxlZ_Yy0fdWgTJ-k0QxFd1dToCqpLAYvIJ34U/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32063) Spark native temporary table

2020-06-23 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143575#comment-17143575
 ] 

Lantao Jin commented on SPARK-32063:


For 1, even RDD cache or table cache can improve performance, but I still think 
they have totally different scopes. Besides, we also can cache a temporary 
table to memory to get more performance improvement. In production usage, I 
found our data engineers and data scientists do not always remember to uncached 
cached tables or views. This situation became worse in the Spark thrift-server 
(sharing Spark driver). 

For 2, we found when Adaptive Query Execution enabled, complex views are easily 
stuck in the optimization step. Cache this view couldn't help.

For 3, the scenario is in our migration case, move SQL from Teradata to Spark. 
Without the temporary table, TD users have to create permanent tables and drop 
them at the end of a script as an alternate of TD volatile table, if JDBC 
session closed or script failed before cleaning up, no mechanism guarantee to 
drop the intermediate data. If they use Spark temporary view, many logic 
couldn't work well. For example, they want to execute UPDATE/DELETE op on 
intermediate tables but we cannot convert a temporary view to Delta table or 
Hudi table ...

> Spark native temporary table
> 
>
> Key: SPARK-32063
> URL: https://issues.apache.org/jira/browse/SPARK-32063
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Many databases and data warehouse SQL engines support temporary tables. A 
> temporary table, as its named implied, is a short-lived table that its life 
> will be only for current session.
> In Spark, there is no temporary table. the DDL “CREATE TEMPORARY TABLE AS 
> SELECT” will create a temporary view. A temporary view is totally different 
> with a temporary table. 
> A temporary view is just a VIEW. It doesn’t materialize data in storage. So 
> it has below shortage:
>  # View will not give improved performance. Materialize intermediate data in 
> temporary tables for a complex query will accurate queries, especially in an 
> ETL pipeline.
>  # View which calls other views can cause severe performance issues. Even, 
> executing a very complex view may fail in Spark. 
>  # Temporary view has no database namespace. In some complex ETL pipelines or 
> data warehouse applications, without database prefix is not convenient. It 
> needs some tables which only used in current session.
>  
> More details are described in [Design 
> Docs|https://docs.google.com/document/d/1RS4Q3VbxlZ_Yy0fdWgTJ-k0QxFd1dToCqpLAYvIJ34U/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30466) remove dependency on jackson-mapper-asl-1.9.13 and jackson-core-asl-1.9.13

2020-06-23 Thread Prashant Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143569#comment-17143569
 ] 

Prashant Sharma commented on SPARK-30466:
-

I just saw, Hadoop 3.2.1 still uses these jars(jackson-mapper-asl-1.9.13 and 
jackson-core-asl-1.9.13), they are a transitive dependency on jersey-json. See 
below.
{code:java}
[INFO] org.apache.hadoop:hadoop-common:jar:3.2.1
[INFO] +- org.apache.hadoop:hadoop-annotations:jar:3.2.1:compile
[INFO] |  \- jdk.tools:jdk.tools:jar:1.8:system
[INFO] +- com.google.guava:guava:jar:27.0-jre:compile
[INFO] |  +- com.google.guava:failureaccess:jar:1.0:compile
[INFO] |  +- 
com.google.guava:listenablefuture:jar:.0-empty-to-avoid-conflict-with-guava:compile
[INFO] |  +- org.checkerframework:checker-qual:jar:2.5.2:compile
[INFO] |  +- com.google.errorprone:error_prone_annotations:jar:2.2.0:compile
[INFO] |  +- com.google.j2objc:j2objc-annotations:jar:1.1:compile
[INFO] |  \- org.codehaus.mojo:animal-sniffer-annotations:jar:1.17:compile
[INFO] +- commons-cli:commons-cli:jar:1.2:compile
[INFO] +- org.apache.commons:commons-math3:jar:3.1.1:compile
[INFO] +- org.apache.httpcomponents:httpclient:jar:4.5.6:compile
[INFO] |  \- org.apache.httpcomponents:httpcore:jar:4.4.10:compile
[INFO] +- commons-codec:commons-codec:jar:1.11:compile
[INFO] +- commons-io:commons-io:jar:2.5:compile
[INFO] +- commons-net:commons-net:jar:3.6:compile
[INFO] +- commons-collections:commons-collections:jar:3.2.2:compile
[INFO] +- javax.servlet:javax.servlet-api:jar:3.1.0:compile
[INFO] +- org.eclipse.jetty:jetty-server:jar:9.3.24.v20180605:compile
[INFO] |  +- org.eclipse.jetty:jetty-http:jar:9.3.24.v20180605:compile
[INFO] |  \- org.eclipse.jetty:jetty-io:jar:9.3.24.v20180605:compile
[INFO] +- org.eclipse.jetty:jetty-util:jar:9.3.24.v20180605:compile
[INFO] +- org.eclipse.jetty:jetty-servlet:jar:9.3.24.v20180605:compile
[INFO] |  \- org.eclipse.jetty:jetty-security:jar:9.3.24.v20180605:compile
[INFO] +- org.eclipse.jetty:jetty-webapp:jar:9.3.24.v20180605:compile
[INFO] |  \- org.eclipse.jetty:jetty-xml:jar:9.3.24.v20180605:compile
[INFO] +- org.eclipse.jetty:jetty-util-ajax:jar:9.3.24.v20180605:test
[INFO] +- javax.servlet.jsp:jsp-api:jar:2.1:runtime
[INFO] +- com.sun.jersey:jersey-core:jar:1.19:compile
[INFO] |  \- javax.ws.rs:jsr311-api:jar:1.1.1:compile
[INFO] +- com.sun.jersey:jersey-servlet:jar:1.19:compile
[INFO] +- com.sun.jersey:jersey-json:jar:1.19:compile
[INFO] |  +- org.codehaus.jettison:jettison:jar:1.1:compile
[INFO] |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
[INFO] |  |  \- javax.xml.bind:jaxb-api:jar:2.2.11:compile
[INFO] |  +- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile
[INFO] |  +- org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile
[INFO] |  +- org.codehaus.jackson:jackson-jaxrs:jar:1.9.13:compile
[INFO] |  \- org.codehaus.jackson:jackson-xc:jar:1.9.13:compile
[INFO] +- com.sun.jersey:jersey-server:jar:1.19:compile

{code}

> remove dependency on jackson-mapper-asl-1.9.13 and jackson-core-asl-1.9.13
> --
>
> Key: SPARK-30466
> URL: https://issues.apache.org/jira/browse/SPARK-30466
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Michael Burgener
>Priority: Major
>  Labels: security
>
> These 2 libraries are deprecated and replaced by the jackson-databind 
> libraries which are already included.  These two libraries are flagged by our 
> vulnerability scanners as having the following security vulnerabilities.  
> I've set the priority to Major due to the Critical nature and hopefully they 
> can be addressed quickly.  Please note, I'm not a developer but work in 
> InfoSec and this was flagged when we incorporated spark into our product.  If 
> you feel the priority is not set correctly please change accordingly.  I'll 
> watch the issue and flag our dev team to update once resolved.  
> jackson-mapper-asl-1.9.13
> CVE-2018-7489 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2018-7489] 
>  
> CVE-2017-7525 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2017-7525]
>  
> CVE-2017-17485 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2017-17485]
>  
> CVE-2017-15095 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2017-15095]
>  
> CVE-2018-5968 (CVSS 3.0 Score 8.1 High)
> [https://nvd.nist.gov/vuln/detail/CVE-2018-5968]
>  
> jackson-core-asl-1.9.13
> CVE-2016-7051 (CVSS 3.0 Score 8.6 High)
> https://nvd.nist.gov/vuln/detail/CVE-2016-7051



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For addi

[jira] [Resolved] (SPARK-32074) Update AppVeyor R to 4.0.2

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32074.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/28909

> Update AppVeyor R to 4.0.2
> --
>
> Key: SPARK-32074
> URL: https://issues.apache.org/jira/browse/SPARK-32074
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> We should test R 4.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected

2020-06-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143566#comment-17143566
 ] 

Hyukjin Kwon commented on SPARK-25244:
--

This issue was closed because it marked the affected version as 2.3 which is 
EOL. Feel free to create new JIRA with a reproducer and analysis if the issue 
persists.

> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-25244
> URL: https://issues.apache.org/jira/browse/SPARK-25244
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Anton Daitche
>Priority: Major
>  Labels: bulk-closed
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take into account the 
> setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would try to come up 
> with a patch. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31887) Date casting to string is giving wrong value

2020-06-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143564#comment-17143564
 ] 

Hyukjin Kwon commented on SPARK-31887:
--

The changes that fixed this issue is likely about calendar switching at 
SPARK-26651, which is a very big and invasive change. It will not likely be 
ported to back.

> Date casting to string is giving wrong value
> 
>
> Key: SPARK-31887
> URL: https://issues.apache.org/jira/browse/SPARK-31887
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5
> Environment: The spark is running on cluster mode with Mesos.
>  
> Mesos agents are dockerised running on Ubuntu 18.
>  
> Timezone setting of docker instance: UTC
> Timezone of server hosting docker: America/New_York
> Timezone of driver machine: America/New_York
>Reporter: Amit Gupta
>Priority: Major
>
> The code converts the string to date and then write it in csv.
> {code:java}
> val x = Seq(("2020-02-19", "2020-02-19 05:11:00")).toDF("a", 
> "b").select('a.cast("date"), 'b.cast("timestamp"))
> x.show()
> +--+---+
> | a|  b|
> +--+---+
> |2020-02-19|2020-02-19 05:11:00|
> +--+---+
> x.write.mode("overwrite").option("header", true).csv("/tmp/test1.csv")
> {code}
>  
> The date written in CSV file is different:
> {code:java}
> > snakebite cat "/tmp/test1.csv/*.csv"
> a,b
> 2020-02-18,2020-02-19T05:11:00.000Z{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27281) Wrong latest offsets returned by DirectKafkaInputDStream#latestOffsets

2020-06-23 Thread Yuanyuan Xia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143562#comment-17143562
 ] 

Yuanyuan Xia commented on SPARK-27281:
--

In our environment, we encounter the same issue and the cause seems also 
related to [KAFKA-7703|https://issues.apache.org/jira/browse/KAFKA-7703]

> Wrong latest offsets returned by DirectKafkaInputDStream#latestOffsets
> --
>
> Key: SPARK-27281
> URL: https://issues.apache.org/jira/browse/SPARK-27281
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0
>Reporter: Viacheslav Krot
>Priority: Major
>
> I have a very strange and hard to reproduce issue when using kafka direct 
> streaming, version 2.4.0
>  From time to time, maybe once a day - once a week I get following error 
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: numRecords must not 
> be negative
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
> at scala.Option.orElse(Option.scala:289)
> at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
> at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> 19/01/29 13:10:00 ERROR apps.BusinessRuleEngine: Job failed. Stopping JVM
> java.lang.IllegalArgumentException: requirement failed: numRecords must not 
> be negative
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.sc

[jira] [Issue Comment Deleted] (SPARK-27281) Wrong latest offsets returned by DirectKafkaInputDStream#latestOffsets

2020-06-23 Thread Yuanyuan Xia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanyuan Xia updated SPARK-27281:
-
Comment: was deleted

(was: In our environment, we encounter the same issue and the cause seems also 
related to [KAFKA-7703|https://issues.apache.org/jira/browse/KAFKA-7703])

> Wrong latest offsets returned by DirectKafkaInputDStream#latestOffsets
> --
>
> Key: SPARK-27281
> URL: https://issues.apache.org/jira/browse/SPARK-27281
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0
>Reporter: Viacheslav Krot
>Priority: Major
>
> I have a very strange and hard to reproduce issue when using kafka direct 
> streaming, version 2.4.0
>  From time to time, maybe once a day - once a week I get following error 
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: numRecords must not 
> be negative
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
> at scala.Option.orElse(Option.scala:289)
> at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
> at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> 19/01/29 13:10:00 ERROR apps.BusinessRuleEngine: Job failed. Stopping JVM
> java.lang.IllegalArgumentException: requirement failed: numRecords must not 
> be negative
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
> at 
> org.apac

[jira] [Commented] (SPARK-27281) Wrong latest offsets returned by DirectKafkaInputDStream#latestOffsets

2020-06-23 Thread Yuanyuan Xia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143561#comment-17143561
 ] 

Yuanyuan Xia commented on SPARK-27281:
--

In our environment, we encounter the same issue and the cause seems also 
related to [KAFKA-7703|https://issues.apache.org/jira/browse/KAFKA-7703]

> Wrong latest offsets returned by DirectKafkaInputDStream#latestOffsets
> --
>
> Key: SPARK-27281
> URL: https://issues.apache.org/jira/browse/SPARK-27281
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0
>Reporter: Viacheslav Krot
>Priority: Major
>
> I have a very strange and hard to reproduce issue when using kafka direct 
> streaming, version 2.4.0
>  From time to time, maybe once a day - once a week I get following error 
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: numRecords must not 
> be negative
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
> at scala.Option.orElse(Option.scala:289)
> at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
> at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> 19/01/29 13:10:00 ERROR apps.BusinessRuleEngine: Job failed. Stopping JVM
> java.lang.IllegalArgumentException: requirement failed: numRecords must not 
> be negative
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:250)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.sc

[jira] [Resolved] (SPARK-32050) GBTClassifier not working with OnevsRest

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32050.
--
Resolution: Duplicate

> GBTClassifier not working with OnevsRest
> 
>
> Key: SPARK-32050
> URL: https://issues.apache.org/jira/browse/SPARK-32050
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
> Environment: spark 2.4.0
>Reporter: Raghuvarran V H
>Priority: Minor
>
> I am trying to use GBT classifier for multi class classification using 
> OnevsRest
>  
> {code:java}
> from pyspark.ml.classification import 
> MultilayerPerceptronClassifier,OneVsRest,GBTClassifier
> from pyspark.ml import Pipeline,PipelineModel
> lr = GBTClassifier(featuresCol='features', labelCol='label', 
> predictionCol='prediction', maxDepth=5,   
>
> maxBins=32,minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, 
> cacheNodeIds=False,checkpointInterval=10, lossType='logistic', 
> maxIter=20,stepSize=0.1, seed=None,subsamplingRate=1.0, 
> featureSubsetStrategy='auto')
> classifier = OneVsRest(featuresCol='features', labelCol='label', 
> predictionCol='prediction', classifier=lr,    weightCol=None,parallelism=1)
> pipeline = Pipeline(stages=[str_indxr,ohe,vecAssembler,normalizer,classifier])
> model = pipeline.fit(train_data)
> {code}
>  
>  
> When I try this I get this error:
> /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/ml/classification.py
>  in _fit(self, dataset)
>  1800 classifier = self.getClassifier()
>  1801 assert isinstance(classifier, HasRawPredictionCol),\
>  -> 1802 "Classifier %s doesn't extend from HasRawPredictionCol." % 
> type(classifier)
>  1803 
>  1804 numClasses = int(dataset.agg(\{labelCol: 
> "max"}).head()["max("+labelCol+")"]) + 1
> AssertionError: Classifier  
> doesn't extend from HasRawPredictionCol.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32051) Dataset.foreachPartition returns object

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32051:
-
Priority: Major  (was: Critical)

> Dataset.foreachPartition returns object
> ---
>
> Key: SPARK-32051
> URL: https://issues.apache.org/jira/browse/SPARK-32051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Frank Oosterhuis
>Priority: Major
>
> I'm trying to map values from the Dataset[Row], but since 3.0.0 this fails.
> In 3.0.0 I'm dealing with an error: "Error:(28, 38) value map is not a member 
> of Object"
>  
> This is the simplest code that works in 2.4.x, but fails in 3.0.0:
> {code:scala}
> spark.range(100)
>   .repartition(10)
>   .foreachPartition(part => println(part.toList))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32053.
--
Resolution: Incomplete

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Kayal
>Priority: Major
> Attachments: image-2020-06-22-18-19-32-236.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-06-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143557#comment-17143557
 ] 

Hyukjin Kwon commented on SPARK-32053:
--

[~kaganesa] Spark 2.3.0 is EOL so we won't be able to land any fix. Can you see 
if this issue still persists in higher versions?
Also, it would be great if you share the full reproducer and full error 
messages.

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Kayal
>Priority: Major
> Attachments: image-2020-06-22-18-19-32-236.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32068) Spark 3 UI task launch time show in error time zone

2020-06-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143556#comment-17143556
 ] 

Hyukjin Kwon commented on SPARK-32068:
--

[~d87904488] can you attach the snapshots?

> Spark 3 UI task launch time show in error time zone
> ---
>
> Key: SPARK-32068
> URL: https://issues.apache.org/jira/browse/SPARK-32068
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Smith Cruise
>Priority: Major
>  Labels: easyfix
>
> For example,
> In this link: history/app-20200623133209-0015/stages/ , stage submit time is 
> correct (UTS)
>  
> But in this link: 
> history/app-20200623133209-0015/stages/stage/?id=0&attempt=0 , task launch 
> time is incorrect(UTC)
>  
> The same problem exists in port 4040 Web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32068) Spark 3 UI task launch time show in error time zone

2020-06-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143556#comment-17143556
 ] 

Hyukjin Kwon edited comment on SPARK-32068 at 6/24/20, 6:24 AM:


[~d87904488] can you attach the screenshots?


was (Author: hyukjin.kwon):
[~d87904488] can you attach the snapshots?

> Spark 3 UI task launch time show in error time zone
> ---
>
> Key: SPARK-32068
> URL: https://issues.apache.org/jira/browse/SPARK-32068
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Smith Cruise
>Priority: Major
>  Labels: easyfix
>
> For example,
> In this link: history/app-20200623133209-0015/stages/ , stage submit time is 
> correct (UTS)
>  
> But in this link: 
> history/app-20200623133209-0015/stages/stage/?id=0&attempt=0 , task launch 
> time is incorrect(UTC)
>  
> The same problem exists in port 4040 Web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32081) facing Invalid UTF-32 character v2.4.5 running pyspark

2020-06-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143554#comment-17143554
 ] 

Hyukjin Kwon commented on SPARK-32081:
--

Please just don't copy and paste the errors. The error message say the encoding 
of your file is wrong:

{code}
java.io.CharConversionException: Invalid UTF-32 character 0x100(above 
10) at char #206, byte
{code}

> facing Invalid UTF-32 character v2.4.5 running pyspark
> --
>
> Key: SPARK-32081
> URL: https://issues.apache.org/jira/browse/SPARK-32081
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 2.4.5
>Reporter: Yaniv Kempler
>Priority: Major
>
> facing Invalid UTF-32 character while reading json files
>  
> Py4JJavaError Traceback (most recent call last)  in  
> ~/.local/lib/python3.6/site-packages/pyspark/sql/readwriter.py in json(self, 
> path, schema, primitivesAsString, prefersDecimal, allowComments, 
> allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, 
> allowBackslashEscapingAnyCharacter, mode, columnNameOfCorruptRecord, 
> dateFormat, timestampFormat, multiLine, allowUnquotedControlChars, lineSep, 
> samplingRatio, dropFieldIfAllNull, encoding)  284 keyed._bypass_serializer = 
> True  285 jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) --> 286 
> return self._df(self._jreader.json(jrdd))  287 else:  288 raise 
> TypeError("path can be only string, list or RDD") 
> ~/.local/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, 
> *args)  1255 answer = self.gateway_client.send_command(command)  1256 
> return_value = get_return_value( -> 1257 answer, self.gateway_client, 
> self.target_id, self.name)  1258  1259 for temp_arg in temp_args: 
> ~/.local/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)  
> 61 def deco(*a, **kw):  62 try: ---> 63 return f(*a, **kw)  64 except 
> py4j.protocol.Py4JJavaError as e:  65 s = e.java_exception.toString() 
> ~/.local/lib/python3.6/site-packages/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)  326 raise 
> Py4JJavaError(  327 "An error occurred while calling \{0}{1}\{2}.\n". --> 328 
> format(target_id, ".", name), value)  329 else:  330 raise Py4JError( 
> Py4JJavaError: An error occurred while calling o67.json. : 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 546 
> in stage 0.0 failed 4 times, most recent failure: Lost task 546.3 in stage 
> 0.0 (TID 642, 172.31.30.196, executor 1): java.io.CharConversionException: 
> Invalid UTF-32 character 0x100(above 10) at char #206, byte #827) at 
> com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) 
> at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
>  at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2017)
>  at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:577)
>  at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:56)
>  at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:55)
>  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543) at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:55)
>  at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:53)
>  at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:891) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at 
> scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185) 
> at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1334) at 
> scala.collection.TraversableOnce$class.reduceLeftOption(TraversableOnce.scala:203)
>  at scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1334) 
> at 
> scala.collection.TraversableOnce$class.reduceOption(TraversableOnce.scala:210)
>  at scala.collection.AbstractIterator.reduceOption(Iterator.scala:1334) at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:70)
>  at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:50)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1

[jira] [Resolved] (SPARK-32081) facing Invalid UTF-32 character v2.4.5 running pyspark

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32081.
--
Resolution: Cannot Reproduce

> facing Invalid UTF-32 character v2.4.5 running pyspark
> --
>
> Key: SPARK-32081
> URL: https://issues.apache.org/jira/browse/SPARK-32081
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 2.4.5
>Reporter: Yaniv Kempler
>Priority: Major
>
> facing Invalid UTF-32 character while reading json files
>  
> Py4JJavaError Traceback (most recent call last)  in  
> ~/.local/lib/python3.6/site-packages/pyspark/sql/readwriter.py in json(self, 
> path, schema, primitivesAsString, prefersDecimal, allowComments, 
> allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, 
> allowBackslashEscapingAnyCharacter, mode, columnNameOfCorruptRecord, 
> dateFormat, timestampFormat, multiLine, allowUnquotedControlChars, lineSep, 
> samplingRatio, dropFieldIfAllNull, encoding)  284 keyed._bypass_serializer = 
> True  285 jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) --> 286 
> return self._df(self._jreader.json(jrdd))  287 else:  288 raise 
> TypeError("path can be only string, list or RDD") 
> ~/.local/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, 
> *args)  1255 answer = self.gateway_client.send_command(command)  1256 
> return_value = get_return_value( -> 1257 answer, self.gateway_client, 
> self.target_id, self.name)  1258  1259 for temp_arg in temp_args: 
> ~/.local/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)  
> 61 def deco(*a, **kw):  62 try: ---> 63 return f(*a, **kw)  64 except 
> py4j.protocol.Py4JJavaError as e:  65 s = e.java_exception.toString() 
> ~/.local/lib/python3.6/site-packages/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)  326 raise 
> Py4JJavaError(  327 "An error occurred while calling \{0}{1}\{2}.\n". --> 328 
> format(target_id, ".", name), value)  329 else:  330 raise Py4JError( 
> Py4JJavaError: An error occurred while calling o67.json. : 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 546 
> in stage 0.0 failed 4 times, most recent failure: Lost task 546.3 in stage 
> 0.0 (TID 642, 172.31.30.196, executor 1): java.io.CharConversionException: 
> Invalid UTF-32 character 0x100(above 10) at char #206, byte #827) at 
> com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) 
> at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
>  at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2017)
>  at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:577)
>  at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:56)
>  at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:55)
>  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543) at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:55)
>  at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:53)
>  at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:891) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at 
> scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185) 
> at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1334) at 
> scala.collection.TraversableOnce$class.reduceLeftOption(TraversableOnce.scala:203)
>  at scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1334) 
> at 
> scala.collection.TraversableOnce$class.reduceOption(TraversableOnce.scala:210)
>  at scala.collection.AbstractIterator.reduceOption(Iterator.scala:1334) at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:70)
>  at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:50)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:310) 

[jira] [Updated] (SPARK-32081) facing Invalid UTF-32 character v2.4.5 running pyspark

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32081:
-
Priority: Major  (was: Blocker)

> facing Invalid UTF-32 character v2.4.5 running pyspark
> --
>
> Key: SPARK-32081
> URL: https://issues.apache.org/jira/browse/SPARK-32081
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 2.4.5
>Reporter: Yaniv Kempler
>Priority: Major
>
> facing Invalid UTF-32 character while reading json files
>  
> Py4JJavaError Traceback (most recent call last)  in  
> ~/.local/lib/python3.6/site-packages/pyspark/sql/readwriter.py in json(self, 
> path, schema, primitivesAsString, prefersDecimal, allowComments, 
> allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, 
> allowBackslashEscapingAnyCharacter, mode, columnNameOfCorruptRecord, 
> dateFormat, timestampFormat, multiLine, allowUnquotedControlChars, lineSep, 
> samplingRatio, dropFieldIfAllNull, encoding)  284 keyed._bypass_serializer = 
> True  285 jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) --> 286 
> return self._df(self._jreader.json(jrdd))  287 else:  288 raise 
> TypeError("path can be only string, list or RDD") 
> ~/.local/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, 
> *args)  1255 answer = self.gateway_client.send_command(command)  1256 
> return_value = get_return_value( -> 1257 answer, self.gateway_client, 
> self.target_id, self.name)  1258  1259 for temp_arg in temp_args: 
> ~/.local/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)  
> 61 def deco(*a, **kw):  62 try: ---> 63 return f(*a, **kw)  64 except 
> py4j.protocol.Py4JJavaError as e:  65 s = e.java_exception.toString() 
> ~/.local/lib/python3.6/site-packages/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)  326 raise 
> Py4JJavaError(  327 "An error occurred while calling \{0}{1}\{2}.\n". --> 328 
> format(target_id, ".", name), value)  329 else:  330 raise Py4JError( 
> Py4JJavaError: An error occurred while calling o67.json. : 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 546 
> in stage 0.0 failed 4 times, most recent failure: Lost task 546.3 in stage 
> 0.0 (TID 642, 172.31.30.196, executor 1): java.io.CharConversionException: 
> Invalid UTF-32 character 0x100(above 10) at char #206, byte #827) at 
> com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) 
> at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
>  at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2017)
>  at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:577)
>  at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:56)
>  at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:55)
>  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543) at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:55)
>  at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:53)
>  at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:891) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at 
> scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185) 
> at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1334) at 
> scala.collection.TraversableOnce$class.reduceLeftOption(TraversableOnce.scala:203)
>  at scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1334) 
> at 
> scala.collection.TraversableOnce$class.reduceOption(TraversableOnce.scala:210)
>  at scala.collection.AbstractIterator.reduceOption(Iterator.scala:1334) at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:70)
>  at 
> org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:50)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:310)

[jira] [Commented] (SPARK-31998) Change package references for ArrowBuf

2020-06-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143553#comment-17143553
 ] 

Apache Spark commented on SPARK-31998:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/28915

> Change package references for ArrowBuf
> --
>
> Key: SPARK-31998
> URL: https://issues.apache.org/jira/browse/SPARK-31998
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> Recently, we have moved class ArrowBuf from package io.netty.buffer to 
> org.apache.arrow.memory. So after upgrading Arrow library, we need to update 
> the references to ArrowBuf with the correct package name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31998) Change package references for ArrowBuf

2020-06-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31998:


Assignee: (was: Apache Spark)

> Change package references for ArrowBuf
> --
>
> Key: SPARK-31998
> URL: https://issues.apache.org/jira/browse/SPARK-31998
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> Recently, we have moved class ArrowBuf from package io.netty.buffer to 
> org.apache.arrow.memory. So after upgrading Arrow library, we need to update 
> the references to ArrowBuf with the correct package name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31998) Change package references for ArrowBuf

2020-06-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143551#comment-17143551
 ] 

Apache Spark commented on SPARK-31998:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/28915

> Change package references for ArrowBuf
> --
>
> Key: SPARK-31998
> URL: https://issues.apache.org/jira/browse/SPARK-31998
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> Recently, we have moved class ArrowBuf from package io.netty.buffer to 
> org.apache.arrow.memory. So after upgrading Arrow library, we need to update 
> the references to ArrowBuf with the correct package name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31998) Change package references for ArrowBuf

2020-06-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31998:


Assignee: Apache Spark

> Change package references for ArrowBuf
> --
>
> Key: SPARK-31998
> URL: https://issues.apache.org/jira/browse/SPARK-31998
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Assignee: Apache Spark
>Priority: Major
>
> Recently, we have moved class ArrowBuf from package io.netty.buffer to 
> org.apache.arrow.memory. So after upgrading Arrow library, we need to update 
> the references to ArrowBuf with the correct package name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32081) facing Invalid UTF-32 character v2.4.5 running pyspark

2020-06-23 Thread Yaniv Kempler (Jira)
Yaniv Kempler created SPARK-32081:
-

 Summary: facing Invalid UTF-32 character v2.4.5 running pyspark
 Key: SPARK-32081
 URL: https://issues.apache.org/jira/browse/SPARK-32081
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 2.4.5
Reporter: Yaniv Kempler


facing Invalid UTF-32 character while reading json files

 

Py4JJavaError Traceback (most recent call last)  in  
~/.local/lib/python3.6/site-packages/pyspark/sql/readwriter.py in json(self, 
path, schema, primitivesAsString, prefersDecimal, allowComments, 
allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZero, 
allowBackslashEscapingAnyCharacter, mode, columnNameOfCorruptRecord, 
dateFormat, timestampFormat, multiLine, allowUnquotedControlChars, lineSep, 
samplingRatio, dropFieldIfAllNull, encoding)  284 keyed._bypass_serializer = 
True  285 jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) --> 286 
return self._df(self._jreader.json(jrdd))  287 else:  288 raise TypeError("path 
can be only string, list or RDD") 
~/.local/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, 
*args)  1255 answer = self.gateway_client.send_command(command)  1256 
return_value = get_return_value( -> 1257 answer, self.gateway_client, 
self.target_id, self.name)  1258  1259 for temp_arg in temp_args: 
~/.local/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)  61 
def deco(*a, **kw):  62 try: ---> 63 return f(*a, **kw)  64 except 
py4j.protocol.Py4JJavaError as e:  65 s = e.java_exception.toString() 
~/.local/lib/python3.6/site-packages/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)  326 raise 
Py4JJavaError(  327 "An error occurred while calling \{0}{1}\{2}.\n". --> 328 
format(target_id, ".", name), value)  329 else:  330 raise Py4JError( 
Py4JJavaError: An error occurred while calling o67.json. : 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 546 in 
stage 0.0 failed 4 times, most recent failure: Lost task 546.3 in stage 0.0 
(TID 642, 172.31.30.196, executor 1): java.io.CharConversionException: Invalid 
UTF-32 character 0x100(above 10) at char #206, byte #827) at 
com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) 
at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
 at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2017)
 at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:577)
 at 
org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:56)
 at 
org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(JsonInferSchema.scala:55)
 at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543) at 
org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:55)
 at 
org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1$$anonfun$apply$1.apply(JsonInferSchema.scala:53)
 at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at 
scala.collection.Iterator$class.foreach(Iterator.scala:891) at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at 
scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185) at 
scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1334) at 
scala.collection.TraversableOnce$class.reduceLeftOption(TraversableOnce.scala:203)
 at scala.collection.AbstractIterator.reduceLeftOption(Iterator.scala:1334) at 
scala.collection.TraversableOnce$class.reduceOption(TraversableOnce.scala:210) 
at scala.collection.AbstractIterator.reduceOption(Iterator.scala:1334) at 
org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:70)
 at 
org.apache.spark.sql.catalyst.json.JsonInferSchema$$anonfun$1.apply(JsonInferSchema.scala:50)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:123) at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 

[jira] [Resolved] (SPARK-32062) Reset listenerRegistered in SparkSession

2020-06-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32062.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28899
[https://github.com/apache/spark/pull/28899]

> Reset listenerRegistered in SparkSession
> 
>
> Key: SPARK-32062
> URL: https://issues.apache.org/jira/browse/SPARK-32062
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32062) Reset listenerRegistered in SparkSession

2020-06-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32062:
---

Assignee: ulysses you

> Reset listenerRegistered in SparkSession
> 
>
> Key: SPARK-32062
> URL: https://issues.apache.org/jira/browse/SPARK-32062
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32072) Unaligned benchmark results

2020-06-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32072:
---

Assignee: Maxim Gekk

> Unaligned benchmark results 
> 
>
> Key: SPARK-32072
> URL: https://issues.apache.org/jira/browse/SPARK-32072
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> If the length of benchmark names is greater than 40, benchmark results are 
> not aligned to column names. For example:
> {code}
> OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 
> 4.15.0-1044-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> make_timestamp(): Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> prepare make_timestamp()   3636   3673
>   38  0.33635.7   1.0X
> make_timestamp(2019, 1, 2, 3, 4, 50.123456) 94 99 
>   4 10.7  93.8  38.8X
> make_timestamp(2019, 1, 2, 3, 4, 60.00) 68 80 
>  13 14.6  68.3  53.2X
> make_timestamp(2019, 12, 31, 23, 59, 60.00) 65 79 
>  19 15.3  65.3  55.7X
> make_timestamp(*, *, *, 3, 4, 50.123456)271280
>   14  3.7 270.7  13.4X
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32072) Unaligned benchmark results

2020-06-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32072.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28906
[https://github.com/apache/spark/pull/28906]

> Unaligned benchmark results 
> 
>
> Key: SPARK-32072
> URL: https://issues.apache.org/jira/browse/SPARK-32072
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> If the length of benchmark names is greater than 40, benchmark results are 
> not aligned to column names. For example:
> {code}
> OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 
> 4.15.0-1044-aws
> Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
> make_timestamp(): Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> prepare make_timestamp()   3636   3673
>   38  0.33635.7   1.0X
> make_timestamp(2019, 1, 2, 3, 4, 50.123456) 94 99 
>   4 10.7  93.8  38.8X
> make_timestamp(2019, 1, 2, 3, 4, 60.00) 68 80 
>  13 14.6  68.3  53.2X
> make_timestamp(2019, 12, 31, 23, 59, 60.00) 65 79 
>  19 15.3  65.3  55.7X
> make_timestamp(*, *, *, 3, 4, 50.123456)271280
>   14  3.7 270.7  13.4X
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32075) Fix a few issues in parameters table

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32075.
--
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/28910

> Fix a few issues in parameters table
> 
>
> Key: SPARK-32075
> URL: https://issues.apache.org/jira/browse/SPARK-32075
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Zuo Dao
>Priority: Trivial
> Fix For: 3.0.1, 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32075) Fix a few issues in parameters table

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32075:


Assignee: Zuo Dao

> Fix a few issues in parameters table
> 
>
> Key: SPARK-32075
> URL: https://issues.apache.org/jira/browse/SPARK-32075
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Zuo Dao
>Assignee: Zuo Dao
>Priority: Trivial
> Fix For: 3.0.1, 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32080) Simplify ArrowColumnVector ListArray accessor

2020-06-23 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-32080:
-
Priority: Trivial  (was: Major)

> Simplify ArrowColumnVector ListArray accessor
> -
>
> Key: SPARK-32080
> URL: https://issues.apache.org/jira/browse/SPARK-32080
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Trivial
>
> The ArrowColumnVector ListArray accessor calculates start and end offset 
> indices manually. There were APIs added in Arrow 0.15.0 that do this and 
> using them will simplify this code and make use of more stable APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32080) Simplify ArrowColumnVector ListArray accessor

2020-06-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143479#comment-17143479
 ] 

Apache Spark commented on SPARK-32080:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/28915

> Simplify ArrowColumnVector ListArray accessor
> -
>
> Key: SPARK-32080
> URL: https://issues.apache.org/jira/browse/SPARK-32080
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> The ArrowColumnVector ListArray accessor calculates start and end offset 
> indices manually. There were APIs added in Arrow 0.15.0 that do this and 
> using them will simplify this code and make use of more stable APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32080) Simplify ArrowColumnVector ListArray accessor

2020-06-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32080:


Assignee: Apache Spark

> Simplify ArrowColumnVector ListArray accessor
> -
>
> Key: SPARK-32080
> URL: https://issues.apache.org/jira/browse/SPARK-32080
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Major
>
> The ArrowColumnVector ListArray accessor calculates start and end offset 
> indices manually. There were APIs added in Arrow 0.15.0 that do this and 
> using them will simplify this code and make use of more stable APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32080) Simplify ArrowColumnVector ListArray accessor

2020-06-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32080:


Assignee: (was: Apache Spark)

> Simplify ArrowColumnVector ListArray accessor
> -
>
> Key: SPARK-32080
> URL: https://issues.apache.org/jira/browse/SPARK-32080
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> The ArrowColumnVector ListArray accessor calculates start and end offset 
> indices manually. There were APIs added in Arrow 0.15.0 that do this and 
> using them will simplify this code and make use of more stable APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32080) Simplify ArrowColumnVector ListArray accessor

2020-06-23 Thread Bryan Cutler (Jira)
Bryan Cutler created SPARK-32080:


 Summary: Simplify ArrowColumnVector ListArray accessor
 Key: SPARK-32080
 URL: https://issues.apache.org/jira/browse/SPARK-32080
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Bryan Cutler


The ArrowColumnVector ListArray accessor calculates start and end offset 
indices manually. There were APIs added in Arrow 0.15.0 that do this and using 
them will simplify this code and make use of more stable APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32028) App id link in history summary page point to wrong application attempt

2020-06-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32028.
--
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28867
[https://github.com/apache/spark/pull/28867]

> App id link in history summary page point to wrong application attempt
> --
>
> Key: SPARK-32028
> URL: https://issues.apache.org/jira/browse/SPARK-32028
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0, 3.1.0
>Reporter: Zhen Li
>Assignee: Zhen Li
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
> Attachments: multi_same.JPG, wrong_attemptJPG.JPG
>
>
> App id link in history summary page url is wrong, for multi attempts case. 
> for details, please see attached screen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32028) App id link in history summary page point to wrong application attempt

2020-06-23 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-32028:


Assignee: Zhen Li

> App id link in history summary page point to wrong application attempt
> --
>
> Key: SPARK-32028
> URL: https://issues.apache.org/jira/browse/SPARK-32028
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0, 3.1.0
>Reporter: Zhen Li
>Assignee: Zhen Li
>Priority: Minor
> Attachments: multi_same.JPG, wrong_attemptJPG.JPG
>
>
> App id link in history summary page url is wrong, for multi attempts case. 
> for details, please see attached screen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32073) Drop R < 3.5 support

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32073.
--
Fix Version/s: 3.1.0
   2.4.7
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28908
[https://github.com/apache/spark/pull/28908]

> Drop R < 3.5 support
> 
>
> Key: SPARK-32073
> URL: https://issues.apache.org/jira/browse/SPARK-32073
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.0.1, 2.4.7, 3.1.0
>
>
> Spark 3.0.0 is built by R 3.6.3 which does not support R < 3.5:
> {code}
> Error in readRDS(pfile) : cannot read workspace version 3 written by R 3.6.3; 
> need R 3.5.0 or newer version.
> {code}
> In fact, with SPARK-31918, we will have to drop R < 3.5 entirely to support R 
> 4.0.0.
> This JIRA targets to drop R < 3.5 in SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32073) Drop R < 3.5 support

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32073:


Assignee: Hyukjin Kwon

> Drop R < 3.5 support
> 
>
> Key: SPARK-32073
> URL: https://issues.apache.org/jira/browse/SPARK-32073
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: releasenotes
>
> Spark 3.0.0 is built by R 3.6.3 which does not support R < 3.5:
> {code}
> Error in readRDS(pfile) : cannot read workspace version 3 written by R 3.6.3; 
> need R 3.5.0 or newer version.
> {code}
> In fact, with SPARK-31918, we will have to drop R < 3.5 entirely to support R 
> 4.0.0.
> This JIRA targets to drop R < 3.5 in SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31918:


Assignee: Hyukjin Kwon

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31918.
--
Fix Version/s: 3.1.0
   2.4.7
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28907
[https://github.com/apache/spark/pull/28907]

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Shivaram Venkataraman
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 3.0.1, 2.4.7, 3.1.0
>
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32073) Drop R < 3.5 support

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32073:
-
Labels: releasenotes  (was: )

> Drop R < 3.5 support
> 
>
> Key: SPARK-32073
> URL: https://issues.apache.org/jira/browse/SPARK-32073
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: releasenotes
>
> Spark 3.0.0 is built by R 3.6.3 which does not support R < 3.5:
> {code}
> Error in readRDS(pfile) : cannot read workspace version 3 written by R 3.6.3; 
> need R 3.5.0 or newer version.
> {code}
> In fact, with SPARK-31918, we will have to drop R < 3.5 entirely to support R 
> 4.0.0.
> This JIRA targets to drop R < 3.5 in SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32079) PySpark <> Beam pickling issues for collections.namedtuple

2020-06-23 Thread Gerard Casas Saez (Jira)
Gerard Casas Saez created SPARK-32079:
-

 Summary: PySpark <> Beam pickling issues for collections.namedtuple
 Key: SPARK-32079
 URL: https://issues.apache.org/jira/browse/SPARK-32079
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Gerard Casas Saez


PySpark monkeypatching namedtuple makes it difficult/impossible to depickle 
collections.namedtuple instances from outside of a pyspark environment.

 

When PySpark has been loaded into the environment, any time that you try to 
pickle a namedtuple, you are only able to unpickle it from an environment where 
the 
[hijack|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L385]
 has been applied. 

This conflicts directly when trying to use Beam from a non-Spark environment 
(namingly Flink or Dataflow) making it impossible to use the pipeline if it has 
a namedtuple loaded somewhere. 

 
{code:python}
import collections
import dill
ColumnInfo = collections.namedtuple(
"ColumnInfo",
[
"name",  # type: ColumnName  # pytype: disable=ignored-type-comment
"type",  # type: Optional[ColumnType]  # pytype: 
disable=ignored-type-comment
])
dill.dumps(ColumnInfo('test', int))
{code}

{{b'\x80\x03cdill._dill\n_create_namedtuple\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x08\x00\x00\x00__main__q\x05\x87q\x06Rq\x07X\x04\x00\x00\x00testq\x08cdill._dill\n_load_type\nq\tX\x03\x00\x00\x00intq\n\x85q\x0bRq\x0c\x86q\r\x81q\x0e.'}}
{code:python}
import pyspark
import collections
import dill
ColumnInfo = collections.namedtuple(
"ColumnInfo",
[
"name",  # type: ColumnName  # pytype: disable=ignored-type-comment
"type",  # type: Optional[ColumnType]  # pytype: 
disable=ignored-type-comment
])
dill.dumps(ColumnInfo('test', int))
{code}
{{b'\x80\x03cpyspark.serializers\n_restore\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x04\x00\x00\x00testq\x05cdill._dill\n_load_type\nq\x06X\x03\x00\x00\x00intq\x07\x85q\x08Rq\t\x86q\n\x87q\x0bRq\x0c.'}}


Second pickled object can only be used from an environment with PySpark. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32078) Add a redirect to sql-ref from sql-reference

2020-06-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143429#comment-17143429
 ] 

Apache Spark commented on SPARK-32078:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/28914

> Add a redirect to sql-ref from sql-reference
> 
>
> Key: SPARK-32078
> URL: https://issues.apache.org/jira/browse/SPARK-32078
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> A number of Google searches I’ve done today have turned up 
> [https://spark.apache.org/docs/latest/sql-reference.html], which does not 
> exist any more. Thus, we should add a redirect to sql-ref.html. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32078) Add a redirect to sql-ref from sql-reference

2020-06-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32078:


Assignee: Xiao Li  (was: Apache Spark)

> Add a redirect to sql-ref from sql-reference
> 
>
> Key: SPARK-32078
> URL: https://issues.apache.org/jira/browse/SPARK-32078
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> A number of Google searches I’ve done today have turned up 
> [https://spark.apache.org/docs/latest/sql-reference.html], which does not 
> exist any more. Thus, we should add a redirect to sql-ref.html. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32078) Add a redirect to sql-ref from sql-reference

2020-06-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32078:


Assignee: Apache Spark  (was: Xiao Li)

> Add a redirect to sql-ref from sql-reference
> 
>
> Key: SPARK-32078
> URL: https://issues.apache.org/jira/browse/SPARK-32078
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> A number of Google searches I’ve done today have turned up 
> [https://spark.apache.org/docs/latest/sql-reference.html], which does not 
> exist any more. Thus, we should add a redirect to sql-ref.html. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32078) Add a redirect to sql-ref from sql-reference

2020-06-23 Thread Xiao Li (Jira)
Xiao Li created SPARK-32078:
---

 Summary: Add a redirect to sql-ref from sql-reference
 Key: SPARK-32078
 URL: https://issues.apache.org/jira/browse/SPARK-32078
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Xiao Li
Assignee: Xiao Li


A number of Google searches I’ve done today have turned up 
[https://spark.apache.org/docs/latest/sql-reference.html], which does not exist 
any more. Thus, we should add a redirect to sql-ref.html. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23631) Add summary to RandomForestClassificationModel

2020-06-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143392#comment-17143392
 ] 

Apache Spark commented on SPARK-23631:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28913

> Add summary to RandomForestClassificationModel
> --
>
> Key: SPARK-23631
> URL: https://issues.apache.org/jira/browse/SPARK-23631
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Evan Zamir
>Priority: Major
>  Labels: bulk-closed
>
> I'm using the RandomForestClassificationModel and noticed that there is no 
> summary attribute like there is for LogisticRegressionModel. Specifically, 
> I'd like to have the roc and pr curves. Is that on the Spark roadmap 
> anywhere? Is there a reason it hasn't been implemented?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32067) [K8S] Executor pod template of subsequent submission inadvertently applies to ongoing submission

2020-06-23 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Description: 
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and with app2 launching while app1 is still ramping up all its executor pods. 
The unwanted result is that some launched executor pods of app1 end up having 
app2's executor pod template applied to them.

The root cause appears to be that app1's podspec-configmap got overwritten by 
app2 during the overlapping launching periods because the configmap names of 
the two apps are the same. This causes some app1's executor pods being ramped 
up after app2 is launched to be inadvertently launched with the app2's pod 
template. The issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app1--podspec-configmap1   13m57s
default  app2--driver-conf-map  1   10s 
default  app2--podspec-configmap1   3m{code}

  was:
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and with app2 launching while app1 is still ramping up all its executor pods. 
The unwanted result is that some launched executor pods of app1 end up having 
app2's executor pod template applied to them.

The root cause appears to be that app1's podspec-configmap got overwritten by 
app2 during the overlapping launching periods because the configmap names of 
the two apps are the same. This causes some app1's executor pods being ramped 
up after app2 is launched to be inadvertently launched with the app2's pod 
template. The issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   3m{code}


> [K8S] Executor pod template of subsequent submission inadvertently applies to 
> ongoing submission
> 
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially, and with app2 launching while app1 is still ramping up all its 
> executor pods. The unwanted result is that some launched executor pods of 
> app1 end up having app2's executor pod template applied to them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because the configmap names of 
> the t

[jira] [Updated] (SPARK-32067) [K8S] Executor pod template of subsequent submission inadvertently applies to ongoing submission

2020-06-23 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Description: 
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and with app2 launching while app1 is still ramping up all its executor pods. 
The unwanted result is that some launched executor pods of app1 end up having 
app2's executor pod template applied to them.

The root cause appears to be that app1's podspec-configmap got overwritten by 
app2 during the overlapping launching periods because the configmap names of 
the two apps are the same. This causes some app1's executor pods being ramped 
up after app2 is launched to be inadvertently launched with the app2's pod 
template. The issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   3m{code}

  was:
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and with app2 launching while app1 is still ramping up all its executor pods. 
The unwanted result is that some launched executor pods of app1 end up having 
app2's executor pod template applied to them.

The root cause appears to be that app1's podspec-configmap got overwritten by 
app2 during the overlapping launching periods because the configmap names of 
the two apps are the same. This causes some app1's executor pods being ramped 
up after app2 is launched to be inadvertently launched with the app2's pod 
template. The issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}


> [K8S] Executor pod template of subsequent submission inadvertently applies to 
> ongoing submission
> 
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially, and with app2 launching while app1 is still ramping up all its 
> executor pods. The unwanted result is that some launched executor pods of 
> app1 end up having app2's executor pod template applied to them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because the configmap names of 
> th

[jira] [Updated] (SPARK-32067) [K8S] Executor pod template of subsequent submission inadvertently applies to ongoing submission

2020-06-23 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Summary: [K8S] Executor pod template of subsequent submission inadvertently 
applies to ongoing submission  (was: [K8s] Executor pod template of subsequent 
submission inadvertently applies to ongoing submission)

> [K8S] Executor pod template of subsequent submission inadvertently applies to 
> ongoing submission
> 
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially, and with app2 launching while app1 is still ramping up all its 
> executor pods. The unwanted result is that some launched executor pods of 
> app1 end up having app2's executor pod template applied to them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because the configmap names of 
> the two apps are the same. This causes some app1's executor pods being ramped 
> up after app2 is launched to be inadvertently launched with the app2's pod 
> template. The issue can be seen as follows:
> First, after submitting app1, you get these configmaps:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors. The 
> podspec-confimap is modified by app2.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  app1--podspec-configmap1   13m57s
> default  app2--podspec-configmap1   13m57s{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32067) [K8s] Executor pod template of subsequent submission inadvertently applies to ongoing submission

2020-06-23 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Summary: [K8s] Executor pod template of subsequent submission inadvertently 
applies to ongoing submission  (was: [K8s] Pod template from subsequent 
submission inadvertently applies to ongoing submission)

> [K8s] Executor pod template of subsequent submission inadvertently applies to 
> ongoing submission
> 
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially, and with app2 launching while app1 is still ramping up all its 
> executor pods. The unwanted result is that some launched executor pods of 
> app1 end up having app2's executor pod template applied to them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because the configmap names of 
> the two apps are the same. This causes some app1's executor pods being ramped 
> up after app2 is launched to be inadvertently launched with the app2's pod 
> template. The issue can be seen as follows:
> First, after submitting app1, you get these configmaps:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors. The 
> podspec-confimap is modified by app2.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  app1--podspec-configmap1   13m57s
> default  app2--podspec-configmap1   13m57s{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequent submission inadvertently applies to ongoing submission

2020-06-23 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Summary: [K8s] Pod template from subsequent submission inadvertently 
applies to ongoing submission  (was: [K8s] Pod template from subsequently 
submission inadvertently applies to ongoing submission)

> [K8s] Pod template from subsequent submission inadvertently applies to 
> ongoing submission
> -
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially, and with app2 launching while app1 is still ramping up all its 
> executor pods. The unwanted result is that some launched executor pods of 
> app1 end up having app2's executor pod template applied to them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because the configmap names of 
> the two apps are the same. This causes some app1's executor pods being ramped 
> up after app2 is launched to be inadvertently launched with the app2's pod 
> template. The issue can be seen as follows:
> First, after submitting app1, you get these configmaps:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors. The 
> podspec-confimap is modified by app2.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  app1--podspec-configmap1   13m57s
> default  app2--podspec-configmap1   13m57s{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31998) Change package references for ArrowBuf

2020-06-23 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143306#comment-17143306
 ] 

Kouhei Sutou commented on SPARK-31998:
--

Yes. This change will be included in Apache Arrow 1.0.0.
Apache Arrow 1.0.0 will be released at the end of 2020-07. We'll start our 
release process at the beginning of 2020-07. It'll take a few weeks for 
verification and vote.

FYI: 
https://lists.apache.org/thread.html/re6fe67fd4cf10113f7969bc00ca6c7b4ccc8067d8512be9c7a904005%40%3Cdev.arrow.apache.org%3E

> Change package references for ArrowBuf
> --
>
> Key: SPARK-31998
> URL: https://issues.apache.org/jira/browse/SPARK-31998
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> Recently, we have moved class ArrowBuf from package io.netty.buffer to 
> org.apache.arrow.memory. So after upgrading Arrow library, we need to update 
> the references to ArrowBuf with the correct package name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32051) Dataset.foreachPartition returns object

2020-06-23 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143271#comment-17143271
 ] 

Jungtaek Lim commented on SPARK-32051:
--

[~frankivo]

Could you put full source code for Dataset.foreach here? It looks to be 
returning DataFrame.

> Dataset.foreachPartition returns object
> ---
>
> Key: SPARK-32051
> URL: https://issues.apache.org/jira/browse/SPARK-32051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Frank Oosterhuis
>Priority: Critical
>
> I'm trying to map values from the Dataset[Row], but since 3.0.0 this fails.
> In 3.0.0 I'm dealing with an error: "Error:(28, 38) value map is not a member 
> of Object"
>  
> This is the simplest code that works in 2.4.x, but fails in 3.0.0:
> {code:scala}
> spark.range(100)
>   .repartition(10)
>   .foreachPartition(part => println(part.toList))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED state correctly

2020-06-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143231#comment-17143231
 ] 

Apache Spark commented on SPARK-32057:
--

User 'alismess-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/28912

> SparkExecuteStatementOperation does not set CANCELED state correctly 
> -
>
> Key: SPARK-32057
> URL: https://issues.apache.org/jira/browse/SPARK-32057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Smesseim
>Priority: Major
>
> https://github.com/apache/spark/pull/28671 introduced changes that changed 
> the way cleanup is done in SparkExecuteStatementOperation. In cancel(), 
> cleanup (killing jobs) used to be done after setting state to CANCELED. Now, 
> the order is reversed. Jobs are killed first, causing exception to be thrown 
> inside execute(), so the status of the operation becomes ERROR before being 
> set to CANCELED.
> cc [~juliuszsompolski]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32063) Spark native temporary table

2020-06-23 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143233#comment-17143233
 ] 

L. C. Hsieh commented on SPARK-32063:
-

For 1 and 2, it seems all related to performance. In Spark, we have caching 
mechanism that materializes complex query. I think it can complement the 
shortage of temporary view.

For 3, I'm not sure about this point. Can you elaborate it more?

> Spark native temporary table
> 
>
> Key: SPARK-32063
> URL: https://issues.apache.org/jira/browse/SPARK-32063
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Many databases and data warehouse SQL engines support temporary tables. A 
> temporary table, as its named implied, is a short-lived table that its life 
> will be only for current session.
> In Spark, there is no temporary table. the DDL “CREATE TEMPORARY TABLE AS 
> SELECT” will create a temporary view. A temporary view is totally different 
> with a temporary table. 
> A temporary view is just a VIEW. It doesn’t materialize data in storage. So 
> it has below shortage:
>  # View will not give improved performance. Materialize intermediate data in 
> temporary tables for a complex query will accurate queries, especially in an 
> ETL pipeline.
>  # View which calls other views can cause severe performance issues. Even, 
> executing a very complex view may fail in Spark. 
>  # Temporary view has no database namespace. In some complex ETL pipelines or 
> data warehouse applications, without database prefix is not convenient. It 
> needs some tables which only used in current session.
>  
> More details are described in [Design 
> Docs|https://docs.google.com/document/d/1RS4Q3VbxlZ_Yy0fdWgTJ-k0QxFd1dToCqpLAYvIJ34U/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED state correctly

2020-06-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143230#comment-17143230
 ] 

Apache Spark commented on SPARK-32057:
--

User 'alismess-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/28912

> SparkExecuteStatementOperation does not set CANCELED state correctly 
> -
>
> Key: SPARK-32057
> URL: https://issues.apache.org/jira/browse/SPARK-32057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Smesseim
>Priority: Major
>
> https://github.com/apache/spark/pull/28671 introduced changes that changed 
> the way cleanup is done in SparkExecuteStatementOperation. In cancel(), 
> cleanup (killing jobs) used to be done after setting state to CANCELED. Now, 
> the order is reversed. Jobs are killed first, causing exception to be thrown 
> inside execute(), so the status of the operation becomes ERROR before being 
> set to CANCELED.
> cc [~juliuszsompolski]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED state correctly

2020-06-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32057:


Assignee: Apache Spark

> SparkExecuteStatementOperation does not set CANCELED state correctly 
> -
>
> Key: SPARK-32057
> URL: https://issues.apache.org/jira/browse/SPARK-32057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Smesseim
>Assignee: Apache Spark
>Priority: Major
>
> https://github.com/apache/spark/pull/28671 introduced changes that changed 
> the way cleanup is done in SparkExecuteStatementOperation. In cancel(), 
> cleanup (killing jobs) used to be done after setting state to CANCELED. Now, 
> the order is reversed. Jobs are killed first, causing exception to be thrown 
> inside execute(), so the status of the operation becomes ERROR before being 
> set to CANCELED.
> cc [~juliuszsompolski]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED state correctly

2020-06-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32057:


Assignee: (was: Apache Spark)

> SparkExecuteStatementOperation does not set CANCELED state correctly 
> -
>
> Key: SPARK-32057
> URL: https://issues.apache.org/jira/browse/SPARK-32057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Smesseim
>Priority: Major
>
> https://github.com/apache/spark/pull/28671 introduced changes that changed 
> the way cleanup is done in SparkExecuteStatementOperation. In cancel(), 
> cleanup (killing jobs) used to be done after setting state to CANCELED. Now, 
> the order is reversed. Jobs are killed first, causing exception to be thrown 
> inside execute(), so the status of the operation becomes ERROR before being 
> set to CANCELED.
> cc [~juliuszsompolski]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED state correctly

2020-06-23 Thread Ali Smesseim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Smesseim updated SPARK-32057:
-
Summary: SparkExecuteStatementOperation does not set CANCELED state 
correctly   (was: SparkExecuteStatementOperation does not set CANCELED/CLOSED 
state correctly )

> SparkExecuteStatementOperation does not set CANCELED state correctly 
> -
>
> Key: SPARK-32057
> URL: https://issues.apache.org/jira/browse/SPARK-32057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Smesseim
>Priority: Major
>
> https://github.com/apache/spark/pull/28671 introduced changes that changed 
> the way cleanup is done in SparkExecuteStatementOperation. In cancel(), 
> cleanup (killing jobs) used to be done after setting state to CANCELED. Now, 
> the order is reversed. Jobs are killed first, causing exception to be thrown 
> inside execute(), so the status of the operation becomes ERROR before being 
> set to CANCELED.
> cc [~juliuszsompolski]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission

2020-06-23 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Description: 
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and with app2 launching while app1 is still ramping up all its executor pods. 
The unwanted result is that some launched executor pods of app1 end up having 
app2's executor pod template applied to them.

The root cause appears to be that app1's podspec-configmap got overwritten by 
app2 during the overlapping launching periods because the configmap names of 
the two apps are the same. This causes some app1's executor pods being ramped 
up after app2 is launched to be inadvertently launched with the app2's pod 
template. The issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}

  was:
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and with app2 launching while app1 is still ramping up all its executor pods. 
The unwanted result is that some launched executor pods of app1 end up having 
app2's executor pod template applied to them.

The root cause is that app1's podspec-configmap got overwritten by app2 during 
the overlapping launching periods because the configmap names of the two apps 
are the same. This causes some app1's executor pods being ramped up after app2 
is launched to be inadvertently launched with the app2's pod template. The 
issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}


> [K8s] Pod template from subsequently submission inadvertently applies to 
> ongoing submission
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially, and with app2 launching while app1 is still ramping up all its 
> executor pods. The unwanted result is that some launched executor pods of 
> app1 end up having app2's executor pod template applied to them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because the configmap names of 
> the two apps are th

[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission

2020-06-23 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Description: 
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and with app2 launching while app1 is still ramping up all its executor pods. 
The unwanted result is that some launched executor pods of app1 end up having 
app2's executor pod template applied to them.

The root cause is that app1's podspec-configmap got overwritten by app2 during 
the overlapping launching periods because the configmap names of the two apps 
are the same. This causes some app1's executor pods being ramped up after app2 
is launched to be inadvertently launched with the app2's pod template. The 
issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}

  was:
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and with app2 launching while app1 is still ramping up all its executor pods. 
The unwanted result is that some launched executor pods of app1 appear to have 
app2's pod template applied.

The root cause is that app1's podspec-configmap got overwritten by app2 during 
the overlapping launching periods because the configmap names of the two apps 
are the same. This causes some app1's executor pods being ramped up after app2 
is launched to be inadvertently launched with the app2's pod template. The 
issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}


> [K8s] Pod template from subsequently submission inadvertently applies to 
> ongoing submission
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially, and with app2 launching while app1 is still ramping up all its 
> executor pods. The unwanted result is that some launched executor pods of 
> app1 end up having app2's executor pod template applied to them.
> The root cause is that app1's podspec-configmap got overwritten by app2 
> during the overlapping launching periods because the configmap names of the 
> two apps are the same. This causes some app1's execut

[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission

2020-06-23 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Description: 
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and with app2 launching while app1 is still ramping up all its executor pods. 
The unwanted result is that some launched executor pods of app1 appear to have 
app2's pod template applied.

The root cause is that app1's podspec-configmap got overwritten by app2 during 
the overlapping launching periods because the configmap names of the two apps 
are the same. This causes some app1's executor pods being ramped up after app2 
is launched to be inadvertently launched with the app2's pod template. The 
issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}

  was:
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and app2 launches while app1 is still ramping up all its executor pods. The 
unwanted result is that some launched executor pods of app1 appear to have 
app2's pod template applied.

The root cause is that app1's podspec-configmap got overwritten by app2 during 
the overlapping launching periods because the configmap names of the two apps 
are the same. This causes some app1's executor pods being ramped up after app2 
is launched to be inadvertently launched with the app2's pod template. The 
issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}


> [K8s] Pod template from subsequently submission inadvertently applies to 
> ongoing submission
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially, and with app2 launching while app1 is still ramping up all its 
> executor pods. The unwanted result is that some launched executor pods of 
> app1 appear to have app2's pod template applied.
> The root cause is that app1's podspec-configmap got overwritten by app2 
> during the overlapping launching periods because the configmap names of the 
> two apps are the same. This causes some app1's executor pods being ramped up 
> after app2 

[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission

2020-06-23 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Description: 
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and app2 launches while app1 is still ramping up all its executor pods. The 
unwanted result is that some launched executor pods of app1 appear to have 
app2's pod template applied.

The root cause is that app1's podspec-configmap got overwritten by app2 during 
the overlapping launching periods because the configmap names of the two apps 
are the same. This causes some app1's executor pods being ramped up after app2 
is launched to be inadvertently launched with the app2's pod template. The 
issue can be seen as follows:

First, after submitting app1, you get these configmaps:
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors. The 
podspec-confimap is modified by app2.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}

  was:
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and app2 launches while app1 is still ramping up all its executor pods. The 
unwanted result is that some launched executor pods of app1 appear to have 
app2's pod template applied.

The root cause is that app1's podspec-configmap got overwritten by app2 during 
the overlapping launching periods because the configmap names of the two apps 
are the same. This causes some app1's executor pods being ramped up after app2 
is launched to be inadvertently launched with the app2's pod template.

First, submit app1
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}


> [K8s] Pod template from subsequently submission inadvertently applies to 
> ongoing submission
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially, and app2 launches while app1 is still ramping up all its 
> executor pods. The unwanted result is that some launched executor pods of 
> app1 appear to have app2's pod template applied.
> The root cause is that app1's podspec-configmap got overwritten by app2 
> during the overlapping launching periods because the configmap names of the 
> two apps are the same. This causes some app1's executor pods being ramped up 
> after app2 is launched to be inadvertently launched with the app2's pod 
> template. The issue can be seen as follows:
> First, after submi

[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission

2020-06-23 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Description: 
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and app2 launches while app1 is still ramping up all its executor pods. The 
unwanted result is that some launched executor pods of app1 appear to have 
app2's pod template applied.

The root cause is that app1's podspec-configmap got overwritten by app2 during 
the overlapping launching periods because the configmap names of the two apps 
are the same. This causes some app1's executor pods being ramped up after app2 
is launched to be inadvertently launched with the app2's pod template.

First, submit app1
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}

  was:
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and app2 launches while app1 is still ramping up all its executor pods. The 
unwanted result is that some launched executor pods of app1 appear to have 
app2's pod template applied.

The root cause is that app1's podspec-configmap got overwritten by app2 during 
the launching period because the configmap names of the two apps are the same. 
This causes some app1's executor pods being ramped up after app2 is launched to 
be inadvertently launched with the app2's pod template.

First, submit app1
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}


> [K8s] Pod template from subsequently submission inadvertently applies to 
> ongoing submission
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially, and app2 launches while app1 is still ramping up all its 
> executor pods. The unwanted result is that some launched executor pods of 
> app1 appear to have app2's pod template applied.
> The root cause is that app1's podspec-configmap got overwritten by app2 
> during the overlapping launching periods because the configmap names of the 
> two apps are the same. This causes some app1's executor pods being ramped up 
> after app2 is launched to be inadvertently launched with the app2's pod 
> template.
> First, submit app1
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> defa

[jira] [Updated] (SPARK-32067) [K8s] Pod template from subsequently submission inadvertently applies to ongoing submission

2020-06-23 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-32067:
-
Description: 
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different executor pod templates (e.g., different labels) to K8s sequentially, 
and app2 launches while app1 is still ramping up all its executor pods. The 
unwanted result is that some launched executor pods of app1 appear to have 
app2's pod template applied.

The root cause is that app1's podspec-configmap got overwritten by app2 during 
the launching period because the configmap names of the two apps are the same. 
This causes some app1's executor pods being ramped up after app2 is launched to 
be inadvertently launched with the app2's pod template.

First, submit app1
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}

  was:
THE BUG:

The bug is reproducible by spark-submit two different apps (app1 and app2) with 
different pod templates to K8s sequentially, and app2 launches while app1 is 
still ramping up all its executor pods. The unwanted result is that some 
launched executor pods of app1 appear to have app2's pod template applied.

The root cause is that app1's podspec-configmap got overwritten by app2 during 
the launching period because the configmap names of the two apps are the same. 
This causes some app1's executor pods being ramped up after app2 is launched to 
be inadvertently launched with the app2's pod template.

First, submit app1
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   9m46s
default  podspec-configmap  1   12m{code}
Then submit app2 while app1 is still ramping up its executors
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  podspec-configmap  1   13m57s{code}
 

PROPOSED SOLUTION:

Properly prefix the podspec-configmap for each submitted app.
{code:java}
NAMESPACENAME   DATAAGE
default  app1--driver-conf-map  1   11m43s
default  app2--driver-conf-map  1   10s
default  app1--podspec-configmap1   13m57s
default  app2--podspec-configmap1   13m57s{code}


> [K8s] Pod template from subsequently submission inadvertently applies to 
> ongoing submission
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.6, 3.0.0
>Reporter: James Yu
>Priority: Minor
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially, and app2 launches while app1 is still ramping up all its 
> executor pods. The unwanted result is that some launched executor pods of 
> app1 appear to have app2's pod template applied.
> The root cause is that app1's podspec-configmap got overwritten by app2 
> during the launching period because the configmap names of the two apps are 
> the same. This causes some app1's executor pods being ramped up after app2 is 
> launched to be inadvertently launched with the app2's pod template.
> First, submit app1
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{

[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-06-23 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143151#comment-17143151
 ] 

Thomas Graves commented on SPARK-32037:
---

I agree healthy/unhealty could mean other things then the current blacklist 
meaning.  Another option is excludes but again has the same problem that it 
could be excluded if user specified it.

A few other options I found searching around:

*grant*list/*block*list

*let*list/*ban*list - I like ban but not sure on the letlist side.
SafeList/BlockList
Allowlist/DenyList
 
[https://tools.ietf.org/id/draft-knodel-terminology-00.html#rfc.section.1.2.1]
 
has:
 * Blocklist-allowlist
 * Block-permit

 

Personally I like the blocklist/allowlist

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Minor
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32077) Support host-local shuffle data reading with external shuffle service disabled

2020-06-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32077:


Assignee: (was: Apache Spark)

> Support host-local shuffle data reading with external shuffle service disabled
> --
>
> Key: SPARK-32077
> URL: https://issues.apache.org/jira/browse/SPARK-32077
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> After SPARK-27651, Spark can read host-local shuffle data directly from disk 
> with external shuffle service enabled. To extend the future, we can also 
> support it with external shuffle service disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32077) Support host-local shuffle data reading with external shuffle service disabled

2020-06-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143021#comment-17143021
 ] 

Apache Spark commented on SPARK-32077:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28911

> Support host-local shuffle data reading with external shuffle service disabled
> --
>
> Key: SPARK-32077
> URL: https://issues.apache.org/jira/browse/SPARK-32077
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> After SPARK-27651, Spark can read host-local shuffle data directly from disk 
> with external shuffle service enabled. To extend the future, we can also 
> support it with external shuffle service disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32077) Support host-local shuffle data reading with external shuffle service disabled

2020-06-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143020#comment-17143020
 ] 

Apache Spark commented on SPARK-32077:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28911

> Support host-local shuffle data reading with external shuffle service disabled
> --
>
> Key: SPARK-32077
> URL: https://issues.apache.org/jira/browse/SPARK-32077
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> After SPARK-27651, Spark can read host-local shuffle data directly from disk 
> with external shuffle service enabled. To extend the future, we can also 
> support it with external shuffle service disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32077) Support host-local shuffle data reading with external shuffle service disabled

2020-06-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32077:


Assignee: Apache Spark

> Support host-local shuffle data reading with external shuffle service disabled
> --
>
> Key: SPARK-32077
> URL: https://issues.apache.org/jira/browse/SPARK-32077
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> After SPARK-27651, Spark can read host-local shuffle data directly from disk 
> with external shuffle service enabled. To extend the future, we can also 
> support it with external shuffle service disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not

2020-06-23 Thread Jim Huang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142990#comment-17142990
 ] 

Jim Huang edited comment on SPARK-31995 at 6/23/20, 3:00 PM:
-

Thanks Gabor for triaging this issue.  SPARK-32076 has been opened to explore 
the improvement perspective.  

[~gsomogyi] I am curious as to what part of the code base within the Spark 
3.0.0 branch that "should make this issue disappear"?

 


was (Author: jimhuang):
Thanks Gabor for triaging this issue.  SPARK-32076 has been opened to explore 
the improvement perspective.  

> Spark Structure Streaming checkpiontFileManager ERROR when 
> HDFS.DFSOutputStream.completeFile with IOException unable to close file 
> because the last block does not have enough number of replicas
> -
>
> Key: SPARK-31995
> URL: https://issues.apache.org/jira/browse/SPARK-31995
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop
> Hadoop 2.7.3 - YARN cluster
> delta-core_ 2.11:0.6.1
>  
>Reporter: Jim Huang
>Priority: Major
>
> I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) 
> as the sink running in YARN cluster running on Hadoop 2.7.3.  I have been 
> using Spark Structured Streaming for several months now in this runtime 
> environment until this new corner case that handicapped my Spark structured 
> streaming job in partial working state.
>  
> I have included the ERROR message and stack trace.  I did a quick search 
> using the string "MicroBatchExecution: Query terminated with error" but did 
> not find any existing Jira that looks like my stack trace.  
>  
> Based on the naive look at this error message and stack trace, is it possible 
> the Spark's CheckpointFileManager could attempt to handle this HDFS exception 
> better to simply wait a little longer for HDFS's pipeline to complete the 
> replicas?  
>  
> Being new to this code, where can I find the configuration parameter that 
> sets the replica counts for the `streaming.HDFSMetadataLog`?  I am just 
> trying to understand if there are already some holistic configuration tuning 
> variable(s) the current code provide to be able to handle this IOException 
> more gracefully?  Hopefully experts can provide some pointers or directions.  
>  
> {code:java}
> 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = 
> yarn-job-id-redacted, runId = run-id-redacted] terminated with error
>  java.io.IOException: Unable to close file because the last block does not 
> have enough number of replicas.
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511)
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472)
>  at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437)
>  at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
>  at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:547)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:557)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$ap

[jira] [Comment Edited] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector

2020-06-23 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143007#comment-17143007
 ] 

Gabor Somogyi edited comment on SPARK-32001 at 6/23/20, 2:54 PM:
-

With the service loader no registration is needed and everything works like 
charm.
Additionally harder to implement enable flag. Let me give an example:

With JdbcDiaclect: one creates a dialect class and then in the app 
"JdbcDialects.registerDialect(new CustomDialect())" must be called.
With ServiceLoader: one creates a provider class + META-INF.services file (no 
registration or whatever needed). Service loader scans classpath which 
implements an API.

Not super experienced in the dialect area so it may be needed per app but 
kerberos authentication provider is not something what the user needs to care 
about in the app code. The company must write one provider, put it on the 
classpath of Spark and must be forgotten.



was (Author: gsomogyi):
With the service loader no registration is needed and everything works like 
charm.
Additionally harder to implement enable flag. Let me give an example:

With JdbcDiaclect: one creates a dialect class and then in the app 
"JdbcDialects.registerDialect(new CustomDialect())" must be called.
With ServiceLoader: one creates a provider class + META-INF.services file (no 
registration or whatever needed)

Not super experienced in the dialect area so it may be needed per app but 
kerberos authentication provider is not something what the user needs to care 
about in the app code. The company must write one provider, put it on the 
classpath of Spark and must be forgotten.


> Create Kerberos authentication provider API in JDBC connector
> -
>
> Key: SPARK-32001
> URL: https://issues.apache.org/jira/browse/SPARK-32001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Adding embedded provider to all the possible databases would generate high 
> maintenance cost on Spark side.
> Instead an API can be introduced which would allow to implement further 
> providers independently.
> One important requirement what I suggest is: JDBC connection providers must 
> be loaded independently just like delegation token providers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32077) Support host-local shuffle data reading with external shuffle service disabled

2020-06-23 Thread wuyi (Jira)
wuyi created SPARK-32077:


 Summary: Support host-local shuffle data reading with external 
shuffle service disabled
 Key: SPARK-32077
 URL: https://issues.apache.org/jira/browse/SPARK-32077
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: wuyi


After SPARK-27651, Spark can read host-local shuffle data directly from disk 
with external shuffle service enabled. To extend the future, we can also 
support it with external shuffle service disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector

2020-06-23 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143007#comment-17143007
 ] 

Gabor Somogyi commented on SPARK-32001:
---

With the service loader no registration is needed and everything works like 
charm.
Additionally harder to implement enable flag. Let me give an example:

With JdbcDiaclect: one creates a dialect class and then in the app 
"JdbcDialects.registerDialect(new CustomDialect())" must be called.
With ServiceLoader: one creates a provider class + META-INF.services file (no 
registration or whatever needed)

Not super experienced in the dialect area so it may be needed per app but 
kerberos authentication provider is not something what the user needs to care 
about in the app code. The company must write one provider, put it on the 
classpath of Spark and must be forgotten.


> Create Kerberos authentication provider API in JDBC connector
> -
>
> Key: SPARK-32001
> URL: https://issues.apache.org/jira/browse/SPARK-32001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Adding embedded provider to all the possible databases would generate high 
> maintenance cost on Spark side.
> Instead an API can be introduced which would allow to implement further 
> providers independently.
> One important requirement what I suggest is: JDBC connection providers must 
> be loaded independently just like delegation token providers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32051) Dataset.foreachPartition returns object

2020-06-23 Thread Frank Oosterhuis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Oosterhuis updated SPARK-32051:
-
Fix Version/s: (was: 3.0.1)
   (was: 3.1.0)

> Dataset.foreachPartition returns object
> ---
>
> Key: SPARK-32051
> URL: https://issues.apache.org/jira/browse/SPARK-32051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Frank Oosterhuis
>Priority: Critical
>
> I'm trying to map values from the Dataset[Row], but since 3.0.0 this fails.
> In 3.0.0 I'm dealing with an error: "Error:(28, 38) value map is not a member 
> of Object"
>  
> This is the simplest code that works in 2.4.x, but fails in 3.0.0:
> {code:scala}
> spark.range(100)
>   .repartition(10)
>   .foreachPartition(part => println(part.toList))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32051) Dataset.foreachPartition returns object

2020-06-23 Thread Frank Oosterhuis (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143001#comment-17143001
 ] 

Frank Oosterhuis commented on SPARK-32051:
--

Looks like a similar conflict happens with Dataset.foreach.

 
{code:java}
Error:(22, 8) overloaded method value foreach with alternatives:
  (func: 
org.apache.spark.api.java.function.ForeachFunction[org.apache.spark.sql.Row])Unit
 
  (f: org.apache.spark.sql.Row => Unit)Unit
 cannot be applied to (org.apache.spark.sql.Row => 
org.apache.spark.sql.DataFrame)
  .foreach((r : Row) => {
{code}
Workaround *.foreach((r: Row) =>*  does not work here.

> Dataset.foreachPartition returns object
> ---
>
> Key: SPARK-32051
> URL: https://issues.apache.org/jira/browse/SPARK-32051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Frank Oosterhuis
>Priority: Critical
> Fix For: 3.0.1, 3.1.0
>
>
> I'm trying to map values from the Dataset[Row], but since 3.0.0 this fails.
> In 3.0.0 I'm dealing with an error: "Error:(28, 38) value map is not a member 
> of Object"
>  
> This is the simplest code that works in 2.4.x, but fails in 3.0.0:
> {code:scala}
> spark.range(100)
>   .repartition(10)
>   .foreachPartition(part => println(part.toList))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32075) Fix a few issues in parameters table

2020-06-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32075:


Assignee: Apache Spark

> Fix a few issues in parameters table
> 
>
> Key: SPARK-32075
> URL: https://issues.apache.org/jira/browse/SPARK-32075
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Zuo Dao
>Assignee: Apache Spark
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32075) Fix a few issues in parameters table

2020-06-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142994#comment-17142994
 ] 

Apache Spark commented on SPARK-32075:
--

User 'sidedoorleftroad' has created a pull request for this issue:
https://github.com/apache/spark/pull/28910

> Fix a few issues in parameters table
> 
>
> Key: SPARK-32075
> URL: https://issues.apache.org/jira/browse/SPARK-32075
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Zuo Dao
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32075) Fix a few issues in parameters table

2020-06-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32075:


Assignee: (was: Apache Spark)

> Fix a few issues in parameters table
> 
>
> Key: SPARK-32075
> URL: https://issues.apache.org/jira/browse/SPARK-32075
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Zuo Dao
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not have

2020-06-23 Thread Jim Huang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142990#comment-17142990
 ] 

Jim Huang commented on SPARK-31995:
---

Thanks Gabor for triaging this issue.  SPARK-32076 has been opened to explore 
the improvement perspective.  

> Spark Structure Streaming checkpiontFileManager ERROR when 
> HDFS.DFSOutputStream.completeFile with IOException unable to close file 
> because the last block does not have enough number of replicas
> -
>
> Key: SPARK-31995
> URL: https://issues.apache.org/jira/browse/SPARK-31995
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop
> Hadoop 2.7.3 - YARN cluster
> delta-core_ 2.11:0.6.1
>  
>Reporter: Jim Huang
>Priority: Major
>
> I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) 
> as the sink running in YARN cluster running on Hadoop 2.7.3.  I have been 
> using Spark Structured Streaming for several months now in this runtime 
> environment until this new corner case that handicapped my Spark structured 
> streaming job in partial working state.
>  
> I have included the ERROR message and stack trace.  I did a quick search 
> using the string "MicroBatchExecution: Query terminated with error" but did 
> not find any existing Jira that looks like my stack trace.  
>  
> Based on the naive look at this error message and stack trace, is it possible 
> the Spark's CheckpointFileManager could attempt to handle this HDFS exception 
> better to simply wait a little longer for HDFS's pipeline to complete the 
> replicas?  
>  
> Being new to this code, where can I find the configuration parameter that 
> sets the replica counts for the `streaming.HDFSMetadataLog`?  I am just 
> trying to understand if there are already some holistic configuration tuning 
> variable(s) the current code provide to be able to handle this IOException 
> more gracefully?  Hopefully experts can provide some pointers or directions.  
>  
> {code:java}
> 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = 
> yarn-job-id-redacted, runId = run-id-redacted] terminated with error
>  java.io.IOException: Unable to close file because the last block does not 
> have enough number of replicas.
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511)
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472)
>  at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437)
>  at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
>  at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:547)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:557)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBa

[jira] [Created] (SPARK-32076) Structured Streaming application continuity when encountering streaming query task level error

2020-06-23 Thread Jim Huang (Jira)
Jim Huang created SPARK-32076:
-

 Summary: Structured Streaming application continuity when 
encountering streaming query task level error
 Key: SPARK-32076
 URL: https://issues.apache.org/jira/browse/SPARK-32076
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.5
 Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop

Hadoop 2.7.3 - YARN cluster

delta-core_ 2.11:0.6.1
Reporter: Jim Huang


>From the Spark Structured Streaming application continuity perspective, the 
>thread that ran this task was terminated with ERROR SPARK-31995 but to YARN it 
>is still an active running job even though this instance of the Spark 
>Structured Streaming job is no longer making any further processing.  If the 
>monitoring of the Spark Structured Streaming job is done only from the YARN 
>job perspective, it may provide a false status.  In this situation, should the 
>Spark Structure Streaming application fail hard and completely (fail by Spark 
>framework or Application exception handling)?  Or should the developer 
>investigate and develop some ideal monitoring implementation that has the 
>right level of specificity to detect Spark Structured Streaming *task* level 
>failures?  Any references on these topics are much appreciated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector

2020-06-23 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142969#comment-17142969
 ] 

Takeshi Yamamuro commented on SPARK-32001:
--

I just want to know the approach (META-INF.services) is the best for that. We 
cannot follow the similar interfaces with JdbcDiaclect (registerDialect and 
unregisterDialect)? 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala#L205-L216

> Create Kerberos authentication provider API in JDBC connector
> -
>
> Key: SPARK-32001
> URL: https://issues.apache.org/jira/browse/SPARK-32001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Adding embedded provider to all the possible databases would generate high 
> maintenance cost on Spark side.
> Instead an API can be introduced which would allow to implement further 
> providers independently.
> One important requirement what I suggest is: JDBC connection providers must 
> be loaded independently just like delegation token providers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32075) Fix a few issues in parameters table

2020-06-23 Thread Zuo Dao (Jira)
Zuo Dao created SPARK-32075:
---

 Summary: Fix a few issues in parameters table
 Key: SPARK-32075
 URL: https://issues.apache.org/jira/browse/SPARK-32075
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Zuo Dao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not have

2020-06-23 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-31995.
---
Resolution: Information Provided

The issue should disappear w/ Spark 3.0. Please re-open it if it's not the case.

> Spark Structure Streaming checkpiontFileManager ERROR when 
> HDFS.DFSOutputStream.completeFile with IOException unable to close file 
> because the last block does not have enough number of replicas
> -
>
> Key: SPARK-31995
> URL: https://issues.apache.org/jira/browse/SPARK-31995
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop
> Hadoop 2.7.3 - YARN cluster
> delta-core_ 2.11:0.6.1
>  
>Reporter: Jim Huang
>Priority: Major
>
> I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) 
> as the sink running in YARN cluster running on Hadoop 2.7.3.  I have been 
> using Spark Structured Streaming for several months now in this runtime 
> environment until this new corner case that handicapped my Spark structured 
> streaming job in partial working state.
>  
> I have included the ERROR message and stack trace.  I did a quick search 
> using the string "MicroBatchExecution: Query terminated with error" but did 
> not find any existing Jira that looks like my stack trace.  
>  
> Based on the naive look at this error message and stack trace, is it possible 
> the Spark's CheckpointFileManager could attempt to handle this HDFS exception 
> better to simply wait a little longer for HDFS's pipeline to complete the 
> replicas?  
>  
> Being new to this code, where can I find the configuration parameter that 
> sets the replica counts for the `streaming.HDFSMetadataLog`?  I am just 
> trying to understand if there are already some holistic configuration tuning 
> variable(s) the current code provide to be able to handle this IOException 
> more gracefully?  Hopefully experts can provide some pointers or directions.  
>  
> {code:java}
> 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = 
> yarn-job-id-redacted, runId = run-id-redacted] terminated with error
>  java.io.IOException: Unable to close file because the last block does not 
> have enough number of replicas.
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511)
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472)
>  at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437)
>  at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
>  at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:547)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:557)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStrea

[jira] [Closed] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not have en

2020-06-23 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-31995.
-

> Spark Structure Streaming checkpiontFileManager ERROR when 
> HDFS.DFSOutputStream.completeFile with IOException unable to close file 
> because the last block does not have enough number of replicas
> -
>
> Key: SPARK-31995
> URL: https://issues.apache.org/jira/browse/SPARK-31995
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop
> Hadoop 2.7.3 - YARN cluster
> delta-core_ 2.11:0.6.1
>  
>Reporter: Jim Huang
>Priority: Major
>
> I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) 
> as the sink running in YARN cluster running on Hadoop 2.7.3.  I have been 
> using Spark Structured Streaming for several months now in this runtime 
> environment until this new corner case that handicapped my Spark structured 
> streaming job in partial working state.
>  
> I have included the ERROR message and stack trace.  I did a quick search 
> using the string "MicroBatchExecution: Query terminated with error" but did 
> not find any existing Jira that looks like my stack trace.  
>  
> Based on the naive look at this error message and stack trace, is it possible 
> the Spark's CheckpointFileManager could attempt to handle this HDFS exception 
> better to simply wait a little longer for HDFS's pipeline to complete the 
> replicas?  
>  
> Being new to this code, where can I find the configuration parameter that 
> sets the replica counts for the `streaming.HDFSMetadataLog`?  I am just 
> trying to understand if there are already some holistic configuration tuning 
> variable(s) the current code provide to be able to handle this IOException 
> more gracefully?  Hopefully experts can provide some pointers or directions.  
>  
> {code:java}
> 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = 
> yarn-job-id-redacted, runId = run-id-redacted] terminated with error
>  java.io.IOException: Unable to close file because the last block does not 
> have enough number of replicas.
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511)
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472)
>  at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437)
>  at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
>  at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:547)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:557)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBat

[jira] [Commented] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not have

2020-06-23 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142957#comment-17142957
 ] 

Gabor Somogyi commented on SPARK-31995:
---

W/o super deep consideration I would say streaming query must stop if exception 
happens during execution but I would like to handle that as a separate jira. My 
proposal is to close this jira w/ later version solves the issue and open 
another w/ the improvement.

> Spark Structure Streaming checkpiontFileManager ERROR when 
> HDFS.DFSOutputStream.completeFile with IOException unable to close file 
> because the last block does not have enough number of replicas
> -
>
> Key: SPARK-31995
> URL: https://issues.apache.org/jira/browse/SPARK-31995
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop
> Hadoop 2.7.3 - YARN cluster
> delta-core_ 2.11:0.6.1
>  
>Reporter: Jim Huang
>Priority: Major
>
> I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) 
> as the sink running in YARN cluster running on Hadoop 2.7.3.  I have been 
> using Spark Structured Streaming for several months now in this runtime 
> environment until this new corner case that handicapped my Spark structured 
> streaming job in partial working state.
>  
> I have included the ERROR message and stack trace.  I did a quick search 
> using the string "MicroBatchExecution: Query terminated with error" but did 
> not find any existing Jira that looks like my stack trace.  
>  
> Based on the naive look at this error message and stack trace, is it possible 
> the Spark's CheckpointFileManager could attempt to handle this HDFS exception 
> better to simply wait a little longer for HDFS's pipeline to complete the 
> replicas?  
>  
> Being new to this code, where can I find the configuration parameter that 
> sets the replica counts for the `streaming.HDFSMetadataLog`?  I am just 
> trying to understand if there are already some holistic configuration tuning 
> variable(s) the current code provide to be able to handle this IOException 
> more gracefully?  Hopefully experts can provide some pointers or directions.  
>  
> {code:java}
> 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = 
> yarn-job-id-redacted, runId = run-id-redacted] terminated with error
>  java.io.IOException: Unable to close file because the last block does not 
> have enough number of replicas.
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511)
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472)
>  at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437)
>  at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
>  at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:547)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:557)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:545)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchEx

[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector

2020-06-23 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142937#comment-17142937
 ] 

Gabor Somogyi commented on SPARK-32001:
---

Here is the DT provider API: 
https://github.com/apache/spark/blob/e00f43cb86a6c76720b45176e9f9a7fba1dc3a35/core/src/main/scala/org/apache/spark/security/HadoopDelegationTokenProvider.scala#L31
META-INF.services: 
https://github.com/apache/spark/blob/master/core/src/main/resources/META-INF/services/org.apache.spark.security.HadoopDelegationTokenProvider


> Create Kerberos authentication provider API in JDBC connector
> -
>
> Key: SPARK-32001
> URL: https://issues.apache.org/jira/browse/SPARK-32001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Adding embedded provider to all the possible databases would generate high 
> maintenance cost on Spark side.
> Instead an API can be introduced which would allow to implement further 
> providers independently.
> One important requirement what I suggest is: JDBC connection providers must 
> be loaded independently just like delegation token providers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector

2020-06-23 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142934#comment-17142934
 ] 

Gabor Somogyi commented on SPARK-32001:
---

Which system you mean? A provider can be just added as an external jar 
containing the API implementation + the META-INF.services file and that's all.

> Create Kerberos authentication provider API in JDBC connector
> -
>
> Key: SPARK-32001
> URL: https://issues.apache.org/jira/browse/SPARK-32001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Adding embedded provider to all the possible databases would generate high 
> maintenance cost on Spark side.
> Instead an API can be introduced which would allow to implement further 
> providers independently.
> One important requirement what I suggest is: JDBC connection providers must 
> be loaded independently just like delegation token providers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31995) Spark Structure Streaming checkpiontFileManager ERROR when HDFS.DFSOutputStream.completeFile with IOException unable to close file because the last block does not have

2020-06-23 Thread Jim Huang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142932#comment-17142932
 ] 

Jim Huang commented on SPARK-31995:
---

Thank you for providing a helpful perspective.  I was able to locate HDFS 11486 
using the search string you have provided and it was "resolved in (Hadoop) 
2.7.4+."  I agree with you the HDFS 11486 fix will definitely improve the HDFS 
replica exception handling.  

This issue is a pretty unique, I am not sure I am equipped to create and induce 
such a rare corner case.  Spark 3.0.0 just got released this past week.  I will 
need additional application development time to migrate to Spark 3.x 
architecture (Delta 0.7.0+) ecosystem.  I will be able to upgrade to Spark 
2.4.6 sooner.  

 

>From the Spark Structured Streaming application continuity perspective, the 
>thread that ran this task was terminated with ERROR but to YARN it is still an 
>active running job even though my Spark Structured Streaming job is no longer 
>making any further processing.  If the monitoring of the Spark Structured 
>Streaming job is done only from the YARN job perspective, it may provide a 
>false status.  In this situation, should the Spark Structure Streaming 
>application fail hard and completely (fail by Spark framework or Application 
>exception handling)?  Or should I investigate and develop some ideal 
>monitoring implementation that has the right level of specificity to detect 
>Spark Structured Streaming task level failures?  Any references on these 
>topics are much appreciated.

 

> Spark Structure Streaming checkpiontFileManager ERROR when 
> HDFS.DFSOutputStream.completeFile with IOException unable to close file 
> because the last block does not have enough number of replicas
> -
>
> Key: SPARK-31995
> URL: https://issues.apache.org/jira/browse/SPARK-31995
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5 Scala 2.11 without Hadoop
> Hadoop 2.7.3 - YARN cluster
> delta-core_ 2.11:0.6.1
>  
>Reporter: Jim Huang
>Priority: Major
>
> I am using Spark 2.4.5's Spark Structured Streaming with Delta table (0.6.1) 
> as the sink running in YARN cluster running on Hadoop 2.7.3.  I have been 
> using Spark Structured Streaming for several months now in this runtime 
> environment until this new corner case that handicapped my Spark structured 
> streaming job in partial working state.
>  
> I have included the ERROR message and stack trace.  I did a quick search 
> using the string "MicroBatchExecution: Query terminated with error" but did 
> not find any existing Jira that looks like my stack trace.  
>  
> Based on the naive look at this error message and stack trace, is it possible 
> the Spark's CheckpointFileManager could attempt to handle this HDFS exception 
> better to simply wait a little longer for HDFS's pipeline to complete the 
> replicas?  
>  
> Being new to this code, where can I find the configuration parameter that 
> sets the replica counts for the `streaming.HDFSMetadataLog`?  I am just 
> trying to understand if there are already some holistic configuration tuning 
> variable(s) the current code provide to be able to handle this IOException 
> more gracefully?  Hopefully experts can provide some pointers or directions.  
>  
> {code:java}
> 20/06/12 20:14:15 ERROR MicroBatchExecution: Query [id = 
> yarn-job-id-redacted, runId = run-id-redacted] terminated with error
>  java.io.IOException: Unable to close file because the last block does not 
> have enough number of replicas.
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2511)
>  at 
> org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2472)
>  at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2437)
>  at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>  at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
>  at 
> org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:145)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatchToFile(HDFSMetadataLog.scala:126)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply$mcZ$sp(HDFSMetadataLog.scala:112)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$$anonfun$add$1.apply(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadata

[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector

2020-06-23 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142916#comment-17142916
 ] 

Takeshi Yamamuro commented on SPARK-32001:
--

Do you know any other systems supporting that kind of interfaces?

> Create Kerberos authentication provider API in JDBC connector
> -
>
> Key: SPARK-32001
> URL: https://issues.apache.org/jira/browse/SPARK-32001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Adding embedded provider to all the possible databases would generate high 
> maintenance cost on Spark side.
> Instead an API can be introduced which would allow to implement further 
> providers independently.
> One important requirement what I suggest is: JDBC connection providers must 
> be loaded independently just like delegation token providers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32074) Update AppVeyor R to 4.0.2

2020-06-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32074:


Assignee: Apache Spark

> Update AppVeyor R to 4.0.2
> --
>
> Key: SPARK-32074
> URL: https://issues.apache.org/jira/browse/SPARK-32074
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> We should test R 4.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32074) Update AppVeyor R to 4.0.2

2020-06-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142915#comment-17142915
 ] 

Apache Spark commented on SPARK-32074:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28909

> Update AppVeyor R to 4.0.2
> --
>
> Key: SPARK-32074
> URL: https://issues.apache.org/jira/browse/SPARK-32074
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should test R 4.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32074) Update AppVeyor R to 4.0.2

2020-06-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142914#comment-17142914
 ] 

Apache Spark commented on SPARK-32074:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28909

> Update AppVeyor R to 4.0.2
> --
>
> Key: SPARK-32074
> URL: https://issues.apache.org/jira/browse/SPARK-32074
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should test R 4.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32074) Update AppVeyor R to 4.0.2

2020-06-23 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32074:


Assignee: (was: Apache Spark)

> Update AppVeyor R to 4.0.2
> --
>
> Key: SPARK-32074
> URL: https://issues.apache.org/jira/browse/SPARK-32074
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should test R 4.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32074) Update AppVeyor R to 4.0.2

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32074:
-
Summary: Update AppVeyor R to 4.0.2  (was: Update AppVeyor R to 4.0.1)

> Update AppVeyor R to 4.0.2
> --
>
> Key: SPARK-32074
> URL: https://issues.apache.org/jira/browse/SPARK-32074
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should test R 4.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32051) Dataset.foreachPartition returns object

2020-06-23 Thread Frank Oosterhuis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Oosterhuis updated SPARK-32051:
-
Fix Version/s: 3.1.0

> Dataset.foreachPartition returns object
> ---
>
> Key: SPARK-32051
> URL: https://issues.apache.org/jira/browse/SPARK-32051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Frank Oosterhuis
>Priority: Critical
> Fix For: 3.0.1, 3.1.0
>
>
> I'm trying to map values from the Dataset[Row], but since 3.0.0 this fails.
> In 3.0.0 I'm dealing with an error: "Error:(28, 38) value map is not a member 
> of Object"
>  
> This is the simplest code that works in 2.4.x, but fails in 3.0.0:
> {code:scala}
> spark.range(100)
>   .repartition(10)
>   .foreachPartition(part => println(part.toList))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector

2020-06-23 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142911#comment-17142911
 ] 

Gabor Somogyi commented on SPARK-32001:
---

Yeah, develper's API for sure. My plan is to load providers w/ service loader 
and new provider will be loaded automatically if it's registered w/ an 
appropriate META-INF.services entry.

> Create Kerberos authentication provider API in JDBC connector
> -
>
> Key: SPARK-32001
> URL: https://issues.apache.org/jira/browse/SPARK-32001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Adding embedded provider to all the possible databases would generate high 
> maintenance cost on Spark side.
> Instead an API can be introduced which would allow to implement further 
> providers independently.
> One important requirement what I suggest is: JDBC connection providers must 
> be loaded independently just like delegation token providers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32074) Update AppVeyor R to 4.0.1

2020-06-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32074:
-
Summary: Update AppVeyor R to 4.0.1  (was: Update AppVeyor R and Rtools to 
4.0.1)

> Update AppVeyor R to 4.0.1
> --
>
> Key: SPARK-32074
> URL: https://issues.apache.org/jira/browse/SPARK-32074
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should test R 4.0.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32051) Dataset.foreachPartition returns object

2020-06-23 Thread Frank Oosterhuis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Oosterhuis updated SPARK-32051:
-
Fix Version/s: 3.0.1

> Dataset.foreachPartition returns object
> ---
>
> Key: SPARK-32051
> URL: https://issues.apache.org/jira/browse/SPARK-32051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Frank Oosterhuis
>Priority: Critical
> Fix For: 3.0.1
>
>
> I'm trying to map values from the Dataset[Row], but since 3.0.0 this fails.
> In 3.0.0 I'm dealing with an error: "Error:(28, 38) value map is not a member 
> of Object"
>  
> This is the simplest code that works in 2.4.x, but fails in 3.0.0:
> {code:scala}
> spark.range(100)
>   .repartition(10)
>   .foreachPartition(part => println(part.toList))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >