[jira] [Resolved] (SPARK-27514) Empty window expression results in error in optimizer
[ https://issues.apache.org/jira/browse/SPARK-27514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27514. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24411 [https://github.com/apache/spark/pull/24411] > Empty window expression results in error in optimizer > - > > Key: SPARK-27514 > URL: https://issues.apache.org/jira/browse/SPARK-27514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yifei Huang >Assignee: Yifei Huang >Priority: Major > Fix For: 3.0.0 > > > Currently, the optimizer will break on the following code: > {code:java} > val schema = StructType(Seq( > StructField("colA", StringType, true), > StructField("colB", IntegerType, true) > )) > var df = sqlContext.sparkSession.createDataFrame(new util.ArrayList[Row](), > schema) > val w = Window.partitionBy("colA") > df = df.withColumn("col1", sum("colB").over(w)) > df = df.withColumn("col3", sum("colB").over(w)) > df = df.withColumn("col4", sum("col3").over(w)) > df = df.withColumn("col2", sum("col1").over(w)) > df = df.select("col2") > df.explain(true) > {code} > with the following stacktrace: > {code:java} > next on empty iterator > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:48) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:803) > at > org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:798) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:281) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158) > at > org.apache.spark.sql.catalyst.plans.logical.Log
[jira] [Assigned] (SPARK-27514) Empty window expression results in error in optimizer
[ https://issues.apache.org/jira/browse/SPARK-27514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27514: --- Assignee: Yifei Huang > Empty window expression results in error in optimizer > - > > Key: SPARK-27514 > URL: https://issues.apache.org/jira/browse/SPARK-27514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yifei Huang >Assignee: Yifei Huang >Priority: Major > > Currently, the optimizer will break on the following code: > {code:java} > val schema = StructType(Seq( > StructField("colA", StringType, true), > StructField("colB", IntegerType, true) > )) > var df = sqlContext.sparkSession.createDataFrame(new util.ArrayList[Row](), > schema) > val w = Window.partitionBy("colA") > df = df.withColumn("col1", sum("colB").over(w)) > df = df.withColumn("col3", sum("colB").over(w)) > df = df.withColumn("col4", sum("col3").over(w)) > df = df.withColumn("col2", sum("col1").over(w)) > df = df.select("col2") > df.explain(true) > {code} > with the following stacktrace: > {code:java} > next on empty iterator > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:48) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:803) > at > org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:798) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:281) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.sc
[jira] [Comment Edited] (SPARK-27367) Faster RoaringBitmap Serialization with v0.8.0
[ https://issues.apache.org/jira/browse/SPARK-27367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821653#comment-16821653 ] Liang-Chi Hsieh edited comment on SPARK-27367 at 4/19/19 4:32 AM: -- I do upgrade it in local. But seems the performance improvement isn't so obvious. Maybe the optimization is only significant on larger bitmap. I'm not sure if in Spark we will have large bitmap that can take advantage of this optimization. I compare 0.7.45 (used in current Spark) and 0.8.1 (latest release), except for serde to bytebuffer, I didn't see other noticeable commits. So, do we still want to upgrade to 0.8.1? If so, I can make a PR. was (Author: viirya): I do upgrade it in local. But seems the performance improvement isn't so obvious. Maybe the optimization is only significant on larger bitmap. I'm not sure if in Spark we will have large bitmap that can take advantage of this optimization. I compare 0.7.45 (used in current master) and 0.8.1 (latest release), except for serde to bytebuffer, I didn't see other noticeable commits. So, do we still want to upgrade to 0.8.1? If so, I can make a PR. > Faster RoaringBitmap Serialization with v0.8.0 > -- > > Key: SPARK-27367 > URL: https://issues.apache.org/jira/browse/SPARK-27367 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > RoaringBitmap 0.8.0 adds faster serde, but also requires us to change how we > call the serde routines slightly to take advantage of it. This is probably a > worthwhile optimization as the every shuffle map task with a large # of > partitions generates these bitmaps, and the driver especially has to > deserialize many of these messages. > See > * https://github.com/apache/spark/pull/24264#issuecomment-479675572 > * https://github.com/RoaringBitmap/RoaringBitmap/pull/325 > * https://github.com/RoaringBitmap/RoaringBitmap/issues/319 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27367) Faster RoaringBitmap Serialization with v0.8.0
[ https://issues.apache.org/jira/browse/SPARK-27367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821653#comment-16821653 ] Liang-Chi Hsieh commented on SPARK-27367: - I do upgrade it in local. But seems the performance improvement isn't so obvious. Maybe the optimization is only significant on larger bitmap. I'm not sure if in Spark we will have large bitmap that can take advantage of this optimization. I compare 0.7.45 (used in current master) and 0.8.1 (latest release), except for serde to bytebuffer, I didn't see other noticeable commits. So, do we still want to upgrade to 0.8.1? If so, I can make a PR. > Faster RoaringBitmap Serialization with v0.8.0 > -- > > Key: SPARK-27367 > URL: https://issues.apache.org/jira/browse/SPARK-27367 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > RoaringBitmap 0.8.0 adds faster serde, but also requires us to change how we > call the serde routines slightly to take advantage of it. This is probably a > worthwhile optimization as the every shuffle map task with a large # of > partitions generates these bitmaps, and the driver especially has to > deserialize many of these messages. > See > * https://github.com/apache/spark/pull/24264#issuecomment-479675572 > * https://github.com/RoaringBitmap/RoaringBitmap/pull/325 > * https://github.com/RoaringBitmap/RoaringBitmap/issues/319 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27518) Show statistics in Optimized Logical Plan in the "Details" On SparkSQL ui page when CBO is enabled
[ https://issues.apache.org/jira/browse/SPARK-27518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peng bo updated SPARK-27518: Attachment: SPARK-27518-1.jpg > Show statistics in Optimized Logical Plan in the "Details" On SparkSQL ui > page when CBO is enabled > -- > > Key: SPARK-27518 > URL: https://issues.apache.org/jira/browse/SPARK-27518 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: peng bo >Priority: Major > Attachments: SPARK-27518-1.jpg > > > {{Statistics}} snapshot info with current query is really helpful to find out > why the query runs slowly, especially caused by bad rejoin order when {{CBO}} > is enabled. > This issue is to show statistics in Optimized Logical Plan in the "Details" > On SparkSQL ui page when CBO is enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27518) Show statistics in Optimized Logical Plan in the "Details" On SparkSQL ui page when CBO is enabled
peng bo created SPARK-27518: --- Summary: Show statistics in Optimized Logical Plan in the "Details" On SparkSQL ui page when CBO is enabled Key: SPARK-27518 URL: https://issues.apache.org/jira/browse/SPARK-27518 Project: Spark Issue Type: Improvement Components: SQL, Web UI Affects Versions: 3.0.0 Reporter: peng bo {{Statistics}} snapshot info with current query is really helpful to find out why the query runs slowly, especially caused by bad rejoin order when {{CBO}} is enabled. This issue is to show statistics in Optimized Logical Plan in the "Details" On SparkSQL ui page when CBO is enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22044) explain function with codegen and cost parameters
[ https://issues.apache.org/jira/browse/SPARK-22044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821645#comment-16821645 ] Huon Wilson commented on SPARK-22044: - I think this would be great, since the current ways to do it are moderately annoying, and the mismatch with the direct {{EXPLAIN CODEGEN}} and {{EXPLAIN COST}} in SQL is a bit jarring/unexpected. * For {{codegen}}, there's the work-around of using {{df.queryExecution.debug.codegen}} in Scala, but this is somewhat awkward to use from pyspark ({{df._jdf.queryExecution().debug().codegen()}}, which doesn't use Python's {{stdout}} for printing, and so can't be captured easily, if required), and very awkward for sparkR (I believe {{invisible(sparkR.callJMethod(sparkR.callJMethod(sparkR.callJMethod(df@sdf, "queryExecution"), "debug"), "codegen"))}}, but again, cannot be captured via {{capture.output}} easily). * For {{cost}}, there's a similar work around of using {{df.queryExecution.stringWithStats}}, but this has the same awkwardness as {{codegen}} for calling from pyspark and sparkR. > explain function with codegen and cost parameters > - > > Key: SPARK-22044 > URL: https://issues.apache.org/jira/browse/SPARK-22044 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > {{explain}} operator creates {{ExplainCommand}} runnable command that accepts > (among other things) {{codegen}} and {{cost}} arguments. > There's no version of {{explain}} to allow for this. That's however possible > using SQL which is kind of surprising (given how much focus is devoted to the > Dataset API). > This is to have another {{explain}} with {{codegen}} and {{cost}} arguments, > i.e. > {code} > def explain(codegen: Boolean = false, cost: Boolean = false): Unit > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27465) Kafka Client 0.11.0.0 is not Supporting the kafkatestutils package
[ https://issues.apache.org/jira/browse/SPARK-27465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821639#comment-16821639 ] Praveen commented on SPARK-27465: - Hi Shahid, Can you please let me know if you have any update on this issue? > Kafka Client 0.11.0.0 is not Supporting the kafkatestutils package > -- > > Key: SPARK-27465 > URL: https://issues.apache.org/jira/browse/SPARK-27465 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1 >Reporter: Praveen >Priority: Critical > > Hi Team, > We are getting the below exceptions with Kafka Client Version 0.11.0.0 for > KafkaTestUtils Package. But its working fine when we use the Kafka Client > Version 0.10.0.1. Please suggest the way forwards. We are using the package " > org.apache.spark.streaming.kafka010.KafkaTestUtils;" > And the Spark Streaming Version is 2.2.3 and above. > > ERROR: > java.lang.NoSuchMethodError: > kafka.server.KafkaServer$.$lessinit$greater$default$2()Lkafka/utils/Time; > at > org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:110) > at > org.apache.spark.streaming.kafka010.KafkaTestUtils$$anonfun$setupEmbeddedKafkaServer$2.apply(KafkaTestUtils.scala:107) > at > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2234) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2226) > at > org.apache.spark.streaming.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:107) > at > org.apache.spark.streaming.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:122) > at > com.netcracker.rms.smart.esp.ESPTestEnv.prepareKafkaTestUtils(ESPTestEnv.java:203) > at com.netcracker.rms.smart.esp.ESPTestEnv.setUp(ESPTestEnv.java:157) > at > com.netcracker.rms.smart.esp.TestEventStreamProcessor.setUp(TestEventStreamProcessor.java:58) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27517) python.PythonRDD: Error while sending iterator
[ https://issues.apache.org/jira/browse/SPARK-27517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lucasysfeng updated SPARK-27517: Attachment: spark.stderr > python.PythonRDD: Error while sending iterator > --- > > Key: SPARK-27517 > URL: https://issues.apache.org/jira/browse/SPARK-27517 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: lucasysfeng >Priority: Major > Attachments: spark.stderr > > > when use collect function, occasionnally throw the exception below: > ERROR python.PythonRDD: Error while sending iterator > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) > at java.net.ServerSocket.implAccept(ServerSocket.java:545) > at java.net.ServerSocket.accept(ServerSocket.java:513) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:702) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27517) python.PythonRDD: Error while sending iterator
lucasysfeng created SPARK-27517: --- Summary: python.PythonRDD: Error while sending iterator Key: SPARK-27517 URL: https://issues.apache.org/jira/browse/SPARK-27517 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Reporter: lucasysfeng when use collect function, occasionnally throw the exception below: ERROR python.PythonRDD: Error while sending iterator java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:702) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27516) java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
[ https://issues.apache.org/jira/browse/SPARK-27516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lucasysfeng updated SPARK-27516: Attachment: driver_gc > java.util.concurrent.TimeoutException: Futures timed out after [10 > milliseconds] > > > Key: SPARK-27516 > URL: https://issues.apache.org/jira/browse/SPARK-27516 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 > Environment: linux > YARN cluster mode >Reporter: lucasysfeng >Priority: Minor > Attachments: driver_gc, spark.stderr > > > > {code:java} > #! /usr/bin/env python > # -*- coding: utf-8 -*- > from pyspark import SparkContext > from pyspark.sql import SparkSession > if __name__ == '__main__': > spark = SparkSession.builder.appName('sparktest').getOrCreate() > # Other code is omitted below > {code} > > *The code is simple, but occasionally throws the following exception:* > 19/04/15 21:30:00 ERROR yarn.ApplicationMaster: Uncaught exception: > java.util.concurrent.TimeoutException: Futures timed out after [10 > milliseconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) > at > org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:400) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:253) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:771) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1743) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:769) > at > org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) > > I know spark.yarn.am.waitTime can increase the sparkcontext initialization > time. > Why does SparkContext initialization take so long? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27516) java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
[ https://issues.apache.org/jira/browse/SPARK-27516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lucasysfeng updated SPARK-27516: Attachment: spark.stderr > java.util.concurrent.TimeoutException: Futures timed out after [10 > milliseconds] > > > Key: SPARK-27516 > URL: https://issues.apache.org/jira/browse/SPARK-27516 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 > Environment: linux > YARN cluster mode >Reporter: lucasysfeng >Priority: Minor > Attachments: driver_gc, spark.stderr > > > > {code:java} > #! /usr/bin/env python > # -*- coding: utf-8 -*- > from pyspark import SparkContext > from pyspark.sql import SparkSession > if __name__ == '__main__': > spark = SparkSession.builder.appName('sparktest').getOrCreate() > # Other code is omitted below > {code} > > *The code is simple, but occasionally throws the following exception:* > 19/04/15 21:30:00 ERROR yarn.ApplicationMaster: Uncaught exception: > java.util.concurrent.TimeoutException: Futures timed out after [10 > milliseconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) > at > org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:400) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:253) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:771) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1743) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:769) > at > org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) > > I know spark.yarn.am.waitTime can increase the sparkcontext initialization > time. > Why does SparkContext initialization take so long? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27516) java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
[ https://issues.apache.org/jira/browse/SPARK-27516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lucasysfeng updated SPARK-27516: Description: {code:java} #! /usr/bin/env python # -*- coding: utf-8 -*- from pyspark import SparkContext from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession.builder.appName('sparktest').getOrCreate() # Other code is omitted below {code} *The code is simple, but occasionally throws the following exception:* 19/04/15 21:30:00 ERROR yarn.ApplicationMaster: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [10 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:400) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:253) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:771) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1743) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:769) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) I know spark.yarn.am.waitTime can increase the sparkcontext initialization time. Why does SparkContext initialization take so long? was: {code:java} #! /usr/bin/env python # -*- coding: utf-8 -*- from pyspark import SparkContext from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession.builder.appName('sparktest').getOrCreate() # Other code is omitted below {code} *The code is simple, but occasionally throws the following exception:* 19/04/15 21:30:00 ERROR yarn.ApplicationMaster: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [10 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:400) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:253) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:771) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1743) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:769) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) I know spark.yarn.am.waitTime can increase the sparkcontext initialization time. Why does SparkContext initialization take so long? > java.util.concurrent.TimeoutException: Futures timed out after [10 > milliseconds] > > > Key: SPARK-27516 > URL: https://issues.apache.org/jira/browse/SPARK-27516 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 > Environment: linux > YARN cluster mode >Reporter: lucasysfeng >Priority: Minor > > > {code:java} > #! /usr/bin/env python > # -*- coding: utf-8 -*- > from pyspark import SparkContext > from pyspark.sql import SparkSession > if __name__ == '__main__': > spark = SparkSession.builder.appName('sparktest').getOrCreate() > # Other code is omitted below > {code} > > *The code is simple, but occasionally throws the following exception:* > 19/04/15 21:30:00 ERROR yarn.ApplicationMaster: Uncaught exception: > java.util.concurrent.TimeoutException: Futures timed out after [10 > milliseconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at org.apache.spar
[jira] [Created] (SPARK-27516) java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
lucasysfeng created SPARK-27516: --- Summary: java.util.concurrent.TimeoutException: Futures timed out after [10 milliseconds] Key: SPARK-27516 URL: https://issues.apache.org/jira/browse/SPARK-27516 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Environment: linux YARN cluster mode Reporter: lucasysfeng {code:java} #! /usr/bin/env python # -*- coding: utf-8 -*- from pyspark import SparkContext from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession.builder.appName('sparktest').getOrCreate() # Other code is omitted below {code} *The code is simple, but occasionally throws the following exception:* 19/04/15 21:30:00 ERROR yarn.ApplicationMaster: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [10 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:400) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:253) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:771) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1743) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:769) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) I know spark.yarn.am.waitTime can increase the sparkcontext initialization time. Why does SparkContext initialization take so long? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.6
[ https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp reopened SPARK-25079: - going to reopen this until i'm done w/the branch-2.3 and -2.4 python deployment. > [PYTHON] upgrade python 3.4 -> 3.6 > -- > > Key: SPARK-25079 > URL: https://issues.apache.org/jira/browse/SPARK-25079 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Fix For: 3.0.0 > > > for the impending arrow upgrade > (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python > 3.4 -> 3.5. > i have been testing this here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69] > my methodology: > 1) upgrade python + arrow to 3.5 and 0.10.0 > 2) run python tests > 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and > upgrade centos workers to python3.5 > 4) simultaneously do the following: > - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that > points to python3.5 (this is currently being tested here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)] > - push a change to python/run-tests.py replacing 3.4 with 3.5 > 5) once the python3.5 change to run-tests.py is merged, we will need to > back-port this to all existing branches > 6) then and only then can i remove the python3.4 -> python3.5 symlink -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27515) [Deploy] When application master retry after a long time running, the hdfs delegation token may be expired
[ https://issues.apache.org/jira/browse/SPARK-27515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-27515: Description: When submit a spark yarn application, we first create a container launch context and store the relative tokens. And for each attempt of applicationMaster, it would transfer origin tokens for connecting to yarn . However, it also transfer origin hdfs delegation tokens. For a spark streaming application, if its applicationMaster failed when it has run for a long duration. The hdfs token stored in container launch context may be expired. When the new attempt applicationMaster prepareLocalResources, it would access the hdfs and failed for token expired. This error occured when we rolling upgrading our cluster. was: When submit a spark yarn application, we first create a container launch context and store the relative tokens. And for each attempt of applicationMaster, it would transfer origin tokens. However, it also transfer origin hdfs delegation tokens. For a spark streaming application, if its applicationMaster failed when it has run for a long duration. The hdfs token stored in container launch context may be expired. When the new attempt applicationMaster prepareLocalResources, it would access the hdfs and failed for token expired. This error occured when we rolling upgrading our cluster. > [Deploy] When application master retry after a long time running, the hdfs > delegation token may be expired > -- > > Key: SPARK-27515 > URL: https://issues.apache.org/jira/browse/SPARK-27515 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.2 >Reporter: feiwang >Priority: Major > > When submit a spark yarn application, we first create a container launch > context and store the relative tokens. > And for each attempt of applicationMaster, it would transfer origin tokens > for connecting to yarn . > However, it also transfer origin hdfs delegation tokens. > For a spark streaming application, if its applicationMaster failed when it > has run for a long duration. > The hdfs token stored in container launch context may be expired. > When the new attempt applicationMaster prepareLocalResources, it would access > the hdfs and failed for token expired. > This error occured when we rolling upgrading our cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27515) [Deploy] When application master retry after a long time running, the hdfs delegation token may be expired
[ https://issues.apache.org/jira/browse/SPARK-27515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-27515: Description: When submit a spark yarn application, we first create a container launch context and store the relative tokens. And for each attempt of applicationMaster, it would transfer origin tokens for connecting to yarn. However, it also transfer origin hdfs delegation tokens. For a spark streaming application, if its applicationMaster failed when it has run for a long duration. The hdfs token stored in container launch context may be expired. When the new attempt applicationMaster prepareLocalResources, it would access the hdfs and failed for token expired. This error occured when we rolling upgrading our cluster. was: When submit a spark yarn application, we first create a container launch context and store the relative tokens. And for each attempt of applicationMaster, it would transfer origin tokens for connecting to yarn . However, it also transfer origin hdfs delegation tokens. For a spark streaming application, if its applicationMaster failed when it has run for a long duration. The hdfs token stored in container launch context may be expired. When the new attempt applicationMaster prepareLocalResources, it would access the hdfs and failed for token expired. This error occured when we rolling upgrading our cluster. > [Deploy] When application master retry after a long time running, the hdfs > delegation token may be expired > -- > > Key: SPARK-27515 > URL: https://issues.apache.org/jira/browse/SPARK-27515 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.2 >Reporter: feiwang >Priority: Major > > When submit a spark yarn application, we first create a container launch > context and store the relative tokens. > And for each attempt of applicationMaster, it would transfer origin tokens > for connecting to yarn. > However, it also transfer origin hdfs delegation tokens. > For a spark streaming application, if its applicationMaster failed when it > has run for a long duration. > The hdfs token stored in container launch context may be expired. > When the new attempt applicationMaster prepareLocalResources, it would access > the hdfs and failed for token expired. > This error occured when we rolling upgrading our cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27515) [Deploy] When application master retry after a long time running, the hdfs delegation token may be expired
[ https://issues.apache.org/jira/browse/SPARK-27515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-27515: Description: When submit a spark yarn application, we first create a container launch context and store the relative tokens. And for each attempt of applicationMaster, it would transfer origin tokens. However, it also transfer origin hdfs delegation tokens. For a spark streaming application, if its applicationMaster failed when it has run for a long duration. The hdfs token stored in container launch context may be expired. When the new attempt applicationMaster prepareLocalResources, it would access the hdfs and failed for token expired. This error occured when we rolling upgrading our cluster. was: When submit a spark yarn application, we first create a container launch context and store the relative tokens. And for each attempt of applicationMaster, it would transfer origin tokens to connect yarn. However, it also transfer origin hdfs delegation tokens. For a spark streaming application, if its applicationMaster failed when it has run for a long duration. The hdfs token stored in container launch context may be expired. When the new attempt applicationMaster prepareLocalResources, it would access the hdfs and failed for token expired. This error occured when we rolling upgrading our cluster. > [Deploy] When application master retry after a long time running, the hdfs > delegation token may be expired > -- > > Key: SPARK-27515 > URL: https://issues.apache.org/jira/browse/SPARK-27515 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.2 >Reporter: feiwang >Priority: Major > > When submit a spark yarn application, we first create a container launch > context and store the relative tokens. > And for each attempt of applicationMaster, it would transfer origin tokens. > However, it also transfer origin hdfs delegation tokens. > For a spark streaming application, if its applicationMaster failed when it > has run for a long duration. > The hdfs token stored in container launch context may be expired. > When the new attempt applicationMaster prepareLocalResources, it would access > the hdfs and failed for token expired. > This error occured when we rolling upgrading our cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27515) [Deploy] When application master retry after a long time running, the hdfs delegation token may be expired
[ https://issues.apache.org/jira/browse/SPARK-27515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-27515: Description: When submit a spark yarn application, we first create a container launch context and store the relative tokens. And for each attempt of applicationMaster, it would transfer origin tokens to connect yarn. However, it also transfer origin hdfs delegation tokens. For a spark streaming application, if its applicationMaster failed when it has run for a long duration. The hdfs token stored in container launch context may be expired. When the new attempt applicationMaster prepareLocalResources, it would access the hdfs and failed for token expired. This error occured when we rolling upgrading our cluster. > [Deploy] When application master retry after a long time running, the hdfs > delegation token may be expired > -- > > Key: SPARK-27515 > URL: https://issues.apache.org/jira/browse/SPARK-27515 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.2 >Reporter: feiwang >Priority: Major > > When submit a spark yarn application, we first create a container launch > context and store the relative tokens. > And for each attempt of applicationMaster, it would transfer origin tokens to > connect yarn. > However, it also transfer origin hdfs delegation tokens. > For a spark streaming application, if its applicationMaster failed when it > has run for a long duration. > The hdfs token stored in container launch context may be expired. > When the new attempt applicationMaster prepareLocalResources, it would access > the hdfs and failed for token expired. > This error occured when we rolling upgrading our cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27515) [Deploy] When application master retry after a long time running, the hdfs delegation token may be expired
[ https://issues.apache.org/jira/browse/SPARK-27515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27515. Resolution: Duplicate > [Deploy] When application master retry after a long time running, the hdfs > delegation token may be expired > -- > > Key: SPARK-27515 > URL: https://issues.apache.org/jira/browse/SPARK-27515 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.2 >Reporter: feiwang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27515) [Deploy] When application master retry after a long time running, the hdfs delegation token may be expired
feiwang created SPARK-27515: --- Summary: [Deploy] When application master retry after a long time running, the hdfs delegation token may be expired Key: SPARK-27515 URL: https://issues.apache.org/jira/browse/SPARK-27515 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.3.2 Reporter: feiwang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27501) Add test for HIVE-13083: Writing HiveDecimal to ORC can wrongly suppress present stream
[ https://issues.apache.org/jira/browse/SPARK-27501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27501. -- Resolution: Fixed Fix Version/s: 3.0.0 Fixed in https://github.com/apache/spark/pull/24397 > Add test for HIVE-13083: Writing HiveDecimal to ORC can wrongly suppress > present stream > --- > > Key: SPARK-27501 > URL: https://issues.apache.org/jira/browse/SPARK-27501 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27501) Add test for HIVE-13083: Writing HiveDecimal to ORC can wrongly suppress present stream
[ https://issues.apache.org/jira/browse/SPARK-27501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-27501: Assignee: Yuming Wang > Add test for HIVE-13083: Writing HiveDecimal to ORC can wrongly suppress > present stream > --- > > Key: SPARK-27501 > URL: https://issues.apache.org/jira/browse/SPARK-27501 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.6
[ https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25079. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24266 [https://github.com/apache/spark/pull/24266] > [PYTHON] upgrade python 3.4 -> 3.6 > -- > > Key: SPARK-25079 > URL: https://issues.apache.org/jira/browse/SPARK-25079 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Fix For: 3.0.0 > > > for the impending arrow upgrade > (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python > 3.4 -> 3.5. > i have been testing this here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69] > my methodology: > 1) upgrade python + arrow to 3.5 and 0.10.0 > 2) run python tests > 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and > upgrade centos workers to python3.5 > 4) simultaneously do the following: > - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that > points to python3.5 (this is currently being tested here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)] > - push a change to python/run-tests.py replacing 3.4 with 3.5 > 5) once the python3.5 change to run-tests.py is merged, we will need to > back-port this to all existing branches > 6) then and only then can i remove the python3.4 -> python3.5 symlink -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821557#comment-16821557 ] Xiao Li commented on SPARK-15348: - Please follow the announcement of the upcoming Spark+AI summit. > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0 >Reporter: Ran Haim >Priority: Major > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27514) Empty window expression results in error in optimizer
[ https://issues.apache.org/jira/browse/SPARK-27514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821552#comment-16821552 ] Yifei Huang commented on SPARK-27514: - [~cloud_fan] [~dongjoon] > Empty window expression results in error in optimizer > - > > Key: SPARK-27514 > URL: https://issues.apache.org/jira/browse/SPARK-27514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yifei Huang >Priority: Major > > Currently, the optimizer will break on the following code: > {code:java} > val schema = StructType(Seq( > StructField("colA", StringType, true), > StructField("colB", IntegerType, true) > )) > var df = sqlContext.sparkSession.createDataFrame(new util.ArrayList[Row](), > schema) > val w = Window.partitionBy("colA") > df = df.withColumn("col1", sum("colB").over(w)) > df = df.withColumn("col3", sum("colB").over(w)) > df = df.withColumn("col4", sum("col3").over(w)) > df = df.withColumn("col2", sum("col1").over(w)) > df = df.select("col2") > df.explain(true) > {code} > with the following stacktrace: > {code:java} > next on empty iterator > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:48) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:803) > at > org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:798) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:281) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformU
[jira] [Created] (SPARK-27514) Empty window expression results in error in optimizer
Yifei Huang created SPARK-27514: --- Summary: Empty window expression results in error in optimizer Key: SPARK-27514 URL: https://issues.apache.org/jira/browse/SPARK-27514 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yifei Huang Currently, the optimizer will break on the following code: {code:java} val schema = StructType(Seq( StructField("colA", StringType, true), StructField("colB", IntegerType, true) )) var df = sqlContext.sparkSession.createDataFrame(new util.ArrayList[Row](), schema) val w = Window.partitionBy("colA") df = df.withColumn("col1", sum("colB").over(w)) df = df.withColumn("col3", sum("colB").over(w)) df = df.withColumn("col4", sum("col3").over(w)) df = df.withColumn("col2", sum("col1").over(w)) df = df.select("col2") df.explain(true) {code} with the following stacktrace: {code:java} next on empty iterator java.util.NoSuchElementException: next on empty iterator at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) at scala.collection.IterableLike$class.head(IterableLike.scala:107) at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:48) at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:48) at org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:803) at org.apache.spark.sql.catalyst.optimizer.CollapseWindow$$anonfun$apply$15.applyOrElse(Optimizer.scala:798) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:282) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:281) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:279) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformUp(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformUp(AnalysisHelper.scala:158) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformUp(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:330) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:191) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:328) at org.apache.spa
[jira] [Commented] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue
[ https://issues.apache.org/jira/browse/SPARK-27068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821450#comment-16821450 ] Sean Owen commented on SPARK-27068: --- I dunno, is this common that you want to research a job from 1000 jobs ago? This is also what the history server's output is for. > Support failed jobs ui and completed jobs ui use different queue > > > Key: SPARK-27068 > URL: https://issues.apache.org/jira/browse/SPARK-27068 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.0 >Reporter: zhoukang >Priority: Major > > For some long running jobs,we may want to check out the cause of some failed > jobs. > But most jobs has completed and failed jobs ui may disappear, we can use > different queue for this two kinds of jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27513) Spark tarball with binaries should have files owned by uid 0
[ https://issues.apache.org/jira/browse/SPARK-27513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-27513: -- Description: currently the tarball is created in dev/make-distribution.sh like this: {code:bash} tar czf "spark-$VERSION-bin-$NAME.tgz" -C "$SPARK_HOME" "$TARDIR_NAME" {code} the problem with this is that if root unpacks this tarball the files are owned by whatever the uid is of the person that created the tarball. this uid probably doesnt exist or belongs to a different unrelated user. this is unlikely to be what anyone wants. for other users this problem doesnt exist since tar is now allowed to change uid. so when they unpack the tarball the files are owned by them. it is more typical to set the uid and gid to 0 for a tarball. that way when root unpacks it the files are owned by root. so like this: {code:bash} tar czf "spark-$VERSION-bin-$NAME.tgz" --numeric-owner --owner=0 --group=0 -C "$SPARK_HOME" "$TARDIR_NAME" {code} was: currently the tarball is created in dev/make-distribution.sh like this: {code:bash} tar czf "spark-$VERSION-bin-$NAME.tgz" -C "$SPARK_HOME" "$TARDIR_NAME" {code} the problem with this is that if root unpacks this tarball the files are owned by whatever the uid is of the person that created the tarball. this uid probably doesnt exist or belongs to a different unrelated user. this is unlikely to be what anyone wants. for other users this problem doesnt exist since tar is now allowed to change uid. so when they unpack the tarball the files are owned by them. it is more typical to set the uid and gid to 0 for a tarball. that way when root unpacks it the files are owned by root. so like this: {code:bash} tar czf "spark-$VERSION-bin-$NAME.tgz" --numeric-owner --owner=0 --group=0 -C "$SPARK_HOME" "$TARDIR_NAME {code} > Spark tarball with binaries should have files owned by uid 0 > > > Key: SPARK-27513 > URL: https://issues.apache.org/jira/browse/SPARK-27513 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.1 >Reporter: koert kuipers >Priority: Minor > Fix For: 3.0.0 > > > currently the tarball is created in dev/make-distribution.sh like this: > {code:bash} > tar czf "spark-$VERSION-bin-$NAME.tgz" -C "$SPARK_HOME" "$TARDIR_NAME" > {code} > the problem with this is that if root unpacks this tarball the files are > owned by whatever the uid is of the person that created the tarball. this uid > probably doesnt exist or belongs to a different unrelated user. this is > unlikely to be what anyone wants. > for other users this problem doesnt exist since tar is now allowed to change > uid. so when they unpack the tarball the files are owned by them. > it is more typical to set the uid and gid to 0 for a tarball. that way when > root unpacks it the files are owned by root. so like this: > {code:bash} > tar czf "spark-$VERSION-bin-$NAME.tgz" --numeric-owner --owner=0 --group=0 -C > "$SPARK_HOME" "$TARDIR_NAME" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27513) Spark tarball with binaries should have files owned by uid 0
koert kuipers created SPARK-27513: - Summary: Spark tarball with binaries should have files owned by uid 0 Key: SPARK-27513 URL: https://issues.apache.org/jira/browse/SPARK-27513 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.1 Reporter: koert kuipers Fix For: 3.0.0 currently the tarball is created in dev/make-distribution.sh like this: {code:bash} tar czf "spark-$VERSION-bin-$NAME.tgz" -C "$SPARK_HOME" "$TARDIR_NAME" {code} the problem with this is that if root unpacks this tarball the files are owned by whatever the uid is of the person that created the tarball. this uid probably doesnt exist or belongs to a different unrelated user. this is unlikely to be what anyone wants. for other users this problem doesnt exist since tar is now allowed to change uid. so when they unpack the tarball the files are owned by them. it is more typical to set the uid and gid to 0 for a tarball. that way when root unpacks it the files are owned by root. so like this: {code:bash} tar czf "spark-$VERSION-bin-$NAME.tgz" --numeric-owner --owner=0 --group=0 -C "$SPARK_HOME" "$TARDIR_NAME {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27512) Decimal parsing leads to unexpected type inference
[ https://issues.apache.org/jira/browse/SPARK-27512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-27512: -- Summary: Decimal parsing leads to unexpected type inference (was: Decimal parsing leading to unexpected type inference) > Decimal parsing leads to unexpected type inference > -- > > Key: SPARK-27512 > URL: https://issues.apache.org/jira/browse/SPARK-27512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark 3.0.0-SNAPSHOT from this commit: > {code:bash} > commit 3ab96d7acf870e53c9016b0b63d0b328eec23bed > Author: Dilip Biswal > Date: Mon Apr 15 21:26:45 2019 +0800 > {code} >Reporter: koert kuipers >Priority: Minor > Fix For: 3.0.0 > > > {code:bash} > $ hadoop fs -text test.bsv > x|y > 1|1,2 > 2|2,3 > 3|3,4 > {code} > in spark 2.4.1: > {code:bash} > scala> val data = spark.read.format("csv").option("header", > true).option("delimiter", "|").option("inferSchema", true).load("test.bsv") > scala> data.printSchema > root > |-- x: integer (nullable = true) > |-- y: string (nullable = true) > scala> data.show > +---+---+ > | x| y| > +---+---+ > | 1|1,2| > | 2|2,3| > | 3|3,4| > +---+---+ > {code} > in spark 3.0.0-SNAPSHOT: > {code:bash} > scala> val data = spark.read.format("csv").option("header", > true).option("delimiter", "|").option("inferSchema", true).load("test.bsv") > scala> data.printSchema > root > |-- x: integer (nullable = true) > |-- y: decimal(2,0) (nullable = true) > scala> data.show > +---+---+ > | x| y| > +---+---+ > | 1| 12| > | 2| 23| > | 3| 34| > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27512) Decimal parsing leading to unexpected type inference
koert kuipers created SPARK-27512: - Summary: Decimal parsing leading to unexpected type inference Key: SPARK-27512 URL: https://issues.apache.org/jira/browse/SPARK-27512 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Environment: spark 3.0.0-SNAPSHOT from this commit: {code:bash} commit 3ab96d7acf870e53c9016b0b63d0b328eec23bed Author: Dilip Biswal Date: Mon Apr 15 21:26:45 2019 +0800 {code} Reporter: koert kuipers Fix For: 3.0.0 {code:bash} $ hadoop fs -text test.bsv x|y 1|1,2 2|2,3 3|3,4 {code} in spark 2.4.1: {code:bash} scala> val data = spark.read.format("csv").option("header", true).option("delimiter", "|").option("inferSchema", true).load("test.bsv") scala> data.printSchema root |-- x: integer (nullable = true) |-- y: string (nullable = true) scala> data.show +---+---+ | x| y| +---+---+ | 1|1,2| | 2|2,3| | 3|3,4| +---+---+ {code} in spark 3.0.0-SNAPSHOT: {code:bash} scala> val data = spark.read.format("csv").option("header", true).option("delimiter", "|").option("inferSchema", true).load("test.bsv") scala> data.printSchema root |-- x: integer (nullable = true) |-- y: decimal(2,0) (nullable = true) scala> data.show +---+---+ | x| y| +---+---+ | 1| 12| | 2| 23| | 3| 34| +---+---+ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27511) Spark Streaming Driver Memory
Badri Krishnan created SPARK-27511: -- Summary: Spark Streaming Driver Memory Key: SPARK-27511 URL: https://issues.apache.org/jira/browse/SPARK-27511 Project: Spark Issue Type: Question Components: DStreams Affects Versions: 2.4.0 Reporter: Badri Krishnan Hello Apache Spark Community. We are currently facing an issue with one of our Spark Streaming jobs which consumes data from a IBM MQ, this is run on a AWS EMR cluster using DStreams and Checkpointing. Our Spark streaming job failed with several containers exiting with error code: 143. I checked your container logs. For example, one of the killed container's stdout logs [1] show the below error: (Exit code from container container_1553356041292_0001_15_04 is : 143) 2019-03-28 19:32:26,569 ERROR [dispatcher-event-loop-3] org.apache.spark.streaming.receiver.ReceiverSupervisorImpl:Error stopping receiver 2 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Failed to connect to ip-**-***-*.***.***.com/**.**.***.**:* at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more These containers exited with code 143 because it was not able to reach the application master(Driver Process). Amazon mentioned that the Application Master is consuming more memory and hence recommended us to double it. As AM runs on driver, we were asked to increase spark.driver.memory from 1.4G to 3G. But the question that was unanswered was whether increasing the memory would solve the problem or delay the failure. As this is an ever running streaming application, do we need to consider something to understand whether the memory usage builds up over a period of time or are there any properties that needs to be set specific to how AM(application Master) works for streaming application. Any inputs on how to track the AM memory usage? Any insights will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue
[ https://issues.apache.org/jira/browse/SPARK-27068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821286#comment-16821286 ] shahid edited comment on SPARK-27068 at 4/18/19 4:31 PM: - cc [~srowen] Can we raise a PR for the issue?. The actual issue is, when there are lots of jobs, UI cleans older jobs if the number of jobs exceeds a threshold. Eventually it removes failure jobs as well. If user want to see the reason for failure, it won't be available in UI. The solution could be, we can remove the jobs only from successful jobs table and retain failed or killed jobs table. Kindly give the feedback was (Author: shahid): cc [~srowen] Can we raise a PR for the issue?. The actual issue is, when there are lots of jobs, UI cleans older jobs if the number of jobs exceeds a threshold. Eventually it removes failure jobs as well. If user want to see the reason for failure, it won't be available in UI. The solution could be, we can remove the jobs only from successful jobs table and retain failed of killed jobs table. Kindly give the feedback > Support failed jobs ui and completed jobs ui use different queue > > > Key: SPARK-27068 > URL: https://issues.apache.org/jira/browse/SPARK-27068 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.0 >Reporter: zhoukang >Priority: Major > > For some long running jobs,we may want to check out the cause of some failed > jobs. > But most jobs has completed and failed jobs ui may disappear, we can use > different queue for this two kinds of jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue
[ https://issues.apache.org/jira/browse/SPARK-27068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821286#comment-16821286 ] shahid commented on SPARK-27068: cc [~srowen] Can we raise a PR for the issue?. The actual issue is, when there are lots of jobs, UI cleans older jobs if the number of jobs exceeds a threshold. Eventually it removes failure jobs as well. If user want to see the reason for failure, it won't be available in UI. The solution could be, we can remove the jobs only from successful jobs table and retain failed of killed jobs table. Kindly give the feedback > Support failed jobs ui and completed jobs ui use different queue > > > Key: SPARK-27068 > URL: https://issues.apache.org/jira/browse/SPARK-27068 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.0 >Reporter: zhoukang >Priority: Major > > For some long running jobs,we may want to check out the cause of some failed > jobs. > But most jobs has completed and failed jobs ui may disappear, we can use > different queue for this two kinds of jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24655) [K8S] Custom Docker Image Expectations and Documentation
[ https://issues.apache.org/jira/browse/SPARK-24655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821270#comment-16821270 ] Thomas Graves commented on SPARK-24655: --- >From the linked issues it seems the goals would be: * support more then alpine image base - ie a glibc version * Allow for adding at least certain support like GPUs - although this may just making base image configurable * Allow for overriding the start commands for things like using jupyter docker images. * add in python pip requirements, and I assume would be nice for R, is there something generic we can do to make this easy Correct me if I'm wrong but anything spark related you should be able to use SPARK confs for, like env variables. like {{spark.kubernetes.driverEnv.[EnvironmentVariableName]}} and spark.executorEnv.. Otherwise you could just use the dockerfile built here as a base and build on it. I think we would just want to try to make it easy for the common cases and allow users to override things we may have hardcoded to allow them to reuse it as a base. [~mcheah] From the original description, why do we want to try to not rebuild the image if spark version changes? It seems ok to allow them to override to point to their own spark version (which they could then use to do this), but I would think normally you would build a new docker image for a new version of spark? Dependencies may have changed, the docker template may have changed, etc.. It seems if they really wanted this, they would just specify their own docker image as a base and just add the spark pieces, is that what you are getting at? We can make the base image a argument to the docker-image-tool.sh script > [K8S] Custom Docker Image Expectations and Documentation > > > Key: SPARK-24655 > URL: https://issues.apache.org/jira/browse/SPARK-24655 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Matt Cheah >Priority: Major > > A common use case we want to support with Kubernetes is the usage of custom > Docker images. Some examples include: > * A user builds an application using Gradle or Maven, using Spark as a > compile-time dependency. The application's jars (both the custom-written jars > and the dependencies) need to be packaged in a docker image that can be run > via spark-submit. > * A user builds a PySpark or R application and desires to include custom > dependencies > * A user wants to switch the base image from Alpine to CentOS while using > either built-in or custom jars > We currently do not document how these custom Docker images are supposed to > be built, nor do we guarantee stability of these Docker images with various > spark-submit versions. To illustrate how this can break down, suppose for > example we decide to change the names of environment variables that denote > the driver/executor extra JVM options specified by > {{spark.[driver|executor].extraJavaOptions}}. If we change the environment > variable spark-submit provides then the user must update their custom > Dockerfile and build new images. > Rather than jumping to an implementation immediately though, it's worth > taking a step back and considering these matters from the perspective of the > end user. Towards that end, this ticket will serve as a forum where we can > answer at least the following questions, and any others pertaining to the > matter: > # What would be the steps a user would need to take to build a custom Docker > image, given their desire to customize the dependencies and the content (OS > or otherwise) of said images? > # How can we ensure the user does not need to rebuild the image if only the > spark-submit version changes? > The end deliverable for this ticket is a design document, and then we'll > create sub-issues for the technical implementation and documentation of the > contract. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27510) Master fall into dead loop while launching executor failed in Worker
wuyi created SPARK-27510: Summary: Master fall into dead loop while launching executor failed in Worker Key: SPARK-27510 URL: https://issues.apache.org/jira/browse/SPARK-27510 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0, 2.3.0, 2.2.0, 2.1.0, 2.0.0, 1.6.0 Reporter: wuyi In Standalone, while launching executor during ExecutorRunner.start() is always failed in Worker, Master will continue to launch new executor for the same Worker indefinitely. The issue is easy to reproduce by running a unit test with local-cluster mode and set a wrong spark.test.home(e.g. /tmp). Then, when running unit test, test would get stuck and we can see endless executor directories under /tmp/work/app/. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27502) Update nested schema benchmark result for Orc V2
[ https://issues.apache.org/jira/browse/SPARK-27502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27502. --- Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24399 > Update nested schema benchmark result for Orc V2 > > > Key: SPARK-27502 > URL: https://issues.apache.org/jira/browse/SPARK-27502 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 3.0.0 > > > We added nested schema pruning support to Orc V2 recently. The benchmark > result should be updated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27502) Update nested schema benchmark result for Orc V2
[ https://issues.apache.org/jira/browse/SPARK-27502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27502: -- Issue Type: Sub-task (was: Test) Parent: SPARK-25603 > Update nested schema benchmark result for Orc V2 > > > Key: SPARK-27502 > URL: https://issues.apache.org/jira/browse/SPARK-27502 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Minor > > We added nested schema pruning support to Orc V2 recently. The benchmark > result should be updated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25422) flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated (encryption = on) (with replication as stream)
[ https://issues.apache.org/jira/browse/SPARK-25422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821178#comment-16821178 ] Imran Rashid commented on SPARK-25422: -- [~magnusfa] handling blocks > 2GB was broken in many ways before 2.4. You'll need to upgrade to 2.4.0 to get it work. There are no known issues with large blocks in 2.4.0 (as far as I know, anyway). > flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated > (encryption = on) (with replication as stream) > > > Key: SPARK-25422 > URL: https://issues.apache.org/jira/browse/SPARK-25422 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0 > > > stacktrace > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 7, localhost, executor 1): java.io.IOException: > org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of > broadcast_0: 1651574976 != 1165629262 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1320) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1347) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: corrupt remote block > broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262 > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:167) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1313) > ... 13 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27509) enable connection in cluster mode for output in client machine
Ilya Brodetsky created SPARK-27509: -- Summary: enable connection in cluster mode for output in client machine Key: SPARK-27509 URL: https://issues.apache.org/jira/browse/SPARK-27509 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 3.0.0 Reporter: Ilya Brodetsky While working on Spark I implemented a feature where when you submit a Spark job you enable the option that when it finishes the client machine will read from a file that the Spark job wrote to. It enables a quick and easy way to get output of lets say a dynamic Spark SQL query in the client without manually doing a hdfs download command or anything like. There are some pre requisites for it ( like having configurations for the hdfs communication) but I think its completely possible to implement it and help many users to have an established communication line with a Spark job. What do you think about the option? Would be glad to hear any issues with this idea or why it shouldnt be implemented. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27460) Running slowest test suites in their own forked JVMs for higher parallelism
[ https://issues.apache.org/jira/browse/SPARK-27460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27460. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24373 [https://github.com/apache/spark/pull/24373] > Running slowest test suites in their own forked JVMs for higher parallelism > --- > > Key: SPARK-27460 > URL: https://issues.apache.org/jira/browse/SPARK-27460 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Gengliang Wang >Priority: Critical > Fix For: 3.0.0 > > > We should modify SparkBuild so that the largest / slowest test suites (or > collections of suites) can run in their own forked JVMs, allowing them to be > run in parallel with each other -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27508) Reduce test time of HiveClientSuites
[ https://issues.apache.org/jira/browse/SPARK-27508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-27508. Resolution: Won't Fix Sorry, I think I have made some mistakes. The time is quite the same. > Reduce test time of HiveClientSuites > > > Key: SPARK-27508 > URL: https://issues.apache.org/jira/browse/SPARK-27508 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > The test time of HiveClientSuites on Jenkins is about 3.5 minutes. > The test suite itself is sometimes flaky. > I find that changing the default table from Managed table to External Table > can fasten the tests, while the test scenarios are still covered. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27508) Reduce test time of HiveClientSuites
[ https://issues.apache.org/jira/browse/SPARK-27508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27508: --- Component/s: (was: SQL) Tests > Reduce test time of HiveClientSuites > > > Key: SPARK-27508 > URL: https://issues.apache.org/jira/browse/SPARK-27508 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > The test time of HiveClientSuites on Jenkins is about 3.5 minutes. > The test suite itself is sometimes flaky. > I find that changing the default table from Managed table to External Table > can fasten the tests, while the test scenarios are still covered. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27508) Reduce test time of HiveClientSuites
Gengliang Wang created SPARK-27508: -- Summary: Reduce test time of HiveClientSuites Key: SPARK-27508 URL: https://issues.apache.org/jira/browse/SPARK-27508 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang The test time of HiveClientSuites on Jenkins is about 3.5 minutes. The test suite itself is sometimes flaky. I find that changing the default table from Managed table to External Table can fasten the tests, while the test scenarios are still covered. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27507) get_json_object fails somewhat arbitrarily on long input
[ https://issues.apache.org/jira/browse/SPARK-27507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Chirico updated SPARK-27507: Description: Some long JSON objects are parsed incorrectly by {{get_json_object}}. The specific string we noticed this on can't be shared, but here's some reproduction in Pyspark: {code:java} # v2.3.1 spark = SparkSession.builder.enableHiveSupport().getOrCreate() from string import ascii_lowercase # create a long string alpha_rep = ascii_lowercase*1000 # create a simple query on a simple json object which contains this string test_q = ''' select get_json_object('{{"a": "{}"}}', '$.a') ''' def run_q(s): return len(spark.sql(test_q.format(s)).collect()[0][0]) def diagnose(s): out_len = run_q(s) # input & output should be identical (length match is a necessary condition) print('Input length: %d\tOutput length: %d' % (len(s), out_len)) return True def test_l(n): diagnose(alpha_rep[:n]) return True test_l(2264) test_l(2265) test_l(2667) test_l(2666) test_l(2668) test_l(len(alpha_rep)){code} With results on my instance: {code:java} Input length: 2264 Output length: 2264 Input length: 2265 Output length: 2265 Input length: 2667 Output length: 2660 < problematic!! Input length: 2666 Output length: 2666 Input length: 2668 Output length: 2661 < problematic!! Input length: 26000 Output length: 26000 {code} It's strange that the error triggers for some lengths, but it's apparently not exclusively about the input being large. More details from a {{pandas}} exploration: {code:java} import pandas as pd DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)}) N = DF.shape[0] # note -- takes about 20 minutes to run on my machine for ii in range(N): DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']]) if ii % 520 == 0: print("%.0f%% Done" % (100.0*ii/N)) DF[DF['n'] != DF['m']].shape # (1326, 2) DF['miss'] = DF['n'] - DF['m'] DF.plot('n', 'miss') {code} Plot attached So it appears to fail for a narrowly defined range of about 1300 characters before recovering and continuing to function as expected. was: Some long JSON objects are parsed incorrectly by {{get_json_object}}. The specific string we noticed this on can't be shared, but here's some reproduction in Pyspark: {code:java} # v2.3.1 spark = SparkSession.builder.enableHiveSupport().getOrCreate() from string import ascii_lowercase # create a long string alpha_rep = ascii_lowercase*1000 # create a simple query on a simple json object which contains this string test_q = ''' select get_json_object('{{"a": "{}"}}', '$.a') ''' def run_q(s): return len(spark.sql(test_q.format(s)).collect()[0][0]) def diagnose(s): out_len = run_q(s) # input & output should be identical (length match is a necessary condition) print('Input length: %d\tOutput length: %d' % (len(s), out_len)) return True def test_l(n): diagnose(alpha_rep[:n]) return True test_l(2264) test_l(2265) test_l(2667) test_l(2666) test_l(2668) test_l(len(alpha_rep)){code} With results on my instance: {code:java} Input length: 2264 Output length: 2264 Input length: 2265 Output length: 2265 Input length: 2667 Output length: 2660 Input length: 2666 Output length: 2666 Input length: 2668 Output length: 2661 < problematic!! Input length: 26000 Output length: 26000 {code} It's strange that the error triggers for some lengths, but it's apparently not exclusively about the input being large. More details from a {{pandas}} exploration: {code:java} import pandas as pd DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)}) N = DF.shape[0] # note -- takes about 20 minutes to run on my machine for ii in range(N): DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']]) if ii % 520 == 0: print("%.0f%% Done" % (100.0*ii/N)) DF[DF['n'] != DF['m']].shape # (1326, 2) DF['miss'] = DF['n'] - DF['m'] DF.plot('n', 'miss') {code} Plot attached So it appears to fail for a narrowly defined range of about 1300 characters before recovering and continuing to function as expected. > get_json_object fails somewhat arbitrarily on long input > > > Key: SPARK-27507 > URL: https://issues.apache.org/jira/browse/SPARK-27507 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.1 >Reporter: Michael Chirico >Priority: Major > Attachments: Screen Shot 2019-04-18 at 7.13.02 PM.png > > > Some long JSON objects are parsed incorrectly by {{get_json_object}}. > The specific string we noticed this on can't be shared, but here's some > reproduction in Pyspark: > {code:java} > # v2.3.1 > spark = SparkSession.builder.enableHiveSupport().getOrCreate() > from string import ascii_lowercase >
[jira] [Updated] (SPARK-27507) get_json_object fails somewhat arbitrarily on long input
[ https://issues.apache.org/jira/browse/SPARK-27507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Chirico updated SPARK-27507: Attachment: Screen Shot 2019-04-18 at 7.13.02 PM.png > get_json_object fails somewhat arbitrarily on long input > > > Key: SPARK-27507 > URL: https://issues.apache.org/jira/browse/SPARK-27507 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.1 >Reporter: Michael Chirico >Priority: Major > Attachments: Screen Shot 2019-04-18 at 7.13.02 PM.png > > > Some long JSON objects are parsed incorrectly by {{get_json_object}}. > The specific string we noticed this on can't be shared, but here's some > reproduction in Pyspark: > {code:java} > # v2.3.1 > spark = SparkSession.builder.enableHiveSupport().getOrCreate() > from string import ascii_lowercase > # create a long string > alpha_rep = ascii_lowercase*1000 > # create a simple query on a simple json object which contains this string > test_q = ''' > select get_json_object('{{"a": "{}"}}', '$.a') > ''' > def run_q(s): > return len(spark.sql(test_q.format(s)).collect()[0][0]) > def diagnose(s): > out_len = run_q(s) > # input & output should be identical (length match is a necessary > condition) > print('Input length: %d\tOutput length: %d' % (len(s), out_len)) > return True > def test_l(n): > diagnose(alpha_rep[:n]) > return True > test_l(2264) > test_l(2265) > test_l(2667) > test_l(2666) > test_l(2668) > test_l(len(alpha_rep)){code} > With results on my instance: > {code:java} > Input length: 2264Output length: 2264 > Input length: 2265Output length: 2265 > Input length: 2667Output length: 2660 > Input length: 2666Output length: 2666 > Input length: 2668Output length: 2661 < problematic!! > Input length: 26000 Output length: 26000 > {code} > It's strange that the error triggers for some lengths, but it's apparently > not exclusively about the input being large. > > More details from a {{pandas}} exploration: > {code:java} > import pandas as pd > DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)}) > N = DF.shape[0] > # note -- takes about 20 minutes to run on my machine > for ii in range(N): > DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']]) > if ii % 520 == 0: > print("%.0f%% Done" % (100.0*ii/N)) > DF[DF['n'] != DF['m']].shape > # (1326, 2) > DF['miss'] = DF['n'] - DF['m'] > DF.plot('n', 'miss') > {code} > Plot here: > [https://imgur.com/vCPLNwy] > So it appears to fail for a narrowly defined range of about 1300 characters > before recovering and continuing to function as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27507) get_json_object fails somewhat arbitrarily on long input
[ https://issues.apache.org/jira/browse/SPARK-27507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Chirico updated SPARK-27507: Description: Some long JSON objects are parsed incorrectly by {{get_json_object}}. The specific string we noticed this on can't be shared, but here's some reproduction in Pyspark: {code:java} # v2.3.1 spark = SparkSession.builder.enableHiveSupport().getOrCreate() from string import ascii_lowercase # create a long string alpha_rep = ascii_lowercase*1000 # create a simple query on a simple json object which contains this string test_q = ''' select get_json_object('{{"a": "{}"}}', '$.a') ''' def run_q(s): return len(spark.sql(test_q.format(s)).collect()[0][0]) def diagnose(s): out_len = run_q(s) # input & output should be identical (length match is a necessary condition) print('Input length: %d\tOutput length: %d' % (len(s), out_len)) return True def test_l(n): diagnose(alpha_rep[:n]) return True test_l(2264) test_l(2265) test_l(2667) test_l(2666) test_l(2668) test_l(len(alpha_rep)){code} With results on my instance: {code:java} Input length: 2264 Output length: 2264 Input length: 2265 Output length: 2265 Input length: 2667 Output length: 2660 Input length: 2666 Output length: 2666 Input length: 2668 Output length: 2661 < problematic!! Input length: 26000 Output length: 26000 {code} It's strange that the error triggers for some lengths, but it's apparently not exclusively about the input being large. More details from a {{pandas}} exploration: {code:java} import pandas as pd DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)}) N = DF.shape[0] # note -- takes about 20 minutes to run on my machine for ii in range(N): DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']]) if ii % 520 == 0: print("%.0f%% Done" % (100.0*ii/N)) DF[DF['n'] != DF['m']].shape # (1326, 2) DF['miss'] = DF['n'] - DF['m'] DF.plot('n', 'miss') {code} Plot attached So it appears to fail for a narrowly defined range of about 1300 characters before recovering and continuing to function as expected. was: Some long JSON objects are parsed incorrectly by {{get_json_object}}. The specific string we noticed this on can't be shared, but here's some reproduction in Pyspark: {code:java} # v2.3.1 spark = SparkSession.builder.enableHiveSupport().getOrCreate() from string import ascii_lowercase # create a long string alpha_rep = ascii_lowercase*1000 # create a simple query on a simple json object which contains this string test_q = ''' select get_json_object('{{"a": "{}"}}', '$.a') ''' def run_q(s): return len(spark.sql(test_q.format(s)).collect()[0][0]) def diagnose(s): out_len = run_q(s) # input & output should be identical (length match is a necessary condition) print('Input length: %d\tOutput length: %d' % (len(s), out_len)) return True def test_l(n): diagnose(alpha_rep[:n]) return True test_l(2264) test_l(2265) test_l(2667) test_l(2666) test_l(2668) test_l(len(alpha_rep)){code} With results on my instance: {code:java} Input length: 2264 Output length: 2264 Input length: 2265 Output length: 2265 Input length: 2667 Output length: 2660 Input length: 2666 Output length: 2666 Input length: 2668 Output length: 2661 < problematic!! Input length: 26000 Output length: 26000 {code} It's strange that the error triggers for some lengths, but it's apparently not exclusively about the input being large. More details from a {{pandas}} exploration: {code:java} import pandas as pd DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)}) N = DF.shape[0] # note -- takes about 20 minutes to run on my machine for ii in range(N): DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']]) if ii % 520 == 0: print("%.0f%% Done" % (100.0*ii/N)) DF[DF['n'] != DF['m']].shape # (1326, 2) DF['miss'] = DF['n'] - DF['m'] DF.plot('n', 'miss') {code} Plot here: [https://imgur.com/vCPLNwy] So it appears to fail for a narrowly defined range of about 1300 characters before recovering and continuing to function as expected. > get_json_object fails somewhat arbitrarily on long input > > > Key: SPARK-27507 > URL: https://issues.apache.org/jira/browse/SPARK-27507 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.1 >Reporter: Michael Chirico >Priority: Major > Attachments: Screen Shot 2019-04-18 at 7.13.02 PM.png > > > Some long JSON objects are parsed incorrectly by {{get_json_object}}. > The specific string we noticed this on can't be shared, but here's some > reproduction in Pyspark: > {code:java} > # v2.3.1 > spark = SparkSession.builder.enableHiveSupport().getOrCreate() > from string import ascii_lowerc
[jira] [Created] (SPARK-27507) get_json_object fails somewhat arbitrarily on long input
Michael Chirico created SPARK-27507: --- Summary: get_json_object fails somewhat arbitrarily on long input Key: SPARK-27507 URL: https://issues.apache.org/jira/browse/SPARK-27507 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.3.1 Reporter: Michael Chirico Some long JSON objects are parsed incorrectly by {{get_json_object}}. The specific string we noticed this on can't be shared, but here's some reproduction in Pyspark: {code:java} # v2.3.1 spark = SparkSession.builder.enableHiveSupport().getOrCreate() from string import ascii_lowercase # create a long string alpha_rep = ascii_lowercase*1000 # create a simple query on a simple json object which contains this string test_q = ''' select get_json_object('{{"a": "{}"}}', '$.a') ''' def run_q(s): return len(spark.sql(test_q.format(s)).collect()[0][0]) def diagnose(s): out_len = run_q(s) # input & output should be identical (length match is a necessary condition) print('Input length: %d\tOutput length: %d' % (len(s), out_len)) return True def test_l(n): diagnose(alpha_rep[:n]) return True test_l(2264) test_l(2265) test_l(2667) test_l(2666) test_l(2668) test_l(len(alpha_rep)){code} With results on my instance: {code:java} Input length: 2264 Output length: 2264 Input length: 2265 Output length: 2265 Input length: 2667 Output length: 2660 Input length: 2666 Output length: 2666 Input length: 2668 Output length: 2661 < problematic!! Input length: 26000 Output length: 26000 {code} It's strange that the error triggers for some lengths, but it's apparently not exclusively about the input being large. More details from a {{pandas}} exploration: {code:java} import pandas as pd DF = pd.DataFrame({'n': range(1, len(alpha_rep) + 1)}) N = DF.shape[0] # note -- takes about 20 minutes to run on my machine for ii in range(N): DF.loc[ii, 'm'] = run_q(alpha_rep[:DF.loc[ii, 'n']]) if ii % 520 == 0: print("%.0f%% Done" % (100.0*ii/N)) DF[DF['n'] != DF['m']].shape # (1326, 2) DF['miss'] = DF['n'] - DF['m'] DF.plot('n', 'miss') {code} Plot here: [https://imgur.com/vCPLNwy] So it appears to fail for a narrowly defined range of about 1300 characters before recovering and continuing to function as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27506) Function `from_avro` doesn't allow deserialization of data using other compatible schemas
Gianluca Amori created SPARK-27506: -- Summary: Function `from_avro` doesn't allow deserialization of data using other compatible schemas Key: SPARK-27506 URL: https://issues.apache.org/jira/browse/SPARK-27506 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.1 Reporter: Gianluca Amori SPARK-24768 and subtasks introduced support to read and write Avro data by parsing a binary column of Avro format and converting it into its corresponding catalyst value (and viceversa). The current implementation has the limitation of requiring deserialization of an event with the exact same schema with which it was serialized. This breaks one of the most important features of Avro, schema evolution [https://docs.confluent.io/current/schema-registry/avro.html] - most importantly, the ability to read old data with a newer (compatible) schema without breaking the consumer. The GenericDatumReader in the Avro library already supports passing an optional *writer's schema* (the schema with which the record was serialized) alongside a mandatory *reader's schema* (the schema with which the record is going to be deserialized). The proposed change is to do the same in the from_avro function, allowing the possibility to pass an optional writer's schema to be used in the deserialization. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27505) autoBroadcastJoinThreshold including bigger table
[ https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820951#comment-16820951 ] Mike Chan commented on SPARK-27505: --- Table desc extended result: Statistics |24452111 bytes > autoBroadcastJoinThreshold including bigger table > - > > Key: SPARK-27505 > URL: https://issues.apache.org/jira/browse/SPARK-27505 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.3.1 > Environment: Hive table with Spark 2.3.1 on Azure, using Azure > storage as storage layer >Reporter: Mike Chan >Priority: Major > Attachments: explain_plan.txt > > > I'm on a case that when certain table being exposed to broadcast join, the > query will eventually failed with remote block error. > > Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely > 10485760 > !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.2&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI&disp=emb&realattid=ii_jumg5jxd1|width=542,height=66! > > Then we proceed to perform query. In the SQL plan, we found that one table > that is 25MB in size is broadcast as well. > > !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.1&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE&disp=emb&realattid=ii_jumg53fq0|width=227,height=542! > > Also in desc extended the table is 24452111 bytes. It is a Hive table. We > always ran into error when this table being broadcast. Below is the sample > error > > Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt > remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > > Also attached the physical plan if you're interested. One thing to note that, > if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query > will get successfully executed and default.product NOT broadcasted.{color} > {color:#00} > {color}{color:#00}However, when I change to another query that querying > even less columns than pervious one, even in 5MB this table still get > broadcasted and failed with the same error. I even changed to 1MB and still > the same. {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27505) autoBroadcastJoinThreshold including bigger table
[ https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Chan updated SPARK-27505: -- Attachment: explain_plan.txt > autoBroadcastJoinThreshold including bigger table > - > > Key: SPARK-27505 > URL: https://issues.apache.org/jira/browse/SPARK-27505 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.3.1 > Environment: Hive table with Spark 2.3.1 on Azure, using Azure > storage as storage layer >Reporter: Mike Chan >Priority: Major > Attachments: explain_plan.txt > > > I'm on a case that when certain table being exposed to broadcast join, the > query will eventually failed with remote block error. > > Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely > 10485760 > !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.2&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI&disp=emb&realattid=ii_jumg5jxd1|width=542,height=66! > > Then we proceed to perform query. In the SQL plan, we found that one table > that is 25MB in size is broadcast as well. > > !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.1&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE&disp=emb&realattid=ii_jumg53fq0|width=227,height=542! > > Also in desc extended the table is 24452111 bytes. It is a Hive table. We > always ran into error when this table being broadcast. Below is the sample > error > > Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt > remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > > Also attached the physical plan if you're interested. One thing to note that, > if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query > will get successfully executed and default.product NOT broadcasted.{color} > {color:#00} > {color}{color:#00}However, when I change to another query that querying > even less columns than pervious one, even in 5MB this table still get > broadcasted and failed with the same error. I even changed to 1MB and still > the same. {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27505) autoBroadcastJoinThreshold including bigger table
Mike Chan created SPARK-27505: - Summary: autoBroadcastJoinThreshold including bigger table Key: SPARK-27505 URL: https://issues.apache.org/jira/browse/SPARK-27505 Project: Spark Issue Type: Question Components: PySpark Affects Versions: 2.3.1 Environment: Hive table with Spark 2.3.1 on Azure, using Azure storage as storage layer Reporter: Mike Chan I'm on a case that when certain table being exposed to broadcast join, the query will eventually failed with remote block error. Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 10485760 !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.2&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI&disp=emb&realattid=ii_jumg5jxd1|width=542,height=66! Then we proceed to perform query. In the SQL plan, we found that one table that is 25MB in size is broadcast as well. !https://mail.google.com/mail/u/1?ui=2&ik=6f09461656&attid=0.0.1&permmsgid=msg-a:r2073778291349183964&th=16a2fd58ea74551c&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE&disp=emb&realattid=ii_jumg53fq0|width=227,height=542! Also in desc extended the table is 24452111 bytes. It is a Hive table. We always ran into error when this table being broadcast. Below is the sample error Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) Also attached the physical plan if you're interested. One thing to note that, if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query will get successfully executed and default.product NOT broadcasted.{color} {color:#00} {color}{color:#00}However, when I change to another query that querying even less columns than pervious one, even in 5MB this table still get broadcasted and failed with the same error. I even changed to 1MB and still the same. {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24492) Endless attempted task when TaskCommitDenied exception writing to S3A
[ https://issues.apache.org/jira/browse/SPARK-24492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu-Jhe Li resolved SPARK-24492. --- Resolution: Won't Do Haven't happened again since upgrade to 2.3.2, i'm going to close this issue. > Endless attempted task when TaskCommitDenied exception writing to S3A > - > > Key: SPARK-24492 > URL: https://issues.apache.org/jira/browse/SPARK-24492 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yu-Jhe Li >Priority: Critical > Attachments: retry_stage.png, 螢幕快照 2018-05-16 上午11.10.46.png, 螢幕快照 > 2018-05-16 上午11.10.57.png > > > Hi, when we run Spark application under spark-2.2.0 on AWS spot instance and > output file to S3, some tasks endless retry and all of them failed with > TaskCommitDenied exception. This happened when we run Spark application on > some network issue instances. (it runs well on healthy spot instances) > Sorry, I can find a easy way to reproduce this issue, here's all I can > provide. > The Spark UI shows (in attachments) one task of stage 112 failed due to > FetchFailedException (it is network issue) and attempt to retry a new stage > 112 (retry 1). But in stage 112 (retry 1), all task failed due to > TaskCommitDenied exception, and keep retry (it never succeed and cause lots > of S3 requests). > On the other side, driver logs shows: > # task 123.0 in stage 112.0 failed due to FetchFailedException (network > issue cause corrupted file) > # warning message from OutputCommitCoordinator > # task 92.0 in stage 112.1 failed when writing rows > # keep retry the failed tasks, but never succeed > {noformat} > 2018-05-16 02:38:055 WARN TaskSetManager:66 - Lost task 123.0 in stage 112.0 > (TID 42909, 10.47.20.17, executor 64): FetchFailed(BlockManagerId(137, > 10.235.164.113, 60758, None), shuffleId=39, mapId=59, reduceId=123, message= > org.apache.spark.shuffle.FetchFailedException: Stream is corrupted > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:403) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.T
[jira] [Created] (SPARK-27504) File source V2: support refreshing metadata cache
Gengliang Wang created SPARK-27504: -- Summary: File source V2: support refreshing metadata cache Key: SPARK-27504 URL: https://issues.apache.org/jira/browse/SPARK-27504 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang In file source V1, if some file is deleted manually, reading the DataFrame/Table will throws an exception with suggestion message "It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.". After refreshing the table/DataFrame, the reads should return correct results. We should follow it in file source V2 as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27503) JobGenerator thread exit for some fatal errors but application keeps running
Genmao Yu created SPARK-27503: - Summary: JobGenerator thread exit for some fatal errors but application keeps running Key: SPARK-27503 URL: https://issues.apache.org/jira/browse/SPARK-27503 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 3.0.0 Reporter: Genmao Yu JobGenerator thread (including some other EventLoop threads) may exit for some fatal error, like OOM, but Spark Streaming job keep running with no batch job generating. Currently, we only report any non-fatal error. {code} override def run(): Unit = { try { while (!stopped.get) { val event = eventQueue.take() try { onReceive(event) } catch { case NonFatal(e) => try { onError(e) } catch { case NonFatal(e) => logError("Unexpected error in " + name, e) } } } } catch { case ie: InterruptedException => // exit even if eventQueue is not empty case NonFatal(e) => logError("Unexpected error in " + name, e) } } {code} In some corner cases, these event threads may exit with OOM error, but driver thread can still keep running. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27502) Update nested schema benchmark result for Orc V2
Liang-Chi Hsieh created SPARK-27502: --- Summary: Update nested schema benchmark result for Orc V2 Key: SPARK-27502 URL: https://issues.apache.org/jira/browse/SPARK-27502 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh We added nested schema pruning support to Orc V2 recently. The benchmark result should be updated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27475) dev/deps/spark-deps-hadoop-3.2 is incorrect
[ https://issues.apache.org/jira/browse/SPARK-27475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820787#comment-16820787 ] Yuming Wang edited comment on SPARK-27475 at 4/18/19 7:08 AM: -- sbt can do it, but some result is incorrect. It added {{jetty-*-9.4.12.jar}}: {code:java} build/sbt "show assembly/compile:dependencyClasspath" -Phadoop-3.2 | grep "Attributed(" | awk -F "/" '{print $NF}' | sed 's/'\)'//g' | sort | grep -v spark {code} was (Author: q79969786): sbt can do it, but some result is incorrect. It added {{jetty-*-9.4.12.jar}}: {code:java} build/sbt "show assembly/compile:dependencyClasspath" -Phadoop-3.2 | grep "Attributed(" | rev | sed \ 's/^.//' | cut -d "/" -f 1 | rev | sort | grep -v spark {code} > dev/deps/spark-deps-hadoop-3.2 is incorrect > --- > > Key: SPARK-27475 > URL: https://issues.apache.org/jira/browse/SPARK-27475 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27475) dev/deps/spark-deps-hadoop-3.2 is incorrect
[ https://issues.apache.org/jira/browse/SPARK-27475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820787#comment-16820787 ] Yuming Wang commented on SPARK-27475: - sbt can do it, but some result is incorrect. It added {{jetty-*-9.4.12.jar}}: {code:java} build/sbt "show assembly/compile:dependencyClasspath" -Phadoop-3.2 | grep "Attributed(" | rev | sed \ 's/^.//' | cut -d "/" -f 1 | rev | sort | grep -v spark {code} > dev/deps/spark-deps-hadoop-3.2 is incorrect > --- > > Key: SPARK-27475 > URL: https://issues.apache.org/jira/browse/SPARK-27475 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org