[jira] [Updated] (SPARK-27145) Close store after test, in the SQLAppStatusListenerSuite
[ https://issues.apache.org/jira/browse/SPARK-27145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-27145: --- Summary: Close store after test, in the SQLAppStatusListenerSuite (was: Close store after test, in SQLAppStatusListenerSuite) > Close store after test, in the SQLAppStatusListenerSuite > > > Key: SPARK-27145 > URL: https://issues.apache.org/jira/browse/SPARK-27145 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.3, 2.4.0, 3.0.0 >Reporter: shahid >Priority: Minor > > We create many stores in the SQLAppStatusListenerSuite, but we need to the > close store after test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27144) Explode with structType may throw NPE
Yoga created SPARK-27144: Summary: Explode with structType may throw NPE Key: SPARK-27144 URL: https://issues.apache.org/jira/browse/SPARK-27144 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Environment: Spark 2.3.0, local mode. Reporter: Yoga Create a dataFrame containing two columns names [weight, animal], the weight's nullable is false while the animal' nullable is true. Give null value in the col animal, then construct a new column with {code:java} explode( array( struct(lit("weight").alias("key"), col("weight").cast(StringType).alias("value")), struct(lit("animal").alias("key"), col("animal").cast(StringType).alias("value")) ) ) {code} then select the struct with .*, Spark will throw NPE {code:java} 19/03/13 14:39:10 INFO DAGScheduler: ResultStage 3 (show at SparkTest.scala:74) failed in 0.043 s due to Job aborted due to stage failure: Task 3 in stage 3.0 failed 1 times, most recent failure: Lost task 3.0 in stage 3.0 (TID 9, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:194) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.project_doConsume$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} Codes for reproduce: {code:java} val data = Seq( Row(20.0, "dog","a"), Row(3.5, "cat","b"), Row(0.06, null,"c") ) val schema = StructType(List( StructField("weight", DoubleType, false), StructField("animal", StringType, true), StructField("extra", StringType, true) ) ) val col1 = "weight" val col2 = "animal" //this should fail in select(test.*) val df1 = originalDF.withColumn("test", explode( array( struct(lit(col1).alias("key"), col(col1).cast(StringType).alias("value")), struct(lit(col2).alias("key"), col(col2).cast(StringType).alias("value")) ) ) ) df1.printSchema() df1.select("test.*").show() // this should succeed in select(test.*) val df2 = originalDF.withColumn("test", explode( array( struct(lit(col2).alias("key"), col(col2).cast(StringType).alias("value")), struct(lit(col1).alias("key"), col(col1).cast(StringType).alias("value")) ) ) ) df2.printSchema() dfs.select("test.*").show() {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27145) Close store after test, in the SQLAppStatusListenerSuite
[ https://issues.apache.org/jira/browse/SPARK-27145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27145: Assignee: (was: Apache Spark) > Close store after test, in the SQLAppStatusListenerSuite > > > Key: SPARK-27145 > URL: https://issues.apache.org/jira/browse/SPARK-27145 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.3, 2.4.0, 3.0.0 >Reporter: shahid >Priority: Minor > > We create many stores in the SQLAppStatusListenerSuite, but we need to the > close store after test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27144) Explode with structType may throw NPE when the first column's nullable is false while the second column's nullable is true
[ https://issues.apache.org/jira/browse/SPARK-27144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yoga updated SPARK-27144: - Summary: Explode with structType may throw NPE when the first column's nullable is false while the second column's nullable is true (was: Explode with structType may throw NPE) > Explode with structType may throw NPE when the first column's nullable is > false while the second column's nullable is true > -- > > Key: SPARK-27144 > URL: https://issues.apache.org/jira/browse/SPARK-27144 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: Spark 2.3.0, local mode. >Reporter: Yoga >Priority: Major > > Create a dataFrame containing two columns names [weight, animal], the > weight's nullable is false while the animal' nullable is true. > Give null value in the col animal, > then construct a new column with > {code:java} > explode( > array( > struct(lit("weight").alias("key"), > col("weight").cast(StringType).alias("value")), > struct(lit("animal").alias("key"), > col("animal").cast(StringType).alias("value")) > ) > ) > {code} > then select the struct with .*, Spark will throw NPE > {code:java} > 19/03/13 14:39:10 INFO DAGScheduler: ResultStage 3 (show at > SparkTest.scala:74) failed in 0.043 s due to Job aborted due to stage > failure: Task 3 in stage 3.0 failed 1 times, most recent failure: Lost task > 3.0 in stage 3.0 (TID 9, localhost, executor driver): > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:194) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.project_doConsume$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > > Codes for reproduce: > {code:java} > val data = Seq( > Row(20.0, "dog","a"), > Row(3.5, "cat","b"), > Row(0.06, null,"c") > ) > val schema = StructType(List( > StructField("weight", DoubleType, false), > StructField("animal", StringType, true), > StructField("extra", StringType, true) > ) > ) > val col1 = "weight" > val col2 = "animal" > //this should fail in select(test.*) > val df1 = originalDF.withColumn("test", > explode( > array( > struct(lit(col1).alias("key"), > col(col1).cast(StringType).alias("value")), > struct(lit(col2).alias("key"), > col(col2).cast(StringType).alias("value")) > ) > ) > ) > df1.printSchema() > df1.select("test.*").show() > // this should succeed in select(test.*) > val df2 = originalDF.withColumn("test", > explode( > array( > struct(lit(col2).alias("key"), > col(col2).cast(StringType).alias("value")), > struct(lit(col1).alias("key"), > col(col1).cast(StringType).alias("value")) > ) > ) > ) > df2.printSchema() > dfs.select("test.*").show() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) ---
[jira] [Comment Edited] (SPARK-27144) Explode with structType may throw NPE
[ https://issues.apache.org/jira/browse/SPARK-27144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791376#comment-16791376 ] Yoga edited comment on SPARK-27144 at 3/13/19 6:48 AM: --- I think the problem comes from the handler logic of different struct type of the array elements, after the exploding, that the schema of `test` column is wrong like below, {code:java} root |-- weight: double (nullable = false) |-- animal: string (nullable = true) |-- extra: string (nullable = true) |-- test: struct (nullable = false) ||-- key: string (nullable = false) ||-- value: string (nullable = false) {code} As the `value` in `test` is nullable, but the cell value contain NULL cell indeed. was (Author: yogatamekasa): I think the problem is the handler of diffrent struct type of the array elements, after the exploding, that the schema of `test` column is wrong like below, {code:java} root |-- weight: double (nullable = false) |-- animal: string (nullable = true) |-- extra: string (nullable = true) |-- test: struct (nullable = false) ||-- key: string (nullable = false) ||-- value: string (nullable = false) {code} As the `value` in `test` is nullable, but the cell value contain NULL cell indeed. > Explode with structType may throw NPE > - > > Key: SPARK-27144 > URL: https://issues.apache.org/jira/browse/SPARK-27144 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: Spark 2.3.0, local mode. >Reporter: Yoga >Priority: Major > > Create a dataFrame containing two columns names [weight, animal], the > weight's nullable is false while the animal' nullable is true. > Give null value in the col animal, > then construct a new column with > {code:java} > explode( > array( > struct(lit("weight").alias("key"), > col("weight").cast(StringType).alias("value")), > struct(lit("animal").alias("key"), > col("animal").cast(StringType).alias("value")) > ) > ) > {code} > then select the struct with .*, Spark will throw NPE > {code:java} > 19/03/13 14:39:10 INFO DAGScheduler: ResultStage 3 (show at > SparkTest.scala:74) failed in 0.043 s due to Job aborted due to stage > failure: Task 3 in stage 3.0 failed 1 times, most recent failure: Lost task > 3.0 in stage 3.0 (TID 9, localhost, executor driver): > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:194) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.project_doConsume$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > > Codes for reproduce: > {code:java} > val data = Seq( > Row(20.0, "dog","a"), > Row(3.5, "cat","b"), > Row(0.06, null,"c") > ) > val schema = StructType(List( > StructField("weight", DoubleType, false), > StructField("animal", StringType, true), > StructField("extra", StringType, true) > ) > ) > val col1 = "weight" > val col2 = "animal" > //this should fail in select(te
[jira] [Comment Edited] (SPARK-27144) Explode with structType may throw NPE
[ https://issues.apache.org/jira/browse/SPARK-27144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791376#comment-16791376 ] Yoga edited comment on SPARK-27144 at 3/13/19 6:48 AM: --- I think the problem comes from the handler logic of different struct type of the array elements, after the exploding, that the schema of `test` column is wrong like below, {code:java} root |-- weight: double (nullable = false) |-- animal: string (nullable = true) |-- extra: string (nullable = true) |-- test: struct (nullable = false) ||-- key: string (nullable = false) ||-- value: string (nullable = false) {code} As the `value` in `test` is set to be NOT-nullable, but the cell value contains NULL cell indeed. was (Author: yogatamekasa): I think the problem comes from the handler logic of different struct type of the array elements, after the exploding, that the schema of `test` column is wrong like below, {code:java} root |-- weight: double (nullable = false) |-- animal: string (nullable = true) |-- extra: string (nullable = true) |-- test: struct (nullable = false) ||-- key: string (nullable = false) ||-- value: string (nullable = false) {code} As the `value` in `test` is nullable, but the cell value contain NULL cell indeed. > Explode with structType may throw NPE > - > > Key: SPARK-27144 > URL: https://issues.apache.org/jira/browse/SPARK-27144 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: Spark 2.3.0, local mode. >Reporter: Yoga >Priority: Major > > Create a dataFrame containing two columns names [weight, animal], the > weight's nullable is false while the animal' nullable is true. > Give null value in the col animal, > then construct a new column with > {code:java} > explode( > array( > struct(lit("weight").alias("key"), > col("weight").cast(StringType).alias("value")), > struct(lit("animal").alias("key"), > col("animal").cast(StringType).alias("value")) > ) > ) > {code} > then select the struct with .*, Spark will throw NPE > {code:java} > 19/03/13 14:39:10 INFO DAGScheduler: ResultStage 3 (show at > SparkTest.scala:74) failed in 0.043 s due to Job aborted due to stage > failure: Task 3 in stage 3.0 failed 1 times, most recent failure: Lost task > 3.0 in stage 3.0 (TID 9, localhost, executor driver): > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:194) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.project_doConsume$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > > Codes for reproduce: > {code:java} > val data = Seq( > Row(20.0, "dog","a"), > Row(3.5, "cat","b"), > Row(0.06, null,"c") > ) > val schema = StructType(List( > StructField("weight", DoubleType, false), > StructField("animal", StringType, true), > StructField("extra", StringType, true) > ) > ) > val col1 = "weight" > val col2 = "animal" > /
[jira] [Created] (SPARK-27145) Close store after test, in SQLAppStatusListenerSuite
shahid created SPARK-27145: -- Summary: Close store after test, in SQLAppStatusListenerSuite Key: SPARK-27145 URL: https://issues.apache.org/jira/browse/SPARK-27145 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.4.0, 2.3.3, 3.0.0 Reporter: shahid We create many stores in the SQLAppStatusListenerSuite, but we need to the close store after test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27017) Creating orc table with special symbols in column name via spark.sql
[ https://issues.apache.org/jira/browse/SPARK-27017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791364#comment-16791364 ] Chakravarthi commented on SPARK-27017: -- Hi, [~uNxe] sorry for delay, please check this jira : SPARK-21912 > Creating orc table with special symbols in column name via spark.sql > > > Key: SPARK-27017 > URL: https://issues.apache.org/jira/browse/SPARK-27017 > Project: Spark > Issue Type: Question > Components: Spark Shell >Affects Versions: 2.3.0 >Reporter: Henryk Cesnolovic >Priority: Major > > Issue is creating orc table with special symbols in column name in spark with > hive support. Example: > _spark.sql("Create table abc_orc (`Column with speci@l symbo|s`string) stored > as orc")_ > throws org.apache.spark.sql.AnalysisException: Column name "Column with > speci@l symbo|s" contains invalid character(s). Please use alias to rename it. > It's interesting, because in Hive we can create such table and after that in > spark we can select data from that table and it resolves schema correctly. > My question is, is it correct behaviour of spark and if so, what is the > reason of that behaviour? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25449) Don't send zero accumulators in heartbeats
[ https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791321#comment-16791321 ] Xiao Li commented on SPARK-25449: - This changed the unit of conf. > Don't send zero accumulators in heartbeats > -- > > Key: SPARK-25449 > URL: https://issues.apache.org/jira/browse/SPARK-25449 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Mukul Murthy >Assignee: Mukul Murthy >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Heartbeats sent from executors to the driver every 10 seconds contain metrics > and are generally on the order of a few KBs. However, for large jobs with > lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks > to die with heartbeat failures. We can mitigate this by not sending zero > metrics to the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25449) Don't send zero accumulators in heartbeats
[ https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25449: Labels: release-notes (was: ) > Don't send zero accumulators in heartbeats > -- > > Key: SPARK-25449 > URL: https://issues.apache.org/jira/browse/SPARK-25449 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Mukul Murthy >Assignee: Mukul Murthy >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Heartbeats sent from executors to the driver every 10 seconds contain metrics > and are generally on the order of a few KBs. However, for large jobs with > lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks > to die with heartbeat failures. We can mitigate this by not sending zero > metrics to the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27141) Use ConfigEntry for hardcoded configs Yarn
[ https://issues.apache.org/jira/browse/SPARK-27141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791284#comment-16791284 ] Sandeep Katta commented on SPARK-27141: --- I will be working on this PR > Use ConfigEntry for hardcoded configs Yarn > -- > > Key: SPARK-27141 > URL: https://issues.apache.org/jira/browse/SPARK-27141 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.0.0 >Reporter: wangjiaochun >Priority: Major > Fix For: 3.0.0 > > > Some of following Yarn file related configs are not use ConfigEntry value,try > to replace them. > ApplicationMaster > YarnAllocatorSuite > ApplicationMasterSuite > BaseYarnClusterSuite > YarnClusterSuite -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27142) Provide REST API for SQL level information
[ https://issues.apache.org/jira/browse/SPARK-27142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27142: Assignee: Apache Spark > Provide REST API for SQL level information > -- > > Key: SPARK-27142 > URL: https://issues.apache.org/jira/browse/SPARK-27142 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ajith S >Assignee: Apache Spark >Priority: Minor > > Currently for Monitoring Spark application SQL information is not available > from REST but only via UI. REST provides only > applications,jobs,stages,environment. This Jira is targeted to provide a REST > API so that SQL level information can be found -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27143) Provide REST API for JDBC/ODBC level information
[ https://issues.apache.org/jira/browse/SPARK-27143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791272#comment-16791272 ] Ajith S commented on SPARK-27143: - ping [~srowen] [~cloud_fan] [~dongjoon] Please suggest if this sounds reasonable > Provide REST API for JDBC/ODBC level information > > > Key: SPARK-27143 > URL: https://issues.apache.org/jira/browse/SPARK-27143 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Ajith S >Priority: Minor > > Currently for Monitoring Spark application JDBC/ODBC information is not > available from REST but only via UI. REST provides only > applications,jobs,stages,environment. This Jira is targeted to provide a REST > API so that JDBC/ODBC level information like session statistics, sql > staistics can be provided -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27143) Provide REST API for JDBC/ODBC level information
[ https://issues.apache.org/jira/browse/SPARK-27143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791270#comment-16791270 ] Ajith S commented on SPARK-27143: - I will be working on this > Provide REST API for JDBC/ODBC level information > > > Key: SPARK-27143 > URL: https://issues.apache.org/jira/browse/SPARK-27143 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Ajith S >Priority: Minor > > Currently for Monitoring Spark application JDBC/ODBC information is not > available from REST but only via UI. REST provides only > applications,jobs,stages,environment. This Jira is targeted to provide a REST > API so that JDBC/ODBC level information like session statistics, sql > staistics can be provided -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27143) Provide REST API for JDBC/ODBC level information
Ajith S created SPARK-27143: --- Summary: Provide REST API for JDBC/ODBC level information Key: SPARK-27143 URL: https://issues.apache.org/jira/browse/SPARK-27143 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.0.0 Reporter: Ajith S Currently for Monitoring Spark application JDBC/ODBC information is not available from REST but only via UI. REST provides only applications,jobs,stages,environment. This Jira is targeted to provide a REST API so that JDBC/ODBC level information like session statistics, sql staistics can be provided -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27142) Provide REST API for SQL level information
[ https://issues.apache.org/jira/browse/SPARK-27142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791269#comment-16791269 ] Ajith S commented on SPARK-27142: - I will be working on this > Provide REST API for SQL level information > -- > > Key: SPARK-27142 > URL: https://issues.apache.org/jira/browse/SPARK-27142 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ajith S >Priority: Minor > > Currently for Monitoring Spark application SQL information is not available > from REST but only via UI. REST provides only > applications,jobs,stages,environment. This Jira is targeted to provide a REST > API so that SQL level information can be found -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27142) Provide REST API for SQL level information
[ https://issues.apache.org/jira/browse/SPARK-27142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27142: Assignee: (was: Apache Spark) > Provide REST API for SQL level information > -- > > Key: SPARK-27142 > URL: https://issues.apache.org/jira/browse/SPARK-27142 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ajith S >Priority: Minor > > Currently for Monitoring Spark application SQL information is not available > from REST but only via UI. REST provides only > applications,jobs,stages,environment. This Jira is targeted to provide a REST > API so that SQL level information can be found -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27142) Provide REST API for SQL level information
Ajith S created SPARK-27142: --- Summary: Provide REST API for SQL level information Key: SPARK-27142 URL: https://issues.apache.org/jira/browse/SPARK-27142 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Ajith S Currently for Monitoring Spark application SQL information is not available from REST but only via UI. REST provides only applications,jobs,stages,environment. This Jira is targeted to provide a REST API so that SQL level information can be found -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27141) Use ConfigEntry for hardcoded configs Yarn
wangjiaochun created SPARK-27141: Summary: Use ConfigEntry for hardcoded configs Yarn Key: SPARK-27141 URL: https://issues.apache.org/jira/browse/SPARK-27141 Project: Spark Issue Type: Sub-task Components: YARN Affects Versions: 3.0.0 Reporter: wangjiaochun Fix For: 3.0.0 Some of following Yarn file related configs are not use ConfigEntry value,try to replace them. ApplicationMaster YarnAllocatorSuite ApplicationMasterSuite BaseYarnClusterSuite YarnClusterSuite -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26936) On yarn-client mode, insert overwrite local directory can not create temporary path in local staging directory
[ https://issues.apache.org/jira/browse/SPARK-26936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-26936: --- Summary: On yarn-client mode, insert overwrite local directory can not create temporary path in local staging directory (was: insert overwrite local directory can not create temporary path in local staging directory) > On yarn-client mode, insert overwrite local directory can not create > temporary path in local staging directory > -- > > Key: SPARK-26936 > URL: https://issues.apache.org/jira/browse/SPARK-26936 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0, 3.0.0 >Reporter: jiaan.geng >Priority: Major > > Let me introduce bug of 'insert overwrite local directory'. > If I execute the SQL mentioned before, a HiveException will appear as follows: > {code:java} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2037) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194) > ... 36 more > Caused by: org.apache.spark.SparkException: Task failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: > java.io.IOException: Mkdirs failed to create > file:/home/xitong/hive/stagingdir_hive_2019-02-19_17-31-00_678_1816816774691551856-1/-ext-1/_temporary/0/_temporary/attempt_20190219173233_0002_m_00_3 > (exists=false, > cwd=file:/data10/yarn/nm-local-dir/usercache/xitong/appcache/application_1543893582405_6126857/container_e124_1543893582405_6126857_01_11) > at > org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123) > at > org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) > at > org.apache.sp
[jira] [Updated] (SPARK-26936) insert overwrite local directory can not create temporary path in local staging directory
[ https://issues.apache.org/jira/browse/SPARK-26936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-26936: --- Description: Let me introduce bug of 'insert overwrite local directory'. If I execute the SQL mentioned before, a HiveException will appear as follows: {code:java} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2037) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194) ... 36 more Caused by: org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Mkdirs failed to create file:/home/xitong/hive/stagingdir_hive_2019-02-19_17-31-00_678_1816816774691551856-1/-ext-1/_temporary/0/_temporary/attempt_20190219173233_0002_m_00_3 (exists=false, cwd=file:/data10/yarn/nm-local-dir/usercache/xitong/appcache/application_1543893582405_6126857/container_e124_1543893582405_6126857_01_11) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249) at org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) ... 8 more Caused by: java.io.IOException: Mkdirs failed to create file:/home/xitong/hive/stagingdir_hive_2019-02-19_17-31-00_678_1816816774691551856-1/-ext-1/_temporary/0/_temporary/attempt_20190219173233_0002_m_00_3 (exists=false, cwd=file:/data10/yarn/nm-local-dir/usercache/xitong/appcache/application_1543893582405_6126857/container_e124_1543893582405_6126857_01_11) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:449) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:435) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:928) at org.apache.hadoop.fs.File
[jira] [Assigned] (SPARK-27140) The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment.
[ https://issues.apache.org/jira/browse/SPARK-27140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27140: Assignee: (was: Apache Spark) > The feature is 'insert overwrite local directory' has an inconsistent > behavior in different environment. > > > Key: SPARK-27140 > URL: https://issues.apache.org/jira/browse/SPARK-27140 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0, 3.0.0 >Reporter: jiaan.geng >Priority: Major > > In local[*] mode, maropu give a test case as follows: > {code:java} > $ls /tmp/noexistdir > ls: /tmp/noexistdir: No such file or directory > scala> sql("""create table t(c0 int, c1 int)""") > scala> spark.table("t").explain > == Physical Plan == > Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6] > scala> sql("""insert into t values(1, 1)""") > scala> sql("""select * from t""").show > +---+---+ > | c0| c1| > +---+---+ > | 1| 1| > +---+---+ > scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * > from t""") > $ls /tmp/noexistdir/t/ > _SUCCESS part-0-bbea4213-071a-49b4-aac8-8510e7263d45-c000 > {code} > This test case prove spark will create the not exists path and move middle > result from local temporary path to created path.This test based on newest > master. > I follow the test case provided by maropu,but find another behavior. > I run these SQL maropu provided on local[*] deploy mode based on 2.3.0. > Inconsistent behavior appears as follows: > {code:java} > ls /tmp/noexistdir > ls: cannot access /tmp/noexistdir: No such file or directory > scala> sql("""create table t(c0 int, c1 int)""") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.table("t").explain > == Physical Plan == > HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6] > scala> sql("""insert into t values(1, 1)""") > scala> sql("""select * from t""").show > +---+---+ > > | c0| c1| > +---+---+ > | 1| 1| > +---+---+ > scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * > from t""") > res1: org.apache.spark.sql.DataFrame = [] > ls /tmp/noexistdir/t/ > /tmp/noexistdir/t > vi /tmp/noexistdir/t > 1 > {code} > Then I pull the master branch and compile it and deploy it on my hadoop > cluster.I get the inconsistent behavior again. The spark version to test is > 3.0.0. > {code:java} > ls /tmp/noexistdir > ls: cannot access /tmp/noexistdir: No such file or directory > Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector > with the Serial old collector is deprecated and will likely be removed in a > future release > Spark context Web UI available at http://10.198.66.204:55326 > Spark context available as 'sc' (master = local[*], app id = > local-1551259036573). > Spark session available as 'spark'. > Welcome to spark version 3.0.0-SNAPSHOT > Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sql("""select * from t""").show > +---+---+ > > | c0| c1| > +---+---+ > | 1| 1| > +---+---+ > scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * > from t""") > res1: org.apache.spark.sql.DataFrame = [] > > scala> > ll /tmp/noexistdir/t > -rw-r--r-- 1 xitong xitong 0 Feb 27 17:19 /tmp/noexistdir/t > vi /tmp/noexistdir/t > 1 > {code} > The /tmp/noexistdir/t is a file too. > I create a PR `https://github.com/apache/spark/pull/23950` used for test the > behavior by UT. > UT results are the same as those of maropu's test, but different from mine. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27140) The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment.
[ https://issues.apache.org/jira/browse/SPARK-27140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27140: Assignee: Apache Spark > The feature is 'insert overwrite local directory' has an inconsistent > behavior in different environment. > > > Key: SPARK-27140 > URL: https://issues.apache.org/jira/browse/SPARK-27140 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0, 3.0.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > In local[*] mode, maropu give a test case as follows: > {code:java} > $ls /tmp/noexistdir > ls: /tmp/noexistdir: No such file or directory > scala> sql("""create table t(c0 int, c1 int)""") > scala> spark.table("t").explain > == Physical Plan == > Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6] > scala> sql("""insert into t values(1, 1)""") > scala> sql("""select * from t""").show > +---+---+ > | c0| c1| > +---+---+ > | 1| 1| > +---+---+ > scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * > from t""") > $ls /tmp/noexistdir/t/ > _SUCCESS part-0-bbea4213-071a-49b4-aac8-8510e7263d45-c000 > {code} > This test case prove spark will create the not exists path and move middle > result from local temporary path to created path.This test based on newest > master. > I follow the test case provided by maropu,but find another behavior. > I run these SQL maropu provided on local[*] deploy mode based on 2.3.0. > Inconsistent behavior appears as follows: > {code:java} > ls /tmp/noexistdir > ls: cannot access /tmp/noexistdir: No such file or directory > scala> sql("""create table t(c0 int, c1 int)""") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.table("t").explain > == Physical Plan == > HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6] > scala> sql("""insert into t values(1, 1)""") > scala> sql("""select * from t""").show > +---+---+ > > | c0| c1| > +---+---+ > | 1| 1| > +---+---+ > scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * > from t""") > res1: org.apache.spark.sql.DataFrame = [] > ls /tmp/noexistdir/t/ > /tmp/noexistdir/t > vi /tmp/noexistdir/t > 1 > {code} > Then I pull the master branch and compile it and deploy it on my hadoop > cluster.I get the inconsistent behavior again. The spark version to test is > 3.0.0. > {code:java} > ls /tmp/noexistdir > ls: cannot access /tmp/noexistdir: No such file or directory > Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector > with the Serial old collector is deprecated and will likely be removed in a > future release > Spark context Web UI available at http://10.198.66.204:55326 > Spark context available as 'sc' (master = local[*], app id = > local-1551259036573). > Spark session available as 'spark'. > Welcome to spark version 3.0.0-SNAPSHOT > Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sql("""select * from t""").show > +---+---+ > > | c0| c1| > +---+---+ > | 1| 1| > +---+---+ > scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * > from t""") > res1: org.apache.spark.sql.DataFrame = [] > > scala> > ll /tmp/noexistdir/t > -rw-r--r-- 1 xitong xitong 0 Feb 27 17:19 /tmp/noexistdir/t > vi /tmp/noexistdir/t > 1 > {code} > The /tmp/noexistdir/t is a file too. > I create a PR `https://github.com/apache/spark/pull/23950` used for test the > behavior by UT. > UT results are the same as those of maropu's test, but different from mine. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27140) The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment.
[ https://issues.apache.org/jira/browse/SPARK-27140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791236#comment-16791236 ] Apache Spark commented on SPARK-27140: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/23950 > The feature is 'insert overwrite local directory' has an inconsistent > behavior in different environment. > > > Key: SPARK-27140 > URL: https://issues.apache.org/jira/browse/SPARK-27140 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0, 3.0.0 >Reporter: jiaan.geng >Priority: Major > > In local[*] mode, maropu give a test case as follows: > {code:java} > $ls /tmp/noexistdir > ls: /tmp/noexistdir: No such file or directory > scala> sql("""create table t(c0 int, c1 int)""") > scala> spark.table("t").explain > == Physical Plan == > Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6] > scala> sql("""insert into t values(1, 1)""") > scala> sql("""select * from t""").show > +---+---+ > | c0| c1| > +---+---+ > | 1| 1| > +---+---+ > scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * > from t""") > $ls /tmp/noexistdir/t/ > _SUCCESS part-0-bbea4213-071a-49b4-aac8-8510e7263d45-c000 > {code} > This test case prove spark will create the not exists path and move middle > result from local temporary path to created path.This test based on newest > master. > I follow the test case provided by maropu,but find another behavior. > I run these SQL maropu provided on local[*] deploy mode based on 2.3.0. > Inconsistent behavior appears as follows: > {code:java} > ls /tmp/noexistdir > ls: cannot access /tmp/noexistdir: No such file or directory > scala> sql("""create table t(c0 int, c1 int)""") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.table("t").explain > == Physical Plan == > HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6] > scala> sql("""insert into t values(1, 1)""") > scala> sql("""select * from t""").show > +---+---+ > > | c0| c1| > +---+---+ > | 1| 1| > +---+---+ > scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * > from t""") > res1: org.apache.spark.sql.DataFrame = [] > ls /tmp/noexistdir/t/ > /tmp/noexistdir/t > vi /tmp/noexistdir/t > 1 > {code} > Then I pull the master branch and compile it and deploy it on my hadoop > cluster.I get the inconsistent behavior again. The spark version to test is > 3.0.0. > {code:java} > ls /tmp/noexistdir > ls: cannot access /tmp/noexistdir: No such file or directory > Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector > with the Serial old collector is deprecated and will likely be removed in a > future release > Spark context Web UI available at http://10.198.66.204:55326 > Spark context available as 'sc' (master = local[*], app id = > local-1551259036573). > Spark session available as 'spark'. > Welcome to spark version 3.0.0-SNAPSHOT > Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sql("""select * from t""").show > +---+---+ > > | c0| c1| > +---+---+ > | 1| 1| > +---+---+ > scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * > from t""") > res1: org.apache.spark.sql.DataFrame = [] > > scala> > ll /tmp/noexistdir/t > -rw-r--r-- 1 xitong xitong 0 Feb 27 17:19 /tmp/noexistdir/t > vi /tmp/noexistdir/t > 1 > {code} > The /tmp/noexistdir/t is a file too. > I create a PR `https://github.com/apache/spark/pull/23950` used for test the > behavior by UT. > UT results are the same as those of maropu's test, but different from mine. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27140) The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment.
[ https://issues.apache.org/jira/browse/SPARK-27140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27140: --- Description: In local[*] mode, maropu give a test case as follows: {code:java} $ls /tmp/noexistdir ls: /tmp/noexistdir: No such file or directory scala> sql("""create table t(c0 int, c1 int)""") scala> spark.table("t").explain == Physical Plan == Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6] scala> sql("""insert into t values(1, 1)""") scala> sql("""select * from t""").show +---+---+ | c0| c1| +---+---+ | 1| 1| +---+---+ scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""") $ls /tmp/noexistdir/t/ _SUCCESS part-0-bbea4213-071a-49b4-aac8-8510e7263d45-c000 {code} This test case prove spark will create the not exists path and move middle result from local temporary path to created path.This test based on newest master. I follow the test case provided by maropu,but find another behavior. I run these SQL maropu provided on local[*] deploy mode based on 2.3.0. Inconsistent behavior appears as follows: {code:java} ls /tmp/noexistdir ls: cannot access /tmp/noexistdir: No such file or directory scala> sql("""create table t(c0 int, c1 int)""") res0: org.apache.spark.sql.DataFrame = [] scala> spark.table("t").explain == Physical Plan == HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6] scala> sql("""insert into t values(1, 1)""") scala> sql("""select * from t""").show +---+---+ | c0| c1| +---+---+ | 1| 1| +---+---+ scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""") res1: org.apache.spark.sql.DataFrame = [] ls /tmp/noexistdir/t/ /tmp/noexistdir/t vi /tmp/noexistdir/t 1 {code} Then I pull the master branch and compile it and deploy it on my hadoop cluster.I get the inconsistent behavior again. The spark version to test is 3.0.0. {code:java} ls /tmp/noexistdir ls: cannot access /tmp/noexistdir: No such file or directory Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release Spark context Web UI available at http://10.198.66.204:55326 Spark context available as 'sc' (master = local[*], app id = local-1551259036573). Spark session available as 'spark'. Welcome to spark version 3.0.0-SNAPSHOT Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> sql("""select * from t""").show +---+---+ | c0| c1| +---+---+ | 1| 1| +---+---+ scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""") res1: org.apache.spark.sql.DataFrame = [] scala> ll /tmp/noexistdir/t -rw-r--r-- 1 xitong xitong 0 Feb 27 17:19 /tmp/noexistdir/t vi /tmp/noexistdir/t 1 {code} The /tmp/noexistdir/t is a file too. I create a PR `https://github.com/apache/spark/pull/23950` used for test the behavior by UT. UT results are the same as those of maropu's test, but different from mine. was: In local[*] mode, maropu give a test case as follows: {code:java} $ls /tmp/noexistdir ls: /tmp/noexistdir: No such file or directory scala> sql("""create table t(c0 int, c1 int)""") scala> spark.table("t").explain == Physical Plan == Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6] scala> sql("""insert into t values(1, 1)""") scala> sql("""select * from t""").show +---+---+ | c0| c1| +---+---+ | 1| 1| +---+---+ scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""") $ls /tmp/noexistdir/t/ _SUCCESS part-0-bbea4213-071a-49b4-aac8-8510e7263d45-c000 {code} This test case prove spark will create the not exists path and move middle result from local temporary path to created path.This test based on newest master. I follow the test case provided by maropu,but find another behavior. I run these SQL maropu provided on local[*] deploy mode based on 2.3.0. Inconsistent behavior appears as follows: {code:java} ls /tmp/noexistdir ls: cannot access /tmp/noexistdir: No such file or directory scala> sql("""create table t(c0 int, c1 int)""") res0: org.apache.spark.sql.DataFrame = [] scala> spark.table("t").explain == Physical Plan == HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6] scala> sql("""insert into t values(1, 1)""") scala> sql("""select * from t""").show +---+---+
[jira] [Resolved] (SPARK-26976) Forbid reserved keywords as identifiers when ANSI mode is on
[ https://issues.apache.org/jira/browse/SPARK-26976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-26976. -- Resolution: Fixed Fix Version/s: 3.0.0 Resolved by https://github.com/apache/spark/pull/23880 > Forbid reserved keywords as identifiers when ANSI mode is on > > > Key: SPARK-26976 > URL: https://issues.apache.org/jira/browse/SPARK-26976 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.0.0 > > > We need to throw an exception to forbid reserved keywords as identifiers when > ANSI mode is on. > This is a follow-up of SPARK-26215. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27140) The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment.
jiaan.geng created SPARK-27140: -- Summary: The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment. Key: SPARK-27140 URL: https://issues.apache.org/jira/browse/SPARK-27140 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 2.3.0, 3.0.0 Reporter: jiaan.geng In local[*] mode, maropu give a test case as follows: {code:java} $ls /tmp/noexistdir ls: /tmp/noexistdir: No such file or directory scala> sql("""create table t(c0 int, c1 int)""") scala> spark.table("t").explain == Physical Plan == Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6] scala> sql("""insert into t values(1, 1)""") scala> sql("""select * from t""").show +---+---+ | c0| c1| +---+---+ | 1| 1| +---+---+ scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""") $ls /tmp/noexistdir/t/ _SUCCESS part-0-bbea4213-071a-49b4-aac8-8510e7263d45-c000 {code} This test case prove spark will create the not exists path and move middle result from local temporary path to created path.This test based on newest master. I follow the test case provided by maropu,but find another behavior. I run these SQL maropu provided on local[*] deploy mode based on 2.3.0. Inconsistent behavior appears as follows: {code:java} ls /tmp/noexistdir ls: cannot access /tmp/noexistdir: No such file or directory scala> sql("""create table t(c0 int, c1 int)""") res0: org.apache.spark.sql.DataFrame = [] scala> spark.table("t").explain == Physical Plan == HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6] scala> sql("""insert into t values(1, 1)""") scala> sql("""select * from t""").show +---+---+ | c0| c1| +---+---+ | 1| 1| +---+---+ scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""") res1: org.apache.spark.sql.DataFrame = [] ls /tmp/noexistdir/t/ /tmp/noexistdir/t vi /tmp/noexistdir/t 1 {code} Then I pull the master branch and compile it and deploy it on my hadoop cluster.I get the inconsistent behavior again. The spark version to test is 3.0.0. {code:java} ls /tmp/noexistdir ls: cannot access /tmp/noexistdir: No such file or directory Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release Spark context Web UI available at http://10.198.66.204:55326 Spark context available as 'sc' (master = local[*], app id = local-1551259036573). Spark session available as 'spark'. Welcome to spark version 3.0.0-SNAPSHOT Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> sql("""select * from t""").show +---+---+ | c0| c1| +---+---+ | 1| 1| +---+---+ scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""") res1: org.apache.spark.sql.DataFrame = [] scala> ll /tmp/noexistdir/t -rw-r--r-- 1 xitong xitong 0 Feb 27 17:19 /tmp/noexistdir/t vi /tmp/noexistdir/t 1 {code} The /tmp/noexistdir/t is a file too. So -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27137) Spark captured variable is null if the code is pasted via :paste
[ https://issues.apache.org/jira/browse/SPARK-27137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27137. -- Resolution: Cannot Reproduce Seems not failing in the current master: {code} scala> :paste // Entering paste mode (ctrl-D to finish) val foo = "foo" def f(arg: Any): Unit = { Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo")) } sc.parallelize(Seq(1, 2), 2).foreach(f) // Exiting paste mode, now interpreting. foo: String = foo f: (arg: Any)Unit {code} It would be great if we can identify the JIRA and backport it if applicable. > Spark captured variable is null if the code is pasted via :paste > > > Key: SPARK-27137 > URL: https://issues.apache.org/jira/browse/SPARK-27137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Osira Ben >Priority: Major > > If I execute this piece of code > {code:java} > val foo = "foo" > def f(arg: Any): Unit = { > Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo")) > } > sc.parallelize(Seq(1, 2), 2).foreach(f) > {code} > {{in spark2-shell via :paste it throws}} > {code:java} > scala> :paste > // Entering paste mode (ctrl-D to finish) > val foo = "foo" > def f(arg: Any): Unit = { > Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo")) > } > sc.parallelize(Seq(1, 2), 2).foreach(f) > // Exiting paste mode, now interpreting. > 19/03/11 15:02:06 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 > (TID 2, hadoop.company.com, executor 1): java.lang.NullPointerException: foo > at java.util.Objects.requireNonNull(Objects.java:228) > {code} > However if I execute it pasting without :paste or via spark2-shell -i it > doesn't. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27131) Merge funciton in QuantileSummaries
[ https://issues.apache.org/jira/browse/SPARK-27131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791217#comment-16791217 ] Hyukjin Kwon commented on SPARK-27131: -- Let's ask questions to mailing list. You could have a better and quicker answer. (see https://spark.apache.org/community.html) > Merge funciton in QuantileSummaries > --- > > Key: SPARK-27131 > URL: https://issues.apache.org/jira/browse/SPARK-27131 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.0 >Reporter: Mingchao Tan >Priority: Minor > > In QuantileSummaries.scala file, line 167 > This function merge two QuantileSummaries into one. You merge the two sampled > array and then compress it. My question is when compressing the merged array, > why you use count instead of the sum of count and other.count? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27131) Merge funciton in QuantileSummaries
[ https://issues.apache.org/jira/browse/SPARK-27131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27131. -- Resolution: Invalid > Merge funciton in QuantileSummaries > --- > > Key: SPARK-27131 > URL: https://issues.apache.org/jira/browse/SPARK-27131 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.4.0 >Reporter: Mingchao Tan >Priority: Minor > > In QuantileSummaries.scala file, line 167 > This function merge two QuantileSummaries into one. You merge the two sampled > array and then compress it. My question is when compressing the merged array, > why you use count instead of the sum of count and other.count? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27045) SQL tab in UI shows actual SQL instead of callsite
[ https://issues.apache.org/jira/browse/SPARK-27045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27045. --- Resolution: Fixed Assignee: Ajith S Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/23958 > SQL tab in UI shows actual SQL instead of callsite > -- > > Key: SPARK-27045 > URL: https://issues.apache.org/jira/browse/SPARK-27045 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 2.3.2, 2.3.3, 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2019-03-04-18-24-27-469.png, > image-2019-03-04-18-24-54-053.png > > > When we run sql in spark ( for example via thrift server), the SparkUI SQL > tab must show SQL instead of stacktrace which is more useful to end user. > Instead in description column it currently shows the callsite shortform which > is less useful > Actual: > !image-2019-03-04-18-24-27-469.png! > > Expected: > !image-2019-03-04-18-24-54-053.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27045) SQL tab in UI shows actual SQL instead of callsite in case of SparkSQLDriver
[ https://issues.apache.org/jira/browse/SPARK-27045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27045: -- Description: When we run sql in spark via SparkSQLDriver (thrift server, spark-sql), SQL string is siet via {{setJobDescription}}. the SparkUI SQL tab must show SQL instead of stacktrace in case {{setJobDescription}} is set which is more useful to end user. Instead it currently shows in description column the callsite shortform which is less useful Actual: !image-2019-03-04-18-24-27-469.png! Expected: !image-2019-03-04-18-24-54-053.png! was: When we run sql in spark ( for example via thrift server), the SparkUI SQL tab must show SQL instead of stacktrace which is more useful to end user. Instead in description column it currently shows the callsite shortform which is less useful Actual: !image-2019-03-04-18-24-27-469.png! Expected: !image-2019-03-04-18-24-54-053.png! > SQL tab in UI shows actual SQL instead of callsite in case of SparkSQLDriver > > > Key: SPARK-27045 > URL: https://issues.apache.org/jira/browse/SPARK-27045 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 2.3.2, 2.3.3, 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2019-03-04-18-24-27-469.png, > image-2019-03-04-18-24-54-053.png > > > When we run sql in spark via SparkSQLDriver (thrift server, spark-sql), SQL > string is siet via {{setJobDescription}}. the SparkUI SQL tab must show SQL > instead of stacktrace in case {{setJobDescription}} is set which is more > useful to end user. Instead it currently shows in description column the > callsite shortform which is less useful > Actual: > !image-2019-03-04-18-24-27-469.png! > > Expected: > !image-2019-03-04-18-24-54-053.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27045) SQL tab in UI shows actual SQL instead of callsite in case of SparkSQLDriver
[ https://issues.apache.org/jira/browse/SPARK-27045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27045: -- Summary: SQL tab in UI shows actual SQL instead of callsite in case of SparkSQLDriver (was: SQL tab in UI shows actual SQL instead of callsite) > SQL tab in UI shows actual SQL instead of callsite in case of SparkSQLDriver > > > Key: SPARK-27045 > URL: https://issues.apache.org/jira/browse/SPARK-27045 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 2.3.2, 2.3.3, 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2019-03-04-18-24-27-469.png, > image-2019-03-04-18-24-54-053.png > > > When we run sql in spark ( for example via thrift server), the SparkUI SQL > tab must show SQL instead of stacktrace which is more useful to end user. > Instead in description column it currently shows the callsite shortform which > is less useful > Actual: > !image-2019-03-04-18-24-27-469.png! > > Expected: > !image-2019-03-04-18-24-54-053.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27130) Automatically select profile when executing sbt-checkstyle
[ https://issues.apache.org/jira/browse/SPARK-27130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27130. -- Resolution: Fixed Fix Version/s: 3.0.0 Fixed in https://github.com/apache/spark/pull/24065 > Automatically select profile when executing sbt-checkstyle > -- > > Key: SPARK-27130 > URL: https://issues.apache.org/jira/browse/SPARK-27130 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27130) Automatically select profile when executing sbt-checkstyle
[ https://issues.apache.org/jira/browse/SPARK-27130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-27130: Assignee: Hyukjin Kwon > Automatically select profile when executing sbt-checkstyle > -- > > Key: SPARK-27130 > URL: https://issues.apache.org/jira/browse/SPARK-27130 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Hyukjin Kwon >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27130) Automatically select profile when executing sbt-checkstyle
[ https://issues.apache.org/jira/browse/SPARK-27130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-27130: Assignee: Yuming Wang (was: Hyukjin Kwon) > Automatically select profile when executing sbt-checkstyle > -- > > Key: SPARK-27130 > URL: https://issues.apache.org/jira/browse/SPARK-27130 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC
[ https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791075#comment-16791075 ] Dongjoon Hyun commented on SPARK-27107: --- Thank you for confirmation, [~Dhruve Ashar]. > Spark SQL Job failing because of Kryo buffer overflow with ORC > -- > > Key: SPARK-27107 > URL: https://issues.apache.org/jira/browse/SPARK-27107 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Dhruve Ashar >Priority: Major > > The issue occurs while trying to read ORC data and setting the SearchArgument. > {code:java} > Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. > Available: 0, required: 9 > Serialization trace: > literalList > (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl) > leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl) > at com.esotericsoftware.kryo.io.Output.require(Output.java:163) > at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614) > at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) > at > org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96) > at > org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315) > at > org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121) > at > org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1
[jira] [Resolved] (SPARK-27034) Nested schema pruning for ORC
[ https://issues.apache.org/jira/browse/SPARK-27034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27034. --- Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/23943 > Nested schema pruning for ORC > - > > Key: SPARK-27034 > URL: https://issues.apache.org/jira/browse/SPARK-27034 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > We only support nested schema pruning for Parquet currently. This is opened > to propose to support nested schema pruning for ORC too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26176) Verify column name when creating table via `STORED AS`
[ https://issues.apache.org/jira/browse/SPARK-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26176: Assignee: Apache Spark > Verify column name when creating table via `STORED AS` > -- > > Key: SPARK-26176 > URL: https://issues.apache.org/jira/browse/SPARK-26176 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > Labels: starter > > We can issue a reasonable exception when we creating Parquet native tables, > {code:java} > CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1; > {code} > {code:java} > org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains > invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; > {code} > However, the error messages are misleading when we create a table using the > Hive serde "STORED AS" > {code:java} > CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1; > {code} > {code:java} > 18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST > stored as parquet AS SELECT COUNT(col1) FROM TAB1] > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97) > at > org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113) > at > org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201) > at > org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266) > at org.apache.spark.sql.Dataset.(Dataset.scala:201) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 3.
[jira] [Assigned] (SPARK-26176) Verify column name when creating table via `STORED AS`
[ https://issues.apache.org/jira/browse/SPARK-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26176: Assignee: (was: Apache Spark) > Verify column name when creating table via `STORED AS` > -- > > Key: SPARK-26176 > URL: https://issues.apache.org/jira/browse/SPARK-26176 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > Labels: starter > > We can issue a reasonable exception when we creating Parquet native tables, > {code:java} > CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1; > {code} > {code:java} > org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains > invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; > {code} > However, the error messages are misleading when we create a table using the > Hive serde "STORED AS" > {code:java} > CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1; > {code} > {code:java} > 18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST > stored as parquet AS SELECT COUNT(col1) FROM TAB1] > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97) > at > org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113) > at > org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201) > at > org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266) > at org.apache.spark.sql.Dataset.(Dataset.scala:201) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 3.0 failed 1 times, most re
[jira] [Commented] (SPARK-26176) Verify column name when creating table via `STORED AS`
[ https://issues.apache.org/jira/browse/SPARK-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16791060#comment-16791060 ] Sujith Chacko commented on SPARK-26176: --- Issue is still happening with spark 2.4 latest version. I fixed and raised a PR. > Verify column name when creating table via `STORED AS` > -- > > Key: SPARK-26176 > URL: https://issues.apache.org/jira/browse/SPARK-26176 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > Labels: starter > > We can issue a reasonable exception when we creating Parquet native tables, > {code:java} > CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1; > {code} > {code:java} > org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains > invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; > {code} > However, the error messages are misleading when we create a table using the > Hive serde "STORED AS" > {code:java} > CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1; > {code} > {code:java} > 18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST > stored as parquet AS SELECT COUNT(col1) FROM TAB1] > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97) > at > org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99) > at > org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113) > at > org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201) > at > org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266) > at org.apache.spark.sql.Dataset.(Dataset.scala:201) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: org.apache.spark.
[jira] [Assigned] (SPARK-27134) array_distinct function does not work correctly with columns containing array of array
[ https://issues.apache.org/jira/browse/SPARK-27134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27134: Assignee: Apache Spark > array_distinct function does not work correctly with columns containing array > of array > -- > > Key: SPARK-27134 > URL: https://issues.apache.org/jira/browse/SPARK-27134 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4, scala 2.11.11 >Reporter: Mike Trenaman >Assignee: Apache Spark >Priority: Major > > The array_distinct function introduced in spark 2.4 is producing strange > results when used on an array column which contains a nested array. The > resulting output can still contain duplicate values, and furthermore, > previously distinct values may be removed. > This is easily repeatable, e.g. with this code: > val df = Seq( > Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5)) > ).toDF("Number_Combinations") > val dfWithDistinct = df.withColumn("distinct_combinations", > array_distinct(col("Number_Combinations"))) > > The initial 'df' DataFrame contains one row, where column > 'Number_Combinations' contains the following values: > [[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]] > > The array_distinct function run on this column produces a new column > containing the following values: > [[1, 2], [1, 2], [1, 2]] > > As you can see, this contains three occurrences of the same value (1, 2), and > furthermore, the distinct values (3, 4), (4, 5) have been removed. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27123) Improve CollapseProject to handle projects cross limit/repartition/sample
[ https://issues.apache.org/jira/browse/SPARK-27123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-27123. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24049 [https://github.com/apache/spark/pull/24049] > Improve CollapseProject to handle projects cross limit/repartition/sample > - > > Key: SPARK-27123 > URL: https://issues.apache.org/jira/browse/SPARK-27123 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > `CollapseProject` optimizer simplifies the plan by merging the adjacent > projects and performing alias substitution. > {code:java} > scala> sql("SELECT b c FROM (SELECT a b FROM t)").explain > == Physical Plan == > *(1) Project [a#5 AS c#1] > +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5] > {code} > We can do that more complex cases like the following. > *BEFORE* > {code:java} > scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM > t)").explain > == Physical Plan == > *(2) Project [b#0 AS c#1] > +- Exchange RoundRobinPartitioning(1) >+- *(1) Project [a#5 AS b#0] > +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5] > {code} > *AFTER* > {code:java} > scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM > t)").explain > == Physical Plan == > Exchange RoundRobinPartitioning(1) > +- *(1) Project [a#11 AS c#7] >+- Scan hive default.t [a#11], HiveTableRelation `default`.`t`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#11] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27134) array_distinct function does not work correctly with columns containing array of array
[ https://issues.apache.org/jira/browse/SPARK-27134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27134: Assignee: (was: Apache Spark) > array_distinct function does not work correctly with columns containing array > of array > -- > > Key: SPARK-27134 > URL: https://issues.apache.org/jira/browse/SPARK-27134 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4, scala 2.11.11 >Reporter: Mike Trenaman >Priority: Major > > The array_distinct function introduced in spark 2.4 is producing strange > results when used on an array column which contains a nested array. The > resulting output can still contain duplicate values, and furthermore, > previously distinct values may be removed. > This is easily repeatable, e.g. with this code: > val df = Seq( > Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5)) > ).toDF("Number_Combinations") > val dfWithDistinct = df.withColumn("distinct_combinations", > array_distinct(col("Number_Combinations"))) > > The initial 'df' DataFrame contains one row, where column > 'Number_Combinations' contains the following values: > [[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]] > > The array_distinct function run on this column produces a new column > containing the following values: > [[1, 2], [1, 2], [1, 2]] > > As you can see, this contains three occurrences of the same value (1, 2), and > furthermore, the distinct values (3, 4), (4, 5) have been removed. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27139) NettyBlockTransferService does not abide by spark.blockManager.port config option
[ https://issues.apache.org/jira/browse/SPARK-27139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bolke de Bruin resolved SPARK-27139. Resolution: Invalid And proper casing does the trick > NettyBlockTransferService does not abide by spark.blockManager.port config > option > - > > Key: SPARK-27139 > URL: https://issues.apache.org/jira/browse/SPARK-27139 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Bolke de Bruin >Priority: Blocker > > This is a regression from a fix in SPARK-4837 > The NettyBlockTransferService always binds to a random port, and does not use > the spark.blockManager.port config as specified. > this is a blocker for tightly controlled environments where random ports are > not allowed to pass firewalls. > neither `spark.driver.blockmanager.port` nor `spark.blockmanager.port` works > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27139) NettyBlockTransferService does not abide by spark.blockManager.port config option
[ https://issues.apache.org/jira/browse/SPARK-27139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bolke de Bruin updated SPARK-27139: --- Description: This is a regression from a fix in SPARK-4837 The NettyBlockTransferService always binds to a random port, and does not use the spark.blockManager.port config as specified. this is a blocker for tightly controlled environments where random ports are not allowed to pass firewalls. neither `spark.driver.blockmanager.port` nor `spark.blockmanager.port` works was: This is a regression from a fix in SPARK-4837 The NettyBlockTransferService always binds to a random port, and does not use the spark.blockManager.port config as specified. this is a blocker for tightly controlled environments where random ports are not allowed to pass firewalls. > NettyBlockTransferService does not abide by spark.blockManager.port config > option > - > > Key: SPARK-27139 > URL: https://issues.apache.org/jira/browse/SPARK-27139 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Bolke de Bruin >Priority: Blocker > > This is a regression from a fix in SPARK-4837 > The NettyBlockTransferService always binds to a random port, and does not use > the spark.blockManager.port config as specified. > this is a blocker for tightly controlled environments where random ports are > not allowed to pass firewalls. > neither `spark.driver.blockmanager.port` nor `spark.blockmanager.port` works > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26927) Race condition may cause dynamic allocation not working
[ https://issues.apache.org/jira/browse/SPARK-26927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-26927: -- Assignee: liupengcheng > Race condition may cause dynamic allocation not working > --- > > Key: SPARK-26927 > URL: https://issues.apache.org/jira/browse/SPARK-26927 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.4.0 >Reporter: liupengcheng >Assignee: liupengcheng >Priority: Major > Attachments: Selection_042.jpg, Selection_043.jpg, Selection_044.jpg, > Selection_045.jpg, Selection_046.jpg > > > Recently, we catch a bug that caused our production spark thriftserver hangs: > There is a race condition in the ExecutorAllocationManager that the > `SparkListenerExecutorRemoved` event is posted before the > `SparkListenerTaskStart` event, which will cause the incorrect result of > `executorIds`, then when some executor idles, the real executors will be > removed even executor number is equal to `minNumExecutors` due to the > incorrect computation of `newExecutorTotal`(may greater than the > `minNumExecutors`), thus may finally causing zero available executors but a > wrong number of executorIds was kept in memory. > What's more, even the `SparkListenerTaskEnd` event can not make the fake > `executorIds` released, because later idle event for the fake executors can > not cause the real removal of these executors, as they are already removed > and they are not exist in the `executorDataMap` of > `CoaseGrainedSchedulerBackend`. > Logs: > !Selection_042.jpg! > !Selection_043.jpg! > !Selection_044.jpg! > !Selection_045.jpg! > !Selection_046.jpg! > EventLogs(DisOrder of events): > {code:java} > {"Event":"SparkListenerExecutorRemoved","Timestamp":1549936077543,"Executor > ID":"131","Removed Reason":"Container > container_e28_1547530852233_236191_02_000180 exited from explicit termination > request."} > {"Event":"SparkListenerTaskStart","Stage ID":136689,"Stage Attempt > ID":0,"Task Info":{"Task ID":448048,"Index":2,"Attempt":0,"Launch > Time":1549936032872,"Executor > ID":"131","Host":"mb2-hadoop-prc-st474.awsind","Locality":"RACK_LOCAL", > "Speculative":false,"Getting Result Time":0,"Finish > Time":1549936032906,"Failed":false,"Killed":false,"Accumulables":[{"ID":12923945,"Name":"internal.metrics.executorDeserializeTime","Update":10,"Value":13,"Internal":true,"Count > Faile d > Values":true},{"ID":12923946,"Name":"internal.metrics.executorDeserializeCpuTime","Update":2244016,"Value":4286494,"Internal":true,"Count > Failed > Values":true},{"ID":12923947,"Name":"internal.metrics.executorRunTime","Update":20,"Val > ue":39,"Internal":true,"Count Failed > Values":true},{"ID":12923948,"Name":"internal.metrics.executorCpuTime","Update":13412614,"Value":26759061,"Internal":true,"Count > Failed Values":true},{"ID":12923949,"Name":"internal.metrics.resultS > ize","Update":3578,"Value":7156,"Internal":true,"Count Failed > Values":true},{"ID":12923954,"Name":"internal.metrics.peakExecutionMemory","Update":33816576,"Value":67633152,"Internal":true,"Count > Failed Values":true},{"ID":12923962,"Na > me":"internal.metrics.shuffle.write.bytesWritten","Update":1367,"Value":2774,"Internal":true,"Count > Failed > Values":true},{"ID":12923963,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":23,"Value":45,"Internal":true,"Cou > nt Failed > Values":true},{"ID":12923964,"Name":"internal.metrics.shuffle.write.writeTime","Update":3259051,"Value":6858121,"Internal":true,"Count > Failed Values":true},{"ID":12921550,"Name":"number of output > rows","Update":"158","Value" :"289","Internal":true,"Count Failed > Values":true,"Metadata":"sql"},{"ID":12921546,"Name":"number of output > rows","Update":"23","Value":"45","Internal":true,"Count Failed > Values":true,"Metadata":"sql"},{"ID":12921547,"Name":"peak memo ry total > (min, med, > max)","Update":"33816575","Value":"67633149","Internal":true,"Count Failed > Values":true,"Metadata":"sql"},{"ID":12921541,"Name":"data size total (min, > med, max)","Update":"551","Value":"1077","Internal":true,"Count Failed > Values":true,"Metadata":"sql"}]}} > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26927) Race condition may cause dynamic allocation not working
[ https://issues.apache.org/jira/browse/SPARK-26927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26927. Resolution: Fixed Fix Version/s: 2.3.4 2.4.2 3.0.0 Issue resolved by pull request 23842 [https://github.com/apache/spark/pull/23842] > Race condition may cause dynamic allocation not working > --- > > Key: SPARK-26927 > URL: https://issues.apache.org/jira/browse/SPARK-26927 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.4.0 >Reporter: liupengcheng >Assignee: liupengcheng >Priority: Major > Fix For: 3.0.0, 2.4.2, 2.3.4 > > Attachments: Selection_042.jpg, Selection_043.jpg, Selection_044.jpg, > Selection_045.jpg, Selection_046.jpg > > > Recently, we catch a bug that caused our production spark thriftserver hangs: > There is a race condition in the ExecutorAllocationManager that the > `SparkListenerExecutorRemoved` event is posted before the > `SparkListenerTaskStart` event, which will cause the incorrect result of > `executorIds`, then when some executor idles, the real executors will be > removed even executor number is equal to `minNumExecutors` due to the > incorrect computation of `newExecutorTotal`(may greater than the > `minNumExecutors`), thus may finally causing zero available executors but a > wrong number of executorIds was kept in memory. > What's more, even the `SparkListenerTaskEnd` event can not make the fake > `executorIds` released, because later idle event for the fake executors can > not cause the real removal of these executors, as they are already removed > and they are not exist in the `executorDataMap` of > `CoaseGrainedSchedulerBackend`. > Logs: > !Selection_042.jpg! > !Selection_043.jpg! > !Selection_044.jpg! > !Selection_045.jpg! > !Selection_046.jpg! > EventLogs(DisOrder of events): > {code:java} > {"Event":"SparkListenerExecutorRemoved","Timestamp":1549936077543,"Executor > ID":"131","Removed Reason":"Container > container_e28_1547530852233_236191_02_000180 exited from explicit termination > request."} > {"Event":"SparkListenerTaskStart","Stage ID":136689,"Stage Attempt > ID":0,"Task Info":{"Task ID":448048,"Index":2,"Attempt":0,"Launch > Time":1549936032872,"Executor > ID":"131","Host":"mb2-hadoop-prc-st474.awsind","Locality":"RACK_LOCAL", > "Speculative":false,"Getting Result Time":0,"Finish > Time":1549936032906,"Failed":false,"Killed":false,"Accumulables":[{"ID":12923945,"Name":"internal.metrics.executorDeserializeTime","Update":10,"Value":13,"Internal":true,"Count > Faile d > Values":true},{"ID":12923946,"Name":"internal.metrics.executorDeserializeCpuTime","Update":2244016,"Value":4286494,"Internal":true,"Count > Failed > Values":true},{"ID":12923947,"Name":"internal.metrics.executorRunTime","Update":20,"Val > ue":39,"Internal":true,"Count Failed > Values":true},{"ID":12923948,"Name":"internal.metrics.executorCpuTime","Update":13412614,"Value":26759061,"Internal":true,"Count > Failed Values":true},{"ID":12923949,"Name":"internal.metrics.resultS > ize","Update":3578,"Value":7156,"Internal":true,"Count Failed > Values":true},{"ID":12923954,"Name":"internal.metrics.peakExecutionMemory","Update":33816576,"Value":67633152,"Internal":true,"Count > Failed Values":true},{"ID":12923962,"Na > me":"internal.metrics.shuffle.write.bytesWritten","Update":1367,"Value":2774,"Internal":true,"Count > Failed > Values":true},{"ID":12923963,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":23,"Value":45,"Internal":true,"Cou > nt Failed > Values":true},{"ID":12923964,"Name":"internal.metrics.shuffle.write.writeTime","Update":3259051,"Value":6858121,"Internal":true,"Count > Failed Values":true},{"ID":12921550,"Name":"number of output > rows","Update":"158","Value" :"289","Internal":true,"Count Failed > Values":true,"Metadata":"sql"},{"ID":12921546,"Name":"number of output > rows","Update":"23","Value":"45","Internal":true,"Count Failed > Values":true,"Metadata":"sql"},{"ID":12921547,"Name":"peak memo ry total > (min, med, > max)","Update":"33816575","Value":"67633149","Internal":true,"Count Failed > Values":true,"Metadata":"sql"},{"ID":12921541,"Name":"data size total (min, > med, max)","Update":"551","Value":"1077","Internal":true,"Count Failed > Values":true,"Metadata":"sql"}]}} > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27139) NettyBlockTransferService does not abide by spark.blockManager.port config option
Bolke de Bruin created SPARK-27139: -- Summary: NettyBlockTransferService does not abide by spark.blockManager.port config option Key: SPARK-27139 URL: https://issues.apache.org/jira/browse/SPARK-27139 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Bolke de Bruin This is a regression from a fix in SPARK-4837 The NettyBlockTransferService always binds to a random port, and does not use the spark.blockManager.port config as specified. this is a blocker for tightly controlled environments where random ports are not allowed to pass firewalls. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC
[ https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790911#comment-16790911 ] Dhruve Ashar commented on SPARK-27107: -- I verified the changes and we are no longer seeing the issue. Thanks for testing+voting the ORC RC. I think I am not on the ORC mailing list, so I might have missed the voting. > Spark SQL Job failing because of Kryo buffer overflow with ORC > -- > > Key: SPARK-27107 > URL: https://issues.apache.org/jira/browse/SPARK-27107 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Dhruve Ashar >Priority: Major > > The issue occurs while trying to read ORC data and setting the SearchArgument. > {code:java} > Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. > Available: 0, required: 9 > Serialization trace: > literalList > (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl) > leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl) > at com.esotericsoftware.kryo.io.Output.require(Output.java:163) > at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614) > at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) > at > org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96) > at > org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315) > at > org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121) > at > org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.S
[jira] [Commented] (SPARK-27087) Inability to access to column alias in pyspark
[ https://issues.apache.org/jira/browse/SPARK-27087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790929#comment-16790929 ] Thincrs commented on SPARK-27087: - A user of thincrs has selected this issue. Deadline: Tue, Mar 19, 2019 7:41 PM > Inability to access to column alias in pyspark > -- > > Key: SPARK-27087 > URL: https://issues.apache.org/jira/browse/SPARK-27087 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Vincent >Priority: Minor > > In pyspark I have the following: > {code:java} > import pyspark.sql.functions as F > cc = F.lit(1).alias("A") > print(cc) > print(cc._jc.toString()) > {code} > I get : > {noformat} > Column > 1 AS `A` > {noformat} > Is there any way for me to just print "A" from cc ? it seems I'm unable to > extract the alias programatically from the column object. > Also I think that in spark-sql in scala, if I print "cc" it would just print > "A" instead, so this seem like a bug or a missing feature to me -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27138) Remove AdminUtils calls
[ https://issues.apache.org/jira/browse/SPARK-27138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dylan Guedes updated SPARK-27138: - Description: KafkaTestUtils (from kafka010) currently uses AdminUtils to create and delete topics for test suites (what is currently deprecated). Since it will stop to work at some point, I think that it is a good opportunity to change the API calls. (was: KafkaTestUtils (from kafka010) currently uses AdminUtils to create and delete topics for test suites (what is currently deprecated). Since it will stop to work at some point, I think that it is a good opportunity.) > Remove AdminUtils calls > --- > > Key: SPARK-27138 > URL: https://issues.apache.org/jira/browse/SPARK-27138 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 2.4.0 >Reporter: Dylan Guedes >Priority: Minor > > KafkaTestUtils (from kafka010) currently uses AdminUtils to create and delete > topics for test suites (what is currently deprecated). Since it will stop to > work at some point, I think that it is a good opportunity to change the API > calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26251) isnan function not picking non-numeric values
[ https://issues.apache.org/jira/browse/SPARK-26251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790914#comment-16790914 ] Thincrs commented on SPARK-26251: - A user of thincrs has selected this issue. Deadline: Tue, Mar 19, 2019 7:38 PM > isnan function not picking non-numeric values > - > > Key: SPARK-26251 > URL: https://issues.apache.org/jira/browse/SPARK-26251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kunal Rao >Priority: Minor > > import org.apache.spark.sql.functions._ > List("po box 7896", "8907", > "435435").toDF("rgid").filter(isnan(col("rgid"))).show > > should pick "po box 7896" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26089) Handle large corrupt shuffle blocks
[ https://issues.apache.org/jira/browse/SPARK-26089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-26089: Assignee: Ankur Gupta > Handle large corrupt shuffle blocks > --- > > Key: SPARK-26089 > URL: https://issues.apache.org/jira/browse/SPARK-26089 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Ankur Gupta >Priority: Major > Fix For: 3.0.0 > > > We've seen a bad disk lead to corruption in a shuffle block, which lead to > tasks repeatedly failing after fetching the data with an IOException. The > tasks get retried, but the same corrupt data gets fetched again, and the > tasks keep failing. As there isn't a fetch-failure, the jobs eventually > fail, spark never tries to regenerate the shuffle data. > This is the same as SPARK-4105, but that fix only covered small blocks. > There was some discussion during that change about this limitation > (https://github.com/apache/spark/pull/15923#discussion_r88756017) and > followups to cover larger blocks (which would involve spilling to disk to > avoid OOM), but it looks like that never happened. > I can think of a few approaches to this: > 1) wrap the shuffle block input stream with another input stream, that > converts all exceptions into FetchFailures. This is similar to the fix of > SPARK-4105, but that reads the entire input stream up-front, and instead I'm > proposing to do it within the InputStream itself so its streaming and does > not have a large memory overhead. > 2) Add checksums to shuffle blocks. This was proposed > [here|https://github.com/apache/spark/pull/15894] and abandoned as being too > complex. > 3) Try to tackle this with blacklisting instead: when there is any failure in > a task that is reading shuffle data, assign some "blame" to the source of the > shuffle data, and eventually blacklist the source. It seems really tricky to > get sensible heuristics for this, though. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26089) Handle large corrupt shuffle blocks
[ https://issues.apache.org/jira/browse/SPARK-26089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-26089. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23453 [https://github.com/apache/spark/pull/23453] > Handle large corrupt shuffle blocks > --- > > Key: SPARK-26089 > URL: https://issues.apache.org/jira/browse/SPARK-26089 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Shuffle, Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Fix For: 3.0.0 > > > We've seen a bad disk lead to corruption in a shuffle block, which lead to > tasks repeatedly failing after fetching the data with an IOException. The > tasks get retried, but the same corrupt data gets fetched again, and the > tasks keep failing. As there isn't a fetch-failure, the jobs eventually > fail, spark never tries to regenerate the shuffle data. > This is the same as SPARK-4105, but that fix only covered small blocks. > There was some discussion during that change about this limitation > (https://github.com/apache/spark/pull/15923#discussion_r88756017) and > followups to cover larger blocks (which would involve spilling to disk to > avoid OOM), but it looks like that never happened. > I can think of a few approaches to this: > 1) wrap the shuffle block input stream with another input stream, that > converts all exceptions into FetchFailures. This is similar to the fix of > SPARK-4105, but that reads the entire input stream up-front, and instead I'm > proposing to do it within the InputStream itself so its streaming and does > not have a large memory overhead. > 2) Add checksums to shuffle blocks. This was proposed > [here|https://github.com/apache/spark/pull/15894] and abandoned as being too > complex. > 3) Try to tackle this with blacklisting instead: when there is any failure in > a task that is reading shuffle data, assign some "blame" to the source of the > shuffle data, and eventually blacklist the source. It seems really tricky to > get sensible heuristics for this, though. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27138) Remove AdminUtils calls
Dylan Guedes created SPARK-27138: Summary: Remove AdminUtils calls Key: SPARK-27138 URL: https://issues.apache.org/jira/browse/SPARK-27138 Project: Spark Issue Type: Task Components: Tests Affects Versions: 2.4.0 Reporter: Dylan Guedes KafkaTestUtils (from kafka010) currently uses AdminUtils to create and delete topics for test suites (what is currently deprecated). Since it will stop to work at some point, I think that it is a good opportunity. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27138) Remove AdminUtils calls
[ https://issues.apache.org/jira/browse/SPARK-27138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27138: Assignee: (was: Apache Spark) > Remove AdminUtils calls > --- > > Key: SPARK-27138 > URL: https://issues.apache.org/jira/browse/SPARK-27138 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 2.4.0 >Reporter: Dylan Guedes >Priority: Minor > > KafkaTestUtils (from kafka010) currently uses AdminUtils to create and delete > topics for test suites (what is currently deprecated). Since it will stop to > work at some point, I think that it is a good opportunity. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27138) Remove AdminUtils calls
[ https://issues.apache.org/jira/browse/SPARK-27138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27138: Assignee: Apache Spark > Remove AdminUtils calls > --- > > Key: SPARK-27138 > URL: https://issues.apache.org/jira/browse/SPARK-27138 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 2.4.0 >Reporter: Dylan Guedes >Assignee: Apache Spark >Priority: Minor > > KafkaTestUtils (from kafka010) currently uses AdminUtils to create and delete > topics for test suites (what is currently deprecated). Since it will stop to > work at some point, I think that it is a good opportunity. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27010) find out the actual port number when hive.server2.thrift.port=0
[ https://issues.apache.org/jira/browse/SPARK-27010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27010: - Assignee: zuotingbing > find out the actual port number when hive.server2.thrift.port=0 > --- > > Key: SPARK-27010 > URL: https://issues.apache.org/jira/browse/SPARK-27010 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: zuotingbing >Assignee: zuotingbing >Priority: Minor > Attachments: 2019-02-28_170844.png, 2019-02-28_170904.png, > 2019-02-28_170942.png, 2019-03-01_092511.png > > > Currently, if we set *hive.server2.thrift.port=0*, it hard to find out the > actual port number which one we should use when using beeline to connect.. > before: > !2019-02-28_170942.png! > after: > !2019-02-28_170904.png! > use beeline to connect success: > !2019-02-28_170844.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27010) find out the actual port number when hive.server2.thrift.port=0
[ https://issues.apache.org/jira/browse/SPARK-27010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27010. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23917 [https://github.com/apache/spark/pull/23917] > find out the actual port number when hive.server2.thrift.port=0 > --- > > Key: SPARK-27010 > URL: https://issues.apache.org/jira/browse/SPARK-27010 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: zuotingbing >Assignee: zuotingbing >Priority: Minor > Fix For: 3.0.0 > > Attachments: 2019-02-28_170844.png, 2019-02-28_170904.png, > 2019-02-28_170942.png, 2019-03-01_092511.png > > > Currently, if we set *hive.server2.thrift.port=0*, it hard to find out the > actual port number which one we should use when using beeline to connect.. > before: > !2019-02-28_170942.png! > after: > !2019-02-28_170904.png! > use beeline to connect success: > !2019-02-28_170844.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27090) Removing old LEGACY_DRIVER_IDENTIFIER ("")
[ https://issues.apache.org/jira/browse/SPARK-27090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27090. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24026 [https://github.com/apache/spark/pull/24026] > Removing old LEGACY_DRIVER_IDENTIFIER ("") > -- > > Key: SPARK-27090 > URL: https://issues.apache.org/jira/browse/SPARK-27090 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Shivu Sondur >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places > along with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver > is running or an executor. > The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So > I think we have a chance to get rid of the LEGACY_DRIVER_IDENTIFIER. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27090) Removing old LEGACY_DRIVER_IDENTIFIER ("")
[ https://issues.apache.org/jira/browse/SPARK-27090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27090: - Assignee: Shivu Sondur > Removing old LEGACY_DRIVER_IDENTIFIER ("") > -- > > Key: SPARK-27090 > URL: https://issues.apache.org/jira/browse/SPARK-27090 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Shivu Sondur >Priority: Minor > Labels: release-notes > > For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places > along with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver > is running or an executor. > The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So > I think we have a chance to get rid of the LEGACY_DRIVER_IDENTIFIER. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23961) pyspark toLocalIterator throws an exception
[ https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23961: Assignee: (was: Apache Spark) > pyspark toLocalIterator throws an exception > --- > > Key: SPARK-23961 > URL: https://issues.apache.org/jira/browse/SPARK-23961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Michel Lemay >Priority: Minor > Labels: DataFrame, pyspark > > Given a dataframe and use toLocalIterator. If we do not consume all records, > it will throw: > {quote}ERROR PythonRDD: Error while sending iterator > java.net.SocketException: Connection reset by peer: socket write error > at java.net.SocketOutputStream.socketWrite0(Native Method) > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) > at java.net.SocketOutputStream.write(SocketOutputStream.java:155) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at java.io.FilterOutputStream.write(FilterOutputStream.java:97) > at > org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706) > {quote} > > To reproduce, here is a simple pyspark shell script that show the error: > {quote}import itertools > df = spark.read.parquet("large parquet folder").cache() > print(df.count()) > b = df.toLocalIterator() > print(len(list(itertools.islice(b, 20 > b = None # Make the iterator goes out of scope. Throws here. > {quote} > > Observations: > * Consuming all records do not throw. Taking only a subset of the > partitions create the error. > * In another experiment, doing the same on a regular RDD works if we > cache/materialize it. If we do not cache the RDD, it throws similarly. > * It works in scala shell > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23961) pyspark toLocalIterator throws an exception
[ https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23961: Assignee: Apache Spark > pyspark toLocalIterator throws an exception > --- > > Key: SPARK-23961 > URL: https://issues.apache.org/jira/browse/SPARK-23961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Michel Lemay >Assignee: Apache Spark >Priority: Minor > Labels: DataFrame, pyspark > > Given a dataframe and use toLocalIterator. If we do not consume all records, > it will throw: > {quote}ERROR PythonRDD: Error while sending iterator > java.net.SocketException: Connection reset by peer: socket write error > at java.net.SocketOutputStream.socketWrite0(Native Method) > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) > at java.net.SocketOutputStream.write(SocketOutputStream.java:155) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at java.io.FilterOutputStream.write(FilterOutputStream.java:97) > at > org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706) > {quote} > > To reproduce, here is a simple pyspark shell script that show the error: > {quote}import itertools > df = spark.read.parquet("large parquet folder").cache() > print(df.count()) > b = df.toLocalIterator() > print(len(list(itertools.islice(b, 20 > b = None # Make the iterator goes out of scope. Throws here. > {quote} > > Observations: > * Consuming all records do not throw. Taking only a subset of the > partitions create the error. > * In another experiment, doing the same on a regular RDD works if we > cache/materialize it. If we do not cache the RDD, it throws similarly. > * It works in scala shell > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27137) Spark captured variable is null if the code is pasted via :paste
Osira Ben created SPARK-27137: - Summary: Spark captured variable is null if the code is pasted via :paste Key: SPARK-27137 URL: https://issues.apache.org/jira/browse/SPARK-27137 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Osira Ben If I execute this piece of code {code:java} val foo = "foo" def f(arg: Any): Unit = { Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo")) } sc.parallelize(Seq(1, 2), 2).foreach(f) {code} {{in spark2-shell via :paste it throws}}{{}} {code:java} scala> :paste // Entering paste mode (ctrl-D to finish) val foo = "foo" def f(arg: Any): Unit = { Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo")) } sc.parallelize(Seq(1, 2), 2).foreach(f) // Exiting paste mode, now interpreting. 19/03/11 15:02:06 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, hadoop.company.com, executor 1): java.lang.NullPointerException: foo at java.util.Objects.requireNonNull(Objects.java:228) {code} However if I execute it pasting without :paste or via spark2-shell -i it doesn't. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27137) Spark captured variable is null if the code is pasted via :paste
[ https://issues.apache.org/jira/browse/SPARK-27137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Osira Ben updated SPARK-27137: -- Description: If I execute this piece of code {code:java} val foo = "foo" def f(arg: Any): Unit = { Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo")) } sc.parallelize(Seq(1, 2), 2).foreach(f) {code} {{in spark2-shell via :paste it throws}} {code:java} scala> :paste // Entering paste mode (ctrl-D to finish) val foo = "foo" def f(arg: Any): Unit = { Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo")) } sc.parallelize(Seq(1, 2), 2).foreach(f) // Exiting paste mode, now interpreting. 19/03/11 15:02:06 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, hadoop.company.com, executor 1): java.lang.NullPointerException: foo at java.util.Objects.requireNonNull(Objects.java:228) {code} However if I execute it pasting without :paste or via spark2-shell -i it doesn't. was: If I execute this piece of code {code:java} val foo = "foo" def f(arg: Any): Unit = { Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo")) } sc.parallelize(Seq(1, 2), 2).foreach(f) {code} {{in spark2-shell via :paste it throws}}{{}} {code:java} scala> :paste // Entering paste mode (ctrl-D to finish) val foo = "foo" def f(arg: Any): Unit = { Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo")) } sc.parallelize(Seq(1, 2), 2).foreach(f) // Exiting paste mode, now interpreting. 19/03/11 15:02:06 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, hadoop.company.com, executor 1): java.lang.NullPointerException: foo at java.util.Objects.requireNonNull(Objects.java:228) {code} However if I execute it pasting without :paste or via spark2-shell -i it doesn't. > Spark captured variable is null if the code is pasted via :paste > > > Key: SPARK-27137 > URL: https://issues.apache.org/jira/browse/SPARK-27137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Osira Ben >Priority: Major > > If I execute this piece of code > {code:java} > val foo = "foo" > def f(arg: Any): Unit = { > Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo")) > } > sc.parallelize(Seq(1, 2), 2).foreach(f) > {code} > {{in spark2-shell via :paste it throws}} > {code:java} > scala> :paste > // Entering paste mode (ctrl-D to finish) > val foo = "foo" > def f(arg: Any): Unit = { > Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo")) > } > sc.parallelize(Seq(1, 2), 2).foreach(f) > // Exiting paste mode, now interpreting. > 19/03/11 15:02:06 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 > (TID 2, hadoop.company.com, executor 1): java.lang.NullPointerException: foo > at java.util.Objects.requireNonNull(Objects.java:228) > {code} > However if I execute it pasting without :paste or via spark2-shell -i it > doesn't. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting
[ https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Parth Gandhi updated SPARK-27112: - Description: Recently, a few spark users in the organization have reported that their jobs were getting stuck. On further analysis, it was found out that there exist two independent deadlocks and either of them occur under different circumstances. The screenshots for these two deadlocks are attached here. We were able to reproduce the deadlocks with the following piece of code: {code:java} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.spark._ import org.apache.spark.TaskContext // Simple example of Word Count in Scala object ScalaWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: ScalaWordCount ") System.exit(1) } val conf = new SparkConf().setAppName("Scala Word Count") val sc = new SparkContext(conf) // get the input file uri val inputFilesUri = args(0) // get the output file uri val outputFilesUri = args(1) while (true) { val textFile = sc.textFile(inputFilesUri) val counts = textFile.flatMap(line => line.split(" ")) .map(word => {if (TaskContext.get.partitionId == 5 && TaskContext.get.attemptNumber == 0) throw new Exception("Fail for blacklisting") else (word, 1)}) .reduceByKey(_ + _) counts.saveAsTextFile(outputFilesUri) val conf: Configuration = new Configuration() val path: Path = new Path(outputFilesUri) val hdfs: FileSystem = FileSystem.get(conf) hdfs.delete(path, true) } sc.stop() } } {code} Additionally, to ensure that the deadlock surfaces up soon enough, I also added a small delay in the Spark code here: [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256] {code:java} executorIdToFailureList.remove(exec) updateNextExpiryTime() Thread.sleep(2000) killBlacklistedExecutor(exec) {code} Also make sure that the following configs are set when launching the above spark job: *spark.blacklist.enabled=true* *spark.blacklist.killBlacklistedExecutors=true* *spark.blacklist.application.maxFailedTasksPerExecutor=1* was: Recently, a few spark users in the organization have reported that their jobs were getting stuck. On further analysis, it was found out that there exist two independent deadlocks and either of them occur under different circumstances. The screenshots for these two deadlocks are attached here. We were able to reproduce the deadlocks with the following piece of code: {code:java} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.spark._ import org.apache.spark.TaskContext // Simple example of Word Count in Scala object ScalaWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: ScalaWordCount ") System.exit(1) } val conf = new SparkConf().setAppName("Scala Word Count") val sc = new SparkContext(conf) // get the input file uri val inputFilesUri = args(0) // get the output file uri val outputFilesUri = args(1) while (true) { val textFile = sc.textFile(inputFilesUri) val counts = textFile.flatMap(line => line.split(" ")) .map(word => {if (TaskContext.get.partitionId == 5 && TaskContext.get.attemptNumber == 0) throw new Exception("Fail for blacklisting") else (word, 1)}) .reduceByKey(_ + _) counts.saveAsTextFile(outputFilesUri) val conf: Configuration = new Configuration() val path: Path = new Path(outputFilesUri) val hdfs: FileSystem = FileSystem.get(conf) hdfs.delete(path, true) } sc.stop() } } {code} Additionally, to ensure that the deadlock surfaces up soon enough, I also added a small delay in the Spark code here: [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256] {code:java} executorIdToFailureList.remove(exec) updateNextExpiryTime() Thread.sleep(2000) killBlacklistedExecutor(exec) {code} > Spark Scheduler encounters two independent Deadlocks when trying to kill > executors either due to dynamic allocation or blacklisting > > > Key: SPARK-27112 > URL: https://issues.apache.org/jira/browse/SPARK-27112 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Parth Gandhi >Priority: Major > Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot > 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, > Screen Shot 2019-02-26 at 4.11.26 PM.png > > > Recently, a few spark users in the organization have reported that their jobs > were getting stuck. On further analysis, it was found out
[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC
[ https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790739#comment-16790739 ] Dongjoon Hyun commented on SPARK-27107: --- [~Dhruve Ashar]. Please vote on ORC RC1 after testing your environment. :) > Spark SQL Job failing because of Kryo buffer overflow with ORC > -- > > Key: SPARK-27107 > URL: https://issues.apache.org/jira/browse/SPARK-27107 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Dhruve Ashar >Priority: Major > > The issue occurs while trying to read ORC data and setting the SearchArgument. > {code:java} > Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. > Available: 0, required: 9 > Serialization trace: > literalList > (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl) > leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl) > at com.esotericsoftware.kryo.io.Output.require(Output.java:163) > at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614) > at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) > at > org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96) > at > org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315) > at > org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121) > at > org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.Spa
[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC
[ https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790672#comment-16790672 ] Dongjoon Hyun commented on SPARK-27107: --- Yep. [~Dhruve Ashar]. I already tested and voted that. I'll do if the vote passes and the artifacts are uploaded. And, we need a test case in Spark side. Without a test case, it's just a dependency upgrade. Also, users can simply replace their ORC jar files. Apache Spark is not a fat assembly jar file for this reason. > Spark SQL Job failing because of Kryo buffer overflow with ORC > -- > > Key: SPARK-27107 > URL: https://issues.apache.org/jira/browse/SPARK-27107 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Dhruve Ashar >Priority: Major > > The issue occurs while trying to read ORC data and setting the SearchArgument. > {code:java} > Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. > Available: 0, required: 9 > Serialization trace: > literalList > (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl) > leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl) > at com.esotericsoftware.kryo.io.Output.require(Output.java:163) > at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614) > at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) > at > org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96) > at > org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315) > at > org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121) > at > org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89) >
[jira] [Comment Edited] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC
[ https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790672#comment-16790672 ] Dongjoon Hyun edited comment on SPARK-27107 at 3/12/19 3:47 PM: Yep. [~Dhruve Ashar]. I already tested and voted for that ORC RC vote yesterday. I'll do if the vote passes and the artifacts are uploaded. And, we need a test case in Spark side. Without a test case, it's just a dependency upgrade. Also, users can simply replace their ORC jar files without waiting Spark releases. Apache Spark is not a fat assembly jar file for this reason. was (Author: dongjoon): Yep. [~Dhruve Ashar]. I already tested and voted for that ORC RC vote. I'll do if the vote passes and the artifacts are uploaded. And, we need a test case in Spark side. Without a test case, it's just a dependency upgrade. Also, users can simply replace their ORC jar files. Apache Spark is not a fat assembly jar file for this reason. > Spark SQL Job failing because of Kryo buffer overflow with ORC > -- > > Key: SPARK-27107 > URL: https://issues.apache.org/jira/browse/SPARK-27107 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Dhruve Ashar >Priority: Major > > The issue occurs while trying to read ORC data and setting the SearchArgument. > {code:java} > Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. > Available: 0, required: 9 > Serialization trace: > literalList > (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl) > leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl) > at com.esotericsoftware.kryo.io.Output.require(Output.java:163) > at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614) > at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) > at > org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96) > at > org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315) > at > org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121) > at > org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
[jira] [Comment Edited] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC
[ https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790672#comment-16790672 ] Dongjoon Hyun edited comment on SPARK-27107 at 3/12/19 3:46 PM: Yep. [~Dhruve Ashar]. I already tested and voted for that ORC RC vote. I'll do if the vote passes and the artifacts are uploaded. And, we need a test case in Spark side. Without a test case, it's just a dependency upgrade. Also, users can simply replace their ORC jar files. Apache Spark is not a fat assembly jar file for this reason. was (Author: dongjoon): Yep. [~Dhruve Ashar]. I already tested and voted that. I'll do if the vote passes and the artifacts are uploaded. And, we need a test case in Spark side. Without a test case, it's just a dependency upgrade. Also, users can simply replace their ORC jar files. Apache Spark is not a fat assembly jar file for this reason. > Spark SQL Job failing because of Kryo buffer overflow with ORC > -- > > Key: SPARK-27107 > URL: https://issues.apache.org/jira/browse/SPARK-27107 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Dhruve Ashar >Priority: Major > > The issue occurs while trying to read ORC data and setting the SearchArgument. > {code:java} > Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. > Available: 0, required: 9 > Serialization trace: > literalList > (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl) > leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl) > at com.esotericsoftware.kryo.io.Output.require(Output.java:163) > at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614) > at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) > at > org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96) > at > org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315) > at > org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121) > at > org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$a
[jira] [Assigned] (SPARK-27041) large partition data cause pyspark with python2.x oom
[ https://issues.apache.org/jira/browse/SPARK-27041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27041: - Assignee: David Yang > large partition data cause pyspark with python2.x oom > - > > Key: SPARK-27041 > URL: https://issues.apache.org/jira/browse/SPARK-27041 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: David Yang >Assignee: David Yang >Priority: Major > > With large partition, pyspark may exceeds executor memory limit and trigger > out of memory for python 2.7. > This is because map() is used. Unlike in python3.x, python 2.7 map() will > generate a list and need to read all data into memory. > The proposed fix will use imap in python 2.7 and it has been verified. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27041) large partition data cause pyspark with python2.x oom
[ https://issues.apache.org/jira/browse/SPARK-27041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27041. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23954 [https://github.com/apache/spark/pull/23954] > large partition data cause pyspark with python2.x oom > - > > Key: SPARK-27041 > URL: https://issues.apache.org/jira/browse/SPARK-27041 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: David Yang >Assignee: David Yang >Priority: Major > Fix For: 3.0.0 > > > With large partition, pyspark may exceeds executor memory limit and trigger > out of memory for python 2.7. > This is because map() is used. Unlike in python3.x, python 2.7 map() will > generate a list and need to read all data into memory. > The proposed fix will use imap in python 2.7 and it has been verified. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27125) Add test suite for sql execution page
[ https://issues.apache.org/jira/browse/SPARK-27125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27125. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24052 [https://github.com/apache/spark/pull/24052] > Add test suite for sql execution page > - > > Key: SPARK-27125 > URL: https://issues.apache.org/jira/browse/SPARK-27125 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: shahid >Assignee: shahid >Priority: Minor > Fix For: 3.0.0 > > > Add test suite for sql execution page -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27125) Add test suite for sql execution page
[ https://issues.apache.org/jira/browse/SPARK-27125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27125: - Assignee: shahid > Add test suite for sql execution page > - > > Key: SPARK-27125 > URL: https://issues.apache.org/jira/browse/SPARK-27125 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: shahid >Assignee: shahid >Priority: Minor > > Add test suite for sql execution page -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC
[ https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790610#comment-16790610 ] Dhruve Ashar commented on SPARK-27107: -- Update: The PR was merged in the orc repository. My understanding is that we should update our pom once a new orc release is cut out. > Spark SQL Job failing because of Kryo buffer overflow with ORC > -- > > Key: SPARK-27107 > URL: https://issues.apache.org/jira/browse/SPARK-27107 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Dhruve Ashar >Priority: Major > > The issue occurs while trying to read ORC data and setting the SearchArgument. > {code:java} > Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. > Available: 0, required: 9 > Serialization trace: > literalList > (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl) > leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl) > at com.esotericsoftware.kryo.io.Output.require(Output.java:163) > at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614) > at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) > at > org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96) > at > org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315) > at > org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121) > at > org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.s
[jira] [Commented] (SPARK-23098) Migrate Kafka batch source to v2
[ https://issues.apache.org/jira/browse/SPARK-23098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790585#comment-16790585 ] Dylan Guedes commented on SPARK-23098: -- Hi, I would like to work on this one. [~joseph.torres] do you mind in helping me with a few suggestions if I get really stuck? Also, is this one similar to the CSVReader/JSONReader? > Migrate Kafka batch source to v2 > > > Key: SPARK-23098 > URL: https://issues.apache.org/jira/browse/SPARK-23098 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23450) jars option in spark submit is documented in misleading way
[ https://issues.apache.org/jira/browse/SPARK-23450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790558#comment-16790558 ] Sujith Chacko edited comment on SPARK-23450 at 3/12/19 1:43 PM: As per my understanding the jar will be distributed to the worker nodes. Already tested a UDF scenario where a custom jar is added to the nodes via --jars and executed the UDF query and its working fine was (Author: s71955): As pr my understanding the jar will be distributed to the worker nodes. Already tested a UDF scenario where a custom jar is added to the nodes via --jars and executed the UDF query and its working fine > jars option in spark submit is documented in misleading way > --- > > Key: SPARK-23450 > URL: https://issues.apache.org/jira/browse/SPARK-23450 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.2.1 >Reporter: Gregory Reshetniak >Priority: Major > > I am wondering if the {{--jars}} option on spark submit is actually meant for > distributing the dependency jars onto the nodes in cluster? > > In my case I can see it working as a "symlink" of sorts. But the > documentation is written in the way that suggests otherwise. Please help me > figure out if this is a bug or just my reading of the docs. Thanks! > _ > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23450) jars option in spark submit is documented in misleading way
[ https://issues.apache.org/jira/browse/SPARK-23450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790558#comment-16790558 ] Sujith Chacko commented on SPARK-23450: --- As pr my understanding the jar will be distributed to the worker nodes. Already tested a UDF scenario where a custom jar is added to the nodes via --jars and executed the UDF query and its working fine > jars option in spark submit is documented in misleading way > --- > > Key: SPARK-23450 > URL: https://issues.apache.org/jira/browse/SPARK-23450 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.2.1 >Reporter: Gregory Reshetniak >Priority: Major > > I am wondering if the {{--jars}} option on spark submit is actually meant for > distributing the dependency jars onto the nodes in cluster? > > In my case I can see it working as a "symlink" of sorts. But the > documentation is written in the way that suggests otherwise. Please help me > figure out if this is a bug or just my reading of the docs. Thanks! > _ > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27136) Remove data source option check_files_exist
[ https://issues.apache.org/jira/browse/SPARK-27136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27136: Assignee: Apache Spark > Remove data source option check_files_exist > --- > > Key: SPARK-27136 > URL: https://issues.apache.org/jira/browse/SPARK-27136 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > The data source option check_files_exist is introduced in In #23383 when the > file source V2 framework is implemented. In the PR, FileIndex was created as > a member of FileTable, so that we could implement partition pruning like > 0f9fcab in the future. For file writes, we needed the option to decide > whether to check file existence. > After https://github.com/apache/spark/pull/23774, the option is not needed > anymore. This PR is to clean the option. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27136) Remove data source option check_files_exist
[ https://issues.apache.org/jira/browse/SPARK-27136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27136: --- Description: The data source option check_files_exist is introduced in In https://github.com/apache/spark/pull/23383 when the file source V2 framework is implemented. In the PR, FileIndex was created as a member of FileTable, so that we could implement partition pruning like 0f9fcab in the future. At that time FileIndexes will always be created for file writes, so we needed the option to decide whether to check file existence. After https://github.com/apache/spark/pull/23774, the option is not needed anymore. This PR is to clean the option. was: The data source option check_files_exist is introduced in In https://github.com/apache/spark/pull/23383 when the file source V2 framework is implemented. In the PR, FileIndex was created as a member of FileTable, so that we could implement partition pruning like 0f9fcab in the future. For file writes, we needed the option to decide whether to check file existence. After https://github.com/apache/spark/pull/23774, the option is not needed anymore. This PR is to clean the option. > Remove data source option check_files_exist > --- > > Key: SPARK-27136 > URL: https://issues.apache.org/jira/browse/SPARK-27136 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > The data source option check_files_exist is introduced in In > https://github.com/apache/spark/pull/23383 when the file source V2 framework > is implemented. In the PR, FileIndex was created as a member of FileTable, so > that we could implement partition pruning like 0f9fcab in the future. At that > time FileIndexes will always be created for file writes, so we needed the > option to decide whether to check file existence. > After https://github.com/apache/spark/pull/23774, the option is not needed > anymore. This PR is to clean the option. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27136) Remove data source option check_files_exist
[ https://issues.apache.org/jira/browse/SPARK-27136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27136: Assignee: (was: Apache Spark) > Remove data source option check_files_exist > --- > > Key: SPARK-27136 > URL: https://issues.apache.org/jira/browse/SPARK-27136 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > The data source option check_files_exist is introduced in In #23383 when the > file source V2 framework is implemented. In the PR, FileIndex was created as > a member of FileTable, so that we could implement partition pruning like > 0f9fcab in the future. For file writes, we needed the option to decide > whether to check file existence. > After https://github.com/apache/spark/pull/23774, the option is not needed > anymore. This PR is to clean the option. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27136) Remove data source option check_files_exist
[ https://issues.apache.org/jira/browse/SPARK-27136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-27136: --- Description: The data source option check_files_exist is introduced in In https://github.com/apache/spark/pull/23383 when the file source V2 framework is implemented. In the PR, FileIndex was created as a member of FileTable, so that we could implement partition pruning like 0f9fcab in the future. For file writes, we needed the option to decide whether to check file existence. After https://github.com/apache/spark/pull/23774, the option is not needed anymore. This PR is to clean the option. was: The data source option check_files_exist is introduced in In #23383 when the file source V2 framework is implemented. In the PR, FileIndex was created as a member of FileTable, so that we could implement partition pruning like 0f9fcab in the future. For file writes, we needed the option to decide whether to check file existence. After https://github.com/apache/spark/pull/23774, the option is not needed anymore. This PR is to clean the option. > Remove data source option check_files_exist > --- > > Key: SPARK-27136 > URL: https://issues.apache.org/jira/browse/SPARK-27136 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > The data source option check_files_exist is introduced in In > https://github.com/apache/spark/pull/23383 when the file source V2 framework > is implemented. In the PR, FileIndex was created as a member of FileTable, so > that we could implement partition pruning like 0f9fcab in the future. For > file writes, we needed the option to decide whether to check file existence. > After https://github.com/apache/spark/pull/23774, the option is not needed > anymore. This PR is to clean the option. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27136) Remove data source option check_files_exist
Gengliang Wang created SPARK-27136: -- Summary: Remove data source option check_files_exist Key: SPARK-27136 URL: https://issues.apache.org/jira/browse/SPARK-27136 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang The data source option check_files_exist is introduced in In #23383 when the file source V2 framework is implemented. In the PR, FileIndex was created as a member of FileTable, so that we could implement partition pruning like 0f9fcab in the future. For file writes, we needed the option to decide whether to check file existence. After https://github.com/apache/spark/pull/23774, the option is not needed anymore. This PR is to clean the option. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27135) Description column under jobs does not show complete text if there is overflow
[ https://issues.apache.org/jira/browse/SPARK-27135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-27135: -- Summary: Description column under jobs does not show complete text if there is overflow (was: Description column under jobs does not complete text if there is overflow) > Description column under jobs does not show complete text if there is overflow > -- > > Key: SPARK-27135 > URL: https://issues.apache.org/jira/browse/SPARK-27135 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sandeep Katta >Priority: Minor > Attachments: UIIssue.PNG > > > In Spark webUI if the query is big then user cannot see the complete query > details even on the mouse over -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27105) Prevent exponential complexity in ORC `createFilter`
[ https://issues.apache.org/jira/browse/SPARK-27105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27105: Assignee: Apache Spark > Prevent exponential complexity in ORC `createFilter` > -- > > Key: SPARK-27105 > URL: https://issues.apache.org/jira/browse/SPARK-27105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Ivan Vergiliev >Assignee: Apache Spark >Priority: Major > Labels: performance > > `OrcFilters.createFilters` currently has complexity that's exponential in the > height of the filter tree. There are multiple places in Spark that try to > prevent the generation of skewed trees so as to not trigger this behaviour, > for example: > - `org.apache.spark.sql.catalyst.parser.AstBuilder.visitLogicalBinary` > combines a number of binary logical expressions into a balanced tree. > - https://github.com/apache/spark/pull/22313 introduced a change to > `OrcFilters` to create a balanced tree instead of a skewed tree. > However, the underlying exponential behaviour can still be triggered by code > paths that don't go through any of the tree balancing methods. For example, > if one generates a tree of `Column`s directly in user code, there's nothing > in Spark that automatically balances that tree and, hence, skewed trees hit > the exponential behaviour. We have hit this in production with jobs > mysteriously taking hours on the Spark driver with no worker activity, with > as few as ~30 OR filters. > I have a fix locally that makes the underlying logic have linear complexity > instead of exponential complexity. With this fix, the code can handle > thousands of filters in milliseconds. I'll send a PR with the fix soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27105) Prevent exponential complexity in ORC `createFilter`
[ https://issues.apache.org/jira/browse/SPARK-27105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27105: Assignee: (was: Apache Spark) > Prevent exponential complexity in ORC `createFilter` > -- > > Key: SPARK-27105 > URL: https://issues.apache.org/jira/browse/SPARK-27105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Ivan Vergiliev >Priority: Major > Labels: performance > > `OrcFilters.createFilters` currently has complexity that's exponential in the > height of the filter tree. There are multiple places in Spark that try to > prevent the generation of skewed trees so as to not trigger this behaviour, > for example: > - `org.apache.spark.sql.catalyst.parser.AstBuilder.visitLogicalBinary` > combines a number of binary logical expressions into a balanced tree. > - https://github.com/apache/spark/pull/22313 introduced a change to > `OrcFilters` to create a balanced tree instead of a skewed tree. > However, the underlying exponential behaviour can still be triggered by code > paths that don't go through any of the tree balancing methods. For example, > if one generates a tree of `Column`s directly in user code, there's nothing > in Spark that automatically balances that tree and, hence, skewed trees hit > the exponential behaviour. We have hit this in production with jobs > mysteriously taking hours on the Spark driver with no worker activity, with > as few as ~30 OR filters. > I have a fix locally that makes the underlying logic have linear complexity > instead of exponential complexity. With this fix, the code can handle > thousands of filters in milliseconds. I'll send a PR with the fix soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27135) Description column under jobs does not complete text if there is overflow
[ https://issues.apache.org/jira/browse/SPARK-27135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790440#comment-16790440 ] Sandeep Katta commented on SPARK-27135: --- On hover complete Query should be shown > Description column under jobs does not complete text if there is overflow > - > > Key: SPARK-27135 > URL: https://issues.apache.org/jira/browse/SPARK-27135 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sandeep Katta >Priority: Minor > Attachments: UIIssue.PNG > > > In Spark webUI if the query is big then user cannot see the complete query > details even on the mouse over -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27135) Description column under jobs does not complete text if there is overflow
[ https://issues.apache.org/jira/browse/SPARK-27135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27135: Assignee: Apache Spark > Description column under jobs does not complete text if there is overflow > - > > Key: SPARK-27135 > URL: https://issues.apache.org/jira/browse/SPARK-27135 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sandeep Katta >Assignee: Apache Spark >Priority: Minor > Attachments: UIIssue.PNG > > > In Spark webUI if the query is big then user cannot see the complete query > details even on the mouse over -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27135) Description column under jobs does not complete text if there is overflow
[ https://issues.apache.org/jira/browse/SPARK-27135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27135: Assignee: (was: Apache Spark) > Description column under jobs does not complete text if there is overflow > - > > Key: SPARK-27135 > URL: https://issues.apache.org/jira/browse/SPARK-27135 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sandeep Katta >Priority: Minor > Attachments: UIIssue.PNG > > > In Spark webUI if the query is big then user cannot see the complete query > details even on the mouse over -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27135) Description column under jobs does not complete text if there is overflow
Sandeep Katta created SPARK-27135: - Summary: Description column under jobs does not complete text if there is overflow Key: SPARK-27135 URL: https://issues.apache.org/jira/browse/SPARK-27135 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.4.0, 2.3.2 Reporter: Sandeep Katta Attachments: UIIssue.PNG In Spark webUI if the query is big then user cannot see the complete query details even on the mouse over -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27135) Description column under jobs does not complete text if there is overflow
[ https://issues.apache.org/jira/browse/SPARK-27135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-27135: -- Attachment: UIIssue.PNG > Description column under jobs does not complete text if there is overflow > - > > Key: SPARK-27135 > URL: https://issues.apache.org/jira/browse/SPARK-27135 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sandeep Katta >Priority: Minor > Attachments: UIIssue.PNG > > > In Spark webUI if the query is big then user cannot see the complete query > details even on the mouse over -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24486) Slow performance reading ArrayType columns
[ https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali resolved SPARK-24486. - Resolution: Fixed Fix Version/s: 3.0.0 > Slow performance reading ArrayType columns > -- > > Key: SPARK-24486 > URL: https://issues.apache.org/jira/browse/SPARK-24486 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Luca Canali >Priority: Minor > Fix For: 3.0.0 > > > We have found an issue of slow performance in one of our applications when > running on Spark 2.3.0 (the same workload does not have a performance issue > on Spark 2.2.1). We suspect a regression in the area of handling columns of > ArrayType. I have built a simplified test case showing a manifestation of the > issue to help with troubleshooting: > > > {code:java} > // prepare test data > val stringListValues=Range(1,3).mkString(",") > sql(s"select 1 as myid, Array($stringListValues) as myarray from > range(2)").repartition(1).write.parquet("file:///tmp/deleteme1") > // run test > spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code} > Performance measurements: > > On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 > (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec > on Spark 2.3.0 > > Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 > only 2 records are read by this workload, while on Spark 2.3.0 all rows in > the file are read, which appears anomalous. > Example: > {code:java} > bin/spark-shell --master local[*] --driver-memory 2g --packages > ch.cern.sparkmeasure:spark-measure_2.11:0.11 > val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) > stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show()) > {code} > > > Selected metrics from Spark 2.3.0 run: > > {noformat} > elapsedTime => 17849 (18 s) > sum(numTasks) => 11 > sum(recordsRead) => 2 > sum(bytesRead) => 1136448171 (1083.0 MB){noformat} > > > From Spark 2.2.1 run: > > {noformat} > elapsedTime => 1329 (1 s) > sum(numTasks) => 2 > sum(recordsRead) => 2 > sum(bytesRead) => 269162610 (256.0 MB) > {noformat} > > Note: Using Spark built from master (as I write this, June 7th 2018) shows > the same behavior as found in Spark 2.3.0 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21385) hive-thriftserver register too many listener in listenerbus
[ https://issues.apache.org/jira/browse/SPARK-21385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790434#comment-16790434 ] Ekaterina Shurgalina commented on SPARK-21385: -- Hi [~honestman] ! Is this issue still there? I would like to work on it. Please, assign it to me. > hive-thriftserver register too many listener in listenerbus > --- > > Key: SPARK-21385 > URL: https://issues.apache.org/jira/browse/SPARK-21385 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.1.0 >Reporter: honestman >Priority: Minor > Labels: easyfix > > when set spark.sql.hive.thriftServer.singleSession true, In > SparkSQLSessionManager will create new Session for each connection, in the > process of create new SqlContext it will create new SparkListener and > register into listenerbus, but the listener is not removed when the hive > session closed, so this would cause too many listener register in the > listenerbus, this may cause dropEvent in listenerbus. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10746) count ( distinct columnref) over () returns wrong result set
[ https://issues.apache.org/jira/browse/SPARK-10746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790428#comment-16790428 ] Izek Greenfield edited comment on SPARK-10746 at 3/12/19 10:41 AM: --- you can implement that by using: {code:scala} import org.apache.spark.sql.functions._ size(collect_set(column).over(window)) {code} was (Author: igreenfi): you can implement that by using: {code:java} // Some comments here size(collect_set(column).over(window)) {code} > count ( distinct columnref) over () returns wrong result set > > > Key: SPARK-10746 > URL: https://issues.apache.org/jira/browse/SPARK-10746 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: N Campbell >Priority: Major > > Same issue as report against HIVE (HIVE-9534) > Result set was expected to contain 5 rows instead of 1 row as others vendors > (ORACLE, Netezza etc) would. > select count( distinct column) over () from t1 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10746) count ( distinct columnref) over () returns wrong result set
[ https://issues.apache.org/jira/browse/SPARK-10746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790428#comment-16790428 ] Izek Greenfield commented on SPARK-10746: - you can implement that by using: {code:java} // Some comments here size(collect_set(column).over(window)) {code} > count ( distinct columnref) over () returns wrong result set > > > Key: SPARK-10746 > URL: https://issues.apache.org/jira/browse/SPARK-10746 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: N Campbell >Priority: Major > > Same issue as report against HIVE (HIVE-9534) > Result set was expected to contain 5 rows instead of 1 row as others vendors > (ORACLE, Netezza etc) would. > select count( distinct column) over () from t1 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27134) array_distinct function does not work correctly with columns containing array of array
Mike Trenaman created SPARK-27134: - Summary: array_distinct function does not work correctly with columns containing array of array Key: SPARK-27134 URL: https://issues.apache.org/jira/browse/SPARK-27134 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Environment: Spark 2.4, scala 2.11.11 Reporter: Mike Trenaman The array_distinct function introduced in spark 2.4 is producing strange results when used on an array column which contains a nested array. The resulting output can still contain duplicate values, and furthermore, previously distinct values may be removed. This is easily repeatable, e.g. with this code: val df = Seq( Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5)) ).toDF("Number_Combinations") val dfWithDistinct = df.withColumn("distinct_combinations", array_distinct(col("Number_Combinations"))) The initial 'df' DataFrame contains one row, where column 'Number_Combinations' contains the following values: [[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]] The array_distinct function run on this column produces a new column containing the following values: [[1, 2], [1, 2], [1, 2]] As you can see, this contains three occurrences of the same value (1, 2), and furthermore, the distinct values (3, 4), (4, 5) have been removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27132) Improve file source V2 framework
Gengliang Wang created SPARK-27132: -- Summary: Improve file source V2 framework Key: SPARK-27132 URL: https://issues.apache.org/jira/browse/SPARK-27132 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang During the migration of CSV V2, I find that we can improve the file source v2 framework by: 1. check duplicated column names in both read and write 2. Not all the file sources support filter push down. So remove `SupportsPushDownFilters` from FileScanBuilder 3. The method `isSplitable` might require data source options. Add a new member `options` to FileScan. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27133) Refactor the REST based spark app management API to follow the new interface
[ https://issues.apache.org/jira/browse/SPARK-27133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27133: Summary: Refactor the REST based spark app management API to follow the new interface (was: Refactor the spark app management API based on REST to follow the new interface) > Refactor the REST based spark app management API to follow the new interface > > > Key: SPARK-27133 > URL: https://issues.apache.org/jira/browse/SPARK-27133 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Based on the discussion > [here|https://github.com/apache/spark/pull/23599/files#r254864701], we have > introduced in that PR a new interface to manage the `kill` and `list` ops for > spark apps specifically for k8s. We should refactor the REST based one for > mesos and standalone too, to use that interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27133) Refactor the spark app management API based on REST to follow the new interface
[ https://issues.apache.org/jira/browse/SPARK-27133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27133: Description: Based on the discussion [here|https://github.com/apache/spark/pull/23599/files#r254864701], we have introduced in that PR a new interface to manage the `kill` and `list` ops for spark apps specifically for k8s. We should refactor the REST based one for mesos and standalone too, to use that interface. (was: Based on the discussion [here|https://github.com/apache/spark/pull/23599/files#r254864701], we have introduced in that PR a new interface to manage the `kill` and `list` ops for spark apps specifically for k8s. We should refactor the rest based one for mesos and standalone too to use that one. ) > Refactor the spark app management API based on REST to follow the new > interface > --- > > Key: SPARK-27133 > URL: https://issues.apache.org/jira/browse/SPARK-27133 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Based on the discussion > [here|https://github.com/apache/spark/pull/23599/files#r254864701], we have > introduced in that PR a new interface to manage the `kill` and `list` ops for > spark apps specifically for k8s. We should refactor the REST based one for > mesos and standalone too, to use that interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org