[jira] [Created] (SPARK-36889) Respect `spark.sql.parquet.filterPushdown` by explain() for DSv2

2021-09-28 Thread Max Gekk (Jira)
Max Gekk created SPARK-36889:


 Summary: Respect `spark.sql.parquet.filterPushdown` by explain() 
for DSv2
 Key: SPARK-36889
 URL: https://issues.apache.org/jira/browse/SPARK-36889
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk
Assignee: Max Gekk


When filters pushdown for parquet is disabled via the SQL config 
spark.sql.parquet.filterPushdown, explain() still outputs pushed down filters:

{code}
== Parsed Logical Plan ==
'Filter ('c0 = 1)
+- RelationV2[c0#7] parquet 
file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-ff7e9a24-fd4e-4981-9c75-e1bcde78e91a

== Analyzed Logical Plan ==
c0: int
Filter (c0#7 = 1)
+- RelationV2[c0#7] parquet 
file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-ff7e9a24-fd4e-4981-9c75-e1bcde78e91a

== Optimized Logical Plan ==
Filter (isnotnull(c0#7) AND (c0#7 = 1))
+- RelationV2[c0#7] parquet 
file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-ff7e9a24-fd4e-4981-9c75-e1bcde78e91a

== Physical Plan ==
*(1) Filter (isnotnull(c0#7) AND (c0#7 = 1))
+- *(1) ColumnarToRow
   +- BatchScan[c0#7] ParquetScan DataFilters: [isnotnull(c0#7), (c0#7 = 1)], 
Format: parquet, Location: InMemoryFileIndex(1 
paths)[file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-ff...,
 PartitionFilters: [], PushedFilters: [IsNotNull(c0), EqualTo(c0,1)], 
ReadSchema: struct, PushedFilters: [IsNotNull(c0), EqualTo(c0,1)] 
RuntimeFilters: []
{code}
See PushedFilters: [IsNotNull(c0), EqualTo(c0,1)]




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-09-28 Thread Vladimir Prus (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421969#comment-17421969
 ] 

Vladimir Prus commented on SPARK-18105:
---

FYI, we recently started to get a lot of such errors; they appear to be 
correlated with increased load and increased spot termination in AWS. So as an 
experiment, I've disabled executor decommission, and the errors all 
disappeared. Specifically, I set these options:
{noformat}
storage.decommission.enabled: false
storage.decommission.rddBlocks.enabled: false
storage.decommission.shuffleBlocks.enabled: false{noformat}
It is of course not a perfect set of options for production, but maybe will be 
a hint at the problem. I am using a recent build from branch-3.1, specifically 
from commit e1fc62de8e05.

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Davies Liu
>Priority: Major
> Attachments: TestWeightedGraph.java
>
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35874) AQE Shuffle should wait for its subqueries to finish before materializing

2021-09-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35874:
--
Parent: SPARK-33828
Issue Type: Sub-task  (was: Bug)

> AQE Shuffle should wait for its subqueries to finish before materializing
> -
>
> Key: SPARK-35874
> URL: https://issues.apache.org/jira/browse/SPARK-35874
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35874) AQE Shuffle should wait for its subqueries to finish before materializing

2021-09-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421950#comment-17421950
 ] 

Dongjoon Hyun commented on SPARK-35874:
---

Of course, yes. Thank you, [~shardulm].

> AQE Shuffle should wait for its subqueries to finish before materializing
> -
>
> Key: SPARK-35874
> URL: https://issues.apache.org/jira/browse/SPARK-35874
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36882) Support ILIKE API on Python

2021-09-28 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-36882.

Fix Version/s: 3.3.0
 Assignee: Leona Yoda
   Resolution: Fixed

Issue resolved in https://github.com/apache/spark/pull/34135

> Support ILIKE API on Python
> ---
>
> Key: SPARK-36882
> URL: https://issues.apache.org/jira/browse/SPARK-36882
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Assignee: Leona Yoda
>Priority: Major
> Fix For: 3.3.0
>
>
> Support ILIKE (case sensitive LIKE) API on Python



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36876) Support Dynamic Partition pruning for HiveTableScanExec

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421944#comment-17421944
 ] 

Apache Spark commented on SPARK-36876:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34139

> Support Dynamic Partition pruning for HiveTableScanExec
> ---
>
> Key: SPARK-36876
> URL: https://issues.apache.org/jira/browse/SPARK-36876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Support dynamic partition pruning for hive serde scan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36876) Support Dynamic Partition pruning for HiveTableScanExec

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421943#comment-17421943
 ] 

Apache Spark commented on SPARK-36876:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34139

> Support Dynamic Partition pruning for HiveTableScanExec
> ---
>
> Key: SPARK-36876
> URL: https://issues.apache.org/jira/browse/SPARK-36876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Support dynamic partition pruning for hive serde scan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36876) Support Dynamic Partition pruning for HiveTableScanExec

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36876:


Assignee: (was: Apache Spark)

> Support Dynamic Partition pruning for HiveTableScanExec
> ---
>
> Key: SPARK-36876
> URL: https://issues.apache.org/jira/browse/SPARK-36876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Support dynamic partition pruning for hive serde scan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36876) Support Dynamic Partition pruning for HiveTableScanExec

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36876:


Assignee: Apache Spark

> Support Dynamic Partition pruning for HiveTableScanExec
> ---
>
> Key: SPARK-36876
> URL: https://issues.apache.org/jira/browse/SPARK-36876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Support dynamic partition pruning for hive serde scan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421941#comment-17421941
 ] 

Apache Spark commented on SPARK-36883:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34138

> Upgrade R version to 4.1.1 in CI images
> ---
>
> Key: SPARK-36883
> URL: https://issues.apache.org/jira/browse/SPARK-36883
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.
> R 4.1.1 is released. We might better to test the latest version of R with 
> SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36883:


Assignee: (was: Apache Spark)

> Upgrade R version to 4.1.1 in CI images
> ---
>
> Key: SPARK-36883
> URL: https://issues.apache.org/jira/browse/SPARK-36883
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.
> R 4.1.1 is released. We might better to test the latest version of R with 
> SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36883:


Assignee: (was: Apache Spark)

> Upgrade R version to 4.1.1 in CI images
> ---
>
> Key: SPARK-36883
> URL: https://issues.apache.org/jira/browse/SPARK-36883
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.
> R 4.1.1 is released. We might better to test the latest version of R with 
> SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36883:


Assignee: Apache Spark

> Upgrade R version to 4.1.1 in CI images
> ---
>
> Key: SPARK-36883
> URL: https://issues.apache.org/jira/browse/SPARK-36883
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.
> R 4.1.1 is released. We might better to test the latest version of R with 
> SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36877) Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing reruns

2021-09-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421939#comment-17421939
 ] 

Hyukjin Kwon commented on SPARK-36877:
--

cc [~maryannxue] too FYI

> Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing 
> reruns
> --
>
> Key: SPARK-36877
> URL: https://issues.apache.org/jira/browse/SPARK-36877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: Screen Shot 2021-09-28 at 09.32.20.png
>
>
> In one of our jobs we perform the following operation:
> {code:scala}
> val df = /* some expensive multi-table/multi-stage join */
> val numPartitions = df.rdd.getNumPartitions
> df.repartition(x).write.
> {code}
> With AQE enabled, we found that the expensive stages were being run twice 
> causing significant performance regression after enabling AQE; once when 
> calling {{df.rdd}} and again when calling {{df.write}}.
> A more concrete example:
> {code:scala}
> scala> sql("SET spark.sql.adaptive.enabled=true")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val df1 = spark.range(10).withColumn("id2", $"id")
> df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), 
> "id").join(spark.range(10), "id")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df3 = df2.groupBy("id2").count()
> df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint]
> scala> df3.rdd.getNumPartitions
> res2: Int = 10(0 + 16) / 
> 16]
> scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1")
> {code}
> In the screenshot below, you can see that the first 3 stages (0 to 4) were 
> rerun again (5 to 9).
> I have two questions:
> 1) Should calling df.rdd trigger actual job execution when AQE is enabled?
> 2) Should calling df.write later cause rerun of the stages? If df.rdd has 
> already partially executed the stages, shouldn't it reuse the result from 
> previous stages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36875) 21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resour

2021-09-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36875.
--
Resolution: Invalid

Apparently It failed to connect to the Yarn cluster. Seems like an environment 
issue

> 21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources
> ---
>
> Key: SPARK-36875
> URL: https://issues.apache.org/jira/browse/SPARK-36875
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 3.1.2
> Environment: Eclipse
> Hadoop 3.3
> Spark3.1.2-hadoop3.2
> Dependencies
>  
> {noformat}
> 
> org.apache.spark
>   
> spark-core_2.12 
> 3.1.2
>   
> 
>  org.apache.spark 
> spark-sql_2.12 
> 3.1.2  
> 
>  janino 
> org.codehaus.janino 
>
>  org.codehaus.janino 
> janino 
> 3.0.8
>  
>   org.apache.spark 
> spark-yarn_2.12
>  3.1.2 provided
>  
>   
> org.scala-lang
>  scala-library 
> 2.12.13
>  
> Enviroment Variables set in eclipse: SPARK_HOME path/to/my/sparkfolder
> OS Linux with UBUNTU 20
> The test is launched on my first user davben. 
> Spark folder and hadoop are on my second user hadoop{noformat}
>  
>  
>Reporter: Davide Benedetto
>Priority: Major
>  Labels: Eclipse, YARN, dataset, hadoop, spark-conf, spark-core
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, I am running a spark job with yarn programmatically using Eclipse IDE. 
> Here I
> 1: open the spark session passing a SparkConf as input parameter,
>  
> {quote} 
> {code:java}
> System.setProperty("hadoop.home.dir", "/home/hadoop/hadoop"); 
> System.setProperty("hadoop.home.dir", "/home/hadoop/hadoop");        
> System.setProperty("SPARK_YARN_MODE", "yarn");        
> System.setProperty("HADOOP_USER_NAME", "hadoop");
> SparkConf sparkConf = new 
> SparkConf().setAppName("simpleTest2").setMaster("yarn") 
> .set("spark.executor.memory", "1g")
> .set("deploy.mode", "cluster")
> .set("spark.yarn.stagingDir", "hdfs://localhost:9000/user/hadoop/")
> .set("spark.yarn.am.memory", "512m") 
> .set("spark.dynamicAllocation.minExecutors","1") 
> .set("spark.dynamicAllocation.maxExecutors","40") 
> .set("spark.dynamicAllocation.initialExecutors","2")         
> .set("spark.shuffle.service.enabled", "true")         
> .set("spark.dynamicAllocation.enabled", "false")
> .set("spark.cores.max", "1")
>  .set("spark.yarn.executor.memoryOverhead", "500m")         
> .set("spark.executor.instances","2")
> .set("spark.executor.memory","500m")
> .set("spark.num.executors","2")
> .set("spark.executor.cores","1")
> .set("spark.worker.instances","1")
> .set("spark.worker.memory","512m")
> .set("spark.worker.max.heapsize","512m")
> .set("spark.worker.cores","1")
> .set("maximizeResourceAllocation", "true") 
> .set("spark.yarn.nodemanager.resource.cpu-vcores","4") 
> .set("spark.yarn.submit.file.replication", "1")
> SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate(); 
> {code}
>  
> {quote}
> 2: Create a dataset of two Rows and i Show them
> {code:java}
> List rows = new ArrayList<>(); List rows = new ArrayList<>(); 
> rows.add(RowFactory.create("a", "b")); rows.add(RowFactory.create("b", "c")); 
> rows.add(RowFactory.create("a", "a"));
>  StructType structType = new StructType(); structType = 
> structType.add("edge_1", DataTypes.StringType, false); structType = 
> structType.add("edge_2", DataTypes.StringType, false); ExpressionEncoder 
> edgeEncoder = RowEncoder.apply(structType);
>  Dataset edge = spark.createDataset(rows, edgeEncoder); edge.show();
>  
> {code}
> {{From now it is all Ok, the job is submitted on hadoop and the rows are 
> showed correctly}}
> {{3: I perform a Map that upper cases the elements in the row}}
>  
> {quote}
> {code:java}
> Dataset edge2 = edge.map(new MyFunction2(), edgeEncoder); Dataset 
> edge2 = edge.map(new MyFunction2(), edgeEncoder);{code}
> {quote}
>  
> {quote}{{ public static class MyFunction2 implements MapFunction { 
> public static class MyFunction2 implements MapFunction {}}
> {{ /** *  */ private static final long serialVersionUID = 1L;}}
> {{ @Override public Row call(Row v1) throws Exception \{ String el1 = 
> v1.get(0).toString().toUpperCase(); String el2 = 
> v1.get(1).toString().toUpperCase(); return RowFactory.create(el1,el2); }}}
> {{ }}}
> {quote}
> {{4: Then I show the dataset after map is performed}}
> {quote}
> {code:java}
> edge2.show();{code}
> {quote}
> {{And precisely here the log Loops saying }}
> {{21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any 
> resources; check your cluster UI to ensure that worke

[jira] [Updated] (SPARK-36875) 21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resourc

2021-09-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36875:
-
Target Version/s:   (was: 3.1.2)

> 21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources
> ---
>
> Key: SPARK-36875
> URL: https://issues.apache.org/jira/browse/SPARK-36875
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 3.1.2
> Environment: Eclipse
> Hadoop 3.3
> Spark3.1.2-hadoop3.2
> Dependencies
>  
> {noformat}
> 
> org.apache.spark
>   
> spark-core_2.12 
> 3.1.2
>   
> 
>  org.apache.spark 
> spark-sql_2.12 
> 3.1.2  
> 
>  janino 
> org.codehaus.janino 
>
>  org.codehaus.janino 
> janino 
> 3.0.8
>  
>   org.apache.spark 
> spark-yarn_2.12
>  3.1.2 provided
>  
>   
> org.scala-lang
>  scala-library 
> 2.12.13
>  
> Enviroment Variables set in eclipse: SPARK_HOME path/to/my/sparkfolder
> OS Linux with UBUNTU 20
> The test is launched on my first user davben. 
> Spark folder and hadoop are on my second user hadoop{noformat}
>  
>  
>Reporter: Davide Benedetto
>Priority: Major
>  Labels: Eclipse, YARN, dataset, hadoop, spark-conf, spark-core
> Fix For: 3.1.2
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, I am running a spark job with yarn programmatically using Eclipse IDE. 
> Here I
> 1: open the spark session passing a SparkConf as input parameter,
>  
> {quote} 
> {code:java}
> System.setProperty("hadoop.home.dir", "/home/hadoop/hadoop"); 
> System.setProperty("hadoop.home.dir", "/home/hadoop/hadoop");        
> System.setProperty("SPARK_YARN_MODE", "yarn");        
> System.setProperty("HADOOP_USER_NAME", "hadoop");
> SparkConf sparkConf = new 
> SparkConf().setAppName("simpleTest2").setMaster("yarn") 
> .set("spark.executor.memory", "1g")
> .set("deploy.mode", "cluster")
> .set("spark.yarn.stagingDir", "hdfs://localhost:9000/user/hadoop/")
> .set("spark.yarn.am.memory", "512m") 
> .set("spark.dynamicAllocation.minExecutors","1") 
> .set("spark.dynamicAllocation.maxExecutors","40") 
> .set("spark.dynamicAllocation.initialExecutors","2")         
> .set("spark.shuffle.service.enabled", "true")         
> .set("spark.dynamicAllocation.enabled", "false")
> .set("spark.cores.max", "1")
>  .set("spark.yarn.executor.memoryOverhead", "500m")         
> .set("spark.executor.instances","2")
> .set("spark.executor.memory","500m")
> .set("spark.num.executors","2")
> .set("spark.executor.cores","1")
> .set("spark.worker.instances","1")
> .set("spark.worker.memory","512m")
> .set("spark.worker.max.heapsize","512m")
> .set("spark.worker.cores","1")
> .set("maximizeResourceAllocation", "true") 
> .set("spark.yarn.nodemanager.resource.cpu-vcores","4") 
> .set("spark.yarn.submit.file.replication", "1")
> SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate(); 
> {code}
>  
> {quote}
> 2: Create a dataset of two Rows and i Show them
> {code:java}
> List rows = new ArrayList<>(); List rows = new ArrayList<>(); 
> rows.add(RowFactory.create("a", "b")); rows.add(RowFactory.create("b", "c")); 
> rows.add(RowFactory.create("a", "a"));
>  StructType structType = new StructType(); structType = 
> structType.add("edge_1", DataTypes.StringType, false); structType = 
> structType.add("edge_2", DataTypes.StringType, false); ExpressionEncoder 
> edgeEncoder = RowEncoder.apply(structType);
>  Dataset edge = spark.createDataset(rows, edgeEncoder); edge.show();
>  
> {code}
> {{From now it is all Ok, the job is submitted on hadoop and the rows are 
> showed correctly}}
> {{3: I perform a Map that upper cases the elements in the row}}
>  
> {quote}
> {code:java}
> Dataset edge2 = edge.map(new MyFunction2(), edgeEncoder); Dataset 
> edge2 = edge.map(new MyFunction2(), edgeEncoder);{code}
> {quote}
>  
> {quote}{{ public static class MyFunction2 implements MapFunction { 
> public static class MyFunction2 implements MapFunction {}}
> {{ /** *  */ private static final long serialVersionUID = 1L;}}
> {{ @Override public Row call(Row v1) throws Exception \{ String el1 = 
> v1.get(0).toString().toUpperCase(); String el2 = 
> v1.get(1).toString().toUpperCase(); return RowFactory.create(el1,el2); }}}
> {{ }}}
> {quote}
> {{4: Then I show the dataset after map is performed}}
> {quote}
> {code:java}
> edge2.show();{code}
> {quote}
> {{And precisely here the log Loops saying }}
> {{21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient reso

[jira] [Updated] (SPARK-36875) 21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resourc

2021-09-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36875:
-
Fix Version/s: (was: 3.1.2)

> 21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources
> ---
>
> Key: SPARK-36875
> URL: https://issues.apache.org/jira/browse/SPARK-36875
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 3.1.2
> Environment: Eclipse
> Hadoop 3.3
> Spark3.1.2-hadoop3.2
> Dependencies
>  
> {noformat}
> 
> org.apache.spark
>   
> spark-core_2.12 
> 3.1.2
>   
> 
>  org.apache.spark 
> spark-sql_2.12 
> 3.1.2  
> 
>  janino 
> org.codehaus.janino 
>
>  org.codehaus.janino 
> janino 
> 3.0.8
>  
>   org.apache.spark 
> spark-yarn_2.12
>  3.1.2 provided
>  
>   
> org.scala-lang
>  scala-library 
> 2.12.13
>  
> Enviroment Variables set in eclipse: SPARK_HOME path/to/my/sparkfolder
> OS Linux with UBUNTU 20
> The test is launched on my first user davben. 
> Spark folder and hadoop are on my second user hadoop{noformat}
>  
>  
>Reporter: Davide Benedetto
>Priority: Major
>  Labels: Eclipse, YARN, dataset, hadoop, spark-conf, spark-core
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, I am running a spark job with yarn programmatically using Eclipse IDE. 
> Here I
> 1: open the spark session passing a SparkConf as input parameter,
>  
> {quote} 
> {code:java}
> System.setProperty("hadoop.home.dir", "/home/hadoop/hadoop"); 
> System.setProperty("hadoop.home.dir", "/home/hadoop/hadoop");        
> System.setProperty("SPARK_YARN_MODE", "yarn");        
> System.setProperty("HADOOP_USER_NAME", "hadoop");
> SparkConf sparkConf = new 
> SparkConf().setAppName("simpleTest2").setMaster("yarn") 
> .set("spark.executor.memory", "1g")
> .set("deploy.mode", "cluster")
> .set("spark.yarn.stagingDir", "hdfs://localhost:9000/user/hadoop/")
> .set("spark.yarn.am.memory", "512m") 
> .set("spark.dynamicAllocation.minExecutors","1") 
> .set("spark.dynamicAllocation.maxExecutors","40") 
> .set("spark.dynamicAllocation.initialExecutors","2")         
> .set("spark.shuffle.service.enabled", "true")         
> .set("spark.dynamicAllocation.enabled", "false")
> .set("spark.cores.max", "1")
>  .set("spark.yarn.executor.memoryOverhead", "500m")         
> .set("spark.executor.instances","2")
> .set("spark.executor.memory","500m")
> .set("spark.num.executors","2")
> .set("spark.executor.cores","1")
> .set("spark.worker.instances","1")
> .set("spark.worker.memory","512m")
> .set("spark.worker.max.heapsize","512m")
> .set("spark.worker.cores","1")
> .set("maximizeResourceAllocation", "true") 
> .set("spark.yarn.nodemanager.resource.cpu-vcores","4") 
> .set("spark.yarn.submit.file.replication", "1")
> SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate(); 
> {code}
>  
> {quote}
> 2: Create a dataset of two Rows and i Show them
> {code:java}
> List rows = new ArrayList<>(); List rows = new ArrayList<>(); 
> rows.add(RowFactory.create("a", "b")); rows.add(RowFactory.create("b", "c")); 
> rows.add(RowFactory.create("a", "a"));
>  StructType structType = new StructType(); structType = 
> structType.add("edge_1", DataTypes.StringType, false); structType = 
> structType.add("edge_2", DataTypes.StringType, false); ExpressionEncoder 
> edgeEncoder = RowEncoder.apply(structType);
>  Dataset edge = spark.createDataset(rows, edgeEncoder); edge.show();
>  
> {code}
> {{From now it is all Ok, the job is submitted on hadoop and the rows are 
> showed correctly}}
> {{3: I perform a Map that upper cases the elements in the row}}
>  
> {quote}
> {code:java}
> Dataset edge2 = edge.map(new MyFunction2(), edgeEncoder); Dataset 
> edge2 = edge.map(new MyFunction2(), edgeEncoder);{code}
> {quote}
>  
> {quote}{{ public static class MyFunction2 implements MapFunction { 
> public static class MyFunction2 implements MapFunction {}}
> {{ /** *  */ private static final long serialVersionUID = 1L;}}
> {{ @Override public Row call(Row v1) throws Exception \{ String el1 = 
> v1.get(0).toString().toUpperCase(); String el2 = 
> v1.get(1).toString().toUpperCase(); return RowFactory.create(el1,el2); }}}
> {{ }}}
> {quote}
> {{4: Then I show the dataset after map is performed}}
> {quote}
> {code:java}
> edge2.show();{code}
> {quote}
> {{And precisely here the log Loops saying }}
> {{21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources}}
>  
> {quote}{{Here is t

[jira] [Updated] (SPARK-36875) 21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resourc

2021-09-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36875:
-
Language:   (was: JAVA,)

> 21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources
> ---
>
> Key: SPARK-36875
> URL: https://issues.apache.org/jira/browse/SPARK-36875
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 3.1.2
> Environment: Eclipse
> Hadoop 3.3
> Spark3.1.2-hadoop3.2
> Dependencies
>  
> {noformat}
> 
> org.apache.spark
>   
> spark-core_2.12 
> 3.1.2
>   
> 
>  org.apache.spark 
> spark-sql_2.12 
> 3.1.2  
> 
>  janino 
> org.codehaus.janino 
>
>  org.codehaus.janino 
> janino 
> 3.0.8
>  
>   org.apache.spark 
> spark-yarn_2.12
>  3.1.2 provided
>  
>   
> org.scala-lang
>  scala-library 
> 2.12.13
>  
> Enviroment Variables set in eclipse: SPARK_HOME path/to/my/sparkfolder
> OS Linux with UBUNTU 20
> The test is launched on my first user davben. 
> Spark folder and hadoop are on my second user hadoop{noformat}
>  
>  
>Reporter: Davide Benedetto
>Priority: Major
>  Labels: Eclipse, YARN, dataset, hadoop, spark-conf, spark-core
> Fix For: 3.1.2
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, I am running a spark job with yarn programmatically using Eclipse IDE. 
> Here I
> 1: open the spark session passing a SparkConf as input parameter,
>  
> {quote} 
> {code:java}
> System.setProperty("hadoop.home.dir", "/home/hadoop/hadoop"); 
> System.setProperty("hadoop.home.dir", "/home/hadoop/hadoop");        
> System.setProperty("SPARK_YARN_MODE", "yarn");        
> System.setProperty("HADOOP_USER_NAME", "hadoop");
> SparkConf sparkConf = new 
> SparkConf().setAppName("simpleTest2").setMaster("yarn") 
> .set("spark.executor.memory", "1g")
> .set("deploy.mode", "cluster")
> .set("spark.yarn.stagingDir", "hdfs://localhost:9000/user/hadoop/")
> .set("spark.yarn.am.memory", "512m") 
> .set("spark.dynamicAllocation.minExecutors","1") 
> .set("spark.dynamicAllocation.maxExecutors","40") 
> .set("spark.dynamicAllocation.initialExecutors","2")         
> .set("spark.shuffle.service.enabled", "true")         
> .set("spark.dynamicAllocation.enabled", "false")
> .set("spark.cores.max", "1")
>  .set("spark.yarn.executor.memoryOverhead", "500m")         
> .set("spark.executor.instances","2")
> .set("spark.executor.memory","500m")
> .set("spark.num.executors","2")
> .set("spark.executor.cores","1")
> .set("spark.worker.instances","1")
> .set("spark.worker.memory","512m")
> .set("spark.worker.max.heapsize","512m")
> .set("spark.worker.cores","1")
> .set("maximizeResourceAllocation", "true") 
> .set("spark.yarn.nodemanager.resource.cpu-vcores","4") 
> .set("spark.yarn.submit.file.replication", "1")
> SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate(); 
> {code}
>  
> {quote}
> 2: Create a dataset of two Rows and i Show them
> {code:java}
> List rows = new ArrayList<>(); List rows = new ArrayList<>(); 
> rows.add(RowFactory.create("a", "b")); rows.add(RowFactory.create("b", "c")); 
> rows.add(RowFactory.create("a", "a"));
>  StructType structType = new StructType(); structType = 
> structType.add("edge_1", DataTypes.StringType, false); structType = 
> structType.add("edge_2", DataTypes.StringType, false); ExpressionEncoder 
> edgeEncoder = RowEncoder.apply(structType);
>  Dataset edge = spark.createDataset(rows, edgeEncoder); edge.show();
>  
> {code}
> {{From now it is all Ok, the job is submitted on hadoop and the rows are 
> showed correctly}}
> {{3: I perform a Map that upper cases the elements in the row}}
>  
> {quote}
> {code:java}
> Dataset edge2 = edge.map(new MyFunction2(), edgeEncoder); Dataset 
> edge2 = edge.map(new MyFunction2(), edgeEncoder);{code}
> {quote}
>  
> {quote}{{ public static class MyFunction2 implements MapFunction { 
> public static class MyFunction2 implements MapFunction {}}
> {{ /** *  */ private static final long serialVersionUID = 1L;}}
> {{ @Override public Row call(Row v1) throws Exception \{ String el1 = 
> v1.get(0).toString().toUpperCase(); String el2 = 
> v1.get(1).toString().toUpperCase(); return RowFactory.create(el1,el2); }}}
> {{ }}}
> {quote}
> {{4: Then I show the dataset after map is performed}}
> {quote}
> {code:java}
> edge2.show();{code}
> {quote}
> {{And precisely here the log Loops saying }}
> {{21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources}}

[jira] [Updated] (SPARK-36875) 21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resourc

2021-09-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36875:
-
Flags:   (was: Important)

> 21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources
> ---
>
> Key: SPARK-36875
> URL: https://issues.apache.org/jira/browse/SPARK-36875
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 3.1.2
> Environment: Eclipse
> Hadoop 3.3
> Spark3.1.2-hadoop3.2
> Dependencies
>  
> {noformat}
> 
> org.apache.spark
>   
> spark-core_2.12 
> 3.1.2
>   
> 
>  org.apache.spark 
> spark-sql_2.12 
> 3.1.2  
> 
>  janino 
> org.codehaus.janino 
>
>  org.codehaus.janino 
> janino 
> 3.0.8
>  
>   org.apache.spark 
> spark-yarn_2.12
>  3.1.2 provided
>  
>   
> org.scala-lang
>  scala-library 
> 2.12.13
>  
> Enviroment Variables set in eclipse: SPARK_HOME path/to/my/sparkfolder
> OS Linux with UBUNTU 20
> The test is launched on my first user davben. 
> Spark folder and hadoop are on my second user hadoop{noformat}
>  
>  
>Reporter: Davide Benedetto
>Priority: Major
>  Labels: Eclipse, YARN, dataset, hadoop, spark-conf, spark-core
> Fix For: 3.1.2
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Hi, I am running a spark job with yarn programmatically using Eclipse IDE. 
> Here I
> 1: open the spark session passing a SparkConf as input parameter,
>  
> {quote} 
> {code:java}
> System.setProperty("hadoop.home.dir", "/home/hadoop/hadoop"); 
> System.setProperty("hadoop.home.dir", "/home/hadoop/hadoop");        
> System.setProperty("SPARK_YARN_MODE", "yarn");        
> System.setProperty("HADOOP_USER_NAME", "hadoop");
> SparkConf sparkConf = new 
> SparkConf().setAppName("simpleTest2").setMaster("yarn") 
> .set("spark.executor.memory", "1g")
> .set("deploy.mode", "cluster")
> .set("spark.yarn.stagingDir", "hdfs://localhost:9000/user/hadoop/")
> .set("spark.yarn.am.memory", "512m") 
> .set("spark.dynamicAllocation.minExecutors","1") 
> .set("spark.dynamicAllocation.maxExecutors","40") 
> .set("spark.dynamicAllocation.initialExecutors","2")         
> .set("spark.shuffle.service.enabled", "true")         
> .set("spark.dynamicAllocation.enabled", "false")
> .set("spark.cores.max", "1")
>  .set("spark.yarn.executor.memoryOverhead", "500m")         
> .set("spark.executor.instances","2")
> .set("spark.executor.memory","500m")
> .set("spark.num.executors","2")
> .set("spark.executor.cores","1")
> .set("spark.worker.instances","1")
> .set("spark.worker.memory","512m")
> .set("spark.worker.max.heapsize","512m")
> .set("spark.worker.cores","1")
> .set("maximizeResourceAllocation", "true") 
> .set("spark.yarn.nodemanager.resource.cpu-vcores","4") 
> .set("spark.yarn.submit.file.replication", "1")
> SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate(); 
> {code}
>  
> {quote}
> 2: Create a dataset of two Rows and i Show them
> {code:java}
> List rows = new ArrayList<>(); List rows = new ArrayList<>(); 
> rows.add(RowFactory.create("a", "b")); rows.add(RowFactory.create("b", "c")); 
> rows.add(RowFactory.create("a", "a"));
>  StructType structType = new StructType(); structType = 
> structType.add("edge_1", DataTypes.StringType, false); structType = 
> structType.add("edge_2", DataTypes.StringType, false); ExpressionEncoder 
> edgeEncoder = RowEncoder.apply(structType);
>  Dataset edge = spark.createDataset(rows, edgeEncoder); edge.show();
>  
> {code}
> {{From now it is all Ok, the job is submitted on hadoop and the rows are 
> showed correctly}}
> {{3: I perform a Map that upper cases the elements in the row}}
>  
> {quote}
> {code:java}
> Dataset edge2 = edge.map(new MyFunction2(), edgeEncoder); Dataset 
> edge2 = edge.map(new MyFunction2(), edgeEncoder);{code}
> {quote}
>  
> {quote}{{ public static class MyFunction2 implements MapFunction { 
> public static class MyFunction2 implements MapFunction {}}
> {{ /** *  */ private static final long serialVersionUID = 1L;}}
> {{ @Override public Row call(Row v1) throws Exception \{ String el1 = 
> v1.get(0).toString().toUpperCase(); String el2 = 
> v1.get(1).toString().toUpperCase(); return RowFactory.create(el1,el2); }}}
> {{ }}}
> {quote}
> {{4: Then I show the dataset after map is performed}}
> {quote}
> {code:java}
> edge2.show();{code}
> {quote}
> {{And precisely here the log Loops saying }}
> {{21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources}}

[jira] [Commented] (SPARK-36874) Ambiguous Self-Join detected only on right dataframe

2021-09-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421937#comment-17421937
 ] 

Hyukjin Kwon commented on SPARK-36874:
--

cc [~cloud_fan] FYI

> Ambiguous Self-Join detected only on right dataframe
> 
>
> Key: SPARK-36874
> URL: https://issues.apache.org/jira/browse/SPARK-36874
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Vincent Doba
>Priority: Major
>  Labels: correctness
>
> When joining two dataframes, if they share the same lineage and one dataframe 
> is a transformation of the other, Ambiguous Self Join detection only works 
> when transformed dataframe is the right dataframe. 
> For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, 
> Ambiguous Self Join detection only works when {{df2}} is the right dataframe:
> - {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
> - {{df2.join(df1, ...)}} returns a valid dataframe
> h1. Minimum Reproducible example
> h2. Code
> {code:scala}
> import sparkSession.implicit._
> val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")
> val df2 = df1.filter($"value" === "A2")
> df2.join(df1, df1("key1") === df2("key2")).show()
> {code}
> h2. Expected Result
> Throw the following exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Column 
> key2#11 are ambiguous. It's probably because you joined several Datasets 
> together, and some of these Datasets are the same. This column points to one 
> of the Datasets but Spark is unable to figure out which one. Please alias the 
> Datasets with different names via `Dataset.as` before joining them, and 
> specify the column using qualified name, e.g. `df.as("a").join(df.as("b"), 
> $"a.id" > $"b.id")`. You can also set 
> spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
>   at 
> org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
>   at 
> org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
>   at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>   at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>   at scala.collection.immutable.List.foldLeft(List.scala:91)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:196)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:190)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:174)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:173)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:143)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:143)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
>   at org.apache.spark.sql.SparkSession.wi

[jira] [Comment Edited] (SPARK-36860) Create the external hive table for HBase failed

2021-09-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421936#comment-17421936
 ] 

Hyukjin Kwon edited comment on SPARK-36860 at 9/29/21, 5:26 AM:


Unless it's explicitly documented, nothing is supported officially.


was (Author: hyukjin.kwon):
Unless it's explicitly documented, nothing is supported.

> Create the external hive table for HBase failed 
> 
>
> Key: SPARK-36860
> URL: https://issues.apache.org/jira/browse/SPARK-36860
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: wineternity
>Priority: Major
> Attachments: image-2021-09-27-14-18-10-910.png, 
> image-2021-09-27-14-25-28-900.png
>
>
> We use follow sql to create hive external table , which read from hbase
> {code:java}
> CREATE EXTERNAL TABLE if not exists dev.sanyu_spotlight_headline_material(
>rowkey string COMMENT 'HBase主键',
>content string COMMENT '图文正文')
> USING HIVE   
> ROW FORMAT SERDE
>'org.apache.hadoop.hive.hbase.HBaseSerDe'
>  STORED BY
>'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>  WITH SERDEPROPERTIES (
>'hbase.columns.mapping'=':key, cf1:content'
> )
>  TBLPROPERTIES (
>'hbase.table.name'='spotlight_headline_material'
>  );
> {code}
> But the sql failed in Spark 3.1.2, which throw this exception
> {code:java}
> 21/09/27 11:44:24 INFO scheduler.DAGScheduler: Asked to cancel job group 
> 26d7459f-7b58-4c18-9939-5f2737525ff2
> 21/09/27 11:44:24 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query with 26d7459f-7b58-4c18-9939-5f2737525ff2, currentState 
> RUNNING,
> org.apache.spark.sql.catalyst.parser.ParseException:
> Operation not allowed: Unexpected combination of ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.hbase.HBaseSerDe' and STORED BY 
> 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITHSERDEPROPERTIES('hbase.columns.mapping'=':key,
>  cf1:content')(line 5, pos 0)
> {code}
> this check was introduced from this change: 
> [https://github.com/apache/spark/pull/28026]
>  
> Could anyone gave the introduction how to create the external table for hbase 
> in spark3 now ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36860) Create the external hive table for HBase failed

2021-09-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421936#comment-17421936
 ] 

Hyukjin Kwon commented on SPARK-36860:
--

Unless it's explicitly documented, nothing is supported.

> Create the external hive table for HBase failed 
> 
>
> Key: SPARK-36860
> URL: https://issues.apache.org/jira/browse/SPARK-36860
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: wineternity
>Priority: Major
> Attachments: image-2021-09-27-14-18-10-910.png, 
> image-2021-09-27-14-25-28-900.png
>
>
> We use follow sql to create hive external table , which read from hbase
> {code:java}
> CREATE EXTERNAL TABLE if not exists dev.sanyu_spotlight_headline_material(
>rowkey string COMMENT 'HBase主键',
>content string COMMENT '图文正文')
> USING HIVE   
> ROW FORMAT SERDE
>'org.apache.hadoop.hive.hbase.HBaseSerDe'
>  STORED BY
>'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>  WITH SERDEPROPERTIES (
>'hbase.columns.mapping'=':key, cf1:content'
> )
>  TBLPROPERTIES (
>'hbase.table.name'='spotlight_headline_material'
>  );
> {code}
> But the sql failed in Spark 3.1.2, which throw this exception
> {code:java}
> 21/09/27 11:44:24 INFO scheduler.DAGScheduler: Asked to cancel job group 
> 26d7459f-7b58-4c18-9939-5f2737525ff2
> 21/09/27 11:44:24 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query with 26d7459f-7b58-4c18-9939-5f2737525ff2, currentState 
> RUNNING,
> org.apache.spark.sql.catalyst.parser.ParseException:
> Operation not allowed: Unexpected combination of ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.hbase.HBaseSerDe' and STORED BY 
> 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITHSERDEPROPERTIES('hbase.columns.mapping'=':key,
>  cf1:content')(line 5, pos 0)
> {code}
> this check was introduced from this change: 
> [https://github.com/apache/spark/pull/28026]
>  
> Could anyone gave the introduction how to create the external table for hbase 
> in spark3 now ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36858) Spark API to apply same function to multiple columns

2021-09-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421935#comment-17421935
 ] 

Hyukjin Kwon commented on SPARK-36858:
--

Can't we simply do this in a for loop?

> Spark API to apply same function to multiple columns
> 
>
> Key: SPARK-36858
> URL: https://issues.apache.org/jira/browse/SPARK-36858
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.1.2
>Reporter: Armand BERGES
>Priority: Minor
>
> Hi
> My team and I have regularly need to apply the same function to multiple 
> columns at once.
> For example, we want to remove all non alphanumerical characters to each 
> columns of our dataframes. 
> When we hit this use case first, some people in my team were using this kind 
> of code : 
> {code:java}
> val colListToClean =  ## Generate some list, could be very long.
> val dfToClean: DataFrame = ... ## This is the dataframe we want to clean
> def cleanFunction(colName: String): Column = ... ## Write some function to 
> manipulate column based on its name.
> val dfCleaned = colListToClean.foldLeft(dfToClean)((df, colName) => 
> df.withColumn(colName, cleanFunction(colName)){code}
> This kind of code when applied on a large set of columns overloaded our 
> driver (because a Dataframe is generated for each column to clean).
> Based on this issue, we developed some code to add two functions : 
>  * One to apply the same function to multiple columns
>  * One to rename multiple columns based on a Map. 
>  
> I wonder if your ever ask your team to add such kind of API ? If you did, had 
> you any kind of issue regarding the implementation ? If you didn't, is this 
> any idea you could add to Spark ? 
> Best regards, 
>  
> LvffY
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36888) Sha2 with bit_length 512 not being tested

2021-09-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421934#comment-17421934
 ] 

Hyukjin Kwon commented on SPARK-36888:
--

Please go ahead and add a test

> Sha2 with bit_length 512 not being tested
> -
>
> Key: SPARK-36888
> URL: https://issues.apache.org/jira/browse/SPARK-36888
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: H. Vetinari
>Priority: Major
>
> Looking at 
> [https://github.com/apache/spark/commit/6c6291b3f6ac13b8415b87b2b741a9cd95bc6c3b]
>  for https://issues.apache.org/jira/browse/SPARK-36836, it's clear that 512 
> bits are supported
> {{bitLength match {}}
>  {{[...]}}
>  {{  case 512 =>}}
>  {{    UTF8String.fromString(DigestUtils.sha512Hex(input))}}
> resp.
> {{nullSafeCodeGen(ctx, ev, (eval1, eval2) => {}}
>  {{    [...]}}
>  {{    else if ($eval2 == 512) {}}
>  {{   ${ev.value} =}}
>  {{ UTF8String.fromString($digestUtils.sha512Hex($eval1));}}
> but the test claims it is unsupported:
> {{// unsupported bit length}}
>  {{checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(1024)), 
> null)}}
>  {{checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(512)), 
> null)}}
> To avoid a similar fate as SPARK-36836, tests should be added.
>   
>  CC [~richardc-db]
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36588) Use v2 commands by default

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421929#comment-17421929
 ] 

Apache Spark commented on SPARK-36588:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/34137

> Use v2 commands by default
> --
>
> Key: SPARK-36588
> URL: https://issues.apache.org/jira/browse/SPARK-36588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
> It's been a while after we introduce the v2 commands, and I think it's time 
> to use v2 commands by default even for the session catalog, with a legacy 
> config to fall back to the v1 commands.
> We can do this one command by one command, with tests for both the v1 and v2 
> versions. The tests should help us understand the behavior difference between 
> v1 and v2 commands, so that we can:
>  # fix the v2 commands to match the v1 behavior
>  # or accept the behavior difference and write migration guide
> We can reuse the test framework built in 
> https://issues.apache.org/jira/browse/SPARK-33381



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36588) Use v2 commands by default

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36588:


Assignee: (was: Apache Spark)

> Use v2 commands by default
> --
>
> Key: SPARK-36588
> URL: https://issues.apache.org/jira/browse/SPARK-36588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
> It's been a while after we introduce the v2 commands, and I think it's time 
> to use v2 commands by default even for the session catalog, with a legacy 
> config to fall back to the v1 commands.
> We can do this one command by one command, with tests for both the v1 and v2 
> versions. The tests should help us understand the behavior difference between 
> v1 and v2 commands, so that we can:
>  # fix the v2 commands to match the v1 behavior
>  # or accept the behavior difference and write migration guide
> We can reuse the test framework built in 
> https://issues.apache.org/jira/browse/SPARK-33381



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36588) Use v2 commands by default

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36588:


Assignee: Apache Spark

> Use v2 commands by default
> --
>
> Key: SPARK-36588
> URL: https://issues.apache.org/jira/browse/SPARK-36588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>
> It's been a while after we introduce the v2 commands, and I think it's time 
> to use v2 commands by default even for the session catalog, with a legacy 
> config to fall back to the v1 commands.
> We can do this one command by one command, with tests for both the v1 and v2 
> versions. The tests should help us understand the behavior difference between 
> v1 and v2 commands, so that we can:
>  # fix the v2 commands to match the v1 behavior
>  # or accept the behavior difference and write migration guide
> We can reuse the test framework built in 
> https://issues.apache.org/jira/browse/SPARK-33381



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421928#comment-17421928
 ] 

Hyukjin Kwon commented on SPARK-36883:
--

Nice!

> Upgrade R version to 4.1.1 in CI images
> ---
>
> Key: SPARK-36883
> URL: https://issues.apache.org/jira/browse/SPARK-36883
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.
> R 4.1.1 is released. We might better to test the latest version of R with 
> SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421925#comment-17421925
 ] 

Dongjoon Hyun commented on SPARK-36883:
---

Oh, it seems to be fixed a few minutes ago.

> Upgrade R version to 4.1.1 in CI images
> ---
>
> Key: SPARK-36883
> URL: https://issues.apache.org/jira/browse/SPARK-36883
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.
> R 4.1.1 is released. We might better to test the latest version of R with 
> SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36888) Sha2 with bit_length 512 not being tested

2021-09-28 Thread H. Vetinari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

H. Vetinari updated SPARK-36888:

Description: 
Looking at 
[https://github.com/apache/spark/commit/6c6291b3f6ac13b8415b87b2b741a9cd95bc6c3b]
 for https://issues.apache.org/jira/browse/SPARK-36836, it's clear that 512 
bits are supported

{{bitLength match {}}
 {{[...]}}
 {{  case 512 =>}}
 {{    UTF8String.fromString(DigestUtils.sha512Hex(input))}}

resp.

{{nullSafeCodeGen(ctx, ev, (eval1, eval2) => {}}
 {{    [...]}}
 {{    else if ($eval2 == 512) {}}
 {{   ${ev.value} =}}
 {{ UTF8String.fromString($digestUtils.sha512Hex($eval1));}}

but the test claims it is unsupported:

{{// unsupported bit length}}
 {{checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(1024)), 
null)}}
 {{checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(512)), null)}}

To avoid a similar fate as SPARK-36836, tests should be added.
  
 CC [~richardc-db]
  

  was:
Looking at 
[https://github.com/apache/spark/commit/6c6291b3f6ac13b8415b87b2b741a9cd95bc6c3b]
 for https://issues.apache.org/jira/browse/SPARK-36836, it's clear that 512 
bits are supported

{{bitLength match}}
{{[...]}}
{{  case 512 =>}}
{{    UTF8String.fromString(DigestUtils.sha512Hex(input))}}

resp.

{{nullSafeCodeGen(ctx, ev, (eval1, eval2) =>}}
{{    [...]}}
{{    else if ($eval2 == 512) {}}
{{   ${ev.value} =}}
{{ UTF8String.fromString($digestUtils.sha512Hex($eval1));}}

but the test claims it is unsupported:

{{// unsupported bit length}}
{{checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(1024)), null)}}
{{checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(512)), null)}}

To avoid a similar fate as SPARK-36836, tests should be added.
  
 CC [~richardc-db]
  


> Sha2 with bit_length 512 not being tested
> -
>
> Key: SPARK-36888
> URL: https://issues.apache.org/jira/browse/SPARK-36888
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: H. Vetinari
>Priority: Major
>
> Looking at 
> [https://github.com/apache/spark/commit/6c6291b3f6ac13b8415b87b2b741a9cd95bc6c3b]
>  for https://issues.apache.org/jira/browse/SPARK-36836, it's clear that 512 
> bits are supported
> {{bitLength match {}}
>  {{[...]}}
>  {{  case 512 =>}}
>  {{    UTF8String.fromString(DigestUtils.sha512Hex(input))}}
> resp.
> {{nullSafeCodeGen(ctx, ev, (eval1, eval2) => {}}
>  {{    [...]}}
>  {{    else if ($eval2 == 512) {}}
>  {{   ${ev.value} =}}
>  {{ UTF8String.fromString($digestUtils.sha512Hex($eval1));}}
> but the test claims it is unsupported:
> {{// unsupported bit length}}
>  {{checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(1024)), 
> null)}}
>  {{checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(512)), 
> null)}}
> To avoid a similar fate as SPARK-36836, tests should be added.
>   
>  CC [~richardc-db]
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36888) Sha2 with bit_length 512 not being tested

2021-09-28 Thread H. Vetinari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

H. Vetinari updated SPARK-36888:

Description: 
Looking at 
[https://github.com/apache/spark/commit/6c6291b3f6ac13b8415b87b2b741a9cd95bc6c3b]
 for https://issues.apache.org/jira/browse/SPARK-36836, it's clear that 512 
bits are supported

{{bitLength match}}
{{[...]}}
{{  case 512 =>}}
{{    UTF8String.fromString(DigestUtils.sha512Hex(input))}}

resp.

{{nullSafeCodeGen(ctx, ev, (eval1, eval2) =>}}
{{    [...]}}
{{    else if ($eval2 == 512) {}}
{{   ${ev.value} =}}
{{ UTF8String.fromString($digestUtils.sha512Hex($eval1));}}

but the test claims it is unsupported:

{{// unsupported bit length}}
{{checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(1024)), null)}}
{{checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(512)), null)}}

To avoid a similar fate as SPARK-36836, tests should be added.
  
 CC [~richardc-db]
  

  was:
Looking at 
[https://github.com/apache/spark/commit/6c6291b3f6ac13b8415b87b2b741a9cd95bc6c3b]
 for https://issues.apache.org/jira/browse/SPARK-36836, it's clear that 512 
bits are supported

{{bitLength match {}}
 {{[...]
   case 512 =>}}
 {{    UTF8String.fromString(DigestUtils.sha512Hex(input))}}

resp.

{{nullSafeCodeGen(ctx, ev, (eval1, eval2) => }}{{{}}
 {{  }}{{s"""}}
 {{    }}{{[...]}}
 {{    else if ($eval2 == 512) {}}
 {{   ${ev.value} =}}
 {{ UTF8String.fromString($digestUtils.sha512Hex($eval1));}}

but the test claims it is unsupported:

{{// unsupported bit length}}
{{ checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(1024)), 
null)}}
{{ checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(512)), null)}}

To avoid a similar fate as SPARK-36836, tests should be added.
 
CC [~richardc-db]
 


> Sha2 with bit_length 512 not being tested
> -
>
> Key: SPARK-36888
> URL: https://issues.apache.org/jira/browse/SPARK-36888
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: H. Vetinari
>Priority: Major
>
> Looking at 
> [https://github.com/apache/spark/commit/6c6291b3f6ac13b8415b87b2b741a9cd95bc6c3b]
>  for https://issues.apache.org/jira/browse/SPARK-36836, it's clear that 512 
> bits are supported
> {{bitLength match}}
> {{[...]}}
> {{  case 512 =>}}
> {{    UTF8String.fromString(DigestUtils.sha512Hex(input))}}
> resp.
> {{nullSafeCodeGen(ctx, ev, (eval1, eval2) =>}}
> {{    [...]}}
> {{    else if ($eval2 == 512) {}}
> {{   ${ev.value} =}}
> {{ UTF8String.fromString($digestUtils.sha512Hex($eval1));}}
> but the test claims it is unsupported:
> {{// unsupported bit length}}
> {{checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(1024)), 
> null)}}
> {{checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(512)), 
> null)}}
> To avoid a similar fate as SPARK-36836, tests should be added.
>   
>  CC [~richardc-db]
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36888) Sha2 with bit_length 512 not being tested

2021-09-28 Thread H. Vetinari (Jira)
H. Vetinari created SPARK-36888:
---

 Summary: Sha2 with bit_length 512 not being tested
 Key: SPARK-36888
 URL: https://issues.apache.org/jira/browse/SPARK-36888
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.0
Reporter: H. Vetinari


Looking at 
[https://github.com/apache/spark/commit/6c6291b3f6ac13b8415b87b2b741a9cd95bc6c3b]
 for https://issues.apache.org/jira/browse/SPARK-36836, it's clear that 512 
bits are supported

{{bitLength match {}}
 {{[...]
   case 512 =>}}
 {{    UTF8String.fromString(DigestUtils.sha512Hex(input))}}

resp.

{{nullSafeCodeGen(ctx, ev, (eval1, eval2) => }}{{{}}
 {{  }}{{s"""}}
 {{    }}{{[...]}}
 {{    else if ($eval2 == 512) {}}
 {{   ${ev.value} =}}
 {{ UTF8String.fromString($digestUtils.sha512Hex($eval1));}}

but the test claims it is unsupported:

{{// unsupported bit length}}
{{ checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(1024)), 
null)}}
{{ checkEvaluation(Sha2(Literal.create(null, BinaryType), Literal(512)), null)}}

To avoid a similar fate as SPARK-36836, tests should be added.
 
CC [~richardc-db]
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421921#comment-17421921
 ] 

Dongjoon Hyun commented on SPARK-36883:
---

BTW, [~hyukjin.kwon]. CRAN mirror is out of sync as of today. It seems to take 
some time because we need to wait for the recovery.
{code}
Get:4 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ Packages [46.9 
kB]
Err:4 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ Packages
  File has unexpected size (47211 != 46867). Mirror sync in progress? [IP: 
65.8.158.74 443]
  Hashes of expected file:
   - Filesize:46867 [weak]
   - 
SHA512:731daccdf2e1f3edd55db65f90ed45fa58009f675477662c8be01f07eb3544611b0e1226c4183b4d1dd1adf753094ff86b5dd1d707ba19c6285c9b64f202fd0a
   - SHA256:041e4b8fe49f5bab84ff5a0b2d424a9478fb35b6e0767879c9a758f498d32024
   - MD5Sum:3665a6fdae98e9d262f4641fa4465df9 [weak]
  Release file created at: Sun, 26 Sep 2021 01:09:15 +
{code}

> Upgrade R version to 4.1.1 in CI images
> ---
>
> Key: SPARK-36883
> URL: https://issues.apache.org/jira/browse/SPARK-36883
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.
> R 4.1.1 is released. We might better to test the latest version of R with 
> SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36886) Inline type hints for python/pyspark/sql/context.py

2021-09-28 Thread dgd_contributor (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dgd_contributor updated SPARK-36886:

Description: Inline type hints for python/pyspark/sql/context.py from 
Inline type hints for python/pyspark/sql/context.pyi.  (was: Inline type hints 
for python/pyspark/sql/column.py from Inline type hints for 
python/pyspark/sql/column.pyi.)

> Inline type hints for python/pyspark/sql/context.py
> ---
>
> Key: SPARK-36886
> URL: https://issues.apache.org/jira/browse/SPARK-36886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dgd_contributor
>Priority: Major
>
> Inline type hints for python/pyspark/sql/context.py from Inline type hints 
> for python/pyspark/sql/context.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36886) Inline type hints for python/pyspark/sql/context.py

2021-09-28 Thread dgd_contributor (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dgd_contributor updated SPARK-36886:

Summary: Inline type hints for python/pyspark/sql/context.py  (was: Inline 
type hints for python/pyspark/sql/column.py)

> Inline type hints for python/pyspark/sql/context.py
> ---
>
> Key: SPARK-36886
> URL: https://issues.apache.org/jira/browse/SPARK-36886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dgd_contributor
>Priority: Major
>
> Inline type hints for python/pyspark/sql/column.py from Inline type hints for 
> python/pyspark/sql/column.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421908#comment-17421908
 ] 

Dongjoon Hyun commented on SPARK-36883:
---

Let me update the image.

> Upgrade R version to 4.1.1 in CI images
> ---
>
> Key: SPARK-36883
> URL: https://issues.apache.org/jira/browse/SPARK-36883
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.
> R 4.1.1 is released. We might better to test the latest version of R with 
> SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421907#comment-17421907
 ] 

Dongjoon Hyun commented on SPARK-36883:
---

Thank you, [~hyukjin.kwon].

> Upgrade R version to 4.1.1 in CI images
> ---
>
> Key: SPARK-36883
> URL: https://issues.apache.org/jira/browse/SPARK-36883
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.
> R 4.1.1 is released. We might better to test the latest version of R with 
> SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36526) Add supportsIndex interface

2021-09-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36526:
---

Assignee: Huaxin Gao

> Add supportsIndex interface
> ---
>
> Key: SPARK-36526
> URL: https://issues.apache.org/jira/browse/SPARK-36526
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Add supportsIndex interface with the following APIs:
> * createIndex
> * deleteIndex
> * indexExists
> * listIndexes
> * dropIndex
> * restoreIndex
> * refreshIndex
> * alterIndex



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36526) Add supportsIndex interface

2021-09-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36526.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33754
[https://github.com/apache/spark/pull/33754]

> Add supportsIndex interface
> ---
>
> Key: SPARK-36526
> URL: https://issues.apache.org/jira/browse/SPARK-36526
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>
> Add supportsIndex interface with the following APIs:
> * createIndex
> * deleteIndex
> * indexExists
> * listIndexes
> * dropIndex
> * restoreIndex
> * refreshIndex
> * alterIndex



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36884) Inline type hints for python/pyspark/sql/session.py

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36884:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/sql/session.py
> ---
>
> Key: SPARK-36884
> URL: https://issues.apache.org/jira/browse/SPARK-36884
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Inline type hints for python/pyspark/sql/session.py from Inline type hints 
> for python/pyspark/sql/session.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36884) Inline type hints for python/pyspark/sql/session.py

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36884:


Assignee: Apache Spark

> Inline type hints for python/pyspark/sql/session.py
> ---
>
> Key: SPARK-36884
> URL: https://issues.apache.org/jira/browse/SPARK-36884
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints for python/pyspark/sql/session.py from Inline type hints 
> for python/pyspark/sql/session.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36884) Inline type hints for python/pyspark/sql/session.py

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421898#comment-17421898
 ] 

Apache Spark commented on SPARK-36884:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/34136

> Inline type hints for python/pyspark/sql/session.py
> ---
>
> Key: SPARK-36884
> URL: https://issues.apache.org/jira/browse/SPARK-36884
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Inline type hints for python/pyspark/sql/session.py from Inline type hints 
> for python/pyspark/sql/session.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36849) Migrate UseStatement to v2 command framework

2021-09-28 Thread dohongdayi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421888#comment-17421888
 ] 

dohongdayi commented on SPARK-36849:


[~huaxingao] Thank you a lot !

> Migrate UseStatement to v2 command framework
> 
>
> Key: SPARK-36849
> URL: https://issues.apache.org/jira/browse/SPARK-36849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36887) Inline type hints for python/pyspark/sql/conf.py

2021-09-28 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421880#comment-17421880
 ] 

dgd_contributor commented on SPARK-36887:
-

I'm working on this.

> Inline type hints for python/pyspark/sql/conf.py
> 
>
> Key: SPARK-36887
> URL: https://issues.apache.org/jira/browse/SPARK-36887
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dgd_contributor
>Priority: Major
>
> Inline type hints for python/pyspark/sql/session.py from Inline type hints 
> for python/pyspark/sql/conf.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36887) Inline type hints for python/pyspark/sql/conf.py

2021-09-28 Thread dgd_contributor (Jira)
dgd_contributor created SPARK-36887:
---

 Summary: Inline type hints for python/pyspark/sql/conf.py
 Key: SPARK-36887
 URL: https://issues.apache.org/jira/browse/SPARK-36887
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: dgd_contributor


Inline type hints for python/pyspark/sql/session.py from Inline type hints for 
python/pyspark/sql/conf.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36886) Inline type hints for python/pyspark/sql/column.py

2021-09-28 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421878#comment-17421878
 ] 

dgd_contributor commented on SPARK-36886:
-

Working on this.

> Inline type hints for python/pyspark/sql/column.py
> --
>
> Key: SPARK-36886
> URL: https://issues.apache.org/jira/browse/SPARK-36886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dgd_contributor
>Priority: Major
>
> Inline type hints for python/pyspark/sql/column.py from Inline type hints for 
> python/pyspark/sql/column.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36886) Inline type hints for python/pyspark/sql/column.py

2021-09-28 Thread dgd_contributor (Jira)
dgd_contributor created SPARK-36886:
---

 Summary: Inline type hints for python/pyspark/sql/column.py
 Key: SPARK-36886
 URL: https://issues.apache.org/jira/browse/SPARK-36886
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: dgd_contributor


Inline type hints for python/pyspark/sql/column.py from Inline type hints for 
python/pyspark/sql/column.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36845) Inline type hint files

2021-09-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421865#comment-17421865
 ] 

Hyukjin Kwon commented on SPARK-36845:
--

please go ahead.

> Inline type hint files
> --
>
> Key: SPARK-36845
> URL: https://issues.apache.org/jira/browse/SPARK-36845
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Currently there are type hint stub files ({{*.pyi}}) to show the expected 
> types for functions, but we can also take advantage of static type checking 
> within the functions by inlining the type hints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36885) Inline type hints for python/pyspark/sql/dataframe.py

2021-09-28 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36885:
-

 Summary: Inline type hints for python/pyspark/sql/dataframe.py
 Key: SPARK-36885
 URL: https://issues.apache.org/jira/browse/SPARK-36885
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Takuya Ueshin


Inline type hints for python/pyspark/sql/dataframe.py from Inline type hints 
for python/pyspark/sql/dataframe.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36885) Inline type hints for python/pyspark/sql/dataframe.py

2021-09-28 Thread Takuya Ueshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421863#comment-17421863
 ] 

Takuya Ueshin commented on SPARK-36885:
---

I'm working on this.

> Inline type hints for python/pyspark/sql/dataframe.py
> -
>
> Key: SPARK-36885
> URL: https://issues.apache.org/jira/browse/SPARK-36885
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Inline type hints for python/pyspark/sql/dataframe.py from Inline type hints 
> for python/pyspark/sql/dataframe.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36884) Inline type hints for python/pyspark/sql/session.py

2021-09-28 Thread Takuya Ueshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421861#comment-17421861
 ] 

Takuya Ueshin commented on SPARK-36884:
---

I'm working on this.

> Inline type hints for python/pyspark/sql/session.py
> ---
>
> Key: SPARK-36884
> URL: https://issues.apache.org/jira/browse/SPARK-36884
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Inline type hints for python/pyspark/sql/session.py from Inline type hints 
> for python/pyspark/sql/session.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36884) Inline type hints for python/pyspark/sql/session.py

2021-09-28 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36884:
-

 Summary: Inline type hints for python/pyspark/sql/session.py
 Key: SPARK-36884
 URL: https://issues.apache.org/jira/browse/SPARK-36884
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Takuya Ueshin


Inline type hints for python/pyspark/sql/session.py from Inline type hints for 
python/pyspark/sql/session.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36845) Inline type hint files

2021-09-28 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421859#comment-17421859
 ] 

dgd_contributor commented on SPARK-36845:
-

Can i work on this :D ?

> Inline type hint files
> --
>
> Key: SPARK-36845
> URL: https://issues.apache.org/jira/browse/SPARK-36845
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Currently there are type hint stub files ({{*.pyi}}) to show the expected 
> types for functions, but we can also take advantage of static type checking 
> within the functions by inlining the type hints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36882) Support ILIKE API on Python

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421858#comment-17421858
 ] 

Apache Spark commented on SPARK-36882:
--

User 'yoda-mon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34135

> Support ILIKE API on Python
> ---
>
> Key: SPARK-36882
> URL: https://issues.apache.org/jira/browse/SPARK-36882
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Priority: Major
>
> Support ILIKE (case sensitive LIKE) API on Python



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36882) Support ILIKE API on Python

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36882:


Assignee: (was: Apache Spark)

> Support ILIKE API on Python
> ---
>
> Key: SPARK-36882
> URL: https://issues.apache.org/jira/browse/SPARK-36882
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Priority: Major
>
> Support ILIKE (case sensitive LIKE) API on Python



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36882) Support ILIKE API on Python

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36882:


Assignee: Apache Spark

> Support ILIKE API on Python
> ---
>
> Key: SPARK-36882
> URL: https://issues.apache.org/jira/browse/SPARK-36882
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Assignee: Apache Spark
>Priority: Major
>
> Support ILIKE (case sensitive LIKE) API on Python



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421854#comment-17421854
 ] 

Hyukjin Kwon commented on SPARK-36883:
--

[~dongjoon], FYI. BTW, I think this could fix the issue in linter going on now 
in CI, e.g.) https://github.com/apache/spark/runs/3738920366

> Upgrade R version to 4.1.1 in CI images
> ---
>
> Key: SPARK-36883
> URL: https://issues.apache.org/jira/browse/SPARK-36883
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.
> R 4.1.1 is released. We might better to test the latest version of R with 
> SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-36883:


 Summary: Upgrade R version to 4.1.1 in CI images
 Key: SPARK-36883
 URL: https://issues.apache.org/jira/browse/SPARK-36883
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.

R 4.1.1 is released. We might better to test the latest version of R with 
SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36882) Support ILIKE API on Python

2021-09-28 Thread Leona Yoda (Jira)
Leona Yoda created SPARK-36882:
--

 Summary: Support ILIKE API on Python
 Key: SPARK-36882
 URL: https://issues.apache.org/jira/browse/SPARK-36882
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Leona Yoda


Support ILIKE (case sensitive LIKE) API on Python



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36752) Support ILIKE by Scala/Java, PySpark and R APIs

2021-09-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421852#comment-17421852
 ] 

Hyukjin Kwon commented on SPARK-36752:
--

it's totally no problem. I was just checking :-).

> Support ILIKE by Scala/Java, PySpark and R APIs
> ---
>
> Key: SPARK-36752
> URL: https://issues.apache.org/jira/browse/SPARK-36752
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Add the ilike function to Scala/Java, Python and R APIs, update docs and 
> examples.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36752) Support ILIKE by Scala/Java, PySpark and R APIs

2021-09-28 Thread Leona Yoda (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421847#comment-17421847
 ] 

Leona Yoda commented on SPARK-36752:


[~hyukjin.kwon] I'm sorry for worrying you ( I took a vacation), I will create 
a PR for Python today and for R soon.

> Support ILIKE by Scala/Java, PySpark and R APIs
> ---
>
> Key: SPARK-36752
> URL: https://issues.apache.org/jira/browse/SPARK-36752
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Add the ilike function to Scala/Java, Python and R APIs, update docs and 
> examples.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36849) Migrate UseStatement to v2 command framework

2021-09-28 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421845#comment-17421845
 ] 

Huaxin Gao commented on SPARK-36849:


I just started the test

> Migrate UseStatement to v2 command framework
> 
>
> Key: SPARK-36849
> URL: https://issues.apache.org/jira/browse/SPARK-36849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36849) Migrate UseStatement to v2 command framework

2021-09-28 Thread dohongdayi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421843#comment-17421843
 ] 

dohongdayi commented on SPARK-36849:


Hi [~huaxingao], I'm not sure why SparkQA didn't run automatically, could you 
advise please ?

Thanks

> Migrate UseStatement to v2 command framework
> 
>
> Key: SPARK-36849
> URL: https://issues.apache.org/jira/browse/SPARK-36849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36813) Implement ps.merge_asof

2021-09-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36813.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34053
[https://github.com/apache/spark/pull/34053]

> Implement ps.merge_asof
> ---
>
> Key: SPARK-36813
> URL: https://issues.apache.org/jira/browse/SPARK-36813
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36813) Implement ps.merge_asof

2021-09-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36813:


Assignee: Takuya Ueshin

> Implement ps.merge_asof
> ---
>
> Key: SPARK-36813
> URL: https://issues.apache.org/jira/browse/SPARK-36813
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder

2021-09-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36846.
--
Fix Version/s: 3.3.0
 Assignee: Takuya Ueshin
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/34101

> Inline most of type hint files under pyspark/sql/pandas folder
> --
>
> Key: SPARK-36846
> URL: https://issues.apache.org/jira/browse/SPARK-36846
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hint files under {{pyspark/sql/pandas}} folder, except for 
> {{pyspark/sql/pandas/functions.pyi}} and files under 
> {{pyspark/sql/pandas/_typing}}.
>  * Since the file contains a lot of overloads, we should revisit and manage 
> it separately.
>  * We can't inline files under {{pyspark/sql/pandas/_typing}} because it 
> includes new syntax for type hints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36664) Log time spent waiting for cluster resources

2021-09-28 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421824#comment-17421824
 ] 

John Zhuge commented on SPARK-36664:


This can be very useful. For example, we'd like to track how long YARN jobs are 
stuck ACCEPTED state.

> Log time spent waiting for cluster resources
> 
>
> Key: SPARK-36664
> URL: https://issues.apache.org/jira/browse/SPARK-36664
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Holden Karau
>Priority: Major
>
> To provide better visibility into why jobs might be running slow it would be 
> useful to log when we are waiting for resources and how long we are waiting 
> for resources so if there is an underlying cluster issue the user can be 
> aware.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36862) ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2021-09-28 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421823#comment-17421823
 ] 

Jungtaek Lim commented on SPARK-36862:
--

No, I meant you're encouraged to paste the "generated" code in the log. It's 
less thing to worry about, as it's quite hard to infer the actual business 
logic from generated code.

> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java'
> -
>
> Key: SPARK-36862
> URL: https://issues.apache.org/jira/browse/SPARK-36862
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1 and Spark 3.1.2
> hadoop 3.2.1
>Reporter: Magdalena Pilawska
>Priority: Major
>
> Hi,
> I am getting the following error running spark-submit command:
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 321, Column 103: ')' expected instead of '['
>  
> It fails running the spark sql command on delta lake: 
> spark.sql(sqlTransformation)
> The template of sqlTransformation is as follows:
> MERGE INTO target_table AS d
>  USING source_table AS s 
>  on s.id = d.id
>  WHEN MATCHED AND d.hash_value <> s.hash_value
>  THEN UPDATE SET d.name =s.name, d.address = s.address
>  
> It is permanent error both for *spark 3.1.1* version.
>  
> The same works fine with spark 3.0.0.
>  
> Here is the full log:
> 2021-09-22 16:43:22,110 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 55, Column 103: ')' expected instead of '['2021-09-22 16:43:22,110 ERROR 
> CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 55, Column 103: ')' expected instead of 
> '['org.codehaus.commons.compiler.CompileException: File 'generated.java', 
> Line 55, Column 103: ')' expected instead of '[' at 
> org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
>  at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:150) at 
> org.codehaus.janino.Parser.read(Parser.java:3703) at 
> org.codehaus.janino.Parser.parseFormalParameters(Parser.java:1622) at 
> org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:1518) at 
> org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:1028) at 
> org.codehaus.janino.Parser.parseClassBody(Parser.java:841) at 
> org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:736) at 
> org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:941) at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:234) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:205) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1427)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1524)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1521)
>  at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
> at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1375)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:721)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181) at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:160)

[jira] [Resolved] (SPARK-36878) Optimization in PushDownPredicates to push all filters in a single iteration has broken some optimizations in PruneFilter rule

2021-09-28 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif resolved SPARK-36878.
--
Resolution: Not A Bug

Further inputs from my company colleague, who pointed out that there might be 
some other rule which would be handling the case, it turns out  
BooleanSimplificationRule which runs before PruneFilter would reduce the 
composite condition into a single condition ( by prunning unnecessary 
conditions) as a result the current PruneFilter code is sufficient. It does not 
have to handle the composite conditions containing nullls or false or true.


> Optimization in PushDownPredicates to push all filters in a single iteration 
> has broken  some optimizations in PruneFilter rule
> ---
>
> Key: SPARK-36878
> URL: https://issues.apache.org/jira/browse/SPARK-36878
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Asif
>Priority: Major
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It appears that the optimization in PushDownPredicates rule to try to push 
> all filters in a single pass to reduce iteration has broken the PruneFilter 
> rule to substitute with EmptyRelation when the filter condition is a 
> composite and statically evaluates to false either because one of the non 
> redundant predicate is Literal(false) or all the non redundant predicates are 
> null.
> The new PushDownPredicate rule is created by chaining CombineFilters, 
> PushPredicateThroughNonJoin and PushPredicateThroughJoin.
> so individual filters will get combined as a single filter while being pushed.
> But the PruneFilters rule does not substitute it with empty relation if the 
> filter is composite. It is coded to handle single predicates.
> The test is falsely passing as it is testing PushPredicateThroughNonJoin, 
> which does not combine filters. 
> While  the actual rule in action has an effect produced by CombineFilters. 
> In fact I believe all the places in other tests which are testing 
> individually for PushDownPredicateThroughNonJoin or 
> PushDownPredicateThroughJoin should be corrected ( may be with rule 
> PushPredicates) & re tested.
> I will add a bug test & open PR.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36881) Inline type hints for python/pyspark/sql/catalog.py

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421817#comment-17421817
 ] 

Apache Spark commented on SPARK-36881:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34133

> Inline type hints for python/pyspark/sql/catalog.py
> ---
>
> Key: SPARK-36881
> URL: https://issues.apache.org/jira/browse/SPARK-36881
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints for python/pyspark/sql/catalog.py from Inline type hints 
> for python/pyspark/sql/catalog.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36881) Inline type hints for python/pyspark/sql/catalog.py

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36881:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/sql/catalog.py
> ---
>
> Key: SPARK-36881
> URL: https://issues.apache.org/jira/browse/SPARK-36881
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints for python/pyspark/sql/catalog.py from Inline type hints 
> for python/pyspark/sql/catalog.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36881) Inline type hints for python/pyspark/sql/catalog.py

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421816#comment-17421816
 ] 

Apache Spark commented on SPARK-36881:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34133

> Inline type hints for python/pyspark/sql/catalog.py
> ---
>
> Key: SPARK-36881
> URL: https://issues.apache.org/jira/browse/SPARK-36881
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints for python/pyspark/sql/catalog.py from Inline type hints 
> for python/pyspark/sql/catalog.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36881) Inline type hints for python/pyspark/sql/catalog.py

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36881:


Assignee: Apache Spark

> Inline type hints for python/pyspark/sql/catalog.py
> ---
>
> Key: SPARK-36881
> URL: https://issues.apache.org/jira/browse/SPARK-36881
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints for python/pyspark/sql/catalog.py from Inline type hints 
> for python/pyspark/sql/catalog.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36881) Inline type hints for python/pyspark/sql/catalog.py

2021-09-28 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36881:


 Summary: Inline type hints for python/pyspark/sql/catalog.py
 Key: SPARK-36881
 URL: https://issues.apache.org/jira/browse/SPARK-36881
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Inline type hints for python/pyspark/sql/catalog.py from Inline type hints for 
python/pyspark/sql/catalog.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35874) AQE Shuffle should wait for its subqueries to finish before materializing

2021-09-28 Thread Shardul Mahadik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421780#comment-17421780
 ] 

Shardul Mahadik commented on SPARK-35874:
-

[~dongjoon] Should this be linked in SPARK-33828?

> AQE Shuffle should wait for its subqueries to finish before materializing
> -
>
> Key: SPARK-35874
> URL: https://issues.apache.org/jira/browse/SPARK-35874
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36871) Migrate CreateViewStatement to v2 command

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421779#comment-17421779
 ] 

Apache Spark commented on SPARK-36871:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34131

> Migrate CreateViewStatement to v2 command
> -
>
> Key: SPARK-36871
> URL: https://issues.apache.org/jira/browse/SPARK-36871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36871) Migrate CreateViewStatement to v2 command

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36871:


Assignee: Apache Spark

> Migrate CreateViewStatement to v2 command
> -
>
> Key: SPARK-36871
> URL: https://issues.apache.org/jira/browse/SPARK-36871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36871) Migrate CreateViewStatement to v2 command

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421778#comment-17421778
 ] 

Apache Spark commented on SPARK-36871:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34131

> Migrate CreateViewStatement to v2 command
> -
>
> Key: SPARK-36871
> URL: https://issues.apache.org/jira/browse/SPARK-36871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36871) Migrate CreateViewStatement to v2 command

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36871:


Assignee: (was: Apache Spark)

> Migrate CreateViewStatement to v2 command
> -
>
> Key: SPARK-36871
> URL: https://issues.apache.org/jira/browse/SPARK-36871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-36869) Spark job fails due to java.io.InvalidClassException: scala.collection.mutable.WrappedArray$ofRef; local class incompatible

2021-09-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-36869.
-

> Spark job fails due to java.io.InvalidClassException: 
> scala.collection.mutable.WrappedArray$ofRef; local class incompatible
> ---
>
> Key: SPARK-36869
> URL: https://issues.apache.org/jira/browse/SPARK-36869
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.2
> Environment: * RHEL 8.4
>  * Java 11.0.12
>  * Spark 3.1.2 (only prebuilt with *2.12.10)*
>  * Scala *2.12.14* for the application code
>Reporter: Hamid EL MAAZOUZ
>Priority: Blocker
>  Labels: scala, serialization, spark
>
> This is a Scala problem. It has been already reported here 
> [https://github.com/scala/bug/issues/5046] and a fix has been merged here 
> [https://github.com/scala/scala/pull/9166.|https://github.com/scala/scala/pull/9166]
> According to 
> [https://github.com/scala/bug/issues/5046#issuecomment-928108088], the *fix* 
> is available on *Scala 2.12.14*, but *Spark 3.0+* is only pre-built with 
> Scala *2.12.10*.
>  
>  * Stacktrace of the failure: (Taken from stderr of a worker process)
> {code:java}
> Spark Executor Command: "/usr/java/jdk-11.0.12/bin/java" "-cp" 
> "/opt/apache/spark-3.1.2-bin-hadoop3.2/conf/:/opt/apache/spark-3.1.2-bin-hadoop3.2/jars/*"
>  "-Xmx1024M" "-Dspark.driver.port=45887" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.191:45887" "--executor-id" "0" 
> "--hostname" "192.168.0.191" "--cores" "12" "--app-id" 
> "app-20210927231035-" "--worker-url" "spark://Worker@192.168.0.191:35261"
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 21/09/27 23:10:36 INFO CoarseGrainedExecutorBackend: Started daemon with 
> process name: 18957@localhost
> 21/09/27 23:10:36 INFO SignalUtils: Registering signal handler for TERM
> 21/09/27 23:10:36 INFO SignalUtils: Registering signal handler for HUP
> 21/09/27 23:10:36 INFO SignalUtils: Registering signal handler for INT
> 21/09/27 23:10:36 WARN Utils: Your hostname, localhost resolves to a loopback 
> address: 127.0.0.1; using 192.168.0.191 instead (on interface wlp82s0)
> 21/09/27 23:10:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/opt/apache/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar) 
> to constructor java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 21/09/27 23:10:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 21/09/27 23:10:36 INFO SecurityManager: Changing view acls to: hamidelmaazouz
> 21/09/27 23:10:36 INFO SecurityManager: Changing modify acls to: 
> hamidelmaazouz
> 21/09/27 23:10:36 INFO SecurityManager: Changing view acls groups to: 
> 21/09/27 23:10:36 INFO SecurityManager: Changing modify acls groups to: 
> 21/09/27 23:10:36 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: 
> Set(hamidelmaazouz); groups with view permissions: Set(); users  with modify 
> permissions: Set(hamidelmaazouz); groups with modify permissions: Set()
> 21/09/27 23:10:37 INFO TransportClientFactory: Successfully created 
> connection to /192.168.0.191:45887 after 44 ms (0 ms spent in bootstraps)
> 21/09/27 23:10:37 WARN TransportChannelHandler: Exception in connection from 
> /192.168.0.191:45887
> java.io.InvalidClassException: scala.collection.mutable.WrappedArray$ofRef; 
> local class incompatible: stream classdesc serialVersionUID = 
> 3456489343829468865, local class serialVersionUID = 1028182004549731694
>   at 
> java.base/java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:689)
>   at 
> java.base/java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2012)
>   at 
> java.base/java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1862)
>   at 
> java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2169)
>   at 
> java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
>   at 
> java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
> 

[jira] [Resolved] (SPARK-36869) Spark job fails due to java.io.InvalidClassException: scala.collection.mutable.WrappedArray$ofRef; local class incompatible

2021-09-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36869.
---
Resolution: Duplicate

This is superseded by SPARK-36759 .

> Spark job fails due to java.io.InvalidClassException: 
> scala.collection.mutable.WrappedArray$ofRef; local class incompatible
> ---
>
> Key: SPARK-36869
> URL: https://issues.apache.org/jira/browse/SPARK-36869
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.2
> Environment: * RHEL 8.4
>  * Java 11.0.12
>  * Spark 3.1.2 (only prebuilt with *2.12.10)*
>  * Scala *2.12.14* for the application code
>Reporter: Hamid EL MAAZOUZ
>Priority: Blocker
>  Labels: scala, serialization, spark
>
> This is a Scala problem. It has been already reported here 
> [https://github.com/scala/bug/issues/5046] and a fix has been merged here 
> [https://github.com/scala/scala/pull/9166.|https://github.com/scala/scala/pull/9166]
> According to 
> [https://github.com/scala/bug/issues/5046#issuecomment-928108088], the *fix* 
> is available on *Scala 2.12.14*, but *Spark 3.0+* is only pre-built with 
> Scala *2.12.10*.
>  
>  * Stacktrace of the failure: (Taken from stderr of a worker process)
> {code:java}
> Spark Executor Command: "/usr/java/jdk-11.0.12/bin/java" "-cp" 
> "/opt/apache/spark-3.1.2-bin-hadoop3.2/conf/:/opt/apache/spark-3.1.2-bin-hadoop3.2/jars/*"
>  "-Xmx1024M" "-Dspark.driver.port=45887" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.191:45887" "--executor-id" "0" 
> "--hostname" "192.168.0.191" "--cores" "12" "--app-id" 
> "app-20210927231035-" "--worker-url" "spark://Worker@192.168.0.191:35261"
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 21/09/27 23:10:36 INFO CoarseGrainedExecutorBackend: Started daemon with 
> process name: 18957@localhost
> 21/09/27 23:10:36 INFO SignalUtils: Registering signal handler for TERM
> 21/09/27 23:10:36 INFO SignalUtils: Registering signal handler for HUP
> 21/09/27 23:10:36 INFO SignalUtils: Registering signal handler for INT
> 21/09/27 23:10:36 WARN Utils: Your hostname, localhost resolves to a loopback 
> address: 127.0.0.1; using 192.168.0.191 instead (on interface wlp82s0)
> 21/09/27 23:10:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/opt/apache/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar) 
> to constructor java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 21/09/27 23:10:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 21/09/27 23:10:36 INFO SecurityManager: Changing view acls to: hamidelmaazouz
> 21/09/27 23:10:36 INFO SecurityManager: Changing modify acls to: 
> hamidelmaazouz
> 21/09/27 23:10:36 INFO SecurityManager: Changing view acls groups to: 
> 21/09/27 23:10:36 INFO SecurityManager: Changing modify acls groups to: 
> 21/09/27 23:10:36 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: 
> Set(hamidelmaazouz); groups with view permissions: Set(); users  with modify 
> permissions: Set(hamidelmaazouz); groups with modify permissions: Set()
> 21/09/27 23:10:37 INFO TransportClientFactory: Successfully created 
> connection to /192.168.0.191:45887 after 44 ms (0 ms spent in bootstraps)
> 21/09/27 23:10:37 WARN TransportChannelHandler: Exception in connection from 
> /192.168.0.191:45887
> java.io.InvalidClassException: scala.collection.mutable.WrappedArray$ofRef; 
> local class incompatible: stream classdesc serialVersionUID = 
> 3456489343829468865, local class serialVersionUID = 1028182004549731694
>   at 
> java.base/java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:689)
>   at 
> java.base/java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2012)
>   at 
> java.base/java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1862)
>   at 
> java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2169)
>   at 
> java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
>   at 
> java.base/java.io.Obje

[jira] [Resolved] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-09-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34276.
---
  Assignee: Chao Sun
Resolution: Done

This is superseded by SPARK-36726 by [~csun].
I'm resolving this blocker issue because I don't see any open Parquet issues to 
block us.

cc [~Gengliang.Wang]

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Chao Sun
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36880) Inline type hints for python/pyspark/sql/functions.py

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421602#comment-17421602
 ] 

Apache Spark commented on SPARK-36880:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34130

> Inline type hints for python/pyspark/sql/functions.py
> -
>
> Key: SPARK-36880
> URL: https://issues.apache.org/jira/browse/SPARK-36880
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints from python/pyspark/sql/functions.pyi to 
> python/pyspark/sql/functions.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36880) Inline type hints for python/pyspark/sql/functions.py

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421603#comment-17421603
 ] 

Apache Spark commented on SPARK-36880:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34130

> Inline type hints for python/pyspark/sql/functions.py
> -
>
> Key: SPARK-36880
> URL: https://issues.apache.org/jira/browse/SPARK-36880
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints from python/pyspark/sql/functions.pyi to 
> python/pyspark/sql/functions.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36880) Inline type hints for python/pyspark/sql/functions.py

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36880:


Assignee: Apache Spark

> Inline type hints for python/pyspark/sql/functions.py
> -
>
> Key: SPARK-36880
> URL: https://issues.apache.org/jira/browse/SPARK-36880
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints from python/pyspark/sql/functions.pyi to 
> python/pyspark/sql/functions.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36880) Inline type hints for python/pyspark/sql/functions.py

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36880:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/sql/functions.py
> -
>
> Key: SPARK-36880
> URL: https://issues.apache.org/jira/browse/SPARK-36880
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints from python/pyspark/sql/functions.pyi to 
> python/pyspark/sql/functions.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36880) Inline type hints for python/pyspark/sql/functions.py

2021-09-28 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36880:


 Summary: Inline type hints for python/pyspark/sql/functions.py
 Key: SPARK-36880
 URL: https://issues.apache.org/jira/browse/SPARK-36880
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Inline type hints from python/pyspark/sql/functions.pyi to 
python/pyspark/sql/functions.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path

2021-09-28 Thread Chao Sun (Jira)
Chao Sun created SPARK-36879:


 Summary: Support Parquet v2 data page encodings for the vectorized 
path
 Key: SPARK-36879
 URL: https://issues.apache.org/jira/browse/SPARK-36879
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Currently Spark only support Parquet V1 encodings (i.e., PLAIN/DICTIONARY/RLE) 
in the vectorized path, and throws exception otherwise:
{code}
java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
{code}

It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, 
DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as 
listed in https://github.com/apache/parquet-format/blob/master/Encodings.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36877) Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing reruns

2021-09-28 Thread Shardul Mahadik (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shardul Mahadik updated SPARK-36877:

Summary: Calling ds.rdd with AQE enabled leads to jobs being run, 
eventually causing reruns  (was: Calling ds.rdd with AQE enabled leads to being 
jobs being run, eventually causing reruns)

> Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing 
> reruns
> --
>
> Key: SPARK-36877
> URL: https://issues.apache.org/jira/browse/SPARK-36877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: Screen Shot 2021-09-28 at 09.32.20.png
>
>
> In one of our jobs we perform the following operation:
> {code:scala}
> val df = /* some expensive multi-table/multi-stage join */
> val numPartitions = df.rdd.getNumPartitions
> df.repartition(x).write.
> {code}
> With AQE enabled, we found that the expensive stages were being run twice 
> causing significant performance regression after enabling AQE; once when 
> calling {{df.rdd}} and again when calling {{df.write}}.
> A more concrete example:
> {code:scala}
> scala> sql("SET spark.sql.adaptive.enabled=true")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val df1 = spark.range(10).withColumn("id2", $"id")
> df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), 
> "id").join(spark.range(10), "id")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df3 = df2.groupBy("id2").count()
> df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint]
> scala> df3.rdd.getNumPartitions
> res2: Int = 10(0 + 16) / 
> 16]
> scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1")
> {code}
> In the screenshot below, you can see that the first 3 stages (0 to 4) were 
> rerun again (5 to 9).
> I have two questions:
> 1) Should calling df.rdd trigger actual job execution when AQE is enabled?
> 2) Should calling df.write later cause rerun of the stages? If df.rdd has 
> already partially executed the stages, shouldn't it reuse the result from 
> previous stages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36786) SPIP: Improving the compile time performance, by improving a couple of rules, from 24 hrs to under 8 minutes

2021-09-28 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-36786:
-
Description: 
h2. Q1. What are you trying to do? Articulate your objectives using absolutely 
no jargon.

The aim is to improve the compile time performance of query which in WorkDay's 
use case takes > 24 hrs ( & eventually fails) , to  < 8 min.

To explain the problem, I will provide the context.

The query plan in our production system, is huge, with nested *case when* 
expressions ( level of nesting could be >  8) , where each *case when* can have 
branches sometimes > 1000.

The plan could look like
{quote}Project1
    |
   Filter 1
    |

Project2
    |
 Filter2
    |
 Project3
    |
 Filter3

  |

Join
{quote}
Now the optimizer has a Batch of Rules , intended to run at max 100 times.

*Also note that the, the batch will continue to run till one of the condition 
is satisfied*

*i.e  either numIter == 100 || inputPlan == outputPlan (idempotency is 
achieved)*

One of the early  Rule is   *PushDownPredicateRule.*

**Followed by **CollapseProject**.

 

The first issue is *PushDownPredicate* rule.

It picks  one filter at a time & pushes it at lowest level ( I understand that 
in 3.1 it pushes through join, while in 2.4 it stops at Join) , but either case 
it picks 1 filter at time starting from top, in each iteration.

*The above comment is no longer true in 3.1 release as it now combines filters. 
so it does push now all the encountered filters in a single pass. But it still 
materializes the filter on each push by realiasing.*

*Also it seems that with the change, one of the PruneFilter rule's 
functionality of replacement with Empty Relation appears to be broken by the 
addition of CombineFilter rule in PushDownPredicates rule. More on it later.*

*I have filed a new bug ticket ([PruneFilter optimization 
broken|https://issues.apache.org/jira/browse/SPARK-36878])*

So if there are say  50 projects interspersed with Filters , the idempotency is 
guaranteedly not going to get achieved till around 49 iterations. Moreover, 
CollapseProject will also be modifying tree on each iteration as a filter will 
get removed within Project.

Moreover, on each movement of filter through project tree, the filter is 
re-aliased using transformUp rule.  transformUp is very expensive compared to 
transformDown. As the filter keeps getting pushed down , its size increases.

To optimize this rule , 2 things are needed
 # Instead of pushing one filter at a time,  collect all the filters as we 
traverse the tree in that iteration itself.
 # Do not re-alias the filters on each push. Collect the sequence of projects 
it has passed through, and  when the filters have reached their resting place, 
do the re-alias by processing the projects collected in down to up manner.

This will result in achieving idempotency in a couple of iterations. 

*How reducing the number of iterations help in performance*

There are many rules like *NullPropagation, OptimizeIn, SimplifyConditionals ( 
... there are around 6 more such rules)*  which traverse the tree using 
transformUp, and they run unnecessarily in each iteration , even when the 
expressions in an operator have not changed since the previous runs.

*I have a different proposal which I will share later, as to how to avoid the 
above rules from running unnecessarily, if it can be guaranteed that the 
expression is not going to mutate in the operator.* 

The cause of our huge compilation time has been identified as the above.
  
h2. Q2. What problem is this proposal NOT designed to solve?

It is not going to change any runtime profile.
h2. Q3. How is it done today, and what are the limits of current practice?

Like mentioned above , currently PushDownPredicate pushes one filter at a time  
& at each Project , it materialized the re-aliased filter.  This results in 
large number of iterations to achieve idempotency as well as immediate 
materialization of Filter after each Project pass,, results in unnecessary tree 
traversals of filter expression that too using transformUp. and the expression 
tree of filter is bound to keep increasing as it is pushed down.
h2. Q4. What is new in your approach and why do you think it will be successful?

In the new approach we push all the filters down in a single pass. And do not 
materialize filters as it pass through Project. Instead keep collecting 
projects in sequential order and materialize the final filter once its final 
position is achieved ( above a join , in case of 2.1 , or above the base 
relation etc).

This approach when coupled with the logic of identifying those Project operator 
whose expressions will not mutate ( which I will share later) , so that rules 
like 

NullPropagation,
 OptimizeIn.,
 LikeSimplification.,
 BooleanSimplification.,
 SimplifyConditionals.,
 RemoveDispensableExpressions.,
 SimplifyBinaryComparison.,
 SimplifyCaseConversionExpr

[jira] [Updated] (SPARK-36786) SPIP: Improving the compile time performance, by improving a couple of rules, from 24 hrs to under 8 minutes

2021-09-28 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-36786:
-
Description: 
h2. Q1. What are you trying to do? Articulate your objectives using absolutely 
no jargon.

The aim is to improve the compile time performance of query which in WorkDay's 
use case takes > 24 hrs ( & eventually fails) , to  < 8 min.

To explain the problem, I will provide the context.

The query plan in our production system, is huge, with nested *case when* 
expressions ( level of nesting could be >  8) , where each *case when* can have 
branches sometimes > 1000.

The plan could look like
{quote}Project1
    |
   Filter 1
    |

Project2
    |
 Filter2
    |
 Project3
    |
 Filter3

  |

Join
{quote}
Now the optimizer has a Batch of Rules , intended to run at max 100 times.

*Also note that the, the batch will continue to run till one of the condition 
is satisfied*

*i.e  either numIter == 100 || inputPlan == outputPlan (idempotency is 
achieved)*

One of the early  Rule is   *PushDownPredicateRule.*

**Followed by **CollapseProject**.

 

The first issue is *PushDownPredicate* rule.

It picks  one filter at a time & pushes it at lowest level ( I understand that 
in 3.1 it pushes through join, while in 2.4 it stops at Join) , but either case 
it picks 1 filter at time starting from top, in each iteration.

*The above comment is no longer true in 3.1 release as it now combines filters. 
so it does push now all the encountered filters in a single pass. But it still 
materializes the filter on each push by realiasing.*

*Also it seems that with the change, one of the PruneFilter rule's 
functionality of replacement with Empty Relation appears to be broken by the 
addition of CombineFilter rule in PushDownPredicates rule. More on it later.*

*I have filed a new bug ticket ([PruneFilter optimization 
broken|https://issues.apache.org/jira/browse/SPARK-36789])*

So if there are say  50 projects interspersed with Filters , the idempotency is 
guaranteedly not going to get achieved till around 49 iterations. Moreover, 
CollapseProject will also be modifying tree on each iteration as a filter will 
get removed within Project.

Moreover, on each movement of filter through project tree, the filter is 
re-aliased using transformUp rule.  transformUp is very expensive compared to 
transformDown. As the filter keeps getting pushed down , its size increases.

To optimize this rule , 2 things are needed
 # Instead of pushing one filter at a time,  collect all the filters as we 
traverse the tree in that iteration itself.
 # Do not re-alias the filters on each push. Collect the sequence of projects 
it has passed through, and  when the filters have reached their resting place, 
do the re-alias by processing the projects collected in down to up manner.

This will result in achieving idempotency in a couple of iterations. 

*How reducing the number of iterations help in performance*

There are many rules like *NullPropagation, OptimizeIn, SimplifyConditionals ( 
... there are around 6 more such rules)*  which traverse the tree using 
transformUp, and they run unnecessarily in each iteration , even when the 
expressions in an operator have not changed since the previous runs.

*I have a different proposal which I will share later, as to how to avoid the 
above rules from running unnecessarily, if it can be guaranteed that the 
expression is not going to mutate in the operator.* 

The cause of our huge compilation time has been identified as the above.
  
h2. Q2. What problem is this proposal NOT designed to solve?

It is not going to change any runtime profile.
h2. Q3. How is it done today, and what are the limits of current practice?

Like mentioned above , currently PushDownPredicate pushes one filter at a time  
& at each Project , it materialized the re-aliased filter.  This results in 
large number of iterations to achieve idempotency as well as immediate 
materialization of Filter after each Project pass,, results in unnecessary tree 
traversals of filter expression that too using transformUp. and the expression 
tree of filter is bound to keep increasing as it is pushed down.
h2. Q4. What is new in your approach and why do you think it will be successful?

In the new approach we push all the filters down in a single pass. And do not 
materialize filters as it pass through Project. Instead keep collecting 
projects in sequential order and materialize the final filter once its final 
position is achieved ( above a join , in case of 2.1 , or above the base 
relation etc).

This approach when coupled with the logic of identifying those Project operator 
whose expressions will not mutate ( which I will share later) , so that rules 
like 

NullPropagation,
 OptimizeIn.,
 LikeSimplification.,
 BooleanSimplification.,
 SimplifyConditionals.,
 RemoveDispensableExpressions.,
 SimplifyBinaryComparison.,
 SimplifyCaseConversionExpr

[jira] [Created] (SPARK-36878) Optimization in PushDownPredicates to push all filters in a single iteration has broken some optimizations in PruneFilter rule

2021-09-28 Thread Asif (Jira)
Asif created SPARK-36878:


 Summary: Optimization in PushDownPredicates to push all filters in 
a single iteration has broken  some optimizations in PruneFilter rule
 Key: SPARK-36878
 URL: https://issues.apache.org/jira/browse/SPARK-36878
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: Asif


It appears that the optimization in PushDownPredicates rule to try to push all 
filters in a single pass to reduce iteration has broken the PruneFilter rule to 
substitute with EmptyRelation when the filter condition is a composite and 
statically evaluates to false either because one of the non redundant predicate 
is Literal(false) or all the non redundant predicates are null.

The new PushDownPredicate rule is created by chaining CombineFilters, 
PushPredicateThroughNonJoin and PushPredicateThroughJoin.

so individual filters will get combined as a single filter while being pushed.

But the PruneFilters rule does not substitute it with empty relation if the 
filter is composite. It is coded to handle single predicates.

The test is falsely passing as it is testing PushPredicateThroughNonJoin, which 
does not combine filters. 

While  the actual rule in action has an effect produced by CombineFilters. 

In fact I believe all the places in other tests which are testing individually 
for PushDownPredicateThroughNonJoin or PushDownPredicateThroughJoin should be 
corrected ( may be with rule PushPredicates) & re tested.

I will add a bug test & open PR.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36877) Calling ds.rdd with AQE enabled leads to being jobs being run, eventually causing reruns

2021-09-28 Thread Shardul Mahadik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421510#comment-17421510
 ] 

Shardul Mahadik commented on SPARK-36877:
-

cc: [~cloud_fan] [~mridulm80]

> Calling ds.rdd with AQE enabled leads to being jobs being run, eventually 
> causing reruns
> 
>
> Key: SPARK-36877
> URL: https://issues.apache.org/jira/browse/SPARK-36877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: Screen Shot 2021-09-28 at 09.32.20.png
>
>
> In one of our jobs we perform the following operation:
> {code:scala}
> val df = /* some expensive multi-table/multi-stage join */
> val numPartitions = df.rdd.getNumPartitions
> df.repartition(x).write.
> {code}
> With AQE enabled, we found that the expensive stages were being run twice 
> causing significant performance regression after enabling AQE; once when 
> calling {{df.rdd}} and again when calling {{df.write}}.
> A more concrete example:
> {code:scala}
> scala> sql("SET spark.sql.adaptive.enabled=true")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val df1 = spark.range(10).withColumn("id2", $"id")
> df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), 
> "id").join(spark.range(10), "id")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df3 = df2.groupBy("id2").count()
> df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint]
> scala> df3.rdd.getNumPartitions
> res2: Int = 10(0 + 16) / 
> 16]
> scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1")
> {code}
> In the screenshot below, you can see that the first 3 stages (0 to 4) were 
> rerun again (5 to 9).
> I have two questions:
> 1) Should calling df.rdd trigger actual job execution when AQE is enabled?
> 2) Should calling df.write later cause rerun of the stages? If df.rdd has 
> already partially executed the stages, shouldn't it reuse the result from 
> previous stages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36877) Calling ds.rdd with AQE enabled leads to being jobs being run, eventually causing reruns

2021-09-28 Thread Shardul Mahadik (Jira)
Shardul Mahadik created SPARK-36877:
---

 Summary: Calling ds.rdd with AQE enabled leads to being jobs being 
run, eventually causing reruns
 Key: SPARK-36877
 URL: https://issues.apache.org/jira/browse/SPARK-36877
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2, 3.2.1
Reporter: Shardul Mahadik
 Attachments: Screen Shot 2021-09-28 at 09.32.20.png

In one of our jobs we perform the following operation:
{code:scala}
val df = /* some expensive multi-table/multi-stage join */
val numPartitions = df.rdd.getNumPartitions
df.repartition(x).write.
{code}

With AQE enabled, we found that the expensive stages were being run twice 
causing significant performance regression after enabling AQE; once when 
calling {{df.rdd}} and again when calling {{df.write}}.

A more concrete example:
{code:scala}
scala> sql("SET spark.sql.adaptive.enabled=true")
res0: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> val df1 = spark.range(10).withColumn("id2", $"id")
df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]

scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), 
"id").join(spark.range(10), "id")
df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]

scala> val df3 = df2.groupBy("id2").count()
df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint]

scala> df3.rdd.getNumPartitions
res2: Int = 10(0 + 16) / 16]

scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1")
{code}

In the screenshot below, you can see that the first 3 stages (0 to 4) were 
rerun again (5 to 9).

I have two questions:
1) Should calling df.rdd trigger actual job execution when AQE is enabled?
2) Should calling df.write later cause rerun of the stages? If df.rdd has 
already partially executed the stages, shouldn't it reuse the result from 
previous stages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36877) Calling ds.rdd with AQE enabled leads to being jobs being run, eventually causing reruns

2021-09-28 Thread Shardul Mahadik (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shardul Mahadik updated SPARK-36877:

Attachment: Screen Shot 2021-09-28 at 09.32.20.png

> Calling ds.rdd with AQE enabled leads to being jobs being run, eventually 
> causing reruns
> 
>
> Key: SPARK-36877
> URL: https://issues.apache.org/jira/browse/SPARK-36877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: Screen Shot 2021-09-28 at 09.32.20.png
>
>
> In one of our jobs we perform the following operation:
> {code:scala}
> val df = /* some expensive multi-table/multi-stage join */
> val numPartitions = df.rdd.getNumPartitions
> df.repartition(x).write.
> {code}
> With AQE enabled, we found that the expensive stages were being run twice 
> causing significant performance regression after enabling AQE; once when 
> calling {{df.rdd}} and again when calling {{df.write}}.
> A more concrete example:
> {code:scala}
> scala> sql("SET spark.sql.adaptive.enabled=true")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val df1 = spark.range(10).withColumn("id2", $"id")
> df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), 
> "id").join(spark.range(10), "id")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df3 = df2.groupBy("id2").count()
> df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint]
> scala> df3.rdd.getNumPartitions
> res2: Int = 10(0 + 16) / 
> 16]
> scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1")
> {code}
> In the screenshot below, you can see that the first 3 stages (0 to 4) were 
> rerun again (5 to 9).
> I have two questions:
> 1) Should calling df.rdd trigger actual job execution when AQE is enabled?
> 2) Should calling df.write later cause rerun of the stages? If df.rdd has 
> already partially executed the stages, shouldn't it reuse the result from 
> previous stages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36873) Add provided Guava dependency for network-yarn module

2021-09-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36873:
-

Assignee: Chao Sun  (was: Apache Spark)

> Add provided Guava dependency for network-yarn module
> -
>
> Key: SPARK-36873
> URL: https://issues.apache.org/jira/browse/SPARK-36873
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
> from hadoop-client dependency, which was changed by SPARK-33212 where we 
> moved to shaded Hadoop client which no longer expose the transitive guava 
> dependency. This was fine for a while since we were not using 
> {{createDependencyReducedPom}} so the module picks up the transitive 
> dependency from {{spark-network-common}}. However, this got changed by 
> SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no 
> longer able to find guava classes:
> {code}
> mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver 
> -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 
> -Pspark-ganglia-lgpl -Pyarn
> ...
> [INFO] Compiling 1 Java source to 
> /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
> [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
>  package com.google.common.annotations does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
>  package com.google.common.base does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
>  package com.google.common.collect does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
>  cannot find symbol
>   symbol:   class VisibleForTesting
>   location: class org.apache.spark.network.yarn.YarnShuffleService
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36862) ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2021-09-28 Thread Magdalena Pilawska (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Magdalena Pilawska updated SPARK-36862:
---
Affects Version/s: (was: 3.1.2)
  Description: 
Hi,

I am getting the following error running spark-submit command:

ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
321, Column 103: ')' expected instead of '['

 

It fails running the spark sql command on delta lake: 
spark.sql(sqlTransformation)

The template of sqlTransformation is as follows:

MERGE INTO target_table AS d
 USING source_table AS s 
 on s.id = d.id
 WHEN MATCHED AND d.hash_value <> s.hash_value
 THEN UPDATE SET d.name =s.name, d.address = s.address

 

It is permanent error both for *spark 3.1.1* version.

 

The same works fine with spark 3.0.0.

 

Here is the full log:

2021-09-22 16:43:22,110 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 55, 
Column 103: ')' expected instead of '['2021-09-22 16:43:22,110 ERROR 
CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 55, 
Column 103: ')' expected instead of 
'['org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
55, Column 103: ')' expected instead of '[' at 
org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362) 
at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:150) at 
org.codehaus.janino.Parser.read(Parser.java:3703) at 
org.codehaus.janino.Parser.parseFormalParameters(Parser.java:1622) at 
org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:1518) at 
org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:1028) at 
org.codehaus.janino.Parser.parseClassBody(Parser.java:841) at 
org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:736) at 
org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:941) at 
org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:234) at 
org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:205) at 
org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1427)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1524)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1521)
 at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
 at 
org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) 
at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
 at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) at 
org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1375)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:721)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181) at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:160)
 at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD(ShuffleExchangeExec.scala:160)
 at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture$lzycompute(ShuffleExchangeExec.scala:164)
 at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture(ShuffleExchangeExec.scala:163)
 at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeLike.$anonfun$materializeFuture$2(ShuffleExchangeExec.scala:100)
 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) 
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeLike.$anonfun$materializeFuture$1(ShuffleExchangeExec.scala:100)
 at org.apache.spark.sql.util.LazyValue.getOrInit(LazyValue.scala:41) at 
org.apache.spark.sql.execution.exchange.Exchange.getOrInitMaterializeFuture(Exchange.scala:68)
 at 
org.apache.spark.sql.execution.ex

[jira] [Comment Edited] (SPARK-36862) ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2021-09-28 Thread Magdalena Pilawska (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421374#comment-17421374
 ] 

Magdalena Pilawska edited comment on SPARK-36862 at 9/28/21, 1:11 PM:
--

I get the physical execution plan as a part of output log but I cannot share 
that in public if you mean so.

Any thoughts why the same works on spark 3.0.0? 


was (Author: mpilaw):
I get the physical execution plan as a part of output log but I cannot share 
that in public if you mean so.

Any thoughts why the same works on 3.0.0? 

> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java'
> -
>
> Key: SPARK-36862
> URL: https://issues.apache.org/jira/browse/SPARK-36862
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 3.1.1, 3.1.2
> Environment: Spark 3.1.1 and Spark 3.1.2
> hadoop 3.2.1
>Reporter: Magdalena Pilawska
>Priority: Major
>
> Hi,
> I am getting the following error running spark-submit command:
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 321, Column 103: ')' expected instead of '['
>  
> It fails running the spark sql command on delta lake: 
> spark.sql(sqlTransformation)
> The template of sqlTransformation is as follows:
> MERGE INTO target_table AS d
>  USING source_table AS s 
>  on s.id = d.id
>  WHEN MATCHED AND d.hash_value <> s.hash_value
>  THEN UPDATE SET d.name =s.name, d.address = s.address
>  
> It is permanent error both for *spark 3.1.1* and *3.1.2* versions.
>  
> The same works fine with spark 3.0.0.
>  
> Here is the full log:
> 2021-09-22 16:43:22,110 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 55, Column 103: ')' expected instead of '['2021-09-22 16:43:22,110 ERROR 
> CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 55, Column 103: ')' expected instead of 
> '['org.codehaus.commons.compiler.CompileException: File 'generated.java', 
> Line 55, Column 103: ')' expected instead of '[' at 
> org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
>  at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:150) at 
> org.codehaus.janino.Parser.read(Parser.java:3703) at 
> org.codehaus.janino.Parser.parseFormalParameters(Parser.java:1622) at 
> org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:1518) at 
> org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:1028) at 
> org.codehaus.janino.Parser.parseClassBody(Parser.java:841) at 
> org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:736) at 
> org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:941) at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:234) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:205) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1427)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1524)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1521)
>  at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
> at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1375)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:721)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution

[jira] [Commented] (SPARK-36862) ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2021-09-28 Thread Magdalena Pilawska (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421374#comment-17421374
 ] 

Magdalena Pilawska commented on SPARK-36862:


I get the physical execution plan as a part of output log but I cannot share 
that in public if you mean so.

Any thoughts why the same works on 3.0.0? 

> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java'
> -
>
> Key: SPARK-36862
> URL: https://issues.apache.org/jira/browse/SPARK-36862
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 3.1.1, 3.1.2
> Environment: Spark 3.1.1 and Spark 3.1.2
> hadoop 3.2.1
>Reporter: Magdalena Pilawska
>Priority: Major
>
> Hi,
> I am getting the following error running spark-submit command:
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 321, Column 103: ')' expected instead of '['
>  
> It fails running the spark sql command on delta lake: 
> spark.sql(sqlTransformation)
> The template of sqlTransformation is as follows:
> MERGE INTO target_table AS d
>  USING source_table AS s 
>  on s.id = d.id
>  WHEN MATCHED AND d.hash_value <> s.hash_value
>  THEN UPDATE SET d.name =s.name, d.address = s.address
>  
> It is permanent error both for *spark 3.1.1* and *3.1.2* versions.
>  
> The same works fine with spark 3.0.0.
>  
> Here is the full log:
> 2021-09-22 16:43:22,110 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 55, Column 103: ')' expected instead of '['2021-09-22 16:43:22,110 ERROR 
> CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 55, Column 103: ')' expected instead of 
> '['org.codehaus.commons.compiler.CompileException: File 'generated.java', 
> Line 55, Column 103: ')' expected instead of '[' at 
> org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
>  at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:150) at 
> org.codehaus.janino.Parser.read(Parser.java:3703) at 
> org.codehaus.janino.Parser.parseFormalParameters(Parser.java:1622) at 
> org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:1518) at 
> org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:1028) at 
> org.codehaus.janino.Parser.parseClassBody(Parser.java:841) at 
> org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:736) at 
> org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:941) at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:234) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:205) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1427)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1524)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1521)
>  at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
> at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1375)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:721)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181) at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:1

[jira] [Commented] (SPARK-36848) Migrate ShowCurrentNamespaceStatement to v2 command framework

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421369#comment-17421369
 ] 

Apache Spark commented on SPARK-36848:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/34128

> Migrate ShowCurrentNamespaceStatement to v2 command framework
> -
>
> Key: SPARK-36848
> URL: https://issues.apache.org/jira/browse/SPARK-36848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >