[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-19 Thread Robin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin updated SPARK-37690:
--
Description: 
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python}from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
`df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  

  was:
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python}from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 
df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
`df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  


> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
> Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python}from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-19 Thread Robin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin updated SPARK-37690:
--
Description: 
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python}from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 
df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
`df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  

  was:
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python}from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
`df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  


> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
> Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python}from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-19 Thread Robin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin updated SPARK-37690:
--
Description: 
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python}from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python} 
AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  

  was:
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python} from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python} 
AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  


> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
> Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python}from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python} 
> AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-19 Thread Robin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin updated SPARK-37690:
--
Description: 
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python}from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
`df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  

  was:
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python}from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python} 
AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  


> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
> Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python}from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-19 Thread Robin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin updated SPARK-37690:
--
Description: 
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python} from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python} 
AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  

  was:
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python} 
from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python} 
AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  


> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
> Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python} from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python} 
> AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-19 Thread Robin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin updated SPARK-37690:
--
Description: 
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python} 
from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python} 
AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  

  was:
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python} from pyspark.context import SparkContext from pyspark.sql import 
SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ 
SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) 
df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = 
spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ 
df = spark.sql(sql) {code}   The following error is now produced:   
{code:python} AnalysisException: Recursive view `df` detected (cycle: `df` -> 
`df`) {code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  


> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
> Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python} 
> from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python} 
> AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-19 Thread Robin (Jira)
Robin created SPARK-37690:
-

 Summary: Recursive view `df` detected (cycle: `df` -> `df`)
 Key: SPARK-37690
 URL: https://issues.apache.org/jira/browse/SPARK-37690
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Robin


In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python} from pyspark.context import SparkContext from pyspark.sql import 
SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ 
SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) 
df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = 
spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ 
df = spark.sql(sql) {code}   The following error is now produced:   
{code:python} AnalysisException: Recursive view `df` detected (cycle: `df` -> 
`df`) {code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37668) 'Index' object has no attribute 'levels' in pyspark.pandas.frame.DataFrame.insert

2021-12-19 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462429#comment-17462429
 ] 

Haejoon Lee commented on SPARK-37668:
-

This should be only applied to MultiIndex columns, let me address it

> 'Index' object has no attribute 'levels' in  
> pyspark.pandas.frame.DataFrame.insert
> --
>
> Key: SPARK-37668
> URL: https://issues.apache.org/jira/browse/SPARK-37668
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
>  [This piece of 
> code|https://github.com/apache/spark/blob/6e45b04db48008fa033b09df983d3bd1c4f790ea/python/pyspark/pandas/frame.py#L3991-L3993]
>  in {{pyspark.pandas.frame}} is going to fail on runtime, when 
> {{is_name_like_tuple}} evaluates to {{True}}
> {code:python}
> if is_name_like_tuple(column):
> if len(column) != len(self.columns.levels):
> {code}
> with 
> {code}
> 'Index' object has no attribute 'levels'
> {code}
> To be honest, I am not sure what is intended behavior (initially, I suspected 
> that we should have 
> {code:python}
>  if len(column) != self.columns.nlevels
> {code}
> but {{nlevels}} is hard-coded to one, and wouldn't be consistent with Pandas 
> at all.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37613) Support ANSI Aggregate Function: regr_count

2021-12-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462423#comment-17462423
 ] 

Apache Spark commented on SPARK-37613:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34955

> Support ANSI Aggregate Function: regr_count
> ---
>
> Key: SPARK-37613
> URL: https://issues.apache.org/jira/browse/SPARK-37613
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> REGR_COUNT is an ANSI aggregate function. many database support it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37613) Support ANSI Aggregate Function: regr_count

2021-12-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462422#comment-17462422
 ] 

Apache Spark commented on SPARK-37613:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34955

> Support ANSI Aggregate Function: regr_count
> ---
>
> Key: SPARK-37613
> URL: https://issues.apache.org/jira/browse/SPARK-37613
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> REGR_COUNT is an ANSI aggregate function. many database support it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37689) Expand should be supported PropagateEmptyRelation

2021-12-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462416#comment-17462416
 ] 

Apache Spark commented on SPARK-37689:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34954

> Expand should be supported PropagateEmptyRelation
> -
>
> Key: SPARK-37689
> URL: https://issues.apache.org/jira/browse/SPARK-37689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2021-12-20-14-30-34-276.png, 
> image-2021-12-20-14-31-16-893.png
>
>
> Empty LocalScan still trigger HashAggregate execute. Expand should be 
> supported in PropagateEmptyRelation too.
> !image-2021-12-20-14-30-34-276.png|width=347,height=336!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37689) Expand should be supported PropagateEmptyRelation

2021-12-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462414#comment-17462414
 ] 

Apache Spark commented on SPARK-37689:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34954

> Expand should be supported PropagateEmptyRelation
> -
>
> Key: SPARK-37689
> URL: https://issues.apache.org/jira/browse/SPARK-37689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2021-12-20-14-30-34-276.png, 
> image-2021-12-20-14-31-16-893.png
>
>
> Empty LocalScan still trigger HashAggregate execute. Expand should be 
> supported in PropagateEmptyRelation too.
> !image-2021-12-20-14-30-34-276.png|width=347,height=336!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37689) Expand should be supported PropagateEmptyRelation

2021-12-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37689:


Assignee: Apache Spark

> Expand should be supported PropagateEmptyRelation
> -
>
> Key: SPARK-37689
> URL: https://issues.apache.org/jira/browse/SPARK-37689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
> Attachments: image-2021-12-20-14-30-34-276.png, 
> image-2021-12-20-14-31-16-893.png
>
>
> Empty LocalScan still trigger HashAggregate execute. Expand should be 
> supported in PropagateEmptyRelation too.
> !image-2021-12-20-14-30-34-276.png|width=347,height=336!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37689) Expand should be supported PropagateEmptyRelation

2021-12-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37689:


Assignee: (was: Apache Spark)

> Expand should be supported PropagateEmptyRelation
> -
>
> Key: SPARK-37689
> URL: https://issues.apache.org/jira/browse/SPARK-37689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2021-12-20-14-30-34-276.png, 
> image-2021-12-20-14-31-16-893.png
>
>
> Empty LocalScan still trigger HashAggregate execute. Expand should be 
> supported in PropagateEmptyRelation too.
> !image-2021-12-20-14-30-34-276.png|width=347,height=336!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37689) Expand should be supported PropagateEmptyRelation

2021-12-19 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37689:
--
Summary: Expand should be supported PropagateEmptyRelation  (was: Expand 
should be supported PropagateEmptyRelationBase)

> Expand should be supported PropagateEmptyRelation
> -
>
> Key: SPARK-37689
> URL: https://issues.apache.org/jira/browse/SPARK-37689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2021-12-20-14-30-34-276.png, 
> image-2021-12-20-14-31-16-893.png
>
>
> Empty LocalScan still trigger HashAggregate execute. Expand should be 
> supported in PropagateEmptyRelation too.
> !image-2021-12-20-14-30-34-276.png|width=347,height=336!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37689) Expand should be supported PropagateEmptyRelationBase

2021-12-19 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37689:
--
Description: 
Empty LocalScan still trigger HashAggregate execute. Expand should be supported 
in PropagateEmptyRelation too.

!image-2021-12-20-14-30-34-276.png|width=347,height=336!

  was:!image-2021-12-20-14-30-24-864.png!


> Expand should be supported PropagateEmptyRelationBase
> -
>
> Key: SPARK-37689
> URL: https://issues.apache.org/jira/browse/SPARK-37689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2021-12-20-14-30-34-276.png, 
> image-2021-12-20-14-31-16-893.png
>
>
> Empty LocalScan still trigger HashAggregate execute. Expand should be 
> supported in PropagateEmptyRelation too.
> !image-2021-12-20-14-30-34-276.png|width=347,height=336!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37689) Expand should be supported PropagateEmptyRelationBase

2021-12-19 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37689:
--
Attachment: image-2021-12-20-14-31-16-893.png

> Expand should be supported PropagateEmptyRelationBase
> -
>
> Key: SPARK-37689
> URL: https://issues.apache.org/jira/browse/SPARK-37689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2021-12-20-14-30-34-276.png, 
> image-2021-12-20-14-31-16-893.png
>
>
> !image-2021-12-20-14-30-24-864.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37689) Expand should be supported PropagateEmptyRelationBase

2021-12-19 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37689:
--
Attachment: image-2021-12-20-14-30-34-276.png

> Expand should be supported PropagateEmptyRelationBase
> -
>
> Key: SPARK-37689
> URL: https://issues.apache.org/jira/browse/SPARK-37689
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2021-12-20-14-30-34-276.png
>
>
> !image-2021-12-20-14-30-24-864.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37689) Expand should be supported PropagateEmptyRelationBase

2021-12-19 Thread angerszhu (Jira)
angerszhu created SPARK-37689:
-

 Summary: Expand should be supported PropagateEmptyRelationBase
 Key: SPARK-37689
 URL: https://issues.apache.org/jira/browse/SPARK-37689
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu
 Attachments: image-2021-12-20-14-30-34-276.png

!image-2021-12-20-14-30-24-864.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37682) Reduce memory pressure of RewriteDistinctAggregates

2021-12-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37682:


Assignee: (was: Apache Spark)

> Reduce memory pressure of RewriteDistinctAggregates
> ---
>
> Key: SPARK-37682
> URL: https://issues.apache.org/jira/browse/SPARK-37682
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kevin Liu
>Priority: Minor
>  Labels: performance
>
> In some cases, current RewriteDistinctAggregates duplicates unnecessary input 
> data in distinct groups.
> This will cause a lot of waste of memory and affects performance.
> We could apply 'merged column' and 'bit vector' tricks to alleviate the 
> problem. For example:
> {code:sql}
> SELECT
>   COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist,
>   COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist,
>   COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist,
>   COUNT(DISTINCT id) as id_cnt_dist,
>   SUM(DISTINCT value) as id_sum_dist
> FROM data
> GROUP BY key
> {code}
> Current rule will rewrite the above sql plan to the following (pseudo) 
> logical plan:
> {noformat}
> Aggregate(
>key = ['key]
>functions = [
>count('cat1) FILTER (WHERE (('gid = 1) AND 'max(id > 1))),
>count('(IF((value > 5), cat1, null))) FILTER (WHERE ('gid = 5)),
>count('cat2) FILTER (WHERE (('gid = 3) AND 'max(id > 2))),
>count('id) FILTER (WHERE ('gid = 2)),
>sum('value) FILTER (WHERE ('gid = 4))
>]
>output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 
> 'cat1_if_cnt_dist,
>  'id_cnt_dist, 'id_sum_dist])
>   Aggregate(
>  key = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, 
> 'gid]
>  functions = [max('id > 1), max('id > 2)]
>  output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 
> 'id, 'gid,
>'max(id > 1), 'max(id > 2)])
> Expand(
>projections = [
>  ('key, 'cat1, null, null, null, null, 1, ('id > 1), null),
>  ('key, null, null, null, null, 'id, 2, null, null),
>  ('key, null, null, 'cat2, null, null, 3, null, ('id > 2)),
>  ('key, null, 'value, null, null, null, 4, null, null),
>  ('key, null, null, null, if (('value > 5)) 'cat1 else null, null, 5, 
> null, null)
>]
>output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 
> 'id,
>  'gid, '(id > 1), '(id > 2)])
>   LocalTableScan [...]
> {noformat}
> After applying 'merged column' and 'bit vector' tricks, the logical plan will 
> become:
> {noformat}
> Aggregate(
>key = ['key]
>functions = [
>count('merged_string_1) FILTER (WHERE (('gid = 1) AND NOT 
> (('filter_vector_1 & 1) = 0)))
>count('merged_string_1) FILTER (WHERE ('gid = 1)),
>count(if (NOT (('if_vector_1 & 1) = 0)) 'merged_string_1 else null) 
> FILTER (WHERE ('gid = 1)),
>count('merged_string_1) FILTER (WHERE (('gid = 2) AND NOT 
> (('filter_vector_1 & 1) = 0))),
>count('merged_integer_1) FILTER (WHERE ('gid = 3)),
>sum('merged_integer_1) FILTER (WHERE ('gid = 4))
>]
>output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 
> 'cat1_if_cnt_dist,
>  'id_cnt_dist, 'id_sum_dist])
>   Aggregate(
>  key = ['key, 'merged_string_1, 'merged_integer_1, 'gid]
>  functions = [bit_or('if_vector_1),bit_or('filter_vector_1)]
>  output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 
> 'bit_or(if_vector_1), 'bit_or(filter_vector_1)])
> Expand(
>projections = [
>  ('key, 'cat1, null, 1, if ('value > 5) 1 else 0, if ('id > 1) 1 else 
> 0),
>  ('key, 'cat2, null, 2, null, if ('id > 2) 1 else 0),
>  ('key, null, 'id, 3, null, null),
>  ('key, null, 'value, 4, null, null)
>]
>output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 
> 'if_vector_1, 'filter_vector_1])
>   LocalTableScan [...]
> {noformat}
> 1. merged column: Children with same datatype from different aggregate 
> functions can share same project column (e.g. cat1, cat2).
> 2. bit vector: If multiple aggregate function children have conditional 
> expressions, these conditions will output one column when it is true, and 
> output null when it is false. The detail logic will be in 
> RewriteDistinctAggregates.groupDistinctAggExpr of the following github link. 
> Then these aggregate functions can share one row group, and store the results 
> of their respective conditional expressions in the bit vector column, 
> reducing the number of rows of data expansion (e.g. cat1_filter_cnt_dist, 
> cat1_if_cnt_dist).
> If there are many similar aggregate functions with 

[jira] [Assigned] (SPARK-37682) Reduce memory pressure of RewriteDistinctAggregates

2021-12-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37682:


Assignee: Apache Spark

> Reduce memory pressure of RewriteDistinctAggregates
> ---
>
> Key: SPARK-37682
> URL: https://issues.apache.org/jira/browse/SPARK-37682
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kevin Liu
>Assignee: Apache Spark
>Priority: Minor
>  Labels: performance
>
> In some cases, current RewriteDistinctAggregates duplicates unnecessary input 
> data in distinct groups.
> This will cause a lot of waste of memory and affects performance.
> We could apply 'merged column' and 'bit vector' tricks to alleviate the 
> problem. For example:
> {code:sql}
> SELECT
>   COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist,
>   COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist,
>   COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist,
>   COUNT(DISTINCT id) as id_cnt_dist,
>   SUM(DISTINCT value) as id_sum_dist
> FROM data
> GROUP BY key
> {code}
> Current rule will rewrite the above sql plan to the following (pseudo) 
> logical plan:
> {noformat}
> Aggregate(
>key = ['key]
>functions = [
>count('cat1) FILTER (WHERE (('gid = 1) AND 'max(id > 1))),
>count('(IF((value > 5), cat1, null))) FILTER (WHERE ('gid = 5)),
>count('cat2) FILTER (WHERE (('gid = 3) AND 'max(id > 2))),
>count('id) FILTER (WHERE ('gid = 2)),
>sum('value) FILTER (WHERE ('gid = 4))
>]
>output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 
> 'cat1_if_cnt_dist,
>  'id_cnt_dist, 'id_sum_dist])
>   Aggregate(
>  key = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, 
> 'gid]
>  functions = [max('id > 1), max('id > 2)]
>  output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 
> 'id, 'gid,
>'max(id > 1), 'max(id > 2)])
> Expand(
>projections = [
>  ('key, 'cat1, null, null, null, null, 1, ('id > 1), null),
>  ('key, null, null, null, null, 'id, 2, null, null),
>  ('key, null, null, 'cat2, null, null, 3, null, ('id > 2)),
>  ('key, null, 'value, null, null, null, 4, null, null),
>  ('key, null, null, null, if (('value > 5)) 'cat1 else null, null, 5, 
> null, null)
>]
>output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 
> 'id,
>  'gid, '(id > 1), '(id > 2)])
>   LocalTableScan [...]
> {noformat}
> After applying 'merged column' and 'bit vector' tricks, the logical plan will 
> become:
> {noformat}
> Aggregate(
>key = ['key]
>functions = [
>count('merged_string_1) FILTER (WHERE (('gid = 1) AND NOT 
> (('filter_vector_1 & 1) = 0)))
>count('merged_string_1) FILTER (WHERE ('gid = 1)),
>count(if (NOT (('if_vector_1 & 1) = 0)) 'merged_string_1 else null) 
> FILTER (WHERE ('gid = 1)),
>count('merged_string_1) FILTER (WHERE (('gid = 2) AND NOT 
> (('filter_vector_1 & 1) = 0))),
>count('merged_integer_1) FILTER (WHERE ('gid = 3)),
>sum('merged_integer_1) FILTER (WHERE ('gid = 4))
>]
>output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 
> 'cat1_if_cnt_dist,
>  'id_cnt_dist, 'id_sum_dist])
>   Aggregate(
>  key = ['key, 'merged_string_1, 'merged_integer_1, 'gid]
>  functions = [bit_or('if_vector_1),bit_or('filter_vector_1)]
>  output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 
> 'bit_or(if_vector_1), 'bit_or(filter_vector_1)])
> Expand(
>projections = [
>  ('key, 'cat1, null, 1, if ('value > 5) 1 else 0, if ('id > 1) 1 else 
> 0),
>  ('key, 'cat2, null, 2, null, if ('id > 2) 1 else 0),
>  ('key, null, 'id, 3, null, null),
>  ('key, null, 'value, 4, null, null)
>]
>output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 
> 'if_vector_1, 'filter_vector_1])
>   LocalTableScan [...]
> {noformat}
> 1. merged column: Children with same datatype from different aggregate 
> functions can share same project column (e.g. cat1, cat2).
> 2. bit vector: If multiple aggregate function children have conditional 
> expressions, these conditions will output one column when it is true, and 
> output null when it is false. The detail logic will be in 
> RewriteDistinctAggregates.groupDistinctAggExpr of the following github link. 
> Then these aggregate functions can share one row group, and store the results 
> of their respective conditional expressions in the bit vector column, 
> reducing the number of rows of data expansion (e.g. cat1_filter_cnt_dist, 
> cat1_if_cnt_dist).
> If there are many similar 

[jira] [Commented] (SPARK-37682) Reduce memory pressure of RewriteDistinctAggregates

2021-12-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462381#comment-17462381
 ] 

Apache Spark commented on SPARK-37682:
--

User 'Flyangz' has created a pull request for this issue:
https://github.com/apache/spark/pull/34953

> Reduce memory pressure of RewriteDistinctAggregates
> ---
>
> Key: SPARK-37682
> URL: https://issues.apache.org/jira/browse/SPARK-37682
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kevin Liu
>Priority: Minor
>  Labels: performance
>
> In some cases, current RewriteDistinctAggregates duplicates unnecessary input 
> data in distinct groups.
> This will cause a lot of waste of memory and affects performance.
> We could apply 'merged column' and 'bit vector' tricks to alleviate the 
> problem. For example:
> {code:sql}
> SELECT
>   COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist,
>   COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist,
>   COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist,
>   COUNT(DISTINCT id) as id_cnt_dist,
>   SUM(DISTINCT value) as id_sum_dist
> FROM data
> GROUP BY key
> {code}
> Current rule will rewrite the above sql plan to the following (pseudo) 
> logical plan:
> {noformat}
> Aggregate(
>key = ['key]
>functions = [
>count('cat1) FILTER (WHERE (('gid = 1) AND 'max(id > 1))),
>count('(IF((value > 5), cat1, null))) FILTER (WHERE ('gid = 5)),
>count('cat2) FILTER (WHERE (('gid = 3) AND 'max(id > 2))),
>count('id) FILTER (WHERE ('gid = 2)),
>sum('value) FILTER (WHERE ('gid = 4))
>]
>output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 
> 'cat1_if_cnt_dist,
>  'id_cnt_dist, 'id_sum_dist])
>   Aggregate(
>  key = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, 
> 'gid]
>  functions = [max('id > 1), max('id > 2)]
>  output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 
> 'id, 'gid,
>'max(id > 1), 'max(id > 2)])
> Expand(
>projections = [
>  ('key, 'cat1, null, null, null, null, 1, ('id > 1), null),
>  ('key, null, null, null, null, 'id, 2, null, null),
>  ('key, null, null, 'cat2, null, null, 3, null, ('id > 2)),
>  ('key, null, 'value, null, null, null, 4, null, null),
>  ('key, null, null, null, if (('value > 5)) 'cat1 else null, null, 5, 
> null, null)
>]
>output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 
> 'id,
>  'gid, '(id > 1), '(id > 2)])
>   LocalTableScan [...]
> {noformat}
> After applying 'merged column' and 'bit vector' tricks, the logical plan will 
> become:
> {noformat}
> Aggregate(
>key = ['key]
>functions = [
>count('merged_string_1) FILTER (WHERE (('gid = 1) AND NOT 
> (('filter_vector_1 & 1) = 0)))
>count('merged_string_1) FILTER (WHERE ('gid = 1)),
>count(if (NOT (('if_vector_1 & 1) = 0)) 'merged_string_1 else null) 
> FILTER (WHERE ('gid = 1)),
>count('merged_string_1) FILTER (WHERE (('gid = 2) AND NOT 
> (('filter_vector_1 & 1) = 0))),
>count('merged_integer_1) FILTER (WHERE ('gid = 3)),
>sum('merged_integer_1) FILTER (WHERE ('gid = 4))
>]
>output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 
> 'cat1_if_cnt_dist,
>  'id_cnt_dist, 'id_sum_dist])
>   Aggregate(
>  key = ['key, 'merged_string_1, 'merged_integer_1, 'gid]
>  functions = [bit_or('if_vector_1),bit_or('filter_vector_1)]
>  output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 
> 'bit_or(if_vector_1), 'bit_or(filter_vector_1)])
> Expand(
>projections = [
>  ('key, 'cat1, null, 1, if ('value > 5) 1 else 0, if ('id > 1) 1 else 
> 0),
>  ('key, 'cat2, null, 2, null, if ('id > 2) 1 else 0),
>  ('key, null, 'id, 3, null, null),
>  ('key, null, 'value, 4, null, null)
>]
>output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 
> 'if_vector_1, 'filter_vector_1])
>   LocalTableScan [...]
> {noformat}
> 1. merged column: Children with same datatype from different aggregate 
> functions can share same project column (e.g. cat1, cat2).
> 2. bit vector: If multiple aggregate function children have conditional 
> expressions, these conditions will output one column when it is true, and 
> output null when it is false. The detail logic will be in 
> RewriteDistinctAggregates.groupDistinctAggExpr of the following github link. 
> Then these aggregate functions can share one row group, and store the results 
> of their respective conditional expressions in the bit vector column, 
> reducing the number of rows of data expansion (e.g. 

[jira] [Updated] (SPARK-37688) ExecutorMonitor should ignore SparkListenerBlockUpdated event if executor was not active

2021-12-19 Thread hujiahua (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiahua updated SPARK-37688:
-
Description: 
When a executor was not alive, and ExecutorMonitor received late 
SparkListenerBlockUpdated event. The `onBlockUpdated` hander will call 
`ensureExecutorIsTracked`, which will create a new executor tracker with 
UNKNOWN_RESOURCE_PROFILE_ID for the dead executor. And 
ExecutorAllocationManager will not remove executor with 
UNKNOWN_RESOURCE_PROFILE_ID, which cause a executor slot is occupied by the 
dead executor, so a new one cannot be created . 

The ExecutorAllocationManager log was like this:
21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
ExecutorAllocationManager: Not removing executor 34324 because the 
ResourceProfile was UNKNOWN!
21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
ExecutorAllocationManager: Not removing executor 34324 because the 
ResourceProfile was UNKNOWN!
21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
ExecutorAllocationManager: Not removing executor 34324 because the 
ResourceProfile was UNKNOWN!
21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
ExecutorAllocationManager: Not removing executor 34324 because the 
ResourceProfile was UNKNOWN!

  was:When a executor was not alive, and ExecutorMonitor received late 
SparkListenerBlockUpdated event. The `onBlockUpdated` hander will call 
`ensureExecutorIsTracked`, which will create a new executor tracker with 
UNKNOWN_RESOURCE_PROFILE_ID for the dead executor. And 
ExecutorAllocationManager will not remove executor with 
UNKNOWN_RESOURCE_PROFILE_ID, which cause a executor slot is occupied by the 
dead executor, so a new one cannot be created . 


> ExecutorMonitor should ignore SparkListenerBlockUpdated event if executor was 
> not active
> 
>
> Key: SPARK-37688
> URL: https://issues.apache.org/jira/browse/SPARK-37688
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: hujiahua
>Priority: Major
>
> When a executor was not alive, and ExecutorMonitor received late 
> SparkListenerBlockUpdated event. The `onBlockUpdated` hander will call 
> `ensureExecutorIsTracked`, which will create a new executor tracker with 
> UNKNOWN_RESOURCE_PROFILE_ID for the dead executor. And 
> ExecutorAllocationManager will not remove executor with 
> UNKNOWN_RESOURCE_PROFILE_ID, which cause a executor slot is occupied by the 
> dead executor, so a new one cannot be created . 
> The ExecutorAllocationManager log was like this:
> 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
> ExecutorAllocationManager: Not removing executor 34324 because the 
> ResourceProfile was UNKNOWN!
> 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
> ExecutorAllocationManager: Not removing executor 34324 because the 
> ResourceProfile was UNKNOWN!
> 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
> ExecutorAllocationManager: Not removing executor 34324 because the 
> ResourceProfile was UNKNOWN!
> 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
> ExecutorAllocationManager: Not removing executor 34324 because the 
> ResourceProfile was UNKNOWN!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37688) ExecutorMonitor should ignore SparkListenerBlockUpdated event if executor was not active

2021-12-19 Thread hujiahua (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiahua updated SPARK-37688:
-
Description: 
When a executor was not alive, and `ExecutorMonitor` received late 
`SparkListenerBlockUpdated` event. The `onBlockUpdated` hander will call 
`ensureExecutorIsTracked`, which will create a new executor tracker with 
UNKNOWN_RESOURCE_PROFILE_ID for the dead executor. And 
`ExecutorAllocationManager` will not remove executor with 
UNKNOWN_RESOURCE_PROFILE_ID, which cause a executor slot is occupied by the 
dead executor, so a new one cannot be created . 

The ExecutorAllocationManager log was like this:
21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
ExecutorAllocationManager: Not removing executor 34324 because the 
ResourceProfile was UNKNOWN!
21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
ExecutorAllocationManager: Not removing executor 34324 because the 
ResourceProfile was UNKNOWN!
21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
ExecutorAllocationManager: Not removing executor 34324 because the 
ResourceProfile was UNKNOWN!
21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
ExecutorAllocationManager: Not removing executor 34324 because the 
ResourceProfile was UNKNOWN!

  was:
When a executor was not alive, and ExecutorMonitor received late 
SparkListenerBlockUpdated event. The `onBlockUpdated` hander will call 
`ensureExecutorIsTracked`, which will create a new executor tracker with 
UNKNOWN_RESOURCE_PROFILE_ID for the dead executor. And 
ExecutorAllocationManager will not remove executor with 
UNKNOWN_RESOURCE_PROFILE_ID, which cause a executor slot is occupied by the 
dead executor, so a new one cannot be created . 

The ExecutorAllocationManager log was like this:
21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
ExecutorAllocationManager: Not removing executor 34324 because the 
ResourceProfile was UNKNOWN!
21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
ExecutorAllocationManager: Not removing executor 34324 because the 
ResourceProfile was UNKNOWN!
21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
ExecutorAllocationManager: Not removing executor 34324 because the 
ResourceProfile was UNKNOWN!
21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
ExecutorAllocationManager: Not removing executor 34324 because the 
ResourceProfile was UNKNOWN!


> ExecutorMonitor should ignore SparkListenerBlockUpdated event if executor was 
> not active
> 
>
> Key: SPARK-37688
> URL: https://issues.apache.org/jira/browse/SPARK-37688
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: hujiahua
>Priority: Major
>
> When a executor was not alive, and `ExecutorMonitor` received late 
> `SparkListenerBlockUpdated` event. The `onBlockUpdated` hander will call 
> `ensureExecutorIsTracked`, which will create a new executor tracker with 
> UNKNOWN_RESOURCE_PROFILE_ID for the dead executor. And 
> `ExecutorAllocationManager` will not remove executor with 
> UNKNOWN_RESOURCE_PROFILE_ID, which cause a executor slot is occupied by the 
> dead executor, so a new one cannot be created . 
> The ExecutorAllocationManager log was like this:
> 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
> ExecutorAllocationManager: Not removing executor 34324 because the 
> ResourceProfile was UNKNOWN!
> 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
> ExecutorAllocationManager: Not removing executor 34324 because the 
> ResourceProfile was UNKNOWN!
> 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
> ExecutorAllocationManager: Not removing executor 34324 because the 
> ResourceProfile was UNKNOWN!
> 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] 
> ExecutorAllocationManager: Not removing executor 34324 because the 
> ResourceProfile was UNKNOWN!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37688) ExecutorMonitor should ignore SparkListenerBlockUpdated event if executor was not active

2021-12-19 Thread hujiahua (Jira)
hujiahua created SPARK-37688:


 Summary: ExecutorMonitor should ignore SparkListenerBlockUpdated 
event if executor was not active
 Key: SPARK-37688
 URL: https://issues.apache.org/jira/browse/SPARK-37688
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: hujiahua


When a executor was not alive, and ExecutorMonitor received late 
SparkListenerBlockUpdated event. The `onBlockUpdated` hander will call 
`ensureExecutorIsTracked`, which will create a new executor tracker with 
UNKNOWN_RESOURCE_PROFILE_ID for the dead executor. And 
ExecutorAllocationManager will not remove executor with 
UNKNOWN_RESOURCE_PROFILE_ID, which cause a executor slot is occupied by the 
dead executor, so a new one cannot be created . 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37687) Cleanup direct usage of OkHttpClient

2021-12-19 Thread Yikun Jiang (Jira)
Yikun Jiang created SPARK-37687:
---

 Summary: Cleanup direct usage of OkHttpClient
 Key: SPARK-37687
 URL: https://issues.apache.org/jira/browse/SPARK-37687
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Yikun Jiang


- There are some problem (such as IPV6 based cluster support) on okhttpclient 
v3, but it's a little bit diffcult to upgrade to v4 [1]

- Kubernetes client are also consider to support other clients [2] rather than 
single okhttpclient.

- Kubernetes client add a abstract layer [3] to address supporting httpclient, 
okhttp client as one of supported http clients.

 

So, we better to consider to cleanup okhttpclient direct usage and use the 
httpclient which kubernetes client diret supported to reduce the potential risk 
in future upgrade.

 

See also:

[[1]https://github.com/fabric8io/kubernetes-client/issues/2632|https://github.com/fabric8io/kubernetes-client/issues/2632]

[2][https://github.com/fabric8io/kubernetes-client/issues/3663#issuecomment-997402993]

[3] [https://github.com/fabric8io/kubernetes-client/issues/3547]

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37648) Spark catalog and Delta tables

2021-12-19 Thread Cheng Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462361#comment-17462361
 ] 

Cheng Pan edited comment on SPARK-37648 at 12/20/21, 2:42 AM:
--

We have a workaround for this issue in Apache Kyuubi (Incubating), 
[https://github.com/apache/incubator-kyuubi/pull/1476]

Kyuubi can be considered as a more powerful Spark Thrift Server, it's worth a 
try.


was (Author: pan3793):
This issue has been fixed in Apache Kyuubi (Incubating), 
[https://github.com/apache/incubator-kyuubi/pull/1476]

Kyuubi can be considered as a more powerful Spark Thrift Server, it's worth a 
try.

> Spark catalog and Delta tables
> --
>
> Key: SPARK-37648
> URL: https://issues.apache.org/jira/browse/SPARK-37648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Spark version 3.1.2
> Scala version 2.12.10
> Hive version 2.3.7
> Delta version 1.0.0
>Reporter: Hanna Liashchuk
>Priority: Major
>
> I'm using Spark with Delta tables, while tables are created, there are no 
> columns in the table.
> Steps to reproduce:
> 1. Start spark-shell 
> {code:java}
> spark-shell --conf 
> "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
> "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
>  --conf "spark.sql.legacy.parquet.int96RebaseModeInWrite=LEGACY"{code}
> 2. Create delta table
> {code:java}
> spark.range(10).write.format("delta").option("path", 
> "tmp/delta").saveAsTable("delta"){code}
> 3. Make sure table exists 
> {code:java}
> spark.catalog.listTables.show{code}
> 4. Find out that columns are not
> {code:java}
> spark.catalog.listColumns("delta").show{code}
> This is critical for Delta integration with different BI tools such as Power 
> BI or Tableau, as they are querying spark catalog for the metadata and we are 
> getting errors that no columns are found. 
> Discussion can be found in Delta repository - 
> https://github.com/delta-io/delta/issues/695



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37648) Spark catalog and Delta tables

2021-12-19 Thread Cheng Pan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462361#comment-17462361
 ] 

Cheng Pan commented on SPARK-37648:
---

This issue has been fixed in Apache Kyuubi (Incubating), 
[https://github.com/apache/incubator-kyuubi/pull/1476]

Kyuubi can be considered as a more powerful Spark Thrift Server, it's worth a 
try.

> Spark catalog and Delta tables
> --
>
> Key: SPARK-37648
> URL: https://issues.apache.org/jira/browse/SPARK-37648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Spark version 3.1.2
> Scala version 2.12.10
> Hive version 2.3.7
> Delta version 1.0.0
>Reporter: Hanna Liashchuk
>Priority: Major
>
> I'm using Spark with Delta tables, while tables are created, there are no 
> columns in the table.
> Steps to reproduce:
> 1. Start spark-shell 
> {code:java}
> spark-shell --conf 
> "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
> "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
>  --conf "spark.sql.legacy.parquet.int96RebaseModeInWrite=LEGACY"{code}
> 2. Create delta table
> {code:java}
> spark.range(10).write.format("delta").option("path", 
> "tmp/delta").saveAsTable("delta"){code}
> 3. Make sure table exists 
> {code:java}
> spark.catalog.listTables.show{code}
> 4. Find out that columns are not
> {code:java}
> spark.catalog.listColumns("delta").show{code}
> This is critical for Delta integration with different BI tools such as Power 
> BI or Tableau, as they are querying spark catalog for the metadata and we are 
> getting errors that no columns are found. 
> Discussion can be found in Delta repository - 
> https://github.com/delta-io/delta/issues/695



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37682) Reduce memory pressure of RewriteDistinctAggregates

2021-12-19 Thread Kevin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Liu updated SPARK-37682:
--
Description: 
In some cases, current RewriteDistinctAggregates duplicates unnecessary input 
data in distinct groups.
This will cause a lot of waste of memory and affects performance.
We could apply 'merged column' and 'bit vector' tricks to alleviate the 
problem. For example:
{code:sql}
SELECT
  COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist,
  COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist,
  COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist,
  COUNT(DISTINCT id) as id_cnt_dist,
  SUM(DISTINCT value) as id_sum_dist
FROM data
GROUP BY key
{code}

Current rule will rewrite the above sql plan to the following (pseudo) logical 
plan:
{noformat}
Aggregate(
   key = ['key]
   functions = [
   count('cat1) FILTER (WHERE (('gid = 1) AND 'max(id > 1))),
   count('(IF((value > 5), cat1, null))) FILTER (WHERE ('gid = 5)),
   count('cat2) FILTER (WHERE (('gid = 3) AND 'max(id > 2))),
   count('id) FILTER (WHERE ('gid = 2)),
   sum('value) FILTER (WHERE ('gid = 4))
   ]
   output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 
'cat1_if_cnt_dist,
 'id_cnt_dist, 'id_sum_dist])
  Aggregate(
 key = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, 
'gid]
 functions = [max('id > 1), max('id > 2)]
 output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, 
'gid,
   'max(id > 1), 'max(id > 2)])
Expand(
   projections = [
 ('key, 'cat1, null, null, null, null, 1, ('id > 1), null),
 ('key, null, null, null, null, 'id, 2, null, null),
 ('key, null, null, 'cat2, null, null, 3, null, ('id > 2)),
 ('key, null, 'value, null, null, null, 4, null, null),
 ('key, null, null, null, if (('value > 5)) 'cat1 else null, null, 5, 
null, null)
   ]
   output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 
'id,
 'gid, '(id > 1), '(id > 2)])
  LocalTableScan [...]
{noformat}

After applying 'merged column' and 'bit vector' tricks, the logical plan will 
become:
{noformat}
Aggregate(
   key = ['key]
   functions = [
   count('merged_string_1) FILTER (WHERE (('gid = 1) AND NOT 
(('filter_vector_1 & 1) = 0)))
   count('merged_string_1) FILTER (WHERE ('gid = 1)),
   count(if (NOT (('if_vector_1 & 1) = 0)) 'merged_string_1 else null) 
FILTER (WHERE ('gid = 1)),
   count('merged_string_1) FILTER (WHERE (('gid = 2) AND NOT 
(('filter_vector_1 & 1) = 0))),
   count('merged_integer_1) FILTER (WHERE ('gid = 3)),
   sum('merged_integer_1) FILTER (WHERE ('gid = 4))
   ]
   output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 
'cat1_if_cnt_dist,
 'id_cnt_dist, 'id_sum_dist])
  Aggregate(
 key = ['key, 'merged_string_1, 'merged_integer_1, 'gid]
 functions = [bit_or('if_vector_1),bit_or('filter_vector_1)]
 output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 
'bit_or(if_vector_1), 'bit_or(filter_vector_1)])
Expand(
   projections = [
 ('key, 'cat1, null, 1, if ('value > 5) 1 else 0, if ('id > 1) 1 else 
0),
 ('key, 'cat2, null, 2, null, if ('id > 2) 1 else 0),
 ('key, null, 'id, 3, null, null),
 ('key, null, 'value, 4, null, null)
   ]
   output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 'if_vector_1, 
'filter_vector_1])
  LocalTableScan [...]
{noformat}

1. merged column: Children with same datatype from different aggregate 
functions can share same project column (e.g. cat1, cat2).
2. bit vector: If multiple aggregate function children have conditional 
expressions, these conditions will output one column when it is true, and 
output null when it is false. The detail logic will be in 
RewriteDistinctAggregates.groupDistinctAggExpr of the following github link. 
Then these aggregate functions can share one row group, and store the results 
of their respective conditional expressions in the bit vector column, reducing 
the number of rows of data expansion (e.g. cat1_filter_cnt_dist, 
cat1_if_cnt_dist).
If there are many similar aggregate functions with or without filter in 
distinct, these tricks can save mass memory and improve performance.

  was:
In some cases, current RewriteDistinctAggregates duplicates unnecessary input 
data in distinct groups.
This will cause a lot of waste of memory and affects performance.
We could apply 'merged column' and 'bit vector' tricks to alleviate the 
problem. For example:
{code:sql}
SELECT
  COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist,
  COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist,
  COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist,
  COUNT(DISTINCT id) as id_cnt_dist,
  SUM(DISTINCT value) as id_sum_dist
FROM 

[jira] [Updated] (SPARK-37682) Reduce memory pressure of RewriteDistinctAggregates

2021-12-19 Thread Kevin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Liu updated SPARK-37682:
--
Description: 
In some cases, current RewriteDistinctAggregates duplicates unnecessary input 
data in distinct groups.
This will cause a lot of waste of memory and affects performance.
We could apply 'merged column' and 'bit vector' tricks to alleviate the 
problem. For example:
{code:sql}
SELECT
  COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist,
  COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist,
  COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist,
  COUNT(DISTINCT id) as id_cnt_dist,
  SUM(DISTINCT value) as id_sum_dist
FROM data
GROUP BY key
{code}

Current rule will rewrite the above sql plan to the following (pseudo) logical 
plan:
{noformat}
Aggregate(
   key = ['key]
   functions = [
   count('cat1) FILTER (WHERE (('gid = 1) AND 'max(id > 1))),
   count('(IF((value > 5), cat1, null))) FILTER (WHERE ('gid = 5)),
   count('cat2) FILTER (WHERE (('gid = 3) AND 'max(id > 2))),
   count('id) FILTER (WHERE ('gid = 2)),
   sum('value) FILTER (WHERE ('gid = 4))
   ]
   output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 
'cat1_if_cnt_dist,
 'id_cnt_dist, 'id_sum_dist])
  Aggregate(
 key = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, 
'gid]
 functions = [max('id > 1), max('id > 2)]
 output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, 
'gid,
   'max(id > 1), 'max(id > 2)])
Expand(
   projections = [
 ('key, 'cat1, null, null, null, null, 1, ('id > 1), null),
 ('key, null, null, null, null, 'id, 2, null, null),
 ('key, null, null, 'cat2, null, null, 3, null, ('id > 2)),
 ('key, null, 'value, null, null, null, 4, null, null),
 ('key, null, null, null, if (('value > 5)) 'cat1 else null, null, 5, 
null, null)
   ]
   output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 
'id,
 'gid, '(id > 1), '(id > 2)])
  LocalTableScan [...]
{noformat}

After applying 'merged column' and 'bit vector' tricks, the logical plan will 
become:
{noformat}
Aggregate(
   key = ['key]
   functions = [
   count('merged_string_1) FILTER (WHERE (('gid = 1) AND NOT 
(('filter_vector_1 & 1) = 0)))
   count('merged_string_1) FILTER (WHERE ('gid = 1)),
   count(if (NOT (('if_vector_1 & 1) = 0)) 'merged_string_1 else null) 
FILTER (WHERE ('gid = 1)),
   count('merged_string_1) FILTER (WHERE (('gid = 2) AND NOT 
(('filter_vector_1 & 1) = 0))),
   count('merged_integer_1) FILTER (WHERE ('gid = 3)),
   sum('merged_integer_1) FILTER (WHERE ('gid = 4))
   ]
   output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 
'cat1_if_cnt_dist,
 'id_cnt_dist, 'id_sum_dist])
  Aggregate(
 key = ['key, 'merged_string_1, 'merged_integer_1, 'gid]
 functions = [bit_or('if_vector_1),bit_or('filter_vector_1)]
 output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 
'bit_or(if_vector_1), 'bit_or(filter_vector_1)])
Expand(
   projections = [
 ('key, 'cat1, null, 1, if ('value > 5) 1 else 0, if ('id > 1) 1 else 
0),
 ('key, 'cat2, null, 2, null, if ('id > 2) 1 else 0),
 ('key, null, 'id, 3, null, null),
 ('key, null, 'value, 4, null, null)
   ]
   output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 'if_vector_1, 
'filter_vector_1])
  LocalTableScan [...]
{noformat}

1. merged column: Children with same datatype from different aggregate 
functions can share same project column (e.g. cat1, cat2).
2. bit vector: If multiple aggregate function children have conditional 
expressions, these conditions will output one column when it is true, and 
output null when it is false. The detail logic is in 
RewriteDistinctAggregates.groupDistinctAggExpr of the following github link. 
Then these aggregate functions can share one row group, and store the results 
of their respective conditional expressions in the bit vector column, reducing 
the number of rows of data expansion (e.g. cat1_filter_cnt_dist, 
cat1_if_cnt_dist).
If there are many similar aggregate functions with or without filter in 
distinct, these tricks can save mass memory and improve performance.

  was:
In some cases, current RewriteDistinctAggregates duplicates unnecessary input 
data in distinct groups.
This will cause a lot of waste of memory and affects performance.
We could apply 'merged column' and 'bit vector' tricks to alleviate the 
problem. For example:
{code:sql}
SELECT
  COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist,
  COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist,
  COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist,
  COUNT(DISTINCT id) as id_cnt_dist,
  SUM(DISTINCT value) as id_sum_dist
FROM data

[jira] [Commented] (SPARK-37686) Migrate remaining pyspark.sql.functions to _invoke_* style

2021-12-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462144#comment-17462144
 ] 

Apache Spark commented on SPARK-37686:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/34951

> Migrate remaining pyspark.sql.functions to _invoke_* style
> --
>
> Key: SPARK-37686
> URL: https://issues.apache.org/jira/browse/SPARK-37686
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> In SPARK-32084 we converted a number of dynamically created functions to 
> standard def using newly added {{_invoke_function}} helpers. Remaining 
> functions stayed untouched. 
> These allow us to implement these with minimum amount of boilerplate code.
> With recent migration of type hints to inline variant we had to type checker 
> hints:
> {code:python}
> sc = SparkContext._active_spark_context
> assert sc is not None and sc._jvm is not Non
> {code}
> to the functions not covered in SPARK-32084. This added to standard 
> conversions (i.e. ({_to_java_column}}, {{Column(jc)}}) make these pretty 
> verbose.
> However, it is possible to convert remaining functions using 
> {{_invoke_function}} helpers to address that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37686) Migrate remaining pyspark.sql.functions to _invoke_* style

2021-12-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37686:


Assignee: (was: Apache Spark)

> Migrate remaining pyspark.sql.functions to _invoke_* style
> --
>
> Key: SPARK-37686
> URL: https://issues.apache.org/jira/browse/SPARK-37686
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> In SPARK-32084 we converted a number of dynamically created functions to 
> standard def using newly added {{_invoke_function}} helpers. Remaining 
> functions stayed untouched. 
> These allow us to implement these with minimum amount of boilerplate code.
> With recent migration of type hints to inline variant we had to type checker 
> hints:
> {code:python}
> sc = SparkContext._active_spark_context
> assert sc is not None and sc._jvm is not Non
> {code}
> to the functions not covered in SPARK-32084. This added to standard 
> conversions (i.e. ({_to_java_column}}, {{Column(jc)}}) make these pretty 
> verbose.
> However, it is possible to convert remaining functions using 
> {{_invoke_function}} helpers to address that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37686) Migrate remaining pyspark.sql.functions to _invoke_* style

2021-12-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37686:


Assignee: Apache Spark

> Migrate remaining pyspark.sql.functions to _invoke_* style
> --
>
> Key: SPARK-37686
> URL: https://issues.apache.org/jira/browse/SPARK-37686
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> In SPARK-32084 we converted a number of dynamically created functions to 
> standard def using newly added {{_invoke_function}} helpers. Remaining 
> functions stayed untouched. 
> These allow us to implement these with minimum amount of boilerplate code.
> With recent migration of type hints to inline variant we had to type checker 
> hints:
> {code:python}
> sc = SparkContext._active_spark_context
> assert sc is not None and sc._jvm is not Non
> {code}
> to the functions not covered in SPARK-32084. This added to standard 
> conversions (i.e. ({_to_java_column}}, {{Column(jc)}}) make these pretty 
> verbose.
> However, it is possible to convert remaining functions using 
> {{_invoke_function}} helpers to address that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37686) Migrare remaining pyspark.sql.functions to _invoke_* style

2021-12-19 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-37686:
--

 Summary: Migrare remaining pyspark.sql.functions to _invoke_* style
 Key: SPARK-37686
 URL: https://issues.apache.org/jira/browse/SPARK-37686
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.3.0
Reporter: Maciej Szymkiewicz


In SPARK-32084 we converted a number of dynamically created functions to 
standard def using newly added {{_invoke_function}} helpers. Remaining 
functions stayed untouched. 

These allow us to implement these with minimum amount of boilerplate code.

With recent migration of type hints to inline variant we had to type checker 
hints:

{code:python}
sc = SparkContext._active_spark_context
assert sc is not None and sc._jvm is not Non
{code}

to the functions not covered in SPARK-32084. This added to standard conversions 
(i.e. ({_to_java_column}}, {{Column(jc)}}) make these pretty verbose.

However, it is possible to convert remaining functions using 
{{_invoke_function}} helpers to address that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37686) Migrate remaining pyspark.sql.functions to _invoke_* style

2021-12-19 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-37686:
---
Summary: Migrate remaining pyspark.sql.functions to _invoke_* style  (was: 
Migrare remaining pyspark.sql.functions to _invoke_* style)

> Migrate remaining pyspark.sql.functions to _invoke_* style
> --
>
> Key: SPARK-37686
> URL: https://issues.apache.org/jira/browse/SPARK-37686
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> In SPARK-32084 we converted a number of dynamically created functions to 
> standard def using newly added {{_invoke_function}} helpers. Remaining 
> functions stayed untouched. 
> These allow us to implement these with minimum amount of boilerplate code.
> With recent migration of type hints to inline variant we had to type checker 
> hints:
> {code:python}
> sc = SparkContext._active_spark_context
> assert sc is not None and sc._jvm is not Non
> {code}
> to the functions not covered in SPARK-32084. This added to standard 
> conversions (i.e. ({_to_java_column}}, {{Column(jc)}}) make these pretty 
> verbose.
> However, it is possible to convert remaining functions using 
> {{_invoke_function}} helpers to address that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37682) Reduce memory pressure of RewriteDistinctAggregates

2021-12-19 Thread Kevin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Liu updated SPARK-37682:
--
Priority: Minor  (was: Major)

> Reduce memory pressure of RewriteDistinctAggregates
> ---
>
> Key: SPARK-37682
> URL: https://issues.apache.org/jira/browse/SPARK-37682
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kevin Liu
>Priority: Minor
>  Labels: performance
>
> In some cases, current RewriteDistinctAggregates duplicates unnecessary input 
> data in distinct groups.
> This will cause a lot of waste of memory and affects performance.
> We could apply 'merged column' and 'bit vector' tricks to alleviate the 
> problem. For example:
> {code:sql}
> SELECT
>   COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist,
>   COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist,
>   COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist,
>   COUNT(DISTINCT id) as id_cnt_dist,
>   SUM(DISTINCT value) as id_sum_dist
> FROM data
> GROUP BY key
> {code}
> Current rule will rewrite the above sql plan to the following (pseudo) 
> logical plan:
> {noformat}
> Aggregate(
>key = ['key]
>functions = [
>count('cat1) FILTER (WHERE (('gid = 1) AND 'max(id > 1))),
>count('(IF((value > 5), cat1, null))) FILTER (WHERE ('gid = 5)),
>count('cat2) FILTER (WHERE (('gid = 3) AND 'max(id > 2))),
>count('id) FILTER (WHERE ('gid = 2)),
>sum('value) FILTER (WHERE ('gid = 4))
>]
>output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 
> 'cat1_if_cnt_dist,
>  'id_cnt_dist, 'id_sum_dist])
>   Aggregate(
>  key = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, 
> 'gid]
>  functions = [max('id > 1), max('id > 2)]
>  output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 
> 'id, 'gid,
>'max(id > 1), 'max(id > 2)])
> Expand(
>projections = [
>  ('key, 'cat1, null, null, null, null, 1, ('id > 1), null),
>  ('key, null, null, null, null, 'id, 2, null, null),
>  ('key, null, null, 'cat2, null, null, 3, null, ('id > 2)),
>  ('key, null, 'value, null, null, null, 4, null, null),
>  ('key, null, null, null, if (('value > 5)) 'cat1 else null, null, 5, 
> null, null)
>]
>output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 
> 'id,
>  'gid, '(id > 1), '(id > 2)])
>   LocalTableScan [...]
> {noformat}
> After applying 'merged column' and 'bit vector' tricks, the logical plan will 
> become:
> {noformat}
> Aggregate(
>key = ['key]
>functions = [
>count('merged_string_1) FILTER (WHERE (('gid = 1) AND NOT (('vector_1 
> & 1) = 0))),
>count('merged_string_1) FILTER (WHERE (('gid = 1) AND NOT (('vector_1 
> & 2) = 0))),
>count('merged_string_1) FILTER (WHERE (('gid = 2) AND NOT (('vector_1 
> & 1) = 0))),
>count('merged_integer_1) FILTER (WHERE ('gid = 3)),
>sum('merged_integer_1) FILTER (WHERE ('gid = 4))
>]
>output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 
> 'cat1_if_cnt_dist,
>  'id_cnt_dist, 'id_sum_dist])
>   Aggregate(
>  key = ['key, 'merged_string_1, 'merged_integer_1, 'gid]
>  functions = [bit_or('vector_1)]
>  output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 
> 'bit_or(vector_1)])
> Expand(
>projections = [
>  ('key, 'cat1, null, 1, (if (('id > 1)) 1 else 0 | if (('value > 5)) 
> 2 else 0)),
>  ('key, 'cat2, null, 2, if (('id > 2)) 1 else 0),
>  ('key, null, 'id, 3, null),
>  ('key, null, 'value, 4, null)
>]
>output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 'vector_1])
>   LocalTableScan [...]
> {noformat}
> 1. merged column: Children with same datatype from different aggregate 
> functions can share same project column (e.g. cat1, cat2).
> 2. bit vector: If multiple aggregate function children have conditional 
> expressions, these conditions will output one column when it is true, and 
> output null when it is false. The detail logic is in 
> RewriteDistinctAggregates.groupDistinctAggExpr of the following github link. 
> Then these aggregate functions can share one row group, and store the results 
> of their respective conditional expressions in the bit vector column, 
> reducing the number of rows of data expansion (e.g. cat1_filter_cnt_dist, 
> cat1_if_cnt_dist).
> If there are many similar aggregate functions with or without filter in 
> distinct, these tricks can save mass memory and improve performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (SPARK-37588) lot of strings get accumulated in the heap dump of spark thrift server

2021-12-19 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462105#comment-17462105
 ] 

Hyukjin Kwon commented on SPARK-37588:
--

[~rkchilaka] can we have a minimized self-contained reproducer? At least you 
can remove the configuration one by one, and see if the issue still persists.

> lot of strings get accumulated in the heap dump of spark thrift server
> --
>
> Key: SPARK-37588
> URL: https://issues.apache.org/jira/browse/SPARK-37588
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
> Environment: Open JDK (8 build 1.8.0_312-b07) and scala 2.12
> OS: Red Hat Enterprise Linux 8.4 (Ootpa), platform:el8
>Reporter: ramakrishna chilaka
>Priority: Major
> Attachments: screenshot-1.png
>
>
> I am starting spark thrift server using the following options
> ```
> /data/spark/sbin/start-thriftserver.sh --master spark://*:7077 --conf 
> "spark.cores.max=320" --conf "spark.executor.cores=3" --conf 
> "spark.driver.cores=15" --executor-memory=10G --driver-memory=50G --conf 
> spark.sql.adaptive.coalescePartitions.enabled=true --conf 
> spark.sql.adaptive.skewJoin.enabled=true --conf spark.sql.cbo.enabled=true 
> --conf spark.sql.adaptive.enabled=true --conf spark.rpc.io.serverThreads=64 
> --conf "spark.driver.maxResultSize=4G" --conf 
> "spark.max.fetch.failures.per.stage=10" --conf 
> "spark.sql.thriftServer.incrementalCollect=false" --conf 
> "spark.ui.reverseProxy=true" --conf "spark.ui.reverseProxyUrl=/spark_ui" 
> --conf "spark.sql.autoBroadcastJoinThreshold=1073741824" --conf 
> spark.sql.thriftServer.interruptOnCancel=true --conf 
> spark.sql.thriftServer.queryTimeout=0 --hiveconf 
> hive.server2.transport.mode=http --hiveconf 
> hive.server2.thrift.http.path=spark_sql --hiveconf 
> hive.server2.thrift.min.worker.threads=500 --hiveconf 
> hive.server2.thrift.max.worker.threads=2147483647 --hiveconf 
> hive.server2.thrift.http.cookie.is.secure=false --hiveconf 
> hive.server2.thrift.http.cookie.auth.enabled=false --hiveconf 
> hive.server2.authentication=NONE --hiveconf hive.server2.enable.doAs=false 
> --hiveconf spark.sql.hive.thriftServer.singleSession=true --hiveconf 
> hive.server2.thrift.bind.host=0.0.0.0 --conf 
> "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
> "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
>  --conf "spark.sql.cbo.joinReorder.enabled=true" --conf 
> "spark.sql.optimizer.dynamicPartitionPruning.enabled=true" --conf 
> "spark.worker.cleanup.enabled=true" --conf 
> "spark.worker.cleanup.appDataTtl=3600" --hiveconf 
> hive.exec.scratchdir=/data/spark_scratch/hive --hiveconf 
> hive.exec.local.scratchdir=/data/spark_scratch/local_scratch_dir --hiveconf 
> hive.download.resources.dir=/data/spark_scratch/hive.downloaded.resources.dir 
> --hiveconf hive.querylog.location=/data/spark_scratch/hive.querylog.location 
> --conf spark.executor.extraJavaOptions="-XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps" --conf spark.driver.extraJavaOptions="-verbose:gc 
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps 
> -Xloggc:/data/thrift_driver_gc.log -XX:+ExplicitGCInvokesConcurrent 
> -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 
> -XX:AdaptiveSizePolicyWeight=90 -XX:MaxRAM=55g" --hiveconf 
> "hive.server2.session.check.interval=6" --hiveconf 
> "hive.server2.idle.session.timeout=90" --hiveconf 
> "hive.server2.idle.session.check.operation=true" --conf 
> "spark.eventLog.enabled=false" --conf 
> "spark.cleaner.periodicGC.interval=5min" --conf 
> "spark.appStateStore.asyncTracking.enable=false" --conf 
> "spark.ui.retainedJobs=30" --conf "spark.ui.retainedStages=100" --conf 
> "spark.ui.retainedTasks=500" --conf "spark.sql.ui.retainedExecutions=10" 
> --conf "spark.ui.retainedDeadExecutors=10" --conf 
> "spark.worker.ui.retainedExecutors=10" --conf 
> "spark.worker.ui.retainedDrivers=10" --conf spark.ui.enabled=false --conf 
> spark.stage.maxConsecutiveAttempts=10 --conf spark.executor.memoryOverhead=1G 
> --conf "spark.io.compression.codec=snappy" --conf 
> "spark.default.parallelism=640" --conf spark.memory.offHeap.enabled=true 
> --conf "spark.memory.offHeap.size=3g" --conf "spark.memory.fraction=0.75" 
> --conf "spark.memory.storageFraction=0.75"
> ```
> the java heap dump after heavy usage is as follows
> ```
>  1:  50465861 9745837152  [C
>2:  23337896 1924089944  [Ljava.lang.Object;
>3:  72524905 1740597720  java.lang.Long
>4:  50463694 1614838208  java.lang.String
>5:  22718029  726976928  
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
>6:   2259416  343483328  

[jira] [Commented] (SPARK-37640) rolled event log still need be clean after compact

2021-12-19 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462104#comment-17462104
 ] 

Hyukjin Kwon commented on SPARK-37640:
--

[~m-sir] mind filling the JIRA description please?

> rolled event log still need be clean after compact
> --
>
> Key: SPARK-37640
> URL: https://issues.apache.org/jira/browse/SPARK-37640
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: muhong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37648) Spark catalog and Delta tables

2021-12-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37648:
-
Component/s: SQL
 (was: Spark Core)

> Spark catalog and Delta tables
> --
>
> Key: SPARK-37648
> URL: https://issues.apache.org/jira/browse/SPARK-37648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Spark version 3.1.2
> Scala version 2.12.10
> Hive version 2.3.7
> Delta version 1.0.0
>Reporter: Hanna Liashchuk
>Priority: Major
>
> I'm using Spark with Delta tables, while tables are created, there are no 
> columns in the table.
> Steps to reproduce:
> 1. Start spark-shell 
> {code:java}
> spark-shell --conf 
> "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
> "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
>  --conf "spark.sql.legacy.parquet.int96RebaseModeInWrite=LEGACY"{code}
> 2. Create delta table
> {code:java}
> spark.range(10).write.format("delta").option("path", 
> "tmp/delta").saveAsTable("delta"){code}
> 3. Make sure table exists 
> {code:java}
> spark.catalog.listTables.show{code}
> 4. Find out that columns are not
> {code:java}
> spark.catalog.listColumns("delta").show{code}
> This is critical for Delta integration with different BI tools such as Power 
> BI or Tableau, as they are querying spark catalog for the metadata and we are 
> getting errors that no columns are found. 
> Discussion can be found in Delta repository - 
> https://github.com/delta-io/delta/issues/695



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37660) Spark-3.2.0 Fetch Hbase Data not working

2021-12-19 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462103#comment-17462103
 ] 

Hyukjin Kwon commented on SPARK-37660:
--

Mind showing actual output? There have been almost zero changes made to this 
code path in Apache Spark between 3.1 and 3.2.

> Spark-3.2.0 Fetch Hbase Data not working
> 
>
> Key: SPARK-37660
> URL: https://issues.apache.org/jira/browse/SPARK-37660
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
> Environment: Hadoop version : hadoop-2.9.2
> HBase version : hbase-2.2.5
> Spark version : spark-3.2.0-bin-without-hadoop
> java version : jdk1.8.0_151
> scala version : scala-sdk-2.12.10
> os version : Red Hat Enterprise Linux Server release 6.6 (Santiago)
>Reporter: Bhavya Raj Sharma
>Priority: Major
>
> Below is the sample code snipet that is used to fetch data from hbase. This 
> used to work fine with spark-3.1.1
> However after upgrading to psark-3.2.0 it is not working, The issue is it is 
> not throwing any exception, it just don't fill RDD.
>  
> {code:java}
>  
>    def getInfo(sc: SparkContext, startDate:String, cachingValue: Int, 
> sparkLoggerParams: SparkLoggerParams, zkIP: String, zkPort: String): 
> RDD[(String)] = {{
> val scan = new Scan
>     scan.addFamily("family")
>     scan.addColumn("family","time")
>     val rdd = getHbaseConfiguredRDDFromScan(sc, zkIP, zkPort, "myTable", 
> scan, cachingValue, sparkLoggerParams)
>     val output: RDD[(String)] = rdd.map { row =>
>       (Bytes.toString(row._2.getRow))
>     }
>     output
>   }
>  
> def getHbaseConfiguredRDDFromScan(sc: SparkContext, zkIP: String, zkPort: 
> String, tableName: String,
>                                     scan: Scan, cachingValue: Int, 
> sparkLoggerParams: SparkLoggerParams): NewHadoopRDD[ImmutableBytesWritable, 
> Result] = {
>     scan.setCaching(cachingValue)
>     val scanString = 
> Base64.getEncoder.encodeToString(org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(scan).toByteArray)
>     val hbaseContext = new SparkHBaseContext(zkIP, zkPort)
>     val hbaseConfig = hbaseContext.getConfiguration()
>     hbaseConfig.set(TableInputFormat.INPUT_TABLE, tableName)
>     hbaseConfig.set(TableInputFormat.SCAN, scanString)
>     sc.newAPIHadoopRDD(
>       hbaseConfig,
>       classOf[TableInputFormat],
>       classOf[ImmutableBytesWritable], classOf[Result]
>     ).asInstanceOf[NewHadoopRDD[ImmutableBytesWritable, Result]]
>   }
>  
> {code}
>  
> If we fetch with using scan directly without using newAPIHadoopRDD, it works.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37667) Spark throws TreeNodeException ("Couldn't find gen_alias") during wildcard column expansion

2021-12-19 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462102#comment-17462102
 ] 

Hyukjin Kwon commented on SPARK-37667:
--

quick question, does it fail in Spark 3.2 too?

> Spark throws TreeNodeException ("Couldn't find gen_alias") during wildcard 
> column expansion
> ---
>
> Key: SPARK-37667
> URL: https://issues.apache.org/jira/browse/SPARK-37667
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kellan B Cummings
>Priority: Major
>
> I'm seeing a TreeNodeException ("Couldn't find {_}gen_alias{_}") when running 
> certain operations in Spark 3.1.2.
> A few conditions need to be met to trigger the bug:
>  - a DF with a nested struct joins to a second DF
>  - a filter that compares a column in the right DF to a column in the left DF
>  - wildcard column expansion of the nested struct
>  - a group by statement on a struct column
> *Data*
> g...@github.com:kellanburket/spark3bug.git
>  
> {code:java}
> val rightDf = spark.read.parquet("right.parquet")
> val leftDf = spark.read.parquet("left.parquet"){code}
>  
> *Schemas*
> {code:java}
> leftDf.printSchema()
> root
>  |-- row: struct (nullable = true)
>  |    |-- mid: string (nullable = true)
>  |    |-- start: struct (nullable = true)
>  |    |    |-- latitude: double (nullable = true)
>  |    |    |-- longitude: double (nullable = true)
>  |-- s2_cell_id: long (nullable = true){code}
> {code:java}
> rightDf.printSchema()
> root
>  |-- id: string (nullable = true)
>  |-- s2_cell_id: long (nullable = true){code}
>  
> *Breaking Code*
> {code:java}
> leftDf.join(rightDf, "s2_cell_id").filter(
>     "id != row.start.latitude"
> ).select(
>    col("row.*"), col("id")
> ).groupBy(
>     "start"
> ).agg(
>    min("id")
> ).show(){code}
>  
> *Working Examples*
> The following examples don't seem to be effected by the bug
> Works without group by:
> {code:java}
> leftDf.join(rightDf, "s2_cell_id").filter(
>     "id != row.start.latitude"
> ).select(
>    col("row.*"), col("id")
> ).show(){code}
> Works without filter
> {code:java}
> leftDf.join(rightDf, "s2_cell_id").select(
>    col("row.*"), col("id")
> ).groupBy(
>     "start"
> ).agg(
>    min("id")
> ).show(){code}
> Works without wildcard expansion
> {code:java}
> leftDf.join(rightDf, "s2_cell_id").filter(
>     "id != row.start.latitude"
> ).select(
>    col("row.start"), col("id")
> ).groupBy(
>     "start"
> ).agg(
>    min("id")
> ).show(){code}
> Works with caching
> {code:java}
> leftDf.join(rightDf, "s2_cell_id").filter(
>     "id != row.start.latitude"
> ).cache().select(
>    col("row.*"),
>    col("id")
> ).groupBy(
>     "start"
> ).agg(
>    min("id")
> ).show(){code}
> *Error message*
>  
>  
> {code:java}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange hashpartitioning(start#2116, 1024), ENSURE_REQUIREMENTS, [id=#3849]
> +- SortAggregate(key=[knownfloatingpointnormalized(if (isnull(start#2116)) 
> null else named_struct(latitude, 
> knownfloatingpointnormalized(normalizenanandzero(start#2116.latitude)), 
> longitude, 
> knownfloatingpointnormalized(normalizenanandzero(start#2116.longitude AS 
> start#2116], functions=[partial_min(id#2103)], output=[start#2116, min#2138])
>    +- *(2) Sort [knownfloatingpointnormalized(if (isnull(start#2116)) null 
> else named_struct(latitude, 
> knownfloatingpointnormalized(normalizenanandzero(start#2116.latitude)), 
> longitude, 
> knownfloatingpointnormalized(normalizenanandzero(start#2116.longitude AS 
> start#2116 ASC NULLS FIRST], false, 0
>       +- *(2) Project [_gen_alias_2133#2133 AS start#2116, id#2103]
>          +- *(2) !BroadcastHashJoin [s2_cell_id#2108L], [s2_cell_id#2104L], 
> Inner, BuildLeft, NOT (cast(id#2103 as double) = _gen_alias_2134#2134), false
>             :- BroadcastQueryStage 0
>             :  +- BroadcastExchange HashedRelationBroadcastMode(List(input[1, 
> bigint, false]),false), [id=#3768]
>             :     +- *(1) Project [row#2107.start AS _gen_alias_2133#2133, 
> s2_cell_id#2108L]
>             :        +- *(1) Filter isnotnull(s2_cell_id#2108L)
>             :           +- FileScan parquet [row#2107,s2_cell_id#2108L] 
> Batched: false, DataFilters: [isnotnull(s2_cell_id#2108L)], Format: Parquet, 
> Location: InMemoryFileIndex[s3://co.mira.public/spark3_bug/left], 
> PartitionFilters: [], PushedFilters: [IsNotNull(s2_cell_id)], ReadSchema: 
> struct>,s2_cell_id:bigint>
>             +- *(2) Filter (isnotnull(id#2103) AND 
> isnotnull(s2_cell_id#2104L))
>                +- *(2) ColumnarToRow
>                   +- FileScan parquet [id#2103,s2_cell_id#2104L] Batched: 
> true, DataFilters: [isnotnull(id#2103), isnotnull(s2_cell_id#2104L)], 

[jira] [Assigned] (SPARK-37685) Make log event immutable for LogAppender

2021-12-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37685:


Assignee: L. C. Hsieh

> Make log event immutable for LogAppender
> 
>
> Key: SPARK-37685
> URL: https://issues.apache.org/jira/browse/SPARK-37685
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Log4j2 will reuse log event. In Spark test, we have LogAppender which records 
> log events and check them later. So sometimes the event will be changed and 
> cause test failure.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37685) Make log event immutable for LogAppender

2021-12-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37685.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34949
[https://github.com/apache/spark/pull/34949]

> Make log event immutable for LogAppender
> 
>
> Key: SPARK-37685
> URL: https://issues.apache.org/jira/browse/SPARK-37685
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.3.0
>
>
> Log4j2 will reuse log event. In Spark test, we have LogAppender which records 
> log events and check them later. So sometimes the event will be changed and 
> cause test failure.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org