[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)
[ https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin updated SPARK-37690: -- Description: In Spark 3.2.0, you can no longer reuse the same name for a temporary view. This change is backwards incompatible, and means a common way of running pipelines of SQL queries no longer works. The following is a simple reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: {code:python}from pyspark.context import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) {code} The following error is now produced: {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) {code} I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot of legacy code, and the `createOrReplaceTempView` method is named explicitly such that replacing an existing view should be allowed. An internet search suggests other users have run into a similar problems, e.g. [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] was: In Spark 3.2.0, you can no longer reuse the same name for a temporary view. This change is backwards incompatible, and means a common way of running pipelines of SQL queries no longer works. The following is a simple reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: {code:python}from pyspark.context import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) {code} The following error is now produced: {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) {code} I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot of legacy code, and the `createOrReplaceTempView` method is named explicitly such that replacing an existing view should be allowed. An internet search suggests other users have run into a similar problems, e.g. [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] > Recursive view `df` detected (cycle: `df` -> `df`) > -- > > Key: SPARK-37690 > URL: https://issues.apache.org/jira/browse/SPARK-37690 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Robin >Priority: Major > > In Spark 3.2.0, you can no longer reuse the same name for a temporary view. > This change is backwards incompatible, and means a common way of running > pipelines of SQL queries no longer works. The following is a simple > reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: > {code:python}from pyspark.context import SparkContext > from pyspark.sql import SparkSession > sc = SparkContext.getOrCreate() > spark = SparkSession(sc) > sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ > df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ > df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ > df = spark.sql(sql) {code} > The following error is now produced: > {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> > `df`) > {code} > I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a > lot of legacy code, and the `createOrReplaceTempView` method is named > explicitly such that replacing an existing view should be allowed. An > internet search suggests other users have run into a similar problems, e.g. > [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)
[ https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin updated SPARK-37690: -- Description: In Spark 3.2.0, you can no longer reuse the same name for a temporary view. This change is backwards incompatible, and means a common way of running pipelines of SQL queries no longer works. The following is a simple reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: {code:python}from pyspark.context import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) {code} The following error is now produced: {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) {code} I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot of legacy code, and the `createOrReplaceTempView` method is named explicitly such that replacing an existing view should be allowed. An internet search suggests other users have run into a similar problems, e.g. [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] was: In Spark 3.2.0, you can no longer reuse the same name for a temporary view. This change is backwards incompatible, and means a common way of running pipelines of SQL queries no longer works. The following is a simple reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: {code:python}from pyspark.context import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) {code} The following error is now produced: {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) {code} I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot of legacy code, and the `createOrReplaceTempView` method is named explicitly such that replacing an existing view should be allowed. An internet search suggests other users have run into a similar problems, e.g. [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] > Recursive view `df` detected (cycle: `df` -> `df`) > -- > > Key: SPARK-37690 > URL: https://issues.apache.org/jira/browse/SPARK-37690 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Robin >Priority: Major > > In Spark 3.2.0, you can no longer reuse the same name for a temporary view. > This change is backwards incompatible, and means a common way of running > pipelines of SQL queries no longer works. The following is a simple > reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: > {code:python}from pyspark.context import SparkContext > from pyspark.sql import SparkSession > sc = SparkContext.getOrCreate() > spark = SparkSession(sc) > sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ > df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ > df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ > df = spark.sql(sql) {code} > The following error is now produced: > {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> > `df`) > {code} > I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a > lot of legacy code, and the `createOrReplaceTempView` method is named > explicitly such that replacing an existing view should be allowed. An > internet search suggests other users have run into a similar problems, e.g. > [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)
[ https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin updated SPARK-37690: -- Description: In Spark 3.2.0, you can no longer reuse the same name for a temporary view. This change is backwards incompatible, and means a common way of running pipelines of SQL queries no longer works. The following is a simple reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: {code:python}from pyspark.context import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) {code} The following error is now produced: {code:python} AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) {code} I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot of legacy code, and the `createOrReplaceTempView` method is named explicitly such that replacing an existing view should be allowed. An internet search suggests other users have run into a similar problems, e.g. [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] was: In Spark 3.2.0, you can no longer reuse the same name for a temporary view. This change is backwards incompatible, and means a common way of running pipelines of SQL queries no longer works. The following is a simple reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: {code:python} from pyspark.context import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) {code} The following error is now produced: {code:python} AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) {code} I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot of legacy code, and the `createOrReplaceTempView` method is named explicitly such that replacing an existing view should be allowed. An internet search suggests other users have run into a similar problems, e.g. [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] > Recursive view `df` detected (cycle: `df` -> `df`) > -- > > Key: SPARK-37690 > URL: https://issues.apache.org/jira/browse/SPARK-37690 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Robin >Priority: Major > > In Spark 3.2.0, you can no longer reuse the same name for a temporary view. > This change is backwards incompatible, and means a common way of running > pipelines of SQL queries no longer works. The following is a simple > reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: > {code:python}from pyspark.context import SparkContext > from pyspark.sql import SparkSession > sc = SparkContext.getOrCreate() > spark = SparkSession(sc) > sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ > df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ > df = spark.sql(sql) {code} > The following error is now produced: > {code:python} > AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) > {code} > I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a > lot of legacy code, and the `createOrReplaceTempView` method is named > explicitly such that replacing an existing view should be allowed. An > internet search suggests other users have run into a similar problems, e.g. > [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)
[ https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin updated SPARK-37690: -- Description: In Spark 3.2.0, you can no longer reuse the same name for a temporary view. This change is backwards incompatible, and means a common way of running pipelines of SQL queries no longer works. The following is a simple reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: {code:python}from pyspark.context import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) {code} The following error is now produced: {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) {code} I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot of legacy code, and the `createOrReplaceTempView` method is named explicitly such that replacing an existing view should be allowed. An internet search suggests other users have run into a similar problems, e.g. [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] was: In Spark 3.2.0, you can no longer reuse the same name for a temporary view. This change is backwards incompatible, and means a common way of running pipelines of SQL queries no longer works. The following is a simple reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: {code:python}from pyspark.context import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) {code} The following error is now produced: {code:python} AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) {code} I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot of legacy code, and the `createOrReplaceTempView` method is named explicitly such that replacing an existing view should be allowed. An internet search suggests other users have run into a similar problems, e.g. [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] > Recursive view `df` detected (cycle: `df` -> `df`) > -- > > Key: SPARK-37690 > URL: https://issues.apache.org/jira/browse/SPARK-37690 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Robin >Priority: Major > > In Spark 3.2.0, you can no longer reuse the same name for a temporary view. > This change is backwards incompatible, and means a common way of running > pipelines of SQL queries no longer works. The following is a simple > reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: > {code:python}from pyspark.context import SparkContext > from pyspark.sql import SparkSession > sc = SparkContext.getOrCreate() > spark = SparkSession(sc) > sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ > df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ > df = spark.sql(sql) {code} > The following error is now produced: > {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> > `df`) > {code} > I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a > lot of legacy code, and the `createOrReplaceTempView` method is named > explicitly such that replacing an existing view should be allowed. An > internet search suggests other users have run into a similar problems, e.g. > [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)
[ https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin updated SPARK-37690: -- Description: In Spark 3.2.0, you can no longer reuse the same name for a temporary view. This change is backwards incompatible, and means a common way of running pipelines of SQL queries no longer works. The following is a simple reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: {code:python} from pyspark.context import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) {code} The following error is now produced: {code:python} AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) {code} I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot of legacy code, and the `createOrReplaceTempView` method is named explicitly such that replacing an existing view should be allowed. An internet search suggests other users have run into a similar problems, e.g. [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] was: In Spark 3.2.0, you can no longer reuse the same name for a temporary view. This change is backwards incompatible, and means a common way of running pipelines of SQL queries no longer works. The following is a simple reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: {code:python} from pyspark.context import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) {code} The following error is now produced: {code:python} AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) {code} I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot of legacy code, and the `createOrReplaceTempView` method is named explicitly such that replacing an existing view should be allowed. An internet search suggests other users have run into a similar problems, e.g. [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] > Recursive view `df` detected (cycle: `df` -> `df`) > -- > > Key: SPARK-37690 > URL: https://issues.apache.org/jira/browse/SPARK-37690 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Robin >Priority: Major > > In Spark 3.2.0, you can no longer reuse the same name for a temporary view. > This change is backwards incompatible, and means a common way of running > pipelines of SQL queries no longer works. The following is a simple > reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: > {code:python} from pyspark.context import SparkContext > from pyspark.sql import SparkSession > sc = SparkContext.getOrCreate() > spark = SparkSession(sc) > sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ > df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ > df = spark.sql(sql) {code} > The following error is now produced: > {code:python} > AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) > {code} > I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a > lot of legacy code, and the `createOrReplaceTempView` method is named > explicitly such that replacing an existing view should be allowed. An > internet search suggests other users have run into a similar problems, e.g. > [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)
[ https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin updated SPARK-37690: -- Description: In Spark 3.2.0, you can no longer reuse the same name for a temporary view. This change is backwards incompatible, and means a common way of running pipelines of SQL queries no longer works. The following is a simple reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: {code:python} from pyspark.context import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) {code} The following error is now produced: {code:python} AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) {code} I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot of legacy code, and the `createOrReplaceTempView` method is named explicitly such that replacing an existing view should be allowed. An internet search suggests other users have run into a similar problems, e.g. [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] was: In Spark 3.2.0, you can no longer reuse the same name for a temporary view. This change is backwards incompatible, and means a common way of running pipelines of SQL queries no longer works. The following is a simple reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: {code:python} from pyspark.context import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) {code} The following error is now produced: {code:python} AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) {code} I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot of legacy code, and the `createOrReplaceTempView` method is named explicitly such that replacing an existing view should be allowed. An internet search suggests other users have run into a similar problems, e.g. [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] > Recursive view `df` detected (cycle: `df` -> `df`) > -- > > Key: SPARK-37690 > URL: https://issues.apache.org/jira/browse/SPARK-37690 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Robin >Priority: Major > > In Spark 3.2.0, you can no longer reuse the same name for a temporary view. > This change is backwards incompatible, and means a common way of running > pipelines of SQL queries no longer works. The following is a simple > reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: > {code:python} > from pyspark.context import SparkContext > from pyspark.sql import SparkSession > sc = SparkContext.getOrCreate() > spark = SparkSession(sc) > sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ > df = spark.sql(sql) df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ df = spark.sql(sql) > df.createOrReplaceTempView("df") > sql = """ SELECT * FROM df """ > df = spark.sql(sql) {code} > The following error is now produced: > {code:python} > AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) > {code} > I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a > lot of legacy code, and the `createOrReplaceTempView` method is named > explicitly such that replacing an existing view should be allowed. An > internet search suggests other users have run into a similar problems, e.g. > [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)
Robin created SPARK-37690: - Summary: Recursive view `df` detected (cycle: `df` -> `df`) Key: SPARK-37690 URL: https://issues.apache.org/jira/browse/SPARK-37690 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.0 Reporter: Robin In Spark 3.2.0, you can no longer reuse the same name for a temporary view. This change is backwards incompatible, and means a common way of running pipelines of SQL queries no longer works. The following is a simple reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: {code:python} from pyspark.context import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = spark.sql(sql) {code} The following error is now produced: {code:python} AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) {code} I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot of legacy code, and the `createOrReplaceTempView` method is named explicitly such that replacing an existing view should be allowed. An internet search suggests other users have run into a similar problems, e.g. [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37668) 'Index' object has no attribute 'levels' in pyspark.pandas.frame.DataFrame.insert
[ https://issues.apache.org/jira/browse/SPARK-37668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462429#comment-17462429 ] Haejoon Lee commented on SPARK-37668: - This should be only applied to MultiIndex columns, let me address it > 'Index' object has no attribute 'levels' in > pyspark.pandas.frame.DataFrame.insert > -- > > Key: SPARK-37668 > URL: https://issues.apache.org/jira/browse/SPARK-37668 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > [This piece of > code|https://github.com/apache/spark/blob/6e45b04db48008fa033b09df983d3bd1c4f790ea/python/pyspark/pandas/frame.py#L3991-L3993] > in {{pyspark.pandas.frame}} is going to fail on runtime, when > {{is_name_like_tuple}} evaluates to {{True}} > {code:python} > if is_name_like_tuple(column): > if len(column) != len(self.columns.levels): > {code} > with > {code} > 'Index' object has no attribute 'levels' > {code} > To be honest, I am not sure what is intended behavior (initially, I suspected > that we should have > {code:python} > if len(column) != self.columns.nlevels > {code} > but {{nlevels}} is hard-coded to one, and wouldn't be consistent with Pandas > at all. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37613) Support ANSI Aggregate Function: regr_count
[ https://issues.apache.org/jira/browse/SPARK-37613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462423#comment-17462423 ] Apache Spark commented on SPARK-37613: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34955 > Support ANSI Aggregate Function: regr_count > --- > > Key: SPARK-37613 > URL: https://issues.apache.org/jira/browse/SPARK-37613 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.3.0 > > > REGR_COUNT is an ANSI aggregate function. many database support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37613) Support ANSI Aggregate Function: regr_count
[ https://issues.apache.org/jira/browse/SPARK-37613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462422#comment-17462422 ] Apache Spark commented on SPARK-37613: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34955 > Support ANSI Aggregate Function: regr_count > --- > > Key: SPARK-37613 > URL: https://issues.apache.org/jira/browse/SPARK-37613 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.3.0 > > > REGR_COUNT is an ANSI aggregate function. many database support it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37689) Expand should be supported PropagateEmptyRelation
[ https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462416#comment-17462416 ] Apache Spark commented on SPARK-37689: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34954 > Expand should be supported PropagateEmptyRelation > - > > Key: SPARK-37689 > URL: https://issues.apache.org/jira/browse/SPARK-37689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2021-12-20-14-30-34-276.png, > image-2021-12-20-14-31-16-893.png > > > Empty LocalScan still trigger HashAggregate execute. Expand should be > supported in PropagateEmptyRelation too. > !image-2021-12-20-14-30-34-276.png|width=347,height=336! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37689) Expand should be supported PropagateEmptyRelation
[ https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462414#comment-17462414 ] Apache Spark commented on SPARK-37689: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34954 > Expand should be supported PropagateEmptyRelation > - > > Key: SPARK-37689 > URL: https://issues.apache.org/jira/browse/SPARK-37689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2021-12-20-14-30-34-276.png, > image-2021-12-20-14-31-16-893.png > > > Empty LocalScan still trigger HashAggregate execute. Expand should be > supported in PropagateEmptyRelation too. > !image-2021-12-20-14-30-34-276.png|width=347,height=336! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37689) Expand should be supported PropagateEmptyRelation
[ https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37689: Assignee: Apache Spark > Expand should be supported PropagateEmptyRelation > - > > Key: SPARK-37689 > URL: https://issues.apache.org/jira/browse/SPARK-37689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > Attachments: image-2021-12-20-14-30-34-276.png, > image-2021-12-20-14-31-16-893.png > > > Empty LocalScan still trigger HashAggregate execute. Expand should be > supported in PropagateEmptyRelation too. > !image-2021-12-20-14-30-34-276.png|width=347,height=336! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37689) Expand should be supported PropagateEmptyRelation
[ https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37689: Assignee: (was: Apache Spark) > Expand should be supported PropagateEmptyRelation > - > > Key: SPARK-37689 > URL: https://issues.apache.org/jira/browse/SPARK-37689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2021-12-20-14-30-34-276.png, > image-2021-12-20-14-31-16-893.png > > > Empty LocalScan still trigger HashAggregate execute. Expand should be > supported in PropagateEmptyRelation too. > !image-2021-12-20-14-30-34-276.png|width=347,height=336! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37689) Expand should be supported PropagateEmptyRelation
[ https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-37689: -- Summary: Expand should be supported PropagateEmptyRelation (was: Expand should be supported PropagateEmptyRelationBase) > Expand should be supported PropagateEmptyRelation > - > > Key: SPARK-37689 > URL: https://issues.apache.org/jira/browse/SPARK-37689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2021-12-20-14-30-34-276.png, > image-2021-12-20-14-31-16-893.png > > > Empty LocalScan still trigger HashAggregate execute. Expand should be > supported in PropagateEmptyRelation too. > !image-2021-12-20-14-30-34-276.png|width=347,height=336! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37689) Expand should be supported PropagateEmptyRelationBase
[ https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-37689: -- Description: Empty LocalScan still trigger HashAggregate execute. Expand should be supported in PropagateEmptyRelation too. !image-2021-12-20-14-30-34-276.png|width=347,height=336! was:!image-2021-12-20-14-30-24-864.png! > Expand should be supported PropagateEmptyRelationBase > - > > Key: SPARK-37689 > URL: https://issues.apache.org/jira/browse/SPARK-37689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2021-12-20-14-30-34-276.png, > image-2021-12-20-14-31-16-893.png > > > Empty LocalScan still trigger HashAggregate execute. Expand should be > supported in PropagateEmptyRelation too. > !image-2021-12-20-14-30-34-276.png|width=347,height=336! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37689) Expand should be supported PropagateEmptyRelationBase
[ https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-37689: -- Attachment: image-2021-12-20-14-31-16-893.png > Expand should be supported PropagateEmptyRelationBase > - > > Key: SPARK-37689 > URL: https://issues.apache.org/jira/browse/SPARK-37689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2021-12-20-14-30-34-276.png, > image-2021-12-20-14-31-16-893.png > > > !image-2021-12-20-14-30-24-864.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37689) Expand should be supported PropagateEmptyRelationBase
[ https://issues.apache.org/jira/browse/SPARK-37689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-37689: -- Attachment: image-2021-12-20-14-30-34-276.png > Expand should be supported PropagateEmptyRelationBase > - > > Key: SPARK-37689 > URL: https://issues.apache.org/jira/browse/SPARK-37689 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2021-12-20-14-30-34-276.png > > > !image-2021-12-20-14-30-24-864.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37689) Expand should be supported PropagateEmptyRelationBase
angerszhu created SPARK-37689: - Summary: Expand should be supported PropagateEmptyRelationBase Key: SPARK-37689 URL: https://issues.apache.org/jira/browse/SPARK-37689 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu Attachments: image-2021-12-20-14-30-34-276.png !image-2021-12-20-14-30-24-864.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37682) Reduce memory pressure of RewriteDistinctAggregates
[ https://issues.apache.org/jira/browse/SPARK-37682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37682: Assignee: (was: Apache Spark) > Reduce memory pressure of RewriteDistinctAggregates > --- > > Key: SPARK-37682 > URL: https://issues.apache.org/jira/browse/SPARK-37682 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kevin Liu >Priority: Minor > Labels: performance > > In some cases, current RewriteDistinctAggregates duplicates unnecessary input > data in distinct groups. > This will cause a lot of waste of memory and affects performance. > We could apply 'merged column' and 'bit vector' tricks to alleviate the > problem. For example: > {code:sql} > SELECT > COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist, > COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist, > COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist, > COUNT(DISTINCT id) as id_cnt_dist, > SUM(DISTINCT value) as id_sum_dist > FROM data > GROUP BY key > {code} > Current rule will rewrite the above sql plan to the following (pseudo) > logical plan: > {noformat} > Aggregate( >key = ['key] >functions = [ >count('cat1) FILTER (WHERE (('gid = 1) AND 'max(id > 1))), >count('(IF((value > 5), cat1, null))) FILTER (WHERE ('gid = 5)), >count('cat2) FILTER (WHERE (('gid = 3) AND 'max(id > 2))), >count('id) FILTER (WHERE ('gid = 2)), >sum('value) FILTER (WHERE ('gid = 4)) >] >output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, > 'cat1_if_cnt_dist, > 'id_cnt_dist, 'id_sum_dist]) > Aggregate( > key = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, > 'gid] > functions = [max('id > 1), max('id > 2)] > output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), > 'id, 'gid, >'max(id > 1), 'max(id > 2)]) > Expand( >projections = [ > ('key, 'cat1, null, null, null, null, 1, ('id > 1), null), > ('key, null, null, null, null, 'id, 2, null, null), > ('key, null, null, 'cat2, null, null, 3, null, ('id > 2)), > ('key, null, 'value, null, null, null, 4, null, null), > ('key, null, null, null, if (('value > 5)) 'cat1 else null, null, 5, > null, null) >] >output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), > 'id, > 'gid, '(id > 1), '(id > 2)]) > LocalTableScan [...] > {noformat} > After applying 'merged column' and 'bit vector' tricks, the logical plan will > become: > {noformat} > Aggregate( >key = ['key] >functions = [ >count('merged_string_1) FILTER (WHERE (('gid = 1) AND NOT > (('filter_vector_1 & 1) = 0))) >count('merged_string_1) FILTER (WHERE ('gid = 1)), >count(if (NOT (('if_vector_1 & 1) = 0)) 'merged_string_1 else null) > FILTER (WHERE ('gid = 1)), >count('merged_string_1) FILTER (WHERE (('gid = 2) AND NOT > (('filter_vector_1 & 1) = 0))), >count('merged_integer_1) FILTER (WHERE ('gid = 3)), >sum('merged_integer_1) FILTER (WHERE ('gid = 4)) >] >output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, > 'cat1_if_cnt_dist, > 'id_cnt_dist, 'id_sum_dist]) > Aggregate( > key = ['key, 'merged_string_1, 'merged_integer_1, 'gid] > functions = [bit_or('if_vector_1),bit_or('filter_vector_1)] > output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, > 'bit_or(if_vector_1), 'bit_or(filter_vector_1)]) > Expand( >projections = [ > ('key, 'cat1, null, 1, if ('value > 5) 1 else 0, if ('id > 1) 1 else > 0), > ('key, 'cat2, null, 2, null, if ('id > 2) 1 else 0), > ('key, null, 'id, 3, null, null), > ('key, null, 'value, 4, null, null) >] >output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, > 'if_vector_1, 'filter_vector_1]) > LocalTableScan [...] > {noformat} > 1. merged column: Children with same datatype from different aggregate > functions can share same project column (e.g. cat1, cat2). > 2. bit vector: If multiple aggregate function children have conditional > expressions, these conditions will output one column when it is true, and > output null when it is false. The detail logic will be in > RewriteDistinctAggregates.groupDistinctAggExpr of the following github link. > Then these aggregate functions can share one row group, and store the results > of their respective conditional expressions in the bit vector column, > reducing the number of rows of data expansion (e.g. cat1_filter_cnt_dist, > cat1_if_cnt_dist). > If there are many similar aggregate functions with
[jira] [Assigned] (SPARK-37682) Reduce memory pressure of RewriteDistinctAggregates
[ https://issues.apache.org/jira/browse/SPARK-37682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37682: Assignee: Apache Spark > Reduce memory pressure of RewriteDistinctAggregates > --- > > Key: SPARK-37682 > URL: https://issues.apache.org/jira/browse/SPARK-37682 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kevin Liu >Assignee: Apache Spark >Priority: Minor > Labels: performance > > In some cases, current RewriteDistinctAggregates duplicates unnecessary input > data in distinct groups. > This will cause a lot of waste of memory and affects performance. > We could apply 'merged column' and 'bit vector' tricks to alleviate the > problem. For example: > {code:sql} > SELECT > COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist, > COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist, > COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist, > COUNT(DISTINCT id) as id_cnt_dist, > SUM(DISTINCT value) as id_sum_dist > FROM data > GROUP BY key > {code} > Current rule will rewrite the above sql plan to the following (pseudo) > logical plan: > {noformat} > Aggregate( >key = ['key] >functions = [ >count('cat1) FILTER (WHERE (('gid = 1) AND 'max(id > 1))), >count('(IF((value > 5), cat1, null))) FILTER (WHERE ('gid = 5)), >count('cat2) FILTER (WHERE (('gid = 3) AND 'max(id > 2))), >count('id) FILTER (WHERE ('gid = 2)), >sum('value) FILTER (WHERE ('gid = 4)) >] >output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, > 'cat1_if_cnt_dist, > 'id_cnt_dist, 'id_sum_dist]) > Aggregate( > key = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, > 'gid] > functions = [max('id > 1), max('id > 2)] > output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), > 'id, 'gid, >'max(id > 1), 'max(id > 2)]) > Expand( >projections = [ > ('key, 'cat1, null, null, null, null, 1, ('id > 1), null), > ('key, null, null, null, null, 'id, 2, null, null), > ('key, null, null, 'cat2, null, null, 3, null, ('id > 2)), > ('key, null, 'value, null, null, null, 4, null, null), > ('key, null, null, null, if (('value > 5)) 'cat1 else null, null, 5, > null, null) >] >output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), > 'id, > 'gid, '(id > 1), '(id > 2)]) > LocalTableScan [...] > {noformat} > After applying 'merged column' and 'bit vector' tricks, the logical plan will > become: > {noformat} > Aggregate( >key = ['key] >functions = [ >count('merged_string_1) FILTER (WHERE (('gid = 1) AND NOT > (('filter_vector_1 & 1) = 0))) >count('merged_string_1) FILTER (WHERE ('gid = 1)), >count(if (NOT (('if_vector_1 & 1) = 0)) 'merged_string_1 else null) > FILTER (WHERE ('gid = 1)), >count('merged_string_1) FILTER (WHERE (('gid = 2) AND NOT > (('filter_vector_1 & 1) = 0))), >count('merged_integer_1) FILTER (WHERE ('gid = 3)), >sum('merged_integer_1) FILTER (WHERE ('gid = 4)) >] >output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, > 'cat1_if_cnt_dist, > 'id_cnt_dist, 'id_sum_dist]) > Aggregate( > key = ['key, 'merged_string_1, 'merged_integer_1, 'gid] > functions = [bit_or('if_vector_1),bit_or('filter_vector_1)] > output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, > 'bit_or(if_vector_1), 'bit_or(filter_vector_1)]) > Expand( >projections = [ > ('key, 'cat1, null, 1, if ('value > 5) 1 else 0, if ('id > 1) 1 else > 0), > ('key, 'cat2, null, 2, null, if ('id > 2) 1 else 0), > ('key, null, 'id, 3, null, null), > ('key, null, 'value, 4, null, null) >] >output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, > 'if_vector_1, 'filter_vector_1]) > LocalTableScan [...] > {noformat} > 1. merged column: Children with same datatype from different aggregate > functions can share same project column (e.g. cat1, cat2). > 2. bit vector: If multiple aggregate function children have conditional > expressions, these conditions will output one column when it is true, and > output null when it is false. The detail logic will be in > RewriteDistinctAggregates.groupDistinctAggExpr of the following github link. > Then these aggregate functions can share one row group, and store the results > of their respective conditional expressions in the bit vector column, > reducing the number of rows of data expansion (e.g. cat1_filter_cnt_dist, > cat1_if_cnt_dist). > If there are many similar
[jira] [Commented] (SPARK-37682) Reduce memory pressure of RewriteDistinctAggregates
[ https://issues.apache.org/jira/browse/SPARK-37682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462381#comment-17462381 ] Apache Spark commented on SPARK-37682: -- User 'Flyangz' has created a pull request for this issue: https://github.com/apache/spark/pull/34953 > Reduce memory pressure of RewriteDistinctAggregates > --- > > Key: SPARK-37682 > URL: https://issues.apache.org/jira/browse/SPARK-37682 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kevin Liu >Priority: Minor > Labels: performance > > In some cases, current RewriteDistinctAggregates duplicates unnecessary input > data in distinct groups. > This will cause a lot of waste of memory and affects performance. > We could apply 'merged column' and 'bit vector' tricks to alleviate the > problem. For example: > {code:sql} > SELECT > COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist, > COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist, > COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist, > COUNT(DISTINCT id) as id_cnt_dist, > SUM(DISTINCT value) as id_sum_dist > FROM data > GROUP BY key > {code} > Current rule will rewrite the above sql plan to the following (pseudo) > logical plan: > {noformat} > Aggregate( >key = ['key] >functions = [ >count('cat1) FILTER (WHERE (('gid = 1) AND 'max(id > 1))), >count('(IF((value > 5), cat1, null))) FILTER (WHERE ('gid = 5)), >count('cat2) FILTER (WHERE (('gid = 3) AND 'max(id > 2))), >count('id) FILTER (WHERE ('gid = 2)), >sum('value) FILTER (WHERE ('gid = 4)) >] >output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, > 'cat1_if_cnt_dist, > 'id_cnt_dist, 'id_sum_dist]) > Aggregate( > key = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, > 'gid] > functions = [max('id > 1), max('id > 2)] > output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), > 'id, 'gid, >'max(id > 1), 'max(id > 2)]) > Expand( >projections = [ > ('key, 'cat1, null, null, null, null, 1, ('id > 1), null), > ('key, null, null, null, null, 'id, 2, null, null), > ('key, null, null, 'cat2, null, null, 3, null, ('id > 2)), > ('key, null, 'value, null, null, null, 4, null, null), > ('key, null, null, null, if (('value > 5)) 'cat1 else null, null, 5, > null, null) >] >output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), > 'id, > 'gid, '(id > 1), '(id > 2)]) > LocalTableScan [...] > {noformat} > After applying 'merged column' and 'bit vector' tricks, the logical plan will > become: > {noformat} > Aggregate( >key = ['key] >functions = [ >count('merged_string_1) FILTER (WHERE (('gid = 1) AND NOT > (('filter_vector_1 & 1) = 0))) >count('merged_string_1) FILTER (WHERE ('gid = 1)), >count(if (NOT (('if_vector_1 & 1) = 0)) 'merged_string_1 else null) > FILTER (WHERE ('gid = 1)), >count('merged_string_1) FILTER (WHERE (('gid = 2) AND NOT > (('filter_vector_1 & 1) = 0))), >count('merged_integer_1) FILTER (WHERE ('gid = 3)), >sum('merged_integer_1) FILTER (WHERE ('gid = 4)) >] >output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, > 'cat1_if_cnt_dist, > 'id_cnt_dist, 'id_sum_dist]) > Aggregate( > key = ['key, 'merged_string_1, 'merged_integer_1, 'gid] > functions = [bit_or('if_vector_1),bit_or('filter_vector_1)] > output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, > 'bit_or(if_vector_1), 'bit_or(filter_vector_1)]) > Expand( >projections = [ > ('key, 'cat1, null, 1, if ('value > 5) 1 else 0, if ('id > 1) 1 else > 0), > ('key, 'cat2, null, 2, null, if ('id > 2) 1 else 0), > ('key, null, 'id, 3, null, null), > ('key, null, 'value, 4, null, null) >] >output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, > 'if_vector_1, 'filter_vector_1]) > LocalTableScan [...] > {noformat} > 1. merged column: Children with same datatype from different aggregate > functions can share same project column (e.g. cat1, cat2). > 2. bit vector: If multiple aggregate function children have conditional > expressions, these conditions will output one column when it is true, and > output null when it is false. The detail logic will be in > RewriteDistinctAggregates.groupDistinctAggExpr of the following github link. > Then these aggregate functions can share one row group, and store the results > of their respective conditional expressions in the bit vector column, > reducing the number of rows of data expansion (e.g.
[jira] [Updated] (SPARK-37688) ExecutorMonitor should ignore SparkListenerBlockUpdated event if executor was not active
[ https://issues.apache.org/jira/browse/SPARK-37688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiahua updated SPARK-37688: - Description: When a executor was not alive, and ExecutorMonitor received late SparkListenerBlockUpdated event. The `onBlockUpdated` hander will call `ensureExecutorIsTracked`, which will create a new executor tracker with UNKNOWN_RESOURCE_PROFILE_ID for the dead executor. And ExecutorAllocationManager will not remove executor with UNKNOWN_RESOURCE_PROFILE_ID, which cause a executor slot is occupied by the dead executor, so a new one cannot be created . The ExecutorAllocationManager log was like this: 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] ExecutorAllocationManager: Not removing executor 34324 because the ResourceProfile was UNKNOWN! 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] ExecutorAllocationManager: Not removing executor 34324 because the ResourceProfile was UNKNOWN! 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] ExecutorAllocationManager: Not removing executor 34324 because the ResourceProfile was UNKNOWN! 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] ExecutorAllocationManager: Not removing executor 34324 because the ResourceProfile was UNKNOWN! was:When a executor was not alive, and ExecutorMonitor received late SparkListenerBlockUpdated event. The `onBlockUpdated` hander will call `ensureExecutorIsTracked`, which will create a new executor tracker with UNKNOWN_RESOURCE_PROFILE_ID for the dead executor. And ExecutorAllocationManager will not remove executor with UNKNOWN_RESOURCE_PROFILE_ID, which cause a executor slot is occupied by the dead executor, so a new one cannot be created . > ExecutorMonitor should ignore SparkListenerBlockUpdated event if executor was > not active > > > Key: SPARK-37688 > URL: https://issues.apache.org/jira/browse/SPARK-37688 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: hujiahua >Priority: Major > > When a executor was not alive, and ExecutorMonitor received late > SparkListenerBlockUpdated event. The `onBlockUpdated` hander will call > `ensureExecutorIsTracked`, which will create a new executor tracker with > UNKNOWN_RESOURCE_PROFILE_ID for the dead executor. And > ExecutorAllocationManager will not remove executor with > UNKNOWN_RESOURCE_PROFILE_ID, which cause a executor slot is occupied by the > dead executor, so a new one cannot be created . > The ExecutorAllocationManager log was like this: > 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] > ExecutorAllocationManager: Not removing executor 34324 because the > ResourceProfile was UNKNOWN! > 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] > ExecutorAllocationManager: Not removing executor 34324 because the > ResourceProfile was UNKNOWN! > 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] > ExecutorAllocationManager: Not removing executor 34324 because the > ResourceProfile was UNKNOWN! > 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] > ExecutorAllocationManager: Not removing executor 34324 because the > ResourceProfile was UNKNOWN! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37688) ExecutorMonitor should ignore SparkListenerBlockUpdated event if executor was not active
[ https://issues.apache.org/jira/browse/SPARK-37688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiahua updated SPARK-37688: - Description: When a executor was not alive, and `ExecutorMonitor` received late `SparkListenerBlockUpdated` event. The `onBlockUpdated` hander will call `ensureExecutorIsTracked`, which will create a new executor tracker with UNKNOWN_RESOURCE_PROFILE_ID for the dead executor. And `ExecutorAllocationManager` will not remove executor with UNKNOWN_RESOURCE_PROFILE_ID, which cause a executor slot is occupied by the dead executor, so a new one cannot be created . The ExecutorAllocationManager log was like this: 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] ExecutorAllocationManager: Not removing executor 34324 because the ResourceProfile was UNKNOWN! 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] ExecutorAllocationManager: Not removing executor 34324 because the ResourceProfile was UNKNOWN! 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] ExecutorAllocationManager: Not removing executor 34324 because the ResourceProfile was UNKNOWN! 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] ExecutorAllocationManager: Not removing executor 34324 because the ResourceProfile was UNKNOWN! was: When a executor was not alive, and ExecutorMonitor received late SparkListenerBlockUpdated event. The `onBlockUpdated` hander will call `ensureExecutorIsTracked`, which will create a new executor tracker with UNKNOWN_RESOURCE_PROFILE_ID for the dead executor. And ExecutorAllocationManager will not remove executor with UNKNOWN_RESOURCE_PROFILE_ID, which cause a executor slot is occupied by the dead executor, so a new one cannot be created . The ExecutorAllocationManager log was like this: 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] ExecutorAllocationManager: Not removing executor 34324 because the ResourceProfile was UNKNOWN! 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] ExecutorAllocationManager: Not removing executor 34324 because the ResourceProfile was UNKNOWN! 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] ExecutorAllocationManager: Not removing executor 34324 because the ResourceProfile was UNKNOWN! 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] ExecutorAllocationManager: Not removing executor 34324 because the ResourceProfile was UNKNOWN! > ExecutorMonitor should ignore SparkListenerBlockUpdated event if executor was > not active > > > Key: SPARK-37688 > URL: https://issues.apache.org/jira/browse/SPARK-37688 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: hujiahua >Priority: Major > > When a executor was not alive, and `ExecutorMonitor` received late > `SparkListenerBlockUpdated` event. The `onBlockUpdated` hander will call > `ensureExecutorIsTracked`, which will create a new executor tracker with > UNKNOWN_RESOURCE_PROFILE_ID for the dead executor. And > `ExecutorAllocationManager` will not remove executor with > UNKNOWN_RESOURCE_PROFILE_ID, which cause a executor slot is occupied by the > dead executor, so a new one cannot be created . > The ExecutorAllocationManager log was like this: > 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] > ExecutorAllocationManager: Not removing executor 34324 because the > ResourceProfile was UNKNOWN! > 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] > ExecutorAllocationManager: Not removing executor 34324 because the > ResourceProfile was UNKNOWN! > 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] > ExecutorAllocationManager: Not removing executor 34324 because the > ResourceProfile was UNKNOWN! > 21/08/24 15:38:14 WARN [spark-dynamic-executor-allocation] > ExecutorAllocationManager: Not removing executor 34324 because the > ResourceProfile was UNKNOWN! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37688) ExecutorMonitor should ignore SparkListenerBlockUpdated event if executor was not active
hujiahua created SPARK-37688: Summary: ExecutorMonitor should ignore SparkListenerBlockUpdated event if executor was not active Key: SPARK-37688 URL: https://issues.apache.org/jira/browse/SPARK-37688 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.2 Reporter: hujiahua When a executor was not alive, and ExecutorMonitor received late SparkListenerBlockUpdated event. The `onBlockUpdated` hander will call `ensureExecutorIsTracked`, which will create a new executor tracker with UNKNOWN_RESOURCE_PROFILE_ID for the dead executor. And ExecutorAllocationManager will not remove executor with UNKNOWN_RESOURCE_PROFILE_ID, which cause a executor slot is occupied by the dead executor, so a new one cannot be created . -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37687) Cleanup direct usage of OkHttpClient
Yikun Jiang created SPARK-37687: --- Summary: Cleanup direct usage of OkHttpClient Key: SPARK-37687 URL: https://issues.apache.org/jira/browse/SPARK-37687 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.3.0 Reporter: Yikun Jiang - There are some problem (such as IPV6 based cluster support) on okhttpclient v3, but it's a little bit diffcult to upgrade to v4 [1] - Kubernetes client are also consider to support other clients [2] rather than single okhttpclient. - Kubernetes client add a abstract layer [3] to address supporting httpclient, okhttp client as one of supported http clients. So, we better to consider to cleanup okhttpclient direct usage and use the httpclient which kubernetes client diret supported to reduce the potential risk in future upgrade. See also: [[1]https://github.com/fabric8io/kubernetes-client/issues/2632|https://github.com/fabric8io/kubernetes-client/issues/2632] [2][https://github.com/fabric8io/kubernetes-client/issues/3663#issuecomment-997402993] [3] [https://github.com/fabric8io/kubernetes-client/issues/3547] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37648) Spark catalog and Delta tables
[ https://issues.apache.org/jira/browse/SPARK-37648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462361#comment-17462361 ] Cheng Pan edited comment on SPARK-37648 at 12/20/21, 2:42 AM: -- We have a workaround for this issue in Apache Kyuubi (Incubating), [https://github.com/apache/incubator-kyuubi/pull/1476] Kyuubi can be considered as a more powerful Spark Thrift Server, it's worth a try. was (Author: pan3793): This issue has been fixed in Apache Kyuubi (Incubating), [https://github.com/apache/incubator-kyuubi/pull/1476] Kyuubi can be considered as a more powerful Spark Thrift Server, it's worth a try. > Spark catalog and Delta tables > -- > > Key: SPARK-37648 > URL: https://issues.apache.org/jira/browse/SPARK-37648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: Spark version 3.1.2 > Scala version 2.12.10 > Hive version 2.3.7 > Delta version 1.0.0 >Reporter: Hanna Liashchuk >Priority: Major > > I'm using Spark with Delta tables, while tables are created, there are no > columns in the table. > Steps to reproduce: > 1. Start spark-shell > {code:java} > spark-shell --conf > "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > --conf "spark.sql.legacy.parquet.int96RebaseModeInWrite=LEGACY"{code} > 2. Create delta table > {code:java} > spark.range(10).write.format("delta").option("path", > "tmp/delta").saveAsTable("delta"){code} > 3. Make sure table exists > {code:java} > spark.catalog.listTables.show{code} > 4. Find out that columns are not > {code:java} > spark.catalog.listColumns("delta").show{code} > This is critical for Delta integration with different BI tools such as Power > BI or Tableau, as they are querying spark catalog for the metadata and we are > getting errors that no columns are found. > Discussion can be found in Delta repository - > https://github.com/delta-io/delta/issues/695 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37648) Spark catalog and Delta tables
[ https://issues.apache.org/jira/browse/SPARK-37648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462361#comment-17462361 ] Cheng Pan commented on SPARK-37648: --- This issue has been fixed in Apache Kyuubi (Incubating), [https://github.com/apache/incubator-kyuubi/pull/1476] Kyuubi can be considered as a more powerful Spark Thrift Server, it's worth a try. > Spark catalog and Delta tables > -- > > Key: SPARK-37648 > URL: https://issues.apache.org/jira/browse/SPARK-37648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: Spark version 3.1.2 > Scala version 2.12.10 > Hive version 2.3.7 > Delta version 1.0.0 >Reporter: Hanna Liashchuk >Priority: Major > > I'm using Spark with Delta tables, while tables are created, there are no > columns in the table. > Steps to reproduce: > 1. Start spark-shell > {code:java} > spark-shell --conf > "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > --conf "spark.sql.legacy.parquet.int96RebaseModeInWrite=LEGACY"{code} > 2. Create delta table > {code:java} > spark.range(10).write.format("delta").option("path", > "tmp/delta").saveAsTable("delta"){code} > 3. Make sure table exists > {code:java} > spark.catalog.listTables.show{code} > 4. Find out that columns are not > {code:java} > spark.catalog.listColumns("delta").show{code} > This is critical for Delta integration with different BI tools such as Power > BI or Tableau, as they are querying spark catalog for the metadata and we are > getting errors that no columns are found. > Discussion can be found in Delta repository - > https://github.com/delta-io/delta/issues/695 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37682) Reduce memory pressure of RewriteDistinctAggregates
[ https://issues.apache.org/jira/browse/SPARK-37682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Liu updated SPARK-37682: -- Description: In some cases, current RewriteDistinctAggregates duplicates unnecessary input data in distinct groups. This will cause a lot of waste of memory and affects performance. We could apply 'merged column' and 'bit vector' tricks to alleviate the problem. For example: {code:sql} SELECT COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist, COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist, COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist, COUNT(DISTINCT id) as id_cnt_dist, SUM(DISTINCT value) as id_sum_dist FROM data GROUP BY key {code} Current rule will rewrite the above sql plan to the following (pseudo) logical plan: {noformat} Aggregate( key = ['key] functions = [ count('cat1) FILTER (WHERE (('gid = 1) AND 'max(id > 1))), count('(IF((value > 5), cat1, null))) FILTER (WHERE ('gid = 5)), count('cat2) FILTER (WHERE (('gid = 3) AND 'max(id > 2))), count('id) FILTER (WHERE ('gid = 2)), sum('value) FILTER (WHERE ('gid = 4)) ] output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 'cat1_if_cnt_dist, 'id_cnt_dist, 'id_sum_dist]) Aggregate( key = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, 'gid] functions = [max('id > 1), max('id > 2)] output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, 'gid, 'max(id > 1), 'max(id > 2)]) Expand( projections = [ ('key, 'cat1, null, null, null, null, 1, ('id > 1), null), ('key, null, null, null, null, 'id, 2, null, null), ('key, null, null, 'cat2, null, null, 3, null, ('id > 2)), ('key, null, 'value, null, null, null, 4, null, null), ('key, null, null, null, if (('value > 5)) 'cat1 else null, null, 5, null, null) ] output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, 'gid, '(id > 1), '(id > 2)]) LocalTableScan [...] {noformat} After applying 'merged column' and 'bit vector' tricks, the logical plan will become: {noformat} Aggregate( key = ['key] functions = [ count('merged_string_1) FILTER (WHERE (('gid = 1) AND NOT (('filter_vector_1 & 1) = 0))) count('merged_string_1) FILTER (WHERE ('gid = 1)), count(if (NOT (('if_vector_1 & 1) = 0)) 'merged_string_1 else null) FILTER (WHERE ('gid = 1)), count('merged_string_1) FILTER (WHERE (('gid = 2) AND NOT (('filter_vector_1 & 1) = 0))), count('merged_integer_1) FILTER (WHERE ('gid = 3)), sum('merged_integer_1) FILTER (WHERE ('gid = 4)) ] output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 'cat1_if_cnt_dist, 'id_cnt_dist, 'id_sum_dist]) Aggregate( key = ['key, 'merged_string_1, 'merged_integer_1, 'gid] functions = [bit_or('if_vector_1),bit_or('filter_vector_1)] output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 'bit_or(if_vector_1), 'bit_or(filter_vector_1)]) Expand( projections = [ ('key, 'cat1, null, 1, if ('value > 5) 1 else 0, if ('id > 1) 1 else 0), ('key, 'cat2, null, 2, null, if ('id > 2) 1 else 0), ('key, null, 'id, 3, null, null), ('key, null, 'value, 4, null, null) ] output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 'if_vector_1, 'filter_vector_1]) LocalTableScan [...] {noformat} 1. merged column: Children with same datatype from different aggregate functions can share same project column (e.g. cat1, cat2). 2. bit vector: If multiple aggregate function children have conditional expressions, these conditions will output one column when it is true, and output null when it is false. The detail logic will be in RewriteDistinctAggregates.groupDistinctAggExpr of the following github link. Then these aggregate functions can share one row group, and store the results of their respective conditional expressions in the bit vector column, reducing the number of rows of data expansion (e.g. cat1_filter_cnt_dist, cat1_if_cnt_dist). If there are many similar aggregate functions with or without filter in distinct, these tricks can save mass memory and improve performance. was: In some cases, current RewriteDistinctAggregates duplicates unnecessary input data in distinct groups. This will cause a lot of waste of memory and affects performance. We could apply 'merged column' and 'bit vector' tricks to alleviate the problem. For example: {code:sql} SELECT COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist, COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist, COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist, COUNT(DISTINCT id) as id_cnt_dist, SUM(DISTINCT value) as id_sum_dist FROM
[jira] [Updated] (SPARK-37682) Reduce memory pressure of RewriteDistinctAggregates
[ https://issues.apache.org/jira/browse/SPARK-37682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Liu updated SPARK-37682: -- Description: In some cases, current RewriteDistinctAggregates duplicates unnecessary input data in distinct groups. This will cause a lot of waste of memory and affects performance. We could apply 'merged column' and 'bit vector' tricks to alleviate the problem. For example: {code:sql} SELECT COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist, COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist, COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist, COUNT(DISTINCT id) as id_cnt_dist, SUM(DISTINCT value) as id_sum_dist FROM data GROUP BY key {code} Current rule will rewrite the above sql plan to the following (pseudo) logical plan: {noformat} Aggregate( key = ['key] functions = [ count('cat1) FILTER (WHERE (('gid = 1) AND 'max(id > 1))), count('(IF((value > 5), cat1, null))) FILTER (WHERE ('gid = 5)), count('cat2) FILTER (WHERE (('gid = 3) AND 'max(id > 2))), count('id) FILTER (WHERE ('gid = 2)), sum('value) FILTER (WHERE ('gid = 4)) ] output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 'cat1_if_cnt_dist, 'id_cnt_dist, 'id_sum_dist]) Aggregate( key = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, 'gid] functions = [max('id > 1), max('id > 2)] output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, 'gid, 'max(id > 1), 'max(id > 2)]) Expand( projections = [ ('key, 'cat1, null, null, null, null, 1, ('id > 1), null), ('key, null, null, null, null, 'id, 2, null, null), ('key, null, null, 'cat2, null, null, 3, null, ('id > 2)), ('key, null, 'value, null, null, null, 4, null, null), ('key, null, null, null, if (('value > 5)) 'cat1 else null, null, 5, null, null) ] output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, 'gid, '(id > 1), '(id > 2)]) LocalTableScan [...] {noformat} After applying 'merged column' and 'bit vector' tricks, the logical plan will become: {noformat} Aggregate( key = ['key] functions = [ count('merged_string_1) FILTER (WHERE (('gid = 1) AND NOT (('filter_vector_1 & 1) = 0))) count('merged_string_1) FILTER (WHERE ('gid = 1)), count(if (NOT (('if_vector_1 & 1) = 0)) 'merged_string_1 else null) FILTER (WHERE ('gid = 1)), count('merged_string_1) FILTER (WHERE (('gid = 2) AND NOT (('filter_vector_1 & 1) = 0))), count('merged_integer_1) FILTER (WHERE ('gid = 3)), sum('merged_integer_1) FILTER (WHERE ('gid = 4)) ] output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, 'cat1_if_cnt_dist, 'id_cnt_dist, 'id_sum_dist]) Aggregate( key = ['key, 'merged_string_1, 'merged_integer_1, 'gid] functions = [bit_or('if_vector_1),bit_or('filter_vector_1)] output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 'bit_or(if_vector_1), 'bit_or(filter_vector_1)]) Expand( projections = [ ('key, 'cat1, null, 1, if ('value > 5) 1 else 0, if ('id > 1) 1 else 0), ('key, 'cat2, null, 2, null, if ('id > 2) 1 else 0), ('key, null, 'id, 3, null, null), ('key, null, 'value, 4, null, null) ] output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 'if_vector_1, 'filter_vector_1]) LocalTableScan [...] {noformat} 1. merged column: Children with same datatype from different aggregate functions can share same project column (e.g. cat1, cat2). 2. bit vector: If multiple aggregate function children have conditional expressions, these conditions will output one column when it is true, and output null when it is false. The detail logic is in RewriteDistinctAggregates.groupDistinctAggExpr of the following github link. Then these aggregate functions can share one row group, and store the results of their respective conditional expressions in the bit vector column, reducing the number of rows of data expansion (e.g. cat1_filter_cnt_dist, cat1_if_cnt_dist). If there are many similar aggregate functions with or without filter in distinct, these tricks can save mass memory and improve performance. was: In some cases, current RewriteDistinctAggregates duplicates unnecessary input data in distinct groups. This will cause a lot of waste of memory and affects performance. We could apply 'merged column' and 'bit vector' tricks to alleviate the problem. For example: {code:sql} SELECT COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist, COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist, COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist, COUNT(DISTINCT id) as id_cnt_dist, SUM(DISTINCT value) as id_sum_dist FROM data
[jira] [Commented] (SPARK-37686) Migrate remaining pyspark.sql.functions to _invoke_* style
[ https://issues.apache.org/jira/browse/SPARK-37686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462144#comment-17462144 ] Apache Spark commented on SPARK-37686: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/34951 > Migrate remaining pyspark.sql.functions to _invoke_* style > -- > > Key: SPARK-37686 > URL: https://issues.apache.org/jira/browse/SPARK-37686 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > In SPARK-32084 we converted a number of dynamically created functions to > standard def using newly added {{_invoke_function}} helpers. Remaining > functions stayed untouched. > These allow us to implement these with minimum amount of boilerplate code. > With recent migration of type hints to inline variant we had to type checker > hints: > {code:python} > sc = SparkContext._active_spark_context > assert sc is not None and sc._jvm is not Non > {code} > to the functions not covered in SPARK-32084. This added to standard > conversions (i.e. ({_to_java_column}}, {{Column(jc)}}) make these pretty > verbose. > However, it is possible to convert remaining functions using > {{_invoke_function}} helpers to address that. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37686) Migrate remaining pyspark.sql.functions to _invoke_* style
[ https://issues.apache.org/jira/browse/SPARK-37686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37686: Assignee: (was: Apache Spark) > Migrate remaining pyspark.sql.functions to _invoke_* style > -- > > Key: SPARK-37686 > URL: https://issues.apache.org/jira/browse/SPARK-37686 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > In SPARK-32084 we converted a number of dynamically created functions to > standard def using newly added {{_invoke_function}} helpers. Remaining > functions stayed untouched. > These allow us to implement these with minimum amount of boilerplate code. > With recent migration of type hints to inline variant we had to type checker > hints: > {code:python} > sc = SparkContext._active_spark_context > assert sc is not None and sc._jvm is not Non > {code} > to the functions not covered in SPARK-32084. This added to standard > conversions (i.e. ({_to_java_column}}, {{Column(jc)}}) make these pretty > verbose. > However, it is possible to convert remaining functions using > {{_invoke_function}} helpers to address that. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37686) Migrate remaining pyspark.sql.functions to _invoke_* style
[ https://issues.apache.org/jira/browse/SPARK-37686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37686: Assignee: Apache Spark > Migrate remaining pyspark.sql.functions to _invoke_* style > -- > > Key: SPARK-37686 > URL: https://issues.apache.org/jira/browse/SPARK-37686 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > > In SPARK-32084 we converted a number of dynamically created functions to > standard def using newly added {{_invoke_function}} helpers. Remaining > functions stayed untouched. > These allow us to implement these with minimum amount of boilerplate code. > With recent migration of type hints to inline variant we had to type checker > hints: > {code:python} > sc = SparkContext._active_spark_context > assert sc is not None and sc._jvm is not Non > {code} > to the functions not covered in SPARK-32084. This added to standard > conversions (i.e. ({_to_java_column}}, {{Column(jc)}}) make these pretty > verbose. > However, it is possible to convert remaining functions using > {{_invoke_function}} helpers to address that. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37686) Migrare remaining pyspark.sql.functions to _invoke_* style
Maciej Szymkiewicz created SPARK-37686: -- Summary: Migrare remaining pyspark.sql.functions to _invoke_* style Key: SPARK-37686 URL: https://issues.apache.org/jira/browse/SPARK-37686 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Maciej Szymkiewicz In SPARK-32084 we converted a number of dynamically created functions to standard def using newly added {{_invoke_function}} helpers. Remaining functions stayed untouched. These allow us to implement these with minimum amount of boilerplate code. With recent migration of type hints to inline variant we had to type checker hints: {code:python} sc = SparkContext._active_spark_context assert sc is not None and sc._jvm is not Non {code} to the functions not covered in SPARK-32084. This added to standard conversions (i.e. ({_to_java_column}}, {{Column(jc)}}) make these pretty verbose. However, it is possible to convert remaining functions using {{_invoke_function}} helpers to address that. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37686) Migrate remaining pyspark.sql.functions to _invoke_* style
[ https://issues.apache.org/jira/browse/SPARK-37686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-37686: --- Summary: Migrate remaining pyspark.sql.functions to _invoke_* style (was: Migrare remaining pyspark.sql.functions to _invoke_* style) > Migrate remaining pyspark.sql.functions to _invoke_* style > -- > > Key: SPARK-37686 > URL: https://issues.apache.org/jira/browse/SPARK-37686 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > In SPARK-32084 we converted a number of dynamically created functions to > standard def using newly added {{_invoke_function}} helpers. Remaining > functions stayed untouched. > These allow us to implement these with minimum amount of boilerplate code. > With recent migration of type hints to inline variant we had to type checker > hints: > {code:python} > sc = SparkContext._active_spark_context > assert sc is not None and sc._jvm is not Non > {code} > to the functions not covered in SPARK-32084. This added to standard > conversions (i.e. ({_to_java_column}}, {{Column(jc)}}) make these pretty > verbose. > However, it is possible to convert remaining functions using > {{_invoke_function}} helpers to address that. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37682) Reduce memory pressure of RewriteDistinctAggregates
[ https://issues.apache.org/jira/browse/SPARK-37682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Liu updated SPARK-37682: -- Priority: Minor (was: Major) > Reduce memory pressure of RewriteDistinctAggregates > --- > > Key: SPARK-37682 > URL: https://issues.apache.org/jira/browse/SPARK-37682 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kevin Liu >Priority: Minor > Labels: performance > > In some cases, current RewriteDistinctAggregates duplicates unnecessary input > data in distinct groups. > This will cause a lot of waste of memory and affects performance. > We could apply 'merged column' and 'bit vector' tricks to alleviate the > problem. For example: > {code:sql} > SELECT > COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_filter_cnt_dist, > COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_filter_cnt_dist, > COUNT(DISTINCT IF(value > 5, cat1, null)) as cat1_if_cnt_dist, > COUNT(DISTINCT id) as id_cnt_dist, > SUM(DISTINCT value) as id_sum_dist > FROM data > GROUP BY key > {code} > Current rule will rewrite the above sql plan to the following (pseudo) > logical plan: > {noformat} > Aggregate( >key = ['key] >functions = [ >count('cat1) FILTER (WHERE (('gid = 1) AND 'max(id > 1))), >count('(IF((value > 5), cat1, null))) FILTER (WHERE ('gid = 5)), >count('cat2) FILTER (WHERE (('gid = 3) AND 'max(id > 2))), >count('id) FILTER (WHERE ('gid = 2)), >sum('value) FILTER (WHERE ('gid = 4)) >] >output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, > 'cat1_if_cnt_dist, > 'id_cnt_dist, 'id_sum_dist]) > Aggregate( > key = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), 'id, > 'gid] > functions = [max('id > 1), max('id > 2)] > output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), > 'id, 'gid, >'max(id > 1), 'max(id > 2)]) > Expand( >projections = [ > ('key, 'cat1, null, null, null, null, 1, ('id > 1), null), > ('key, null, null, null, null, 'id, 2, null, null), > ('key, null, null, 'cat2, null, null, 3, null, ('id > 2)), > ('key, null, 'value, null, null, null, 4, null, null), > ('key, null, null, null, if (('value > 5)) 'cat1 else null, null, 5, > null, null) >] >output = ['key, 'cat1, 'value, 'cat2, '(IF((value > 5), cat1, null)), > 'id, > 'gid, '(id > 1), '(id > 2)]) > LocalTableScan [...] > {noformat} > After applying 'merged column' and 'bit vector' tricks, the logical plan will > become: > {noformat} > Aggregate( >key = ['key] >functions = [ >count('merged_string_1) FILTER (WHERE (('gid = 1) AND NOT (('vector_1 > & 1) = 0))), >count('merged_string_1) FILTER (WHERE (('gid = 1) AND NOT (('vector_1 > & 2) = 0))), >count('merged_string_1) FILTER (WHERE (('gid = 2) AND NOT (('vector_1 > & 1) = 0))), >count('merged_integer_1) FILTER (WHERE ('gid = 3)), >sum('merged_integer_1) FILTER (WHERE ('gid = 4)) >] >output = ['key, 'cat1_filter_cnt_dist, 'cat2_filter_cnt_dist, > 'cat1_if_cnt_dist, > 'id_cnt_dist, 'id_sum_dist]) > Aggregate( > key = ['key, 'merged_string_1, 'merged_integer_1, 'gid] > functions = [bit_or('vector_1)] > output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, > 'bit_or(vector_1)]) > Expand( >projections = [ > ('key, 'cat1, null, 1, (if (('id > 1)) 1 else 0 | if (('value > 5)) > 2 else 0)), > ('key, 'cat2, null, 2, if (('id > 2)) 1 else 0), > ('key, null, 'id, 3, null), > ('key, null, 'value, 4, null) >] >output = ['key, 'merged_string_1, 'merged_integer_1, 'gid, 'vector_1]) > LocalTableScan [...] > {noformat} > 1. merged column: Children with same datatype from different aggregate > functions can share same project column (e.g. cat1, cat2). > 2. bit vector: If multiple aggregate function children have conditional > expressions, these conditions will output one column when it is true, and > output null when it is false. The detail logic is in > RewriteDistinctAggregates.groupDistinctAggExpr of the following github link. > Then these aggregate functions can share one row group, and store the results > of their respective conditional expressions in the bit vector column, > reducing the number of rows of data expansion (e.g. cat1_filter_cnt_dist, > cat1_if_cnt_dist). > If there are many similar aggregate functions with or without filter in > distinct, these tricks can save mass memory and improve performance. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SPARK-37588) lot of strings get accumulated in the heap dump of spark thrift server
[ https://issues.apache.org/jira/browse/SPARK-37588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462105#comment-17462105 ] Hyukjin Kwon commented on SPARK-37588: -- [~rkchilaka] can we have a minimized self-contained reproducer? At least you can remove the configuration one by one, and see if the issue still persists. > lot of strings get accumulated in the heap dump of spark thrift server > -- > > Key: SPARK-37588 > URL: https://issues.apache.org/jira/browse/SPARK-37588 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 > Environment: Open JDK (8 build 1.8.0_312-b07) and scala 2.12 > OS: Red Hat Enterprise Linux 8.4 (Ootpa), platform:el8 >Reporter: ramakrishna chilaka >Priority: Major > Attachments: screenshot-1.png > > > I am starting spark thrift server using the following options > ``` > /data/spark/sbin/start-thriftserver.sh --master spark://*:7077 --conf > "spark.cores.max=320" --conf "spark.executor.cores=3" --conf > "spark.driver.cores=15" --executor-memory=10G --driver-memory=50G --conf > spark.sql.adaptive.coalescePartitions.enabled=true --conf > spark.sql.adaptive.skewJoin.enabled=true --conf spark.sql.cbo.enabled=true > --conf spark.sql.adaptive.enabled=true --conf spark.rpc.io.serverThreads=64 > --conf "spark.driver.maxResultSize=4G" --conf > "spark.max.fetch.failures.per.stage=10" --conf > "spark.sql.thriftServer.incrementalCollect=false" --conf > "spark.ui.reverseProxy=true" --conf "spark.ui.reverseProxyUrl=/spark_ui" > --conf "spark.sql.autoBroadcastJoinThreshold=1073741824" --conf > spark.sql.thriftServer.interruptOnCancel=true --conf > spark.sql.thriftServer.queryTimeout=0 --hiveconf > hive.server2.transport.mode=http --hiveconf > hive.server2.thrift.http.path=spark_sql --hiveconf > hive.server2.thrift.min.worker.threads=500 --hiveconf > hive.server2.thrift.max.worker.threads=2147483647 --hiveconf > hive.server2.thrift.http.cookie.is.secure=false --hiveconf > hive.server2.thrift.http.cookie.auth.enabled=false --hiveconf > hive.server2.authentication=NONE --hiveconf hive.server2.enable.doAs=false > --hiveconf spark.sql.hive.thriftServer.singleSession=true --hiveconf > hive.server2.thrift.bind.host=0.0.0.0 --conf > "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > --conf "spark.sql.cbo.joinReorder.enabled=true" --conf > "spark.sql.optimizer.dynamicPartitionPruning.enabled=true" --conf > "spark.worker.cleanup.enabled=true" --conf > "spark.worker.cleanup.appDataTtl=3600" --hiveconf > hive.exec.scratchdir=/data/spark_scratch/hive --hiveconf > hive.exec.local.scratchdir=/data/spark_scratch/local_scratch_dir --hiveconf > hive.download.resources.dir=/data/spark_scratch/hive.downloaded.resources.dir > --hiveconf hive.querylog.location=/data/spark_scratch/hive.querylog.location > --conf spark.executor.extraJavaOptions="-XX:+PrintGCDetails > -XX:+PrintGCTimeStamps" --conf spark.driver.extraJavaOptions="-verbose:gc > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps > -Xloggc:/data/thrift_driver_gc.log -XX:+ExplicitGCInvokesConcurrent > -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:GCTimeRatio=4 > -XX:AdaptiveSizePolicyWeight=90 -XX:MaxRAM=55g" --hiveconf > "hive.server2.session.check.interval=6" --hiveconf > "hive.server2.idle.session.timeout=90" --hiveconf > "hive.server2.idle.session.check.operation=true" --conf > "spark.eventLog.enabled=false" --conf > "spark.cleaner.periodicGC.interval=5min" --conf > "spark.appStateStore.asyncTracking.enable=false" --conf > "spark.ui.retainedJobs=30" --conf "spark.ui.retainedStages=100" --conf > "spark.ui.retainedTasks=500" --conf "spark.sql.ui.retainedExecutions=10" > --conf "spark.ui.retainedDeadExecutors=10" --conf > "spark.worker.ui.retainedExecutors=10" --conf > "spark.worker.ui.retainedDrivers=10" --conf spark.ui.enabled=false --conf > spark.stage.maxConsecutiveAttempts=10 --conf spark.executor.memoryOverhead=1G > --conf "spark.io.compression.codec=snappy" --conf > "spark.default.parallelism=640" --conf spark.memory.offHeap.enabled=true > --conf "spark.memory.offHeap.size=3g" --conf "spark.memory.fraction=0.75" > --conf "spark.memory.storageFraction=0.75" > ``` > the java heap dump after heavy usage is as follows > ``` > 1: 50465861 9745837152 [C >2: 23337896 1924089944 [Ljava.lang.Object; >3: 72524905 1740597720 java.lang.Long >4: 50463694 1614838208 java.lang.String >5: 22718029 726976928 > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema >6: 2259416 343483328
[jira] [Commented] (SPARK-37640) rolled event log still need be clean after compact
[ https://issues.apache.org/jira/browse/SPARK-37640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462104#comment-17462104 ] Hyukjin Kwon commented on SPARK-37640: -- [~m-sir] mind filling the JIRA description please? > rolled event log still need be clean after compact > -- > > Key: SPARK-37640 > URL: https://issues.apache.org/jira/browse/SPARK-37640 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: muhong >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37648) Spark catalog and Delta tables
[ https://issues.apache.org/jira/browse/SPARK-37648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37648: - Component/s: SQL (was: Spark Core) > Spark catalog and Delta tables > -- > > Key: SPARK-37648 > URL: https://issues.apache.org/jira/browse/SPARK-37648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: Spark version 3.1.2 > Scala version 2.12.10 > Hive version 2.3.7 > Delta version 1.0.0 >Reporter: Hanna Liashchuk >Priority: Major > > I'm using Spark with Delta tables, while tables are created, there are no > columns in the table. > Steps to reproduce: > 1. Start spark-shell > {code:java} > spark-shell --conf > "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf > "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" > --conf "spark.sql.legacy.parquet.int96RebaseModeInWrite=LEGACY"{code} > 2. Create delta table > {code:java} > spark.range(10).write.format("delta").option("path", > "tmp/delta").saveAsTable("delta"){code} > 3. Make sure table exists > {code:java} > spark.catalog.listTables.show{code} > 4. Find out that columns are not > {code:java} > spark.catalog.listColumns("delta").show{code} > This is critical for Delta integration with different BI tools such as Power > BI or Tableau, as they are querying spark catalog for the metadata and we are > getting errors that no columns are found. > Discussion can be found in Delta repository - > https://github.com/delta-io/delta/issues/695 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37660) Spark-3.2.0 Fetch Hbase Data not working
[ https://issues.apache.org/jira/browse/SPARK-37660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462103#comment-17462103 ] Hyukjin Kwon commented on SPARK-37660: -- Mind showing actual output? There have been almost zero changes made to this code path in Apache Spark between 3.1 and 3.2. > Spark-3.2.0 Fetch Hbase Data not working > > > Key: SPARK-37660 > URL: https://issues.apache.org/jira/browse/SPARK-37660 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 > Environment: Hadoop version : hadoop-2.9.2 > HBase version : hbase-2.2.5 > Spark version : spark-3.2.0-bin-without-hadoop > java version : jdk1.8.0_151 > scala version : scala-sdk-2.12.10 > os version : Red Hat Enterprise Linux Server release 6.6 (Santiago) >Reporter: Bhavya Raj Sharma >Priority: Major > > Below is the sample code snipet that is used to fetch data from hbase. This > used to work fine with spark-3.1.1 > However after upgrading to psark-3.2.0 it is not working, The issue is it is > not throwing any exception, it just don't fill RDD. > > {code:java} > > def getInfo(sc: SparkContext, startDate:String, cachingValue: Int, > sparkLoggerParams: SparkLoggerParams, zkIP: String, zkPort: String): > RDD[(String)] = {{ > val scan = new Scan > scan.addFamily("family") > scan.addColumn("family","time") > val rdd = getHbaseConfiguredRDDFromScan(sc, zkIP, zkPort, "myTable", > scan, cachingValue, sparkLoggerParams) > val output: RDD[(String)] = rdd.map { row => > (Bytes.toString(row._2.getRow)) > } > output > } > > def getHbaseConfiguredRDDFromScan(sc: SparkContext, zkIP: String, zkPort: > String, tableName: String, > scan: Scan, cachingValue: Int, > sparkLoggerParams: SparkLoggerParams): NewHadoopRDD[ImmutableBytesWritable, > Result] = { > scan.setCaching(cachingValue) > val scanString = > Base64.getEncoder.encodeToString(org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(scan).toByteArray) > val hbaseContext = new SparkHBaseContext(zkIP, zkPort) > val hbaseConfig = hbaseContext.getConfiguration() > hbaseConfig.set(TableInputFormat.INPUT_TABLE, tableName) > hbaseConfig.set(TableInputFormat.SCAN, scanString) > sc.newAPIHadoopRDD( > hbaseConfig, > classOf[TableInputFormat], > classOf[ImmutableBytesWritable], classOf[Result] > ).asInstanceOf[NewHadoopRDD[ImmutableBytesWritable, Result]] > } > > {code} > > If we fetch with using scan directly without using newAPIHadoopRDD, it works. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37667) Spark throws TreeNodeException ("Couldn't find gen_alias") during wildcard column expansion
[ https://issues.apache.org/jira/browse/SPARK-37667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17462102#comment-17462102 ] Hyukjin Kwon commented on SPARK-37667: -- quick question, does it fail in Spark 3.2 too? > Spark throws TreeNodeException ("Couldn't find gen_alias") during wildcard > column expansion > --- > > Key: SPARK-37667 > URL: https://issues.apache.org/jira/browse/SPARK-37667 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Kellan B Cummings >Priority: Major > > I'm seeing a TreeNodeException ("Couldn't find {_}gen_alias{_}") when running > certain operations in Spark 3.1.2. > A few conditions need to be met to trigger the bug: > - a DF with a nested struct joins to a second DF > - a filter that compares a column in the right DF to a column in the left DF > - wildcard column expansion of the nested struct > - a group by statement on a struct column > *Data* > g...@github.com:kellanburket/spark3bug.git > > {code:java} > val rightDf = spark.read.parquet("right.parquet") > val leftDf = spark.read.parquet("left.parquet"){code} > > *Schemas* > {code:java} > leftDf.printSchema() > root > |-- row: struct (nullable = true) > | |-- mid: string (nullable = true) > | |-- start: struct (nullable = true) > | | |-- latitude: double (nullable = true) > | | |-- longitude: double (nullable = true) > |-- s2_cell_id: long (nullable = true){code} > {code:java} > rightDf.printSchema() > root > |-- id: string (nullable = true) > |-- s2_cell_id: long (nullable = true){code} > > *Breaking Code* > {code:java} > leftDf.join(rightDf, "s2_cell_id").filter( > "id != row.start.latitude" > ).select( > col("row.*"), col("id") > ).groupBy( > "start" > ).agg( > min("id") > ).show(){code} > > *Working Examples* > The following examples don't seem to be effected by the bug > Works without group by: > {code:java} > leftDf.join(rightDf, "s2_cell_id").filter( > "id != row.start.latitude" > ).select( > col("row.*"), col("id") > ).show(){code} > Works without filter > {code:java} > leftDf.join(rightDf, "s2_cell_id").select( > col("row.*"), col("id") > ).groupBy( > "start" > ).agg( > min("id") > ).show(){code} > Works without wildcard expansion > {code:java} > leftDf.join(rightDf, "s2_cell_id").filter( > "id != row.start.latitude" > ).select( > col("row.start"), col("id") > ).groupBy( > "start" > ).agg( > min("id") > ).show(){code} > Works with caching > {code:java} > leftDf.join(rightDf, "s2_cell_id").filter( > "id != row.start.latitude" > ).cache().select( > col("row.*"), > col("id") > ).groupBy( > "start" > ).agg( > min("id") > ).show(){code} > *Error message* > > > {code:java} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange hashpartitioning(start#2116, 1024), ENSURE_REQUIREMENTS, [id=#3849] > +- SortAggregate(key=[knownfloatingpointnormalized(if (isnull(start#2116)) > null else named_struct(latitude, > knownfloatingpointnormalized(normalizenanandzero(start#2116.latitude)), > longitude, > knownfloatingpointnormalized(normalizenanandzero(start#2116.longitude AS > start#2116], functions=[partial_min(id#2103)], output=[start#2116, min#2138]) > +- *(2) Sort [knownfloatingpointnormalized(if (isnull(start#2116)) null > else named_struct(latitude, > knownfloatingpointnormalized(normalizenanandzero(start#2116.latitude)), > longitude, > knownfloatingpointnormalized(normalizenanandzero(start#2116.longitude AS > start#2116 ASC NULLS FIRST], false, 0 > +- *(2) Project [_gen_alias_2133#2133 AS start#2116, id#2103] > +- *(2) !BroadcastHashJoin [s2_cell_id#2108L], [s2_cell_id#2104L], > Inner, BuildLeft, NOT (cast(id#2103 as double) = _gen_alias_2134#2134), false > :- BroadcastQueryStage 0 > : +- BroadcastExchange HashedRelationBroadcastMode(List(input[1, > bigint, false]),false), [id=#3768] > : +- *(1) Project [row#2107.start AS _gen_alias_2133#2133, > s2_cell_id#2108L] > : +- *(1) Filter isnotnull(s2_cell_id#2108L) > : +- FileScan parquet [row#2107,s2_cell_id#2108L] > Batched: false, DataFilters: [isnotnull(s2_cell_id#2108L)], Format: Parquet, > Location: InMemoryFileIndex[s3://co.mira.public/spark3_bug/left], > PartitionFilters: [], PushedFilters: [IsNotNull(s2_cell_id)], ReadSchema: > struct>,s2_cell_id:bigint> > +- *(2) Filter (isnotnull(id#2103) AND > isnotnull(s2_cell_id#2104L)) > +- *(2) ColumnarToRow > +- FileScan parquet [id#2103,s2_cell_id#2104L] Batched: > true, DataFilters: [isnotnull(id#2103), isnotnull(s2_cell_id#2104L)],
[jira] [Assigned] (SPARK-37685) Make log event immutable for LogAppender
[ https://issues.apache.org/jira/browse/SPARK-37685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37685: Assignee: L. C. Hsieh > Make log event immutable for LogAppender > > > Key: SPARK-37685 > URL: https://issues.apache.org/jira/browse/SPARK-37685 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Log4j2 will reuse log event. In Spark test, we have LogAppender which records > log events and check them later. So sometimes the event will be changed and > cause test failure. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37685) Make log event immutable for LogAppender
[ https://issues.apache.org/jira/browse/SPARK-37685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37685. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34949 [https://github.com/apache/spark/pull/34949] > Make log event immutable for LogAppender > > > Key: SPARK-37685 > URL: https://issues.apache.org/jira/browse/SPARK-37685 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.3.0 > > > Log4j2 will reuse log event. In Spark test, we have LogAppender which records > log events and check them later. So sometimes the event will be changed and > cause test failure. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org