date:20220727

[jira] [Updated] (SPARK-39893) Remote Aggregate if it is group only and all grouping and aggregate expressions are foldable

2022-07-27 Thread Wan Kun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-39893:

Description: 
If all groupingExpressions and aggregateExpressions in a aggregate are 
foldable, we can remove this aggregate.
For example, query : 
{code:java}
SELECT distinct 1001 as id , cast('2022-06-03' as date) AS DT FROM testData
{code}

the grouping expressions are : *[1001, 2022-06-03]*
the aggregate expressions are :  *[1001 AS id#274, 2022-06-03 AS DT#275]*
so we can skip scan table testData and remote the aggregate operation.

Before this PR:

{code:java}
Aggregate [1001, 2022-06-03], [1001 AS id#274, 2022-06-03 AS DT#275], 
Statistics(sizeInBytes=16.0 EiB)
+- SerializeFromObject, Statistics(sizeInBytes=8.0 EiB)
   +- ExternalRDD [obj#12], Statistics(sizeInBytes=8.0 EiB)
{code}

After this PR:

{code:java}
Project [1001 AS id#218, 2022-06-03 AS DT#219], Statistics(sizeInBytes=2.0 B)
+- OneRowRelation, Statistics(sizeInBytes=1.0 B)
{code}



  was:
If all groupingExpressions and aggregateExpressions in a aggregate are 
foldable, we can remove this aggregate.
For example, query : 
{code:java}
SELECT distinct 1001 as id , cast('2022-06-03' as date) AS DT FROM testData
{code}

the grouping expressions are : *[1001, 2022-06-03]*
the aggregate expressions are :  *[1001 AS id#274, 2022-06-03 AS DT#275]*
so we can skip scan table testData and remote the aggregate operation.


> Remote Aggregate if it is group only and all grouping and aggregate 
> expressions are foldable
> 
>
> Key: SPARK-39893
> URL: https://issues.apache.org/jira/browse/SPARK-39893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wan Kun
>Priority: Major
>
> If all groupingExpressions and aggregateExpressions in a aggregate are 
> foldable, we can remove this aggregate.
> For example, query : 
> {code:java}
> SELECT distinct 1001 as id , cast('2022-06-03' as date) AS DT FROM testData
> {code}
> the grouping expressions are : *[1001, 2022-06-03]*
> the aggregate expressions are :  *[1001 AS id#274, 2022-06-03 AS DT#275]*
> so we can skip scan table testData and remote the aggregate operation.
> Before this PR:
> {code:java}
> Aggregate [1001, 2022-06-03], [1001 AS id#274, 2022-06-03 AS DT#275], 
> Statistics(sizeInBytes=16.0 EiB)
> +- SerializeFromObject, Statistics(sizeInBytes=8.0 EiB)
>+- ExternalRDD [obj#12], Statistics(sizeInBytes=8.0 EiB)
> {code}
> After this PR:
> {code:java}
> Project [1001 AS id#218, 2022-06-03 AS DT#219], Statistics(sizeInBytes=2.0 B)
> +- OneRowRelation, Statistics(sizeInBytes=1.0 B)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39893) Remote Aggregate if it is group only and all grouping and aggregate expressions are foldable

2022-07-27 Thread Wan Kun (Jira)

Wan Kun created SPARK-39893:
---

 Summary: Remote Aggregate if it is group only and all grouping and 
aggregate expressions are foldable
 Key: SPARK-39893
 URL: https://issues.apache.org/jira/browse/SPARK-39893
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Wan Kun


If all groupingExpressions and aggregateExpressions in a aggregate are 
foldable, we can remove this aggregate.
For example, query : 
{code:java}
SELECT distinct 1001 as id , cast('2022-06-03' as date) AS DT FROM testData
{code}

the grouping expressions are : *[1001, 2022-06-03]*
the aggregate expressions are :  *[1001 AS id#274, 2022-06-03 AS DT#275]*
so we can skip scan table testData and remote the aggregate operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame

2022-07-27 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-39834.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37248
[https://github.com/apache/spark/pull/37248]

> Include the origin stats and constraints for LogicalRDD if it comes from 
> DataFrame
> --
>
> Key: SPARK-39834
> URL: https://issues.apache.org/jira/browse/SPARK-39834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.4.0
>
>
> With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it 
> comes from DataFrame, to achieve carrying-over stats as well as providing 
> information to possibly connect two disconnected logical plans into one.
> After we introduced the change, we figured out several issues:
> 1. One of major use case for DataFrame.checkpoint is ML, especially 
> "iterative algorithm", which purpose is to "prune" the logical plan. That is 
> against the purpose of including origin logical plan and we have a risk to 
> have nested LogicalRDDs which grows the size of logical plan infinitely.
> 2. We leverage logical plan to carry over stats, but the correct stats 
> information is in optimized plan.
> 3. (Not an issue but missing spot) constraints is also something we can carry 
> over.
> To address above issues, it would be better if we include stats and 
> constraints in LogicalRDD rather than logical plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame

2022-07-27 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-39834:


Assignee: Jungtaek Lim

> Include the origin stats and constraints for LogicalRDD if it comes from 
> DataFrame
> --
>
> Key: SPARK-39834
> URL: https://issues.apache.org/jira/browse/SPARK-39834
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it 
> comes from DataFrame, to achieve carrying-over stats as well as providing 
> information to possibly connect two disconnected logical plans into one.
> After we introduced the change, we figured out several issues:
> 1. One of major use case for DataFrame.checkpoint is ML, especially 
> "iterative algorithm", which purpose is to "prune" the logical plan. That is 
> against the purpose of including origin logical plan and we have a risk to 
> have nested LogicalRDDs which grows the size of logical plan infinitely.
> 2. We leverage logical plan to carry over stats, but the correct stats 
> information is in optimized plan.
> 3. (Not an issue but missing spot) constraints is also something we can carry 
> over.
> To address above issues, it would be better if we include stats and 
> constraints in LogicalRDD rather than logical plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39892) Use ArrowType.Decimal(precision, scale, bitWidth) instead of ArrowType.Decimal(precision, scale)

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39892:


Assignee: Apache Spark

> Use ArrowType.Decimal(precision, scale, bitWidth) instead of 
> ArrowType.Decimal(precision, scale)
> 
>
> Key: SPARK-39892
> URL: https://issues.apache.org/jira/browse/SPARK-39892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.4.0
>
>
> [warn] 
> /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:48:49:
>  [deprecation @ org.apache.spark.sql.util.ArrowUtils.toArrowType | 
> origin=org.apache.arrow.vector.types.pojo.ArrowType.Decimal. | 
> version=] constructor Decimal in class Decimal is deprecated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39892) Use ArrowType.Decimal(precision, scale, bitWidth) instead of ArrowType.Decimal(precision, scale)

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39892:


Assignee: (was: Apache Spark)

> Use ArrowType.Decimal(precision, scale, bitWidth) instead of 
> ArrowType.Decimal(precision, scale)
> 
>
> Key: SPARK-39892
> URL: https://issues.apache.org/jira/browse/SPARK-39892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> [warn] 
> /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:48:49:
>  [deprecation @ org.apache.spark.sql.util.ArrowUtils.toArrowType | 
> origin=org.apache.arrow.vector.types.pojo.ArrowType.Decimal. | 
> version=] constructor Decimal in class Decimal is deprecated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39892) Use ArrowType.Decimal(precision, scale, bitWidth) instead of ArrowType.Decimal(precision, scale)

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571780#comment-17571780
 ] 

Apache Spark commented on SPARK-39892:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37315

> Use ArrowType.Decimal(precision, scale, bitWidth) instead of 
> ArrowType.Decimal(precision, scale)
> 
>
> Key: SPARK-39892
> URL: https://issues.apache.org/jira/browse/SPARK-39892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> [warn] 
> /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:48:49:
>  [deprecation @ org.apache.spark.sql.util.ArrowUtils.toArrowType | 
> origin=org.apache.arrow.vector.types.pojo.ArrowType.Decimal. | 
> version=] constructor Decimal in class Decimal is deprecated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39892) Use ArrowType.Decimal(precision, scale, bitWidth) instead of ArrowType.Decimal(precision, scale)

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571779#comment-17571779
 ] 

Apache Spark commented on SPARK-39892:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37315

> Use ArrowType.Decimal(precision, scale, bitWidth) instead of 
> ArrowType.Decimal(precision, scale)
> 
>
> Key: SPARK-39892
> URL: https://issues.apache.org/jira/browse/SPARK-39892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> [warn] 
> /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:48:49:
>  [deprecation @ org.apache.spark.sql.util.ArrowUtils.toArrowType | 
> origin=org.apache.arrow.vector.types.pojo.ArrowType.Decimal. | 
> version=] constructor Decimal in class Decimal is deprecated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39892) Use ArrowType.Decimal(precision, scale, bitWidth) instead of ArrowType.Decimal(precision, scale)

2022-07-27 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-39892:

Summary: Use ArrowType.Decimal(precision, scale, bitWidth) instead of 
ArrowType.Decimal(precision, scale)  (was: Use ArrowType.Decimal(int precision, 
int scale, int bitWidth) instead of ArrowType.Decimal(int precision, int scale))

> Use ArrowType.Decimal(precision, scale, bitWidth) instead of 
> ArrowType.Decimal(precision, scale)
> 
>
> Key: SPARK-39892
> URL: https://issues.apache.org/jira/browse/SPARK-39892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> [warn] 
> /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:48:49:
>  [deprecation @ org.apache.spark.sql.util.ArrowUtils.toArrowType | 
> origin=org.apache.arrow.vector.types.pojo.ArrowType.Decimal. | 
> version=] constructor Decimal in class Decimal is deprecated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39892) Use ArrowType.Decimal(int precision, int scale, int bitWidth) instead of ArrowType.Decimal(int precision, int scale)

2022-07-27 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-39892:
---

 Summary: Use ArrowType.Decimal(int precision, int scale, int 
bitWidth) instead of ArrowType.Decimal(int precision, int scale)
 Key: SPARK-39892
 URL: https://issues.apache.org/jira/browse/SPARK-39892
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: BingKun Pan
 Fix For: 3.4.0


[warn] 
/home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:48:49:
 [deprecation @ org.apache.spark.sql.util.ArrowUtils.toArrowType | 
origin=org.apache.arrow.vector.types.pojo.ArrowType.Decimal. | version=] 
constructor Decimal in class Decimal is deprecated



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39833) Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true

2022-07-27 Thread Ivan Sadikov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571768#comment-17571768
 ] 

Ivan Sadikov commented on SPARK-39833:
--

Interesting, I will take a look.

> Filtered parquet data frame count() and show() produce inconsistent results 
> when spark.sql.parquet.filterPushdown is true
> -
>
> Key: SPARK-39833
> URL: https://issues.apache.org/jira/browse/SPARK-39833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Michael Allman
>Priority: Major
>  Labels: correctness
>
> One of our data scientists discovered a problem wherein a data frame 
> `.show()` call printed non-empty results, but `.count()` printed 0. I've 
> narrowed the issue to a small, reproducible test case which exhibits this 
> aberrant behavior. In pyspark, run the following code:
> {code:python}
> from pyspark.sql.types import *
> parquet_pushdown_bug_df = spark.createDataFrame([{"COL0": int(0)}], 
> schema=StructType(fields=[StructField("COL0",IntegerType(),True)]))
> parquet_pushdown_bug_df.repartition(1).write.mode("overwrite").parquet("parquet_pushdown_bug/col0=0/parquet_pushdown_bug.parquet")
> reread_parquet_pushdown_bug_df = spark.read.parquet("parquet_pushdown_bug")
> reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
> print(reread_parquet_pushdown_bug_df.filter("col0 = 0").count())
> {code}
> In my usage, this prints a data frame with 1 row and a count of 0. However, 
> disabling `spark.sql.parquet.filterPushdown` produces consistent results:
> {code:python}
> spark.conf.set("spark.sql.parquet.filterPushdown", False)
> reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
> reread_parquet_pushdown_bug_df.filter("col0 = 0").count()
> {code}
> This will print the same data frame, however it will print a count of 1. The 
> key to triggering this bug is not just enabling 
> `spark.sql.parquet.filterPushdown` (which is enabled by default). The case of 
> the column in the data frame (before writing) must differ from the case of 
> the partition column in the file path, i.e. COL0 versus col0 or col0 versus 
> COL0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39888) Fix scala code style in SecurityManagerSuite

2022-07-27 Thread xiaoping.huang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiaoping.huang resolved SPARK-39888.

Resolution: Won't Do

> Fix scala code style in SecurityManagerSuite
> 
>
> Key: SPARK-39888
> URL: https://issues.apache.org/jira/browse/SPARK-39888
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: xiaoping.huang
>Priority: Major
>
> Usually, semicolons are not used at the end of a line of code in Scala 
> syntax. In order to maintain the code style, need to delete some semicolons



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39891) Bump h2 to 2.1.214

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571765#comment-17571765
 ] 

Apache Spark commented on SPARK-39891:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37314

> Bump h2 to 2.1.214
> --
>
> Key: SPARK-39891
> URL: https://issues.apache.org/jira/browse/SPARK-39891
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> release notes:
> https://github.com/h2database/h2database/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39891) Bump h2 to 2.1.214

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39891:


Assignee: Apache Spark

> Bump h2 to 2.1.214
> --
>
> Key: SPARK-39891
> URL: https://issues.apache.org/jira/browse/SPARK-39891
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.4.0
>
>
> release notes:
> https://github.com/h2database/h2database/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39891) Bump h2 to 2.1.214

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39891:


Assignee: (was: Apache Spark)

> Bump h2 to 2.1.214
> --
>
> Key: SPARK-39891
> URL: https://issues.apache.org/jira/browse/SPARK-39891
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> release notes:
> https://github.com/h2database/h2database/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39833) Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true

2022-07-27 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571763#comment-17571763
 ] 

Hyukjin Kwon commented on SPARK-39833:
--

Seems like a bug from Parquet side in rowgroup filtering.

> Filtered parquet data frame count() and show() produce inconsistent results 
> when spark.sql.parquet.filterPushdown is true
> -
>
> Key: SPARK-39833
> URL: https://issues.apache.org/jira/browse/SPARK-39833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Michael Allman
>Priority: Major
>  Labels: correctness
>
> One of our data scientists discovered a problem wherein a data frame 
> `.show()` call printed non-empty results, but `.count()` printed 0. I've 
> narrowed the issue to a small, reproducible test case which exhibits this 
> aberrant behavior. In pyspark, run the following code:
> {code:python}
> from pyspark.sql.types import *
> parquet_pushdown_bug_df = spark.createDataFrame([{"COL0": int(0)}], 
> schema=StructType(fields=[StructField("COL0",IntegerType(),True)]))
> parquet_pushdown_bug_df.repartition(1).write.mode("overwrite").parquet("parquet_pushdown_bug/col0=0/parquet_pushdown_bug.parquet")
> reread_parquet_pushdown_bug_df = spark.read.parquet("parquet_pushdown_bug")
> reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
> print(reread_parquet_pushdown_bug_df.filter("col0 = 0").count())
> {code}
> In my usage, this prints a data frame with 1 row and a count of 0. However, 
> disabling `spark.sql.parquet.filterPushdown` produces consistent results:
> {code:python}
> spark.conf.set("spark.sql.parquet.filterPushdown", False)
> reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
> reread_parquet_pushdown_bug_df.filter("col0 = 0").count()
> {code}
> This will print the same data frame, however it will print a count of 1. The 
> key to triggering this bug is not just enabling 
> `spark.sql.parquet.filterPushdown` (which is enabled by default). The case of 
> the column in the data frame (before writing) must differ from the case of 
> the partition column in the file path, i.e. COL0 versus col0 or col0 versus 
> COL0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39891) Bump h2 to 2.1.214

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571764#comment-17571764
 ] 

Apache Spark commented on SPARK-39891:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37314

> Bump h2 to 2.1.214
> --
>
> Key: SPARK-39891
> URL: https://issues.apache.org/jira/browse/SPARK-39891
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> release notes:
> https://github.com/h2database/h2database/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39891) Bump h2 to 2.1.214

2022-07-27 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-39891:
---

 Summary: Bump h2 to 2.1.214
 Key: SPARK-39891
 URL: https://issues.apache.org/jira/browse/SPARK-39891
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: BingKun Pan
 Fix For: 3.4.0


release notes:

https://github.com/h2database/h2database/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39889) Use different error classes for numeric/interval divided by 0

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571751#comment-17571751
 ] 

Apache Spark commented on SPARK-39889:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37313

> Use different error classes for numeric/interval divided by 0
> -
>
> Key: SPARK-39889
> URL: https://issues.apache.org/jira/browse/SPARK-39889
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, when numbers are divided by 0 under ANSI mode, the error message 
> is like
> {quote}[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate 
> divisor being 0 and return NULL instead. If necessary set "ansi_mode" to 
> "false" (except for ANSI interval type) to bypass this error.{quote}
> The "(except for ANSI interval type)" part is confusing.  We should remove it 
> and have a new error class "INTERVAL_DIVIDED_BY_ZERO"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39889) Use different error classes for numeric/interval divided by 0

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39889:


Assignee: Apache Spark  (was: Gengliang Wang)

> Use different error classes for numeric/interval divided by 0
> -
>
> Key: SPARK-39889
> URL: https://issues.apache.org/jira/browse/SPARK-39889
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently, when numbers are divided by 0 under ANSI mode, the error message 
> is like
> {quote}[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate 
> divisor being 0 and return NULL instead. If necessary set "ansi_mode" to 
> "false" (except for ANSI interval type) to bypass this error.{quote}
> The "(except for ANSI interval type)" part is confusing.  We should remove it 
> and have a new error class "INTERVAL_DIVIDED_BY_ZERO"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39889) Use different error classes for numeric/interval divided by 0

2022-07-27 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39889:


Assignee: Gengliang Wang  (was: Apache Spark)

> Use different error classes for numeric/interval divided by 0
> -
>
> Key: SPARK-39889
> URL: https://issues.apache.org/jira/browse/SPARK-39889
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, when numbers are divided by 0 under ANSI mode, the error message 
> is like
> {quote}[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate 
> divisor being 0 and return NULL instead. If necessary set "ansi_mode" to 
> "false" (except for ANSI interval type) to bypass this error.{quote}
> The "(except for ANSI interval type)" part is confusing.  We should remove it 
> and have a new error class "INTERVAL_DIVIDED_BY_ZERO"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39889) Use different error classes for numeric/interval divided by 0

2022-07-27 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571750#comment-17571750
 ] 

Apache Spark commented on SPARK-39889:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37313

> Use different error classes for numeric/interval divided by 0
> -
>
> Key: SPARK-39889
> URL: https://issues.apache.org/jira/browse/SPARK-39889
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, when numbers are divided by 0 under ANSI mode, the error message 
> is like
> {quote}[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate 
> divisor being 0 and return NULL instead. If necessary set "ansi_mode" to 
> "false" (except for ANSI interval type) to bypass this error.{quote}
> The "(except for ANSI interval type)" part is confusing.  We should remove it 
> and have a new error class "INTERVAL_DIVIDED_BY_ZERO"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39893) Remote Aggregate if it is group only and all grouping and aggregate expressions are foldable

[jira] [Created] (SPARK-39893) Remote Aggregate if it is group only and all grouping and aggregate expressions are foldable

[jira] [Resolved] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame

[jira] [Assigned] (SPARK-39834) Include the origin stats and constraints for LogicalRDD if it comes from DataFrame

[jira] [Assigned] (SPARK-39892) Use ArrowType.Decimal(precision, scale, bitWidth) instead of ArrowType.Decimal(precision, scale)

[jira] [Assigned] (SPARK-39892) Use ArrowType.Decimal(precision, scale, bitWidth) instead of ArrowType.Decimal(precision, scale)

[jira] [Commented] (SPARK-39892) Use ArrowType.Decimal(precision, scale, bitWidth) instead of ArrowType.Decimal(precision, scale)

[jira] [Commented] (SPARK-39892) Use ArrowType.Decimal(precision, scale, bitWidth) instead of ArrowType.Decimal(precision, scale)

[jira] [Updated] (SPARK-39892) Use ArrowType.Decimal(precision, scale, bitWidth) instead of ArrowType.Decimal(precision, scale)

[jira] [Created] (SPARK-39892) Use ArrowType.Decimal(int precision, int scale, int bitWidth) instead of ArrowType.Decimal(int precision, int scale)

[jira] [Commented] (SPARK-39833) Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true

[jira] [Resolved] (SPARK-39888) Fix scala code style in SecurityManagerSuite

[jira] [Commented] (SPARK-39891) Bump h2 to 2.1.214

[jira] [Assigned] (SPARK-39891) Bump h2 to 2.1.214

[jira] [Assigned] (SPARK-39891) Bump h2 to 2.1.214

[jira] [Commented] (SPARK-39833) Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true

[jira] [Commented] (SPARK-39891) Bump h2 to 2.1.214

[jira] [Created] (SPARK-39891) Bump h2 to 2.1.214

[jira] [Commented] (SPARK-39889) Use different error classes for numeric/interval divided by 0

[jira] [Assigned] (SPARK-39889) Use different error classes for numeric/interval divided by 0

[jira] [Assigned] (SPARK-39889) Use different error classes for numeric/interval divided by 0

[jira] [Commented] (SPARK-39889) Use different error classes for numeric/interval divided by 0

< 1 2

101 - 122 of 122 matches

Site Navigation

Mail list logo

Footer information