[jira] [Updated] (SPARK-26709) OptimizeMetadataOnlyQuery does not correctly handle the empty files

Xiao Li (JIRA) Wed, 23 Jan 2019 14:58:18 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-26709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiao Li updated SPARK-26709:
----------------------------
    Description: 
{code:java}
import org.apache.spark.sql.functions.lit
withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
  withTempPath { path =>
    val tabLocation = path.getAbsolutePath
    val partLocation = new Path(path.getAbsolutePath, "partCol1=3")
    val df = spark.emptyDataFrame.select(lit(1).as("col1"))
    df.write.parquet(partLocation.toString)
    val readDF = spark.read.parquet(tabLocation)
    checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null))
    checkAnswer(readDF.selectExpr("max(col1)"), Row(null))
  }
}
{code}

OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the 
empty records for partitioned tables. The above test will fail in 2.4, which 
can write an empty file, but the underlying issue in the read path still exists 
in 2.3, 2.2 and 2.1. 

  was:
{code:java}
import org.apache.spark.sql.functions.lit
withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
  withTempPath { path =>
    val tabLocation = path.getAbsolutePath
    val partLocation = new Path(path.getAbsolutePath, "partCol1=3")
    val df = spark.emptyDataFrame.select(lit(1).as("col1"))
    df.write.parquet(partLocation.toString)
    val readDF = spark.read.parquet(tabLocation)
    checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null))
    checkAnswer(readDF.selectExpr("max(col1)"), Row(null))
  }
}
{code}

OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the 
empty records for partitioned tables. 


> OptimizeMetadataOnlyQuery does not correctly handle the empty files
> -------------------------------------------------------------------
>
>                 Key: SPARK-26709
>                 URL: https://issues.apache.org/jira/browse/SPARK-26709
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.3, 2.2.3, 2.3.2, 2.4.0
>            Reporter: Xiao Li
>            Priority: Blocker
>              Labels: correctness
>
> {code:java}
> import org.apache.spark.sql.functions.lit
> withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
>   withTempPath { path =>
>     val tabLocation = path.getAbsolutePath
>     val partLocation = new Path(path.getAbsolutePath, "partCol1=3")
>     val df = spark.emptyDataFrame.select(lit(1).as("col1"))
>     df.write.parquet(partLocation.toString)
>     val readDF = spark.read.parquet(tabLocation)
>     checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null))
>     checkAnswer(readDF.selectExpr("max(col1)"), Row(null))
>   }
> }
> {code}
> OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the 
> empty records for partitioned tables. The above test will fail in 2.4, which 
> can write an empty file, but the underlying issue in the read path still 
> exists in 2.3, 2.2 and 2.1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26709) OptimizeMetadataOnlyQuery does not correctly handle the empty files

Reply via email to