[jira] [Updated] (SPARK-26709) OptimizeMetadataOnlyQuery does not correctly handle the empty files

2019-01-23 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26709:

Description: 
{code:java}
import org.apache.spark.sql.functions.lit
withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
  withTempPath { path =>
val tabLocation = path.getAbsolutePath
val partLocation = new Path(path.getAbsolutePath, "partCol1=3")
val df = spark.emptyDataFrame.select(lit(1).as("col1"))
df.write.parquet(partLocation.toString)
val readDF = spark.read.parquet(tabLocation)
checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null))
checkAnswer(readDF.selectExpr("max(col1)"), Row(null))
  }
}
{code}

OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the 
empty records for partitioned tables. The above test will fail in 2.4, which 
can generate an empty file, but the underlying issue in the read path still 
exists in 2.3, 2.2 and 2.1. 

  was:
{code:java}
import org.apache.spark.sql.functions.lit
withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
  withTempPath { path =>
val tabLocation = path.getAbsolutePath
val partLocation = new Path(path.getAbsolutePath, "partCol1=3")
val df = spark.emptyDataFrame.select(lit(1).as("col1"))
df.write.parquet(partLocation.toString)
val readDF = spark.read.parquet(tabLocation)
checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null))
checkAnswer(readDF.selectExpr("max(col1)"), Row(null))
  }
}
{code}

OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the 
empty records for partitioned tables. The above test will fail in 2.4, which 
can write an empty file, but the underlying issue in the read path still exists 
in 2.3, 2.2 and 2.1. 


> OptimizeMetadataOnlyQuery does not correctly handle the empty files
> ---
>
> Key: SPARK-26709
> URL: https://issues.apache.org/jira/browse/SPARK-26709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.2, 2.4.0
>Reporter: Xiao Li
>Priority: Blocker
>  Labels: correctness
>
> {code:java}
> import org.apache.spark.sql.functions.lit
> withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
>   withTempPath { path =>
> val tabLocation = path.getAbsolutePath
> val partLocation = new Path(path.getAbsolutePath, "partCol1=3")
> val df = spark.emptyDataFrame.select(lit(1).as("col1"))
> df.write.parquet(partLocation.toString)
> val readDF = spark.read.parquet(tabLocation)
> checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null))
> checkAnswer(readDF.selectExpr("max(col1)"), Row(null))
>   }
> }
> {code}
> OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the 
> empty records for partitioned tables. The above test will fail in 2.4, which 
> can generate an empty file, but the underlying issue in the read path still 
> exists in 2.3, 2.2 and 2.1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26709) OptimizeMetadataOnlyQuery does not correctly handle the empty files

2019-01-23 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26709:

Description: 
{code:java}
import org.apache.spark.sql.functions.lit
withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
  withTempPath { path =>
val tabLocation = path.getAbsolutePath
val partLocation = new Path(path.getAbsolutePath, "partCol1=3")
val df = spark.emptyDataFrame.select(lit(1).as("col1"))
df.write.parquet(partLocation.toString)
val readDF = spark.read.parquet(tabLocation)
checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null))
checkAnswer(readDF.selectExpr("max(col1)"), Row(null))
  }
}
{code}

OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the 
empty records for partitioned tables. The above test will fail in 2.4, which 
can write an empty file, but the underlying issue in the read path still exists 
in 2.3, 2.2 and 2.1. 

  was:
{code:java}
import org.apache.spark.sql.functions.lit
withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
  withTempPath { path =>
val tabLocation = path.getAbsolutePath
val partLocation = new Path(path.getAbsolutePath, "partCol1=3")
val df = spark.emptyDataFrame.select(lit(1).as("col1"))
df.write.parquet(partLocation.toString)
val readDF = spark.read.parquet(tabLocation)
checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null))
checkAnswer(readDF.selectExpr("max(col1)"), Row(null))
  }
}
{code}

OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the 
empty records for partitioned tables. 


> OptimizeMetadataOnlyQuery does not correctly handle the empty files
> ---
>
> Key: SPARK-26709
> URL: https://issues.apache.org/jira/browse/SPARK-26709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.2, 2.4.0
>Reporter: Xiao Li
>Priority: Blocker
>  Labels: correctness
>
> {code:java}
> import org.apache.spark.sql.functions.lit
> withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
>   withTempPath { path =>
> val tabLocation = path.getAbsolutePath
> val partLocation = new Path(path.getAbsolutePath, "partCol1=3")
> val df = spark.emptyDataFrame.select(lit(1).as("col1"))
> df.write.parquet(partLocation.toString)
> val readDF = spark.read.parquet(tabLocation)
> checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null))
> checkAnswer(readDF.selectExpr("max(col1)"), Row(null))
>   }
> }
> {code}
> OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the 
> empty records for partitioned tables. The above test will fail in 2.4, which 
> can write an empty file, but the underlying issue in the read path still 
> exists in 2.3, 2.2 and 2.1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org