thegiive created SPARK-8690:
-------------------------------

             Summary: Add a setting to disable SparkSQL parquet schema merge by 
using datasource API 
                 Key: SPARK-8690
                 URL: https://issues.apache.org/jira/browse/SPARK-8690
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.4.0
         Environment: all
            Reporter: thegiive
            Priority: Minor


We need a general config to disable the parquet schema merge feature. 

Our sparkSQL application requirement is 

# In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't 
want increase too much read parquet time. Around 2000 parquet file,  the schema 
is the same. So we don't need  schema merge feature
# We need to use datasource API's feature like partition discovery. So we 
cannot use Spark 1.2 or pervious version 
# We have a lot of SparkSQL product. We use *sqlContext.parquetFile(filename)* 
to read the parquet file. We don't want to change the application code. One 
setting to disable this feature is what we want 


In  1.4, we have serval method. But both of them cannot perfect match our use 
case 

# Set spark.sql.parquet.useDataSourceApi to false. It will match requirement 
1,3. But it will use old parquet API and fail in requirement 2 
# Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> 
"false" ))  will meet requirement 1,2. But it need to change a lot of code we 
use in parquet load. 
# Spark 1.4 improve a lot on schema merge than 1.3. But directly use default 
version of parquet will increase the load time from 1~5 sec to 100 sec. It will 
fail requirement 1. 
# Try PR 5231 config. But it  cannot disable schema merge. 

I think it is better to use a config to disable datasource API's schema merge 
feature. A PR will be provide later 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to