thegiive created SPARK-8690: ------------------------------- Summary: Add a setting to disable SparkSQL parquet schema merge by using datasource API Key: SPARK-8690 URL: https://issues.apache.org/jira/browse/SPARK-8690 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Environment: all Reporter: thegiive Priority: Minor
We need a general config to disable the parquet schema merge feature. Our sparkSQL application requirement is # In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't want increase too much read parquet time. Around 2000 parquet file, the schema is the same. So we don't need schema merge feature # We need to use datasource API's feature like partition discovery. So we cannot use Spark 1.2 or pervious version # We have a lot of SparkSQL product. We use *sqlContext.parquetFile(filename)* to read the parquet file. We don't want to change the application code. One setting to disable this feature is what we want In 1.4, we have serval method. But both of them cannot perfect match our use case # Set spark.sql.parquet.useDataSourceApi to false. It will match requirement 1,3. But it will use old parquet API and fail in requirement 2 # Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> "false" )) will meet requirement 1,2. But it need to change a lot of code we use in parquet load. # Spark 1.4 improve a lot on schema merge than 1.3. But directly use default version of parquet will increase the load time from 1~5 sec to 100 sec. It will fail requirement 1. # Try PR 5231 config. But it cannot disable schema merge. I think it is better to use a config to disable datasource API's schema merge feature. A PR will be provide later -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org