[ https://issues.apache.org/jira/browse/SPARK-8690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604631#comment-14604631 ]
Apache Spark commented on SPARK-8690: ------------------------------------- User 'thegiive' has created a pull request for this issue: https://github.com/apache/spark/pull/7070 > Add a setting to disable SparkSQL parquet schema merge by using datasource > API > ------------------------------------------------------------------------------- > > Key: SPARK-8690 > URL: https://issues.apache.org/jira/browse/SPARK-8690 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.4.0 > Environment: all > Reporter: thegiive > Priority: Minor > > We need a general config to disable the parquet schema merge feature. > Our sparkSQL application requirement is > # In spark 1.1, 1.2, sparkSQL read parquet time is around 1~5 sec. We don't > want increase too much read parquet time. Around 2000 parquet file, the > schema is the same. So we don't need schema merge feature > # We need to use datasource API's feature like partition discovery. So we > cannot use Spark 1.2 or pervious version > # We have a lot of SparkSQL product. We use > *sqlContext.parquetFile(filename)* to read the parquet file. We don't want to > change the application code. One setting to disable this feature is what we > want > In 1.4, we have serval method. But both of them cannot perfect match our use > case > # Set spark.sql.parquet.useDataSourceApi to false. It will match requirement > 1,3. But it will use old parquet API and fail in requirement 2 > # Use sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> > "false" )) will meet requirement 1,2. But it need to change a lot of code we > use in parquet load. > # Spark 1.4 improve a lot on schema merge than 1.3. But directly use default > version of parquet will increase the load time from 1~5 sec to 100 sec. It > will fail requirement 1. > # Try PR 5231 config. But it cannot disable schema merge. > I think it is better to use a config to disable datasource API's schema merge > feature. A PR will be provide later -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org