[ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403308#comment-15403308 ]
Hyukjin Kwon commented on SPARK-16842: -------------------------------------- Thanks for your feedback. Yea, but I think it might not be very heavy time consuming (yea but still it is) since we will touch a single file (for Parquet and ORC) in most cases whereas JSON and CSV needs a entire scan as you already know. So, this overhead in case of Parquet and ORC would be almost constant regardless of number of files or size. I am personally supportive to allowing schema compatibility (and I did open a PR) but I saw some opinions and comments which *I assume* infers not supporting schema compatibility. In that case, this one might be an option. > Concern about disallowing user-given schema for Parquet and ORC > --------------------------------------------------------------- > > Key: SPARK-16842 > URL: https://issues.apache.org/jira/browse/SPARK-16842 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.0.0 > Reporter: Hyukjin Kwon > > If my understanding is correct, > If the user-given schema is different with the inferred schema, it is handled > differently for each datasource. > - For JSON and CSV > it is kind of permissive generally (for example, compatibility among > numeric types). > - For ORC and Parquet > Generally it is strict to types. So they don't allow the compatibility > (except for very few cases, e.g. for Parquet, > https://github.com/apache/spark/pull/14272 and > https://github.com/apache/spark/pull/14278) > - For Text > it only supports {{StringType}}. > - For JDBC > it does not take user-given schema since it does not implement > {{SchemaRelationProvider}}. > By allowing the user-given schema, we can use some types such as {{DateType}} > and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably > permissive schema. > To cut this short, JSON and CSV do not have the complete schema information > written in the data whereas Orc and Parquet do. > So, we might have to just disallow giving user-given schema for Parquet and > Orc. Actually, we can't give a different schema for Orc and Parquet almost at > all times if my understanding it correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org