[ https://issues.apache.org/jira/browse/SPARK-15693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Armbrust updated SPARK-15693: ------------------------------------- Target Version/s: 2.3.0 (was: 2.2.0) > Write schema definition out for file-based data sources to avoid schema > inference > --------------------------------------------------------------------------------- > > Key: SPARK-15693 > URL: https://issues.apache.org/jira/browse/SPARK-15693 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Reynold Xin > > Spark supports reading a variety of data format, many of which don't have > self-describing schema. For these file formats, Spark often can infer the > schema by going through all the data. However, schema inference is expensive > and does not always infer the intended schema (for example, with json data > Spark always infer integer types as long, rather than int). > It would be great if Spark can write the schema definition out for file-based > formats, and when reading the data in, schema can be "inferred" directly by > reading the schema definition file without going through full schema > inference. If the file does not exist, then the good old schema inference > should be performed. > This ticket certainly merits a design doc that should discuss the spec for > schema definition, as well as all the corner cases that this feature needs to > handle (e.g. schema merging, schema evolution, partitioning). It would be > great if the schema definition is using a human readable format (e.g. JSON). -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org