Reynold Xin created SPARK-15693:
-----------------------------------

             Summary: Write schema definition out for file-based data sources
                 Key: SPARK-15693
                 URL: https://issues.apache.org/jira/browse/SPARK-15693
             Project: Spark
          Issue Type: New Feature
          Components: SQL
            Reporter: Reynold Xin


Spark supports reading a variety of data format, many of which don't have 
self-describing schema. For these file formats, Spark often can infer the 
schema by going through all the data. However, schema inference is expensive 
and does not always infer the intended schema (for example, with json data 
Spark always infer integer types as long, rather than int).

It would be great if Spark can write the schema definition out for file-based 
formats, and when reading the data in, schema can be "inferred" directly by 
reading the schema definition file without going through full schema inference. 
If the file does not exist, then the good old schema inference should be 
performed.

This ticket certainly merits a design doc that should discuss the spec for 
schema definition, as well as all the corner cases that this feature needs to 
handle (e.g. schema merging, schema evolution, partitioning). It would be great 
if the schema definition is using a human readable format (e.g. JSON).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to