[jira] [Updated] (SPARK-15693) Write schema definition out for file-based data sources to avoid schema inference
[ https://issues.apache.org/jira/browse/SPARK-15693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-15693: --- Target Version/s: 2.4.0 (was: 2.3.0) > Write schema definition out for file-based data sources to avoid schema > inference > - > > Key: SPARK-15693 > URL: https://issues.apache.org/jira/browse/SPARK-15693 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark supports reading a variety of data format, many of which don't have > self-describing schema. For these file formats, Spark often can infer the > schema by going through all the data. However, schema inference is expensive > and does not always infer the intended schema (for example, with json data > Spark always infer integer types as long, rather than int). > It would be great if Spark can write the schema definition out for file-based > formats, and when reading the data in, schema can be "inferred" directly by > reading the schema definition file without going through full schema > inference. If the file does not exist, then the good old schema inference > should be performed. > This ticket certainly merits a design doc that should discuss the spec for > schema definition, as well as all the corner cases that this feature needs to > handle (e.g. schema merging, schema evolution, partitioning). It would be > great if the schema definition is using a human readable format (e.g. JSON). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15693) Write schema definition out for file-based data sources to avoid schema inference
[ https://issues.apache.org/jira/browse/SPARK-15693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15693: - Target Version/s: 2.3.0 (was: 2.2.0) > Write schema definition out for file-based data sources to avoid schema > inference > - > > Key: SPARK-15693 > URL: https://issues.apache.org/jira/browse/SPARK-15693 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark supports reading a variety of data format, many of which don't have > self-describing schema. For these file formats, Spark often can infer the > schema by going through all the data. However, schema inference is expensive > and does not always infer the intended schema (for example, with json data > Spark always infer integer types as long, rather than int). > It would be great if Spark can write the schema definition out for file-based > formats, and when reading the data in, schema can be "inferred" directly by > reading the schema definition file without going through full schema > inference. If the file does not exist, then the good old schema inference > should be performed. > This ticket certainly merits a design doc that should discuss the spec for > schema definition, as well as all the corner cases that this feature needs to > handle (e.g. schema merging, schema evolution, partitioning). It would be > great if the schema definition is using a human readable format (e.g. JSON). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15693) Write schema definition out for file-based data sources to avoid schema inference
[ https://issues.apache.org/jira/browse/SPARK-15693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15693: Target Version/s: 2.2.0 (was: 2.1.0) > Write schema definition out for file-based data sources to avoid schema > inference > - > > Key: SPARK-15693 > URL: https://issues.apache.org/jira/browse/SPARK-15693 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark supports reading a variety of data format, many of which don't have > self-describing schema. For these file formats, Spark often can infer the > schema by going through all the data. However, schema inference is expensive > and does not always infer the intended schema (for example, with json data > Spark always infer integer types as long, rather than int). > It would be great if Spark can write the schema definition out for file-based > formats, and when reading the data in, schema can be "inferred" directly by > reading the schema definition file without going through full schema > inference. If the file does not exist, then the good old schema inference > should be performed. > This ticket certainly merits a design doc that should discuss the spec for > schema definition, as well as all the corner cases that this feature needs to > handle (e.g. schema merging, schema evolution, partitioning). It would be > great if the schema definition is using a human readable format (e.g. JSON). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15693) Write schema definition out for file-based data sources to avoid schema inference
[ https://issues.apache.org/jira/browse/SPARK-15693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15693: Summary: Write schema definition out for file-based data sources to avoid schema inference (was: Write schema definition out for file-based data sources) > Write schema definition out for file-based data sources to avoid schema > inference > - > > Key: SPARK-15693 > URL: https://issues.apache.org/jira/browse/SPARK-15693 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > Spark supports reading a variety of data format, many of which don't have > self-describing schema. For these file formats, Spark often can infer the > schema by going through all the data. However, schema inference is expensive > and does not always infer the intended schema (for example, with json data > Spark always infer integer types as long, rather than int). > It would be great if Spark can write the schema definition out for file-based > formats, and when reading the data in, schema can be "inferred" directly by > reading the schema definition file without going through full schema > inference. If the file does not exist, then the good old schema inference > should be performed. > This ticket certainly merits a design doc that should discuss the spec for > schema definition, as well as all the corner cases that this feature needs to > handle (e.g. schema merging, schema evolution, partitioning). It would be > great if the schema definition is using a human readable format (e.g. JSON). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org