Burak Yavuz created SPARK-30334: ----------------------------------- Summary: Add metadata around semi-structured columns to Spark Key: SPARK-30334 URL: https://issues.apache.org/jira/browse/SPARK-30334 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.4 Reporter: Burak Yavuz
Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml. The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields: - format: The format of the semi-structured column, e.g. json, xml, avro - options: Options for parsing these columns Then imagine having the following data: {code:java} +------------+-------+--------------------+ | ts | event | raw | +------------+-------+--------------------+ | 2019-10-12 | click | {"field":"value"} | +------------+-------+--------------------+ {code} SELECT raw.field FROM data will return "value" or the following data {code:java} +------------+-------+----------------------+ | ts | event | raw | +------------+-------+----------------------+ | 2019-10-12 | click | field1=v1|field2=v2 | +------------+-------+----------------------+ {code} SELECT raw.field1 FROM data will return v1. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org