[ https://issues.apache.org/jira/browse/SPARK-30334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-30334: ------------------------------------ Assignee: (was: Apache Spark) > Add metadata around semi-structured columns to Spark > ---------------------------------------------------- > > Key: SPARK-30334 > URL: https://issues.apache.org/jira/browse/SPARK-30334 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 2.4.4 > Reporter: Burak Yavuz > Priority: Major > > Semi-structured data is used widely in the data industry for reporting events > in a wide variety of formats. Click events in product analytics can be stored > as json. Some application logs can be in the form of delimited key=value > text. Some data may be in xml. > The goal of this project is to be able to signal Spark that such a column > exists. This will then enable Spark to "auto-parse" these columns on the fly. > The proposal is to store this information as part of the column metadata, in > the fields: > - format: The format of the semi-structured column, e.g. json, xml, avro > - options: Options for parsing these columns > Then imagine having the following data: > {code:java} > +------------+-------+--------------------+ > | ts | event | raw | > +------------+-------+--------------------+ > | 2019-10-12 | click | {"field":"value"} | > +------------+-------+--------------------+ {code} > SELECT raw.field FROM data > will return "value" > or the following data > {code:java} > +------------+-------+----------------------+ > | ts | event | raw | > +------------+-------+----------------------+ > | 2019-10-12 | click | field1=v1|field2=v2 | > +------------+-------+----------------------+ {code} > SELECT raw.field1 FROM data > will return v1. > > As a first step, we will introduce the function "as_json", which accomplishes > this for JSON columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org