Julien Genini created SPARK-10869: ------------------------------------- Summary: Auto-normalization of semi-structured schema from a dataframe Key: SPARK-10869 URL: https://issues.apache.org/jira/browse/SPARK-10869 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 1.5.1 Reporter: Julien Genini Priority: Minor
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, JSON, etc..) Not so easy to deal in data warehousing where it's better to normalize the data. I propose an option to add when you get the schema (linear, default False) with the path for each field, and the list of the different node levels df = sqlContext.read.json(jsonPath) jsonLinearSchema = df.schema.jsonValue(linear=True) >> {'fields': [{'metadata': {}, 'name': 'BusinessDate', 'nullable': True, 'pathName': 'SiteXML.BusinessDate', 'type': 'string'}, {'metadata': {}, 'name': 'Id_Group', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.Id_Group', 'type': 'string'}, {'metadata': {}, 'name': 'Id_Site', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.Id_Site', 'type': 'string'}, {'metadata': {}, 'name': 'label', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.label', 'type': 'string'}, {'metadata': {}, 'name': 'label_group', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.label_group', 'type': 'string'}, {'metadata': {}, 'name': 'TimeStamp', 'nullable': True, 'pathName': 'SiteXML.TimeStamp', 'type': 'string'}], 'nodes': [{'name': '', 'nbFields': 3}, {'name': 'SiteXML', 'nbFields': 1}, {'name': 'SiteXML.Site_List', 'nbFields': 0}, {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org