[ https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Genini updated SPARK-10869: ---------------------------------- Target Version/s: (was: 1.5.1) > Auto-normalization of semi-structured schema from a dataframe > ------------------------------------------------------------- > > Key: SPARK-10869 > URL: https://issues.apache.org/jira/browse/SPARK-10869 > Project: Spark > Issue Type: New Feature > Components: PySpark > Affects Versions: 1.5.1 > Reporter: Julien Genini > Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > today, you can get a multi-depth schema from a semi-structured dataframe. > (XML, JSON, etc..) > Not so easy to deal in data warehousing where it's better to normalize the > data. > I propose an option to add when you get the schema (normalized, default False) > Then the returned json schema will contains the normalized path for each > field, and the list of the different node levels > df = sqlContext.read.json(jsonPath) > jsonLinearSchema = df.schema.jsonValue(normalized=True) > >> > {code} > {'fields': [{'fullPathName': 'SiteXML.BusinessDate', > > 'metadata': {}, > 'name': 'BusinessDate', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.Site_List.Site.Id_Group', > 'metadata': {}, > 'name': 'Id_Group', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.Site_List.Site.Id_Site', > 'metadata': {}, > 'name': 'Id_Site', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.Site_List.Site.libelle', > 'metadata': {}, > 'name': 'libelle', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.Site_List.Site.libelle_Group', > 'metadata': {}, > 'name': 'libelle_Group', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.TimeStamp', > 'metadata': {}, > 'name': 'TimeStamp', > 'nullable': True, > 'type': 'string'}], > 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate', > 'SiteXML.TimeStamp'], > 'fullPathName': 'SiteXML', > 'nbFields': 2}, > {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group', > 'SiteXML.Site_List.Site.Id_Site', > 'SiteXML.Site_List.Site.libelle', > 'SiteXML.Site_List.Site.libelle_Group'], > 'fullPathName': 'SiteXML.Site_List.Site', > 'nbFields': 4}]} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org