[ https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Genini updated SPARK-10869: ---------------------------------- Description: today, you can get a multi-depth schema from a semi-structured dataframe. (XML, JSON, etc..) Not so easy to deal in data warehousing where it's better to normalize the data. I propose an option to add when you get the schema (normalized, default False) Then the returned json schema will contains the normalized path for each field, and the list of the different node levels df = sqlContext.read.json(jsonPath) jsonLinearSchema = df.schema.jsonValue(normalized=True) >> {code} {'fields': [{'fullPathName': 'SiteXML.BusinessDate', 'metadata': {}, 'name': 'BusinessDate', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.Id_Group', 'metadata': {}, 'name': 'Id_Group', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.Id_Site', 'metadata': {}, 'name': 'Id_Site', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.libelle', 'metadata': {}, 'name': 'libelle', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.libelle_Group', 'metadata': {}, 'name': 'libelle_Group', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.TimeStamp', 'metadata': {}, 'name': 'TimeStamp', 'nullable': True, 'type': 'string'}], 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate', 'SiteXML.TimeStamp'], 'fullPathName': 'SiteXML', 'nbFields': 2}, {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group', 'SiteXML.Site_List.Site.Id_Site', 'SiteXML.Site_List.Site.libelle', 'SiteXML.Site_List.Site.libelle_Group'], 'fullPathName': 'SiteXML.Site_List.Site', 'nbFields': 4}]} {code} was: today, you can get a multi-depth schema from a semi-structured dataframe. (XML, JSON, etc..) Not so easy to deal in data warehousing where it's better to normalize the data. I propose an option to add when you get the schema (normalized, default False) Then the returned json schema will contains the normalized path for each field, and the list of the different node levels df = sqlContext.read.json(jsonPath) jsonLinearSchema = df.schema.jsonValue(normalized=True) >> {code:json} {'fields': [{'fullPathName': 'SiteXML.BusinessDate', 'metadata': {}, 'name': 'BusinessDate', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.Id_Group', 'metadata': {}, 'name': 'Id_Group', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.Id_Site', 'metadata': {}, 'name': 'Id_Site', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.libelle', 'metadata': {}, 'name': 'libelle', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.libelle_Group', 'metadata': {}, 'name': 'libelle_Group', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.TimeStamp', 'metadata': {}, 'name': 'TimeStamp', 'nullable': True, 'type': 'string'}], 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate', 'SiteXML.TimeStamp'], 'fullPathName': 'SiteXML', 'nbFields': 2}, {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group', 'SiteXML.Site_List.Site.Id_Site', 'SiteXML.Site_List.Site.libelle', 'SiteXML.Site_List.Site.libelle_Group'], 'fullPathName': 'SiteXML.Site_List.Site', 'nbFields': 4}]} {code} > Auto-normalization of semi-structured schema from a dataframe > ------------------------------------------------------------- > > Key: SPARK-10869 > URL: https://issues.apache.org/jira/browse/SPARK-10869 > Project: Spark > Issue Type: New Feature > Components: PySpark > Affects Versions: 1.5.1 > Reporter: Julien Genini > Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > today, you can get a multi-depth schema from a semi-structured dataframe. > (XML, JSON, etc..) > Not so easy to deal in data warehousing where it's better to normalize the > data. > I propose an option to add when you get the schema (normalized, default False) > Then the returned json schema will contains the normalized path for each > field, and the list of the different node levels > df = sqlContext.read.json(jsonPath) > jsonLinearSchema = df.schema.jsonValue(normalized=True) > >> > {code} > {'fields': [{'fullPathName': 'SiteXML.BusinessDate', > > 'metadata': {}, > 'name': 'BusinessDate', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.Site_List.Site.Id_Group', > 'metadata': {}, > 'name': 'Id_Group', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.Site_List.Site.Id_Site', > 'metadata': {}, > 'name': 'Id_Site', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.Site_List.Site.libelle', > 'metadata': {}, > 'name': 'libelle', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.Site_List.Site.libelle_Group', > 'metadata': {}, > 'name': 'libelle_Group', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.TimeStamp', > 'metadata': {}, > 'name': 'TimeStamp', > 'nullable': True, > 'type': 'string'}], > 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate', > 'SiteXML.TimeStamp'], > 'fullPathName': 'SiteXML', > 'nbFields': 2}, > {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group', > 'SiteXML.Site_List.Site.Id_Site', > 'SiteXML.Site_List.Site.libelle', > 'SiteXML.Site_List.Site.libelle_Group'], > 'fullPathName': 'SiteXML.Site_List.Site', > 'nbFields': 4}]} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org