[jira] [Updated] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe
[ https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Genini updated SPARK-10869: -- Target Version/s: (was: 1.5.1) > Auto-normalization of semi-structured schema from a dataframe > - > > Key: SPARK-10869 > URL: https://issues.apache.org/jira/browse/SPARK-10869 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.5.1 >Reporter: Julien Genini >Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > today, you can get a multi-depth schema from a semi-structured dataframe. > (XML, JSON, etc..) > Not so easy to deal in data warehousing where it's better to normalize the > data. > I propose an option to add when you get the schema (normalized, default False) > Then the returned json schema will contains the normalized path for each > field, and the list of the different node levels > df = sqlContext.read.json(jsonPath) > jsonLinearSchema = df.schema.jsonValue(normalized=True) > >> > {code} > {'fields': [{'fullPathName': 'SiteXML.BusinessDate', > > 'metadata': {}, > 'name': 'BusinessDate', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.Site_List.Site.Id_Group', > 'metadata': {}, > 'name': 'Id_Group', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.Site_List.Site.Id_Site', > 'metadata': {}, > 'name': 'Id_Site', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.Site_List.Site.libelle', > 'metadata': {}, > 'name': 'libelle', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.Site_List.Site.libelle_Group', > 'metadata': {}, > 'name': 'libelle_Group', > 'nullable': True, > 'type': 'string'}, > {'fullPathName': 'SiteXML.TimeStamp', > 'metadata': {}, > 'name': 'TimeStamp', > 'nullable': True, > 'type': 'string'}], > 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate', >'SiteXML.TimeStamp'], > 'fullPathName': 'SiteXML', > 'nbFields': 2}, >{'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group', >'SiteXML.Site_List.Site.Id_Site', >'SiteXML.Site_List.Site.libelle', >'SiteXML.Site_List.Site.libelle_Group'], > 'fullPathName': 'SiteXML.Site_List.Site', > 'nbFields': 4}]} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe
Julien Genini created SPARK-10869: - Summary: Auto-normalization of semi-structured schema from a dataframe Key: SPARK-10869 URL: https://issues.apache.org/jira/browse/SPARK-10869 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 1.5.1 Reporter: Julien Genini Priority: Minor today, you can get a multi-depth schema from a semi-structured dataframe. (XML, JSON, etc..) Not so easy to deal in data warehousing where it's better to normalize the data. I propose an option to add when you get the schema (linear, default False) with the path for each field, and the list of the different node levels df = sqlContext.read.json(jsonPath) jsonLinearSchema = df.schema.jsonValue(linear=True) >> {'fields': [{'metadata': {}, 'name': 'BusinessDate', 'nullable': True, 'pathName': 'SiteXML.BusinessDate', 'type': 'string'}, {'metadata': {}, 'name': 'Id_Group', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.Id_Group', 'type': 'string'}, {'metadata': {}, 'name': 'Id_Site', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.Id_Site', 'type': 'string'}, {'metadata': {}, 'name': 'label', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.label', 'type': 'string'}, {'metadata': {}, 'name': 'label_group', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.label_group', 'type': 'string'}, {'metadata': {}, 'name': 'TimeStamp', 'nullable': True, 'pathName': 'SiteXML.TimeStamp', 'type': 'string'}], 'nodes': [{'name': '', 'nbFields': 3}, {'name': 'SiteXML', 'nbFields': 1}, {'name': 'SiteXML.Site_List', 'nbFields': 0}, {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe
[ https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Genini updated SPARK-10869: -- Description: today, you can get a multi-depth schema from a semi-structured dataframe. (XML, JSON, etc..) Not so easy to deal in data warehousing where it's better to normalize the data. I propose an option to add when you get the schema (normalized, default False) Then the returned json schema will contains the normalized path for each field, and the list of the different node levels df = sqlContext.read.json(jsonPath) jsonLinearSchema = df.schema.jsonValue(normalized=True) >> {'fields': [{'metadata': {}, 'name': 'BusinessDate', 'nullable': True, 'pathName': 'SiteXML.BusinessDate', 'type': 'string'}, {'metadata': {}, 'name': 'Id_Group', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.Id_Group', 'type': 'string'}, {'metadata': {}, 'name': 'Id_Site', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.Id_Site', 'type': 'string'}, {'metadata': {}, 'name': 'label', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.label', 'type': 'string'}, {'metadata': {}, 'name': 'label_group', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.label_group', 'type': 'string'}, {'metadata': {}, 'name': 'TimeStamp', 'nullable': True, 'pathName': 'SiteXML.TimeStamp', 'type': 'string'}], 'nodes': [{'name': '', 'nbFields': 3}, {'name': 'SiteXML', 'nbFields': 1}, {'name': 'SiteXML.Site_List', 'nbFields': 0}, {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]} was: today, you can get a multi-depth schema from a semi-structured dataframe. (XML, JSON, etc..) Not so easy to deal in data warehousing where it's better to normalize the data. I propose an option to add when you get the schema (linear, default False) with the path for each field, and the list of the different node levels df = sqlContext.read.json(jsonPath) jsonLinearSchema = df.schema.jsonValue(linear=True) >> {'fields': [{'metadata': {}, 'name': 'BusinessDate', 'nullable': True, 'pathName': 'SiteXML.BusinessDate', 'type': 'string'}, {'metadata': {}, 'name': 'Id_Group', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.Id_Group', 'type': 'string'}, {'metadata': {}, 'name': 'Id_Site', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.Id_Site', 'type': 'string'}, {'metadata': {}, 'name': 'label', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.label', 'type': 'string'}, {'metadata': {}, 'name': 'label_group', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.label_group', 'type': 'string'}, {'metadata': {}, 'name': 'TimeStamp', 'nullable': True, 'pathName': 'SiteXML.TimeStamp', 'type': 'string'}], 'nodes': [{'name': '', 'nbFields': 3}, {'name': 'SiteXML', 'nbFields': 1}, {'name': 'SiteXML.Site_List', 'nbFields': 0}, {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]} > Auto-normalization of semi-structured schema from a dataframe > - > > Key: SPARK-10869 > URL: https://issues.apache.org/jira/browse/SPARK-10869 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.5.1 >Reporter: Julien Genini >Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > today, you can get a multi-depth schema from a semi-structured dataframe. > (XML, JSON, etc..) > Not so easy to deal in data warehousing where it's better to normalize the > data. > I propose an option to add when you get the schema (normalized, default False) > Then the returned json schema will contains the normalized path for each > field, and the list of the different node levels > df = sqlContext.read.json(jsonPath) > jsonLinearSchema = df.schema.jsonValue(normalized=True) > >> > {'fields': [{'metadata': {}, > > 'name': 'BusinessDate', > 'nullable': True, > 'pathName':
[jira] [Updated] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe
[ https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Genini updated SPARK-10869: -- Description: today, you can get a multi-depth schema from a semi-structured dataframe. (XML, JSON, etc..) Not so easy to deal in data warehousing where it's better to normalize the data. I propose an option to add when you get the schema (normalized, default False) Then the returned json schema will contains the normalized path for each field, and the list of the different node levels df = sqlContext.read.json(jsonPath) jsonLinearSchema = df.schema.jsonValue(normalized=True) >> {code:json} {'fields': [{'fullPathName': 'SiteXML.BusinessDate', 'metadata': {}, 'name': 'BusinessDate', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.Id_Group', 'metadata': {}, 'name': 'Id_Group', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.Id_Site', 'metadata': {}, 'name': 'Id_Site', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.libelle', 'metadata': {}, 'name': 'libelle', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.libelle_Group', 'metadata': {}, 'name': 'libelle_Group', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.TimeStamp', 'metadata': {}, 'name': 'TimeStamp', 'nullable': True, 'type': 'string'}], 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate', 'SiteXML.TimeStamp'], 'fullPathName': 'SiteXML', 'nbFields': 2}, {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group', 'SiteXML.Site_List.Site.Id_Site', 'SiteXML.Site_List.Site.libelle', 'SiteXML.Site_List.Site.libelle_Group'], 'fullPathName': 'SiteXML.Site_List.Site', 'nbFields': 4}]} {code} was: today, you can get a multi-depth schema from a semi-structured dataframe. (XML, JSON, etc..) Not so easy to deal in data warehousing where it's better to normalize the data. I propose an option to add when you get the schema (normalized, default False) Then the returned json schema will contains the normalized path for each field, and the list of the different node levels df = sqlContext.read.json(jsonPath) jsonLinearSchema = df.schema.jsonValue(normalized=True) >> {'fields': [{'fullPathName': 'SiteXML.BusinessDate', 'metadata': {}, 'name': 'BusinessDate', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.Id_Group', 'metadata': {}, 'name': 'Id_Group', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.Id_Site', 'metadata': {}, 'name': 'Id_Site', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.libelle', 'metadata': {}, 'name': 'libelle', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.libelle_Group', 'metadata': {}, 'name': 'libelle_Group', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.TimeStamp', 'metadata': {}, 'name': 'TimeStamp', 'nullable': True, 'type': 'string'}], 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate', 'SiteXML.TimeStamp'], 'fullPathName': 'SiteXML', 'nbFields': 2}, {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group', 'SiteXML.Site_List.Site.Id_Site', 'SiteXML.Site_List.Site.libelle', 'SiteXML.Site_List.Site.libelle_Group'], 'fullPathName': 'SiteXML.Site_List.Site', 'nbFields': 4}]} > Auto-normalization of semi-structured schema from a dataframe > - > > Key: SPARK-10869 > URL: https://issues.apache.org/jira/browse/SPARK-10869 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.5.1 >Reporter: Julien Genini >
[jira] [Updated] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe
[ https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Genini updated SPARK-10869: -- Description: today, you can get a multi-depth schema from a semi-structured dataframe. (XML, JSON, etc..) Not so easy to deal in data warehousing where it's better to normalize the data. I propose an option to add when you get the schema (normalized, default False) Then the returned json schema will contains the normalized path for each field, and the list of the different node levels df = sqlContext.read.json(jsonPath) jsonLinearSchema = df.schema.jsonValue(normalized=True) >> {code} {'fields': [{'fullPathName': 'SiteXML.BusinessDate', 'metadata': {}, 'name': 'BusinessDate', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.Id_Group', 'metadata': {}, 'name': 'Id_Group', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.Id_Site', 'metadata': {}, 'name': 'Id_Site', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.libelle', 'metadata': {}, 'name': 'libelle', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.libelle_Group', 'metadata': {}, 'name': 'libelle_Group', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.TimeStamp', 'metadata': {}, 'name': 'TimeStamp', 'nullable': True, 'type': 'string'}], 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate', 'SiteXML.TimeStamp'], 'fullPathName': 'SiteXML', 'nbFields': 2}, {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group', 'SiteXML.Site_List.Site.Id_Site', 'SiteXML.Site_List.Site.libelle', 'SiteXML.Site_List.Site.libelle_Group'], 'fullPathName': 'SiteXML.Site_List.Site', 'nbFields': 4}]} {code} was: today, you can get a multi-depth schema from a semi-structured dataframe. (XML, JSON, etc..) Not so easy to deal in data warehousing where it's better to normalize the data. I propose an option to add when you get the schema (normalized, default False) Then the returned json schema will contains the normalized path for each field, and the list of the different node levels df = sqlContext.read.json(jsonPath) jsonLinearSchema = df.schema.jsonValue(normalized=True) >> {code:json} {'fields': [{'fullPathName': 'SiteXML.BusinessDate', 'metadata': {}, 'name': 'BusinessDate', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.Id_Group', 'metadata': {}, 'name': 'Id_Group', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.Id_Site', 'metadata': {}, 'name': 'Id_Site', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.libelle', 'metadata': {}, 'name': 'libelle', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.libelle_Group', 'metadata': {}, 'name': 'libelle_Group', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.TimeStamp', 'metadata': {}, 'name': 'TimeStamp', 'nullable': True, 'type': 'string'}], 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate', 'SiteXML.TimeStamp'], 'fullPathName': 'SiteXML', 'nbFields': 2}, {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group', 'SiteXML.Site_List.Site.Id_Site', 'SiteXML.Site_List.Site.libelle', 'SiteXML.Site_List.Site.libelle_Group'], 'fullPathName': 'SiteXML.Site_List.Site', 'nbFields': 4}]} {code} > Auto-normalization of semi-structured schema from a dataframe > - > > Key: SPARK-10869 > URL: https://issues.apache.org/jira/browse/SPARK-10869 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.5.1 >Reporter:
[jira] [Updated] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe
[ https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Genini updated SPARK-10869: -- Description: today, you can get a multi-depth schema from a semi-structured dataframe. (XML, JSON, etc..) Not so easy to deal in data warehousing where it's better to normalize the data. I propose an option to add when you get the schema (normalized, default False) Then the returned json schema will contains the normalized path for each field, and the list of the different node levels df = sqlContext.read.json(jsonPath) jsonLinearSchema = df.schema.jsonValue(normalized=True) >> {'fields': [{'fullPathName': 'SiteXML.BusinessDate', 'metadata': {}, 'name': 'BusinessDate', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.Id_Group', 'metadata': {}, 'name': 'Id_Group', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.Id_Site', 'metadata': {}, 'name': 'Id_Site', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.libelle', 'metadata': {}, 'name': 'libelle', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.Site_List.Site.libelle_Group', 'metadata': {}, 'name': 'libelle_Group', 'nullable': True, 'type': 'string'}, {'fullPathName': 'SiteXML.TimeStamp', 'metadata': {}, 'name': 'TimeStamp', 'nullable': True, 'type': 'string'}], 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate', 'SiteXML.TimeStamp'], 'fullPathName': 'SiteXML', 'nbFields': 2}, {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group', 'SiteXML.Site_List.Site.Id_Site', 'SiteXML.Site_List.Site.libelle', 'SiteXML.Site_List.Site.libelle_Group'], 'fullPathName': 'SiteXML.Site_List.Site', 'nbFields': 4}]} was: today, you can get a multi-depth schema from a semi-structured dataframe. (XML, JSON, etc..) Not so easy to deal in data warehousing where it's better to normalize the data. I propose an option to add when you get the schema (normalized, default False) Then the returned json schema will contains the normalized path for each field, and the list of the different node levels df = sqlContext.read.json(jsonPath) jsonLinearSchema = df.schema.jsonValue(normalized=True) >> {'fields': [{'metadata': {}, 'name': 'BusinessDate', 'nullable': True, 'pathName': 'SiteXML.BusinessDate', 'type': 'string'}, {'metadata': {}, 'name': 'Id_Group', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.Id_Group', 'type': 'string'}, {'metadata': {}, 'name': 'Id_Site', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.Id_Site', 'type': 'string'}, {'metadata': {}, 'name': 'label', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.label', 'type': 'string'}, {'metadata': {}, 'name': 'label_group', 'nullable': True, 'pathName': 'SiteXML.Site_List.Site.label_group', 'type': 'string'}, {'metadata': {}, 'name': 'TimeStamp', 'nullable': True, 'pathName': 'SiteXML.TimeStamp', 'type': 'string'}], 'nodes': [{'name': '', 'nbFields': 3}, {'name': 'SiteXML', 'nbFields': 1}, {'name': 'SiteXML.Site_List', 'nbFields': 0}, {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]} > Auto-normalization of semi-structured schema from a dataframe > - > > Key: SPARK-10869 > URL: https://issues.apache.org/jira/browse/SPARK-10869 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.5.1 >Reporter: Julien Genini >Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > today, you can get a multi-depth schema from a semi-structured dataframe. > (XML, JSON, etc..) > Not so easy to deal in data warehousing where it's better to normalize the > data. > I propose an option to add when you get the schema (normalized, default False) > Then the returned