[jira] [Created] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe

Julien Genini (JIRA) Tue, 29 Sep 2015 06:14:40 -0700

Julien Genini created SPARK-10869:
-------------------------------------

             Summary: Auto-normalization of semi-structured schema from a 
dataframe
                 Key: SPARK-10869
                 URL: https://issues.apache.org/jira/browse/SPARK-10869
             Project: Spark
          Issue Type: New Feature
          Components: PySpark
    Affects Versions: 1.5.1
            Reporter: Julien Genini
            Priority: Minor



today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (linear, default False)
with the path for each field, and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(linear=True)

>>
{'fields': [{'metadata': {},                                                    
             'name': 'BusinessDate',
             'nullable': True,
             'pathName': 'SiteXML.BusinessDate',
             'type': 'string'},
            {'metadata': {},
             'name': 'Id_Group',
             'nullable': True,
             'pathName': 'SiteXML.Site_List.Site.Id_Group',
             'type': 'string'},
            {'metadata': {},
             'name': 'Id_Site',
             'nullable': True,
             'pathName': 'SiteXML.Site_List.Site.Id_Site',
             'type': 'string'},
            {'metadata': {},
             'name': 'label',
             'nullable': True,
             'pathName': 'SiteXML.Site_List.Site.label',
             'type': 'string'},
            {'metadata': {},
             'name': 'label_group',
             'nullable': True,
             'pathName': 'SiteXML.Site_List.Site.label_group',
             'type': 'string'},
            {'metadata': {},
             'name': 'TimeStamp',
             'nullable': True,
             'pathName': 'SiteXML.TimeStamp',
             'type': 'string'}],
 'nodes': [{'name': '', 'nbFields': 3},
           {'name': 'SiteXML', 'nbFields': 1},
           {'name': 'SiteXML.Site_List', 'nbFields': 0},
           {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]}






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe

Reply via email to