[jira] [Updated] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe

2015-09-30 Thread Julien Genini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Genini updated SPARK-10869:
--
Target Version/s:   (was: 1.5.1)

> Auto-normalization of semi-structured schema from a dataframe
> -
>
> Key: SPARK-10869
> URL: https://issues.apache.org/jira/browse/SPARK-10869
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Julien Genini
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> today, you can get a multi-depth schema from a semi-structured dataframe. 
> (XML, JSON, etc..)
> Not so easy to deal in data warehousing where it's better to normalize the 
> data.
> I propose an option to add when you get the schema (normalized, default False)
> Then the returned json schema will contains the normalized path for each 
> field, and the list of the different node levels
> df = sqlContext.read.json(jsonPath)
> jsonLinearSchema = df.schema.jsonValue(normalized=True)
> >>
> {code}
> {'fields': [{'fullPathName': 'SiteXML.BusinessDate',  
>   
>  'metadata': {},
>  'name': 'BusinessDate',
>  'nullable': True,
>  'type': 'string'},
> {'fullPathName': 'SiteXML.Site_List.Site.Id_Group',
>  'metadata': {},
>  'name': 'Id_Group',
>  'nullable': True,
>  'type': 'string'},
> {'fullPathName': 'SiteXML.Site_List.Site.Id_Site',
>  'metadata': {},
>  'name': 'Id_Site',
>  'nullable': True,
>  'type': 'string'},
> {'fullPathName': 'SiteXML.Site_List.Site.libelle',
>  'metadata': {},
>  'name': 'libelle',
>  'nullable': True,
>  'type': 'string'},
> {'fullPathName': 'SiteXML.Site_List.Site.libelle_Group',
>  'metadata': {},
>  'name': 'libelle_Group',
>  'nullable': True,
>  'type': 'string'},
> {'fullPathName': 'SiteXML.TimeStamp',
>  'metadata': {},
>  'name': 'TimeStamp',
>  'nullable': True,
>  'type': 'string'}],
>  'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate',
>'SiteXML.TimeStamp'],
> 'fullPathName': 'SiteXML',
> 'nbFields': 2},
>{'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group',
>'SiteXML.Site_List.Site.Id_Site',
>'SiteXML.Site_List.Site.libelle',
>'SiteXML.Site_List.Site.libelle_Group'],
> 'fullPathName': 'SiteXML.Site_List.Site',
> 'nbFields': 4}]}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe

2015-09-29 Thread Julien Genini (JIRA)
Julien Genini created SPARK-10869:
-

 Summary: Auto-normalization of semi-structured schema from a 
dataframe
 Key: SPARK-10869
 URL: https://issues.apache.org/jira/browse/SPARK-10869
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 1.5.1
Reporter: Julien Genini
Priority: Minor


today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (linear, default False)
with the path for each field, and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(linear=True)

>>
{'fields': [{'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'pathName': 'SiteXML.BusinessDate',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Group',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Site',
 'type': 'string'},
{'metadata': {},
 'name': 'label',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label',
 'type': 'string'},
{'metadata': {},
 'name': 'label_group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label_group',
 'type': 'string'},
{'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'pathName': 'SiteXML.TimeStamp',
 'type': 'string'}],
 'nodes': [{'name': '', 'nbFields': 3},
   {'name': 'SiteXML', 'nbFields': 1},
   {'name': 'SiteXML.Site_List', 'nbFields': 0},
   {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]}






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe

2015-09-29 Thread Julien Genini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Genini updated SPARK-10869:
--
Description: 
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (normalized, default False)
Then the returned json schema will contains the normalized path for each field, 
and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(normalized=True)

>>
{'fields': [{'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'pathName': 'SiteXML.BusinessDate',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Group',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Site',
 'type': 'string'},
{'metadata': {},
 'name': 'label',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label',
 'type': 'string'},
{'metadata': {},
 'name': 'label_group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label_group',
 'type': 'string'},
{'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'pathName': 'SiteXML.TimeStamp',
 'type': 'string'}],
 'nodes': [{'name': '', 'nbFields': 3},
   {'name': 'SiteXML', 'nbFields': 1},
   {'name': 'SiteXML.Site_List', 'nbFields': 0},
   {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]}




  was:
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (linear, default False)
with the path for each field, and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(linear=True)

>>
{'fields': [{'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'pathName': 'SiteXML.BusinessDate',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Group',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Site',
 'type': 'string'},
{'metadata': {},
 'name': 'label',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label',
 'type': 'string'},
{'metadata': {},
 'name': 'label_group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label_group',
 'type': 'string'},
{'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'pathName': 'SiteXML.TimeStamp',
 'type': 'string'}],
 'nodes': [{'name': '', 'nbFields': 3},
   {'name': 'SiteXML', 'nbFields': 1},
   {'name': 'SiteXML.Site_List', 'nbFields': 0},
   {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]}





> Auto-normalization of semi-structured schema from a dataframe
> -
>
> Key: SPARK-10869
> URL: https://issues.apache.org/jira/browse/SPARK-10869
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Julien Genini
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> today, you can get a multi-depth schema from a semi-structured dataframe. 
> (XML, JSON, etc..)
> Not so easy to deal in data warehousing where it's better to normalize the 
> data.
> I propose an option to add when you get the schema (normalized, default False)
> Then the returned json schema will contains the normalized path for each 
> field, and the list of the different node levels
> df = sqlContext.read.json(jsonPath)
> jsonLinearSchema = df.schema.jsonValue(normalized=True)
> >>
> {'fields': [{'metadata': {},  
>   
>  'name': 'BusinessDate',
>  'nullable': True,
>  'pathName': 

[jira] [Updated] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe

2015-09-29 Thread Julien Genini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Genini updated SPARK-10869:
--
Description: 
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (normalized, default False)
Then the returned json schema will contains the normalized path for each field, 
and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(normalized=True)

>>
{code:json}
{'fields': [{'fullPathName': 'SiteXML.BusinessDate',
 'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Group',
 'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Site',
 'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle',
 'metadata': {},
 'name': 'libelle',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle_Group',
 'metadata': {},
 'name': 'libelle_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.TimeStamp',
 'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'type': 'string'}],
 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate',
   'SiteXML.TimeStamp'],
'fullPathName': 'SiteXML',
'nbFields': 2},
   {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group',
   'SiteXML.Site_List.Site.Id_Site',
   'SiteXML.Site_List.Site.libelle',
   'SiteXML.Site_List.Site.libelle_Group'],
'fullPathName': 'SiteXML.Site_List.Site',
'nbFields': 4}]}
{code}


  was:
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (normalized, default False)
Then the returned json schema will contains the normalized path for each field, 
and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(normalized=True)

>>
{'fields': [{'fullPathName': 'SiteXML.BusinessDate',
 'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Group',
 'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Site',
 'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle',
 'metadata': {},
 'name': 'libelle',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle_Group',
 'metadata': {},
 'name': 'libelle_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.TimeStamp',
 'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'type': 'string'}],
 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate',
   'SiteXML.TimeStamp'],
'fullPathName': 'SiteXML',
'nbFields': 2},
   {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group',
   'SiteXML.Site_List.Site.Id_Site',
   'SiteXML.Site_List.Site.libelle',
   'SiteXML.Site_List.Site.libelle_Group'],
'fullPathName': 'SiteXML.Site_List.Site',
'nbFields': 4}]}




> Auto-normalization of semi-structured schema from a dataframe
> -
>
> Key: SPARK-10869
> URL: https://issues.apache.org/jira/browse/SPARK-10869
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Julien Genini
>  

[jira] [Updated] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe

2015-09-29 Thread Julien Genini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Genini updated SPARK-10869:
--
Description: 
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (normalized, default False)
Then the returned json schema will contains the normalized path for each field, 
and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(normalized=True)

>>
{code}
{'fields': [{'fullPathName': 'SiteXML.BusinessDate',
 'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Group',
 'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Site',
 'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle',
 'metadata': {},
 'name': 'libelle',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle_Group',
 'metadata': {},
 'name': 'libelle_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.TimeStamp',
 'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'type': 'string'}],
 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate',
   'SiteXML.TimeStamp'],
'fullPathName': 'SiteXML',
'nbFields': 2},
   {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group',
   'SiteXML.Site_List.Site.Id_Site',
   'SiteXML.Site_List.Site.libelle',
   'SiteXML.Site_List.Site.libelle_Group'],
'fullPathName': 'SiteXML.Site_List.Site',
'nbFields': 4}]}
{code}


  was:
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (normalized, default False)
Then the returned json schema will contains the normalized path for each field, 
and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(normalized=True)

>>
{code:json}
{'fields': [{'fullPathName': 'SiteXML.BusinessDate',
 'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Group',
 'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Site',
 'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle',
 'metadata': {},
 'name': 'libelle',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle_Group',
 'metadata': {},
 'name': 'libelle_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.TimeStamp',
 'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'type': 'string'}],
 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate',
   'SiteXML.TimeStamp'],
'fullPathName': 'SiteXML',
'nbFields': 2},
   {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group',
   'SiteXML.Site_List.Site.Id_Site',
   'SiteXML.Site_List.Site.libelle',
   'SiteXML.Site_List.Site.libelle_Group'],
'fullPathName': 'SiteXML.Site_List.Site',
'nbFields': 4}]}
{code}



> Auto-normalization of semi-structured schema from a dataframe
> -
>
> Key: SPARK-10869
> URL: https://issues.apache.org/jira/browse/SPARK-10869
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: 

[jira] [Updated] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe

2015-09-29 Thread Julien Genini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Genini updated SPARK-10869:
--
Description: 
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (normalized, default False)
Then the returned json schema will contains the normalized path for each field, 
and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(normalized=True)

>>
{'fields': [{'fullPathName': 'SiteXML.BusinessDate',
 'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Group',
 'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Site',
 'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle',
 'metadata': {},
 'name': 'libelle',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle_Group',
 'metadata': {},
 'name': 'libelle_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.TimeStamp',
 'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'type': 'string'}],
 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate',
   'SiteXML.TimeStamp'],
'fullPathName': 'SiteXML',
'nbFields': 2},
   {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group',
   'SiteXML.Site_List.Site.Id_Site',
   'SiteXML.Site_List.Site.libelle',
   'SiteXML.Site_List.Site.libelle_Group'],
'fullPathName': 'SiteXML.Site_List.Site',
'nbFields': 4}]}



  was:
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (normalized, default False)
Then the returned json schema will contains the normalized path for each field, 
and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(normalized=True)

>>
{'fields': [{'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'pathName': 'SiteXML.BusinessDate',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Group',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Site',
 'type': 'string'},
{'metadata': {},
 'name': 'label',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label',
 'type': 'string'},
{'metadata': {},
 'name': 'label_group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label_group',
 'type': 'string'},
{'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'pathName': 'SiteXML.TimeStamp',
 'type': 'string'}],
 'nodes': [{'name': '', 'nbFields': 3},
   {'name': 'SiteXML', 'nbFields': 1},
   {'name': 'SiteXML.Site_List', 'nbFields': 0},
   {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]}





> Auto-normalization of semi-structured schema from a dataframe
> -
>
> Key: SPARK-10869
> URL: https://issues.apache.org/jira/browse/SPARK-10869
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Julien Genini
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> today, you can get a multi-depth schema from a semi-structured dataframe. 
> (XML, JSON, etc..)
> Not so easy to deal in data warehousing where it's better to normalize the 
> data.
> I propose an option to add when you get the schema (normalized, default False)
> Then the returned