[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2870:

Target Version/s:   (was: 1.6.0)

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-08-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2870:
---
Parent Issue: SPARK-9576  (was: SPARK-6116)

 Thorough schema inference directly on RDDs of Python dictionaries
 -

 Key: SPARK-2870
 URL: https://issues.apache.org/jira/browse/SPARK-2870
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Nicholas Chammas

 h4. Background
 I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
 They process JSON text directly and infer a schema that covers the entire 
 source data set. 
 This is very important with semi-structured data like JSON since individual 
 elements in the data set are free to have different structures. Matching 
 fields across elements may even have different value types.
 For example:
 {code}
 {a: 5}
 {a: cow}
 {code}
 To get a queryable schema that covers the whole data set, you need to infer a 
 schema by looking at the whole data set. The aforementioned 
 {{SQLContext.json...()}} methods do this very well. 
 h4. Feature Request
 What we need is for {{SQlContext.inferSchema()}} to do this, too. 
 Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
 Python dictionaries and does something functionally equivalent to this:
 {code}
 SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
 {code}
 As of 1.0.2, 
 [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
  just looks at the first element in the data set. This won't help much when 
 the structure of the elements in the target RDD is variable.
 h4. Example Use Case
 * You have some JSON text data that you want to analyze using Spark SQL. 
 * You would use one of the {{SQLContext.json...()}} methods, but you need to 
 do some filtering on the data first to remove bad elements--basically, some 
 minimal schema validation.
 * You deserialize the JSON objects to Python {{dict}} s and filter out the 
 bad ones. You now have an RDD of dictionaries.
 * From this RDD, you want a SchemaRDD that captures the schema for the whole 
 data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-05-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2870:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-6116

 Thorough schema inference directly on RDDs of Python dictionaries
 -

 Key: SPARK-2870
 URL: https://issues.apache.org/jira/browse/SPARK-2870
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Nicholas Chammas

 h4. Background
 I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
 They process JSON text directly and infer a schema that covers the entire 
 source data set. 
 This is very important with semi-structured data like JSON since individual 
 elements in the data set are free to have different structures. Matching 
 fields across elements may even have different value types.
 For example:
 {code}
 {a: 5}
 {a: cow}
 {code}
 To get a queryable schema that covers the whole data set, you need to infer a 
 schema by looking at the whole data set. The aforementioned 
 {{SQLContext.json...()}} methods do this very well. 
 h4. Feature Request
 What we need is for {{SQlContext.inferSchema()}} to do this, too. 
 Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
 Python dictionaries and does something functionally equivalent to this:
 {code}
 SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
 {code}
 As of 1.0.2, 
 [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
  just looks at the first element in the data set. This won't help much when 
 the structure of the elements in the target RDD is variable.
 h4. Example Use Case
 * You have some JSON text data that you want to analyze using Spark SQL. 
 * You would use one of the {{SQLContext.json...()}} methods, but you need to 
 do some filtering on the data first to remove bad elements--basically, some 
 minimal schema validation.
 * You deserialize the JSON objects to Python {{dict}} s and filter out the 
 bad ones. You now have an RDD of dictionaries.
 * From this RDD, you want a SchemaRDD that captures the schema for the whole 
 data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-05-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2870:
---
Target Version/s: 1.5.0  (was: 1.4.0)

 Thorough schema inference directly on RDDs of Python dictionaries
 -

 Key: SPARK-2870
 URL: https://issues.apache.org/jira/browse/SPARK-2870
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Nicholas Chammas

 h4. Background
 I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
 They process JSON text directly and infer a schema that covers the entire 
 source data set. 
 This is very important with semi-structured data like JSON since individual 
 elements in the data set are free to have different structures. Matching 
 fields across elements may even have different value types.
 For example:
 {code}
 {a: 5}
 {a: cow}
 {code}
 To get a queryable schema that covers the whole data set, you need to infer a 
 schema by looking at the whole data set. The aforementioned 
 {{SQLContext.json...()}} methods do this very well. 
 h4. Feature Request
 What we need is for {{SQlContext.inferSchema()}} to do this, too. 
 Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
 Python dictionaries and does something functionally equivalent to this:
 {code}
 SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
 {code}
 As of 1.0.2, 
 [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
  just looks at the first element in the data set. This won't help much when 
 the structure of the elements in the target RDD is variable.
 h4. Example Use Case
 * You have some JSON text data that you want to analyze using Spark SQL. 
 * You would use one of the {{SQLContext.json...()}} methods, but you need to 
 do some filtering on the data first to remove bad elements--basically, some 
 minimal schema validation.
 * You deserialize the JSON objects to Python {{dict}} s and filter out the 
 bad ones. You now have an RDD of dictionaries.
 * From this RDD, you want a SchemaRDD that captures the schema for the whole 
 data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-02-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2870:

Target Version/s: 1.4.0  (was: 1.3.0)

 Thorough schema inference directly on RDDs of Python dictionaries
 -

 Key: SPARK-2870
 URL: https://issues.apache.org/jira/browse/SPARK-2870
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Nicholas Chammas

 h4. Background
 I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
 They process JSON text directly and infer a schema that covers the entire 
 source data set. 
 This is very important with semi-structured data like JSON since individual 
 elements in the data set are free to have different structures. Matching 
 fields across elements may even have different value types.
 For example:
 {code}
 {a: 5}
 {a: cow}
 {code}
 To get a queryable schema that covers the whole data set, you need to infer a 
 schema by looking at the whole data set. The aforementioned 
 {{SQLContext.json...()}} methods do this very well. 
 h4. Feature Request
 What we need is for {{SQlContext.inferSchema()}} to do this, too. 
 Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
 Python dictionaries and does something functionally equivalent to this:
 {code}
 SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
 {code}
 As of 1.0.2, 
 [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
  just looks at the first element in the data set. This won't help much when 
 the structure of the elements in the target RDD is variable.
 h4. Example Use Case
 * You have some JSON text data that you want to analyze using Spark SQL. 
 * You would use one of the {{SQLContext.json...()}} methods, but you need to 
 do some filtering on the data first to remove bad elements--basically, some 
 minimal schema validation.
 * You deserialize the JSON objects to Python {{dict}} s and filter out the 
 bad ones. You now have an RDD of dictionaries.
 * From this RDD, you want a SchemaRDD that captures the schema for the whole 
 data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2014-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2870:

Target Version/s: 1.3.0  (was: 1.2.0)

 Thorough schema inference directly on RDDs of Python dictionaries
 -

 Key: SPARK-2870
 URL: https://issues.apache.org/jira/browse/SPARK-2870
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Nicholas Chammas

 h4. Background
 I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
 They process JSON text directly and infer a schema that covers the entire 
 source data set. 
 This is very important with semi-structured data like JSON since individual 
 elements in the data set are free to have different structures. Matching 
 fields across elements may even have different value types.
 For example:
 {code}
 {a: 5}
 {a: cow}
 {code}
 To get a queryable schema that covers the whole data set, you need to infer a 
 schema by looking at the whole data set. The aforementioned 
 {{SQLContext.json...()}} methods do this very well. 
 h4. Feature Request
 What we need is for {{SQlContext.inferSchema()}} to do this, too. 
 Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
 Python dictionaries and does something functionally equivalent to this:
 {code}
 SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
 {code}
 As of 1.0.2, 
 [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
  just looks at the first element in the data set. This won't help much when 
 the structure of the elements in the target RDD is variable.
 h4. Example Use Case
 * You have some JSON text data that you want to analyze using Spark SQL. 
 * You would use one of the {{SQLContext.json...()}} methods, but you need to 
 do some filtering on the data first to remove bad elements--basically, some 
 minimal schema validation.
 * You deserialize the JSON objects to Python {{dict}} s and filter out the 
 bad ones. You now have an RDD of dictionaries.
 * From this RDD, you want a SchemaRDD that captures the schema for the whole 
 data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2014-08-31 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2870:

Target Version/s: 1.2.0

 Thorough schema inference directly on RDDs of Python dictionaries
 -

 Key: SPARK-2870
 URL: https://issues.apache.org/jira/browse/SPARK-2870
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Nicholas Chammas

 h4. Background
 I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
 They process JSON text directly and infer a schema that covers the entire 
 source data set. 
 This is very important with semi-structured data like JSON since individual 
 elements in the data set are free to have different structures. Matching 
 fields across elements may even have different value types.
 For example:
 {code}
 {a: 5}
 {a: cow}
 {code}
 To get a queryable schema that covers the whole data set, you need to infer a 
 schema by looking at the whole data set. The aforementioned 
 {{SQLContext.json...()}} methods do this very well. 
 h4. Feature Request
 What we need is for {{SQlContext.inferSchema()}} to do this, too. 
 Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
 Python dictionaries and does something functionally equivalent to this:
 {code}
 SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
 {code}
 As of 1.0.2, 
 [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
  just looks at the first element in the data set. This won't help much when 
 the structure of the elements in the target RDD is variable.
 h4. Example Use Case
 * You have some JSON text data that you want to analyze using Spark SQL. 
 * You would use one of the {{SQLContext.json...()}} methods, but you need to 
 do some filtering on the data first to remove bad elements--basically, some 
 minimal schema validation.
 * You deserialize the JSON objects to Python {{dict}} s and filter out the 
 bad ones. You now have an RDD of dictionaries.
 * From this RDD, you want a SchemaRDD that captures the schema for the whole 
 data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org