[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2870: Target Version/s: (was: 1.6.0) > Thorough schema inference directly on RDDs of Python dictionaries > - > > Key: SPARK-2870 > URL: https://issues.apache.org/jira/browse/SPARK-2870 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Reporter: Nicholas Chammas > > h4. Background > I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. > They process JSON text directly and infer a schema that covers the entire > source data set. > This is very important with semi-structured data like JSON since individual > elements in the data set are free to have different structures. Matching > fields across elements may even have different value types. > For example: > {code} > {"a": 5} > {"a": "cow"} > {code} > To get a queryable schema that covers the whole data set, you need to infer a > schema by looking at the whole data set. The aforementioned > {{SQLContext.json...()}} methods do this very well. > h4. Feature Request > What we need is for {{SQlContext.inferSchema()}} to do this, too. > Alternatively, we need a new {{SQLContext}} method that works on RDDs of > Python dictionaries and does something functionally equivalent to this: > {code} > SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) > {code} > As of 1.0.2, > [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] > just looks at the first element in the data set. This won't help much when > the structure of the elements in the target RDD is variable. > h4. Example Use Case > * You have some JSON text data that you want to analyze using Spark SQL. > * You would use one of the {{SQLContext.json...()}} methods, but you need to > do some filtering on the data first to remove bad elements--basically, some > minimal schema validation. > * You deserialize the JSON objects to Python {{dict}} s and filter out the > bad ones. You now have an RDD of dictionaries. > * From this RDD, you want a SchemaRDD that captures the schema for the whole > data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2870: --- Parent Issue: SPARK-9576 (was: SPARK-6116) Thorough schema inference directly on RDDs of Python dictionaries - Key: SPARK-2870 URL: https://issues.apache.org/jira/browse/SPARK-2870 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Reporter: Nicholas Chammas h4. Background I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. They process JSON text directly and infer a schema that covers the entire source data set. This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types. For example: {code} {a: 5} {a: cow} {code} To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned {{SQLContext.json...()}} methods do this very well. h4. Feature Request What we need is for {{SQlContext.inferSchema()}} to do this, too. Alternatively, we need a new {{SQLContext}} method that works on RDDs of Python dictionaries and does something functionally equivalent to this: {code} SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) {code} As of 1.0.2, [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable. h4. Example Use Case * You have some JSON text data that you want to analyze using Spark SQL. * You would use one of the {{SQLContext.json...()}} methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation. * You deserialize the JSON objects to Python {{dict}} s and filter out the bad ones. You now have an RDD of dictionaries. * From this RDD, you want a SchemaRDD that captures the schema for the whole data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2870: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-6116 Thorough schema inference directly on RDDs of Python dictionaries - Key: SPARK-2870 URL: https://issues.apache.org/jira/browse/SPARK-2870 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Reporter: Nicholas Chammas h4. Background I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. They process JSON text directly and infer a schema that covers the entire source data set. This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types. For example: {code} {a: 5} {a: cow} {code} To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned {{SQLContext.json...()}} methods do this very well. h4. Feature Request What we need is for {{SQlContext.inferSchema()}} to do this, too. Alternatively, we need a new {{SQLContext}} method that works on RDDs of Python dictionaries and does something functionally equivalent to this: {code} SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) {code} As of 1.0.2, [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable. h4. Example Use Case * You have some JSON text data that you want to analyze using Spark SQL. * You would use one of the {{SQLContext.json...()}} methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation. * You deserialize the JSON objects to Python {{dict}} s and filter out the bad ones. You now have an RDD of dictionaries. * From this RDD, you want a SchemaRDD that captures the schema for the whole data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2870: --- Target Version/s: 1.5.0 (was: 1.4.0) Thorough schema inference directly on RDDs of Python dictionaries - Key: SPARK-2870 URL: https://issues.apache.org/jira/browse/SPARK-2870 Project: Spark Issue Type: Improvement Components: PySpark, SQL Reporter: Nicholas Chammas h4. Background I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. They process JSON text directly and infer a schema that covers the entire source data set. This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types. For example: {code} {a: 5} {a: cow} {code} To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned {{SQLContext.json...()}} methods do this very well. h4. Feature Request What we need is for {{SQlContext.inferSchema()}} to do this, too. Alternatively, we need a new {{SQLContext}} method that works on RDDs of Python dictionaries and does something functionally equivalent to this: {code} SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) {code} As of 1.0.2, [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable. h4. Example Use Case * You have some JSON text data that you want to analyze using Spark SQL. * You would use one of the {{SQLContext.json...()}} methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation. * You deserialize the JSON objects to Python {{dict}} s and filter out the bad ones. You now have an RDD of dictionaries. * From this RDD, you want a SchemaRDD that captures the schema for the whole data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2870: Target Version/s: 1.4.0 (was: 1.3.0) Thorough schema inference directly on RDDs of Python dictionaries - Key: SPARK-2870 URL: https://issues.apache.org/jira/browse/SPARK-2870 Project: Spark Issue Type: Improvement Components: PySpark, SQL Reporter: Nicholas Chammas h4. Background I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. They process JSON text directly and infer a schema that covers the entire source data set. This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types. For example: {code} {a: 5} {a: cow} {code} To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned {{SQLContext.json...()}} methods do this very well. h4. Feature Request What we need is for {{SQlContext.inferSchema()}} to do this, too. Alternatively, we need a new {{SQLContext}} method that works on RDDs of Python dictionaries and does something functionally equivalent to this: {code} SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) {code} As of 1.0.2, [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable. h4. Example Use Case * You have some JSON text data that you want to analyze using Spark SQL. * You would use one of the {{SQLContext.json...()}} methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation. * You deserialize the JSON objects to Python {{dict}} s and filter out the bad ones. You now have an RDD of dictionaries. * From this RDD, you want a SchemaRDD that captures the schema for the whole data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2870: Target Version/s: 1.3.0 (was: 1.2.0) Thorough schema inference directly on RDDs of Python dictionaries - Key: SPARK-2870 URL: https://issues.apache.org/jira/browse/SPARK-2870 Project: Spark Issue Type: Improvement Components: PySpark, SQL Reporter: Nicholas Chammas h4. Background I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. They process JSON text directly and infer a schema that covers the entire source data set. This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types. For example: {code} {a: 5} {a: cow} {code} To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned {{SQLContext.json...()}} methods do this very well. h4. Feature Request What we need is for {{SQlContext.inferSchema()}} to do this, too. Alternatively, we need a new {{SQLContext}} method that works on RDDs of Python dictionaries and does something functionally equivalent to this: {code} SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) {code} As of 1.0.2, [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable. h4. Example Use Case * You have some JSON text data that you want to analyze using Spark SQL. * You would use one of the {{SQLContext.json...()}} methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation. * You deserialize the JSON objects to Python {{dict}} s and filter out the bad ones. You now have an RDD of dictionaries. * From this RDD, you want a SchemaRDD that captures the schema for the whole data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2870: Target Version/s: 1.2.0 Thorough schema inference directly on RDDs of Python dictionaries - Key: SPARK-2870 URL: https://issues.apache.org/jira/browse/SPARK-2870 Project: Spark Issue Type: Improvement Components: PySpark, SQL Reporter: Nicholas Chammas h4. Background I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. They process JSON text directly and infer a schema that covers the entire source data set. This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types. For example: {code} {a: 5} {a: cow} {code} To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned {{SQLContext.json...()}} methods do this very well. h4. Feature Request What we need is for {{SQlContext.inferSchema()}} to do this, too. Alternatively, we need a new {{SQLContext}} method that works on RDDs of Python dictionaries and does something functionally equivalent to this: {code} SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) {code} As of 1.0.2, [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable. h4. Example Use Case * You have some JSON text data that you want to analyze using Spark SQL. * You would use one of the {{SQLContext.json...()}} methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation. * You deserialize the JSON objects to Python {{dict}} s and filter out the bad ones. You now have an RDD of dictionaries. * From this RDD, you want a SchemaRDD that captures the schema for the whole data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org