[jira] [Commented] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]
[ https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15556510#comment-15556510 ] Xiao Li commented on SPARK-4782: This should have been already resolved in the 2.0 release. Thanks! > Add inferSchema support for RDD[Map[String, Any]] > - > > Key: SPARK-4782 > URL: https://issues.apache.org/jira/browse/SPARK-4782 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jianshi Huang >Priority: Minor > > The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to > be converting each Map to JSON String first and use JsonRDD.inferSchema on it. > It's very inefficient. > Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for > Schemaless data as adding Map like interface to any serialization format is > easy. > So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new > serialization format we want to support, we just need to add a Map interface > wrapper to it* > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]
[ https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567279#comment-14567279 ] Jianshi Huang commented on SPARK-4782: -- Thanks Luca for the clever fix! I also noticed that the schema inference in JsonRDD is too JSON specific. As JSON's datatype is quite limited. Jianshi > Add inferSchema support for RDD[Map[String, Any]] > - > > Key: SPARK-4782 > URL: https://issues.apache.org/jira/browse/SPARK-4782 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jianshi Huang >Priority: Minor > > The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to > be converting each Map to JSON String first and use JsonRDD.inferSchema on it. > It's very inefficient. > Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for > Schemaless data as adding Map like interface to any serialization format is > easy. > So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new > serialization format we want to support, we just need to add a Map interface > wrapper to it* > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]
[ https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567254#comment-14567254 ] Luca Rosellini commented on SPARK-4782: --- Hi Jianshi, I've just hit the same problem as you, it seems quite inefficient to have to serialize to JSON when you already have a {{Map\[String,Any\]}}. I've opened a PR in github that adds this feature in a generic way, check it out at: [https://github.com/apache/spark/pull/6554]. Hopefully it will be merged in master. The patch extends {{inferSchema}} functionality to any RDD of type T for which you can provide a function mapping from {{RDD\[T\]}} to {{RDD\[Map\[String,Any\]\]}}. In your case, you already have an {{RDD\[Map\[String,Any\]\]}}, so you can simply pass the identity function, something like this: {{JsonRDD.inferSchema(json, 1.0, conf.columnNameOfCorruptRecord, \{ (a:RDD\[Map\[String,Any\]\],b:String) => a \}))}} > Add inferSchema support for RDD[Map[String, Any]] > - > > Key: SPARK-4782 > URL: https://issues.apache.org/jira/browse/SPARK-4782 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jianshi Huang >Priority: Minor > > The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to > be converting each Map to JSON String first and use JsonRDD.inferSchema on it. > It's very inefficient. > Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for > Schemaless data as adding Map like interface to any serialization format is > easy. > So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new > serialization format we want to support, we just need to add a Map interface > wrapper to it* > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4782) Add inferSchema support for RDD[Map[String, Any]]
[ https://issues.apache.org/jira/browse/SPARK-4782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567237#comment-14567237 ] Apache Spark commented on SPARK-4782: - User 'lucarosellini' has created a pull request for this issue: https://github.com/apache/spark/pull/6554 > Add inferSchema support for RDD[Map[String, Any]] > - > > Key: SPARK-4782 > URL: https://issues.apache.org/jira/browse/SPARK-4782 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jianshi Huang >Priority: Minor > > The best way to convert RDD[Map[String, Any]] to SchemaRDD currently seems to > be converting each Map to JSON String first and use JsonRDD.inferSchema on it. > It's very inefficient. > Instead of JsonRDD, RDD[Map[String, Any]] is a better common denominator for > Schemaless data as adding Map like interface to any serialization format is > easy. > So please add inferSchema support to RDD[Map[String, Any]]. *Then for any new > serialization format we want to support, we just need to add a Map interface > wrapper to it* > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org