Wow, glad to know that it works well, and sorry, the Jira is another issue, which is not the same case here.
From: Bagmeet Behera [mailto:bagme...@gmail.com] Sent: Saturday, January 17, 2015 12:47 AM To: Cheng, Hao Subject: Re: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file Hi Cheng, Hao An update: I installed the latest binaries of Spark 1.2.0 (prebuild for Hadoop 2.4 and later) and tried your suggestion. And it *works* perfectly! Therefore I would encourage you to post your reply on the archive for the advantage of all. Thanks and best wishes, BB (Bagmeet) On Fri, Jan 16, 2015 at 11:20 AM, Bagmeet Behera <bagme...@gmail.com<mailto:bagme...@gmail.com>> wrote: Hi Chen, Hao The awesome thing is: the way you suggest works perfectly on Spark 1.1.0. - I am testing this on a old test installation with Spark 1.1.0 (installed from http://spark.apache.org/) with scala 2.10.4. Just fyi: This was because I could not create a HiveContext on the newer installation of spark 1.2.0 (scala 2.10.4) - from Cloudera CDH release 5.3.0 - which gave some strange error that looked like there is some incompatibility between hive and spark libraries. I can create a post for this (if I find an appropriate user group, perhaps on cloudera side) but would this be also the result of the bug you mention? BTW your reply is not in the archives. I guess this is also because of the bug in the current version you mentioned? Many thanks for the reply. Best, BB On Fri, Jan 16, 2015 at 3:24 AM, Cheng, Hao <hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote: Hi, BB Ideally you can do the query like: select key, value.percent from mytable_data lateral view explode(audiences) f as key, value limit 3; But there is a bug in HiveContext: https://issues.apache.org/jira/browse/SPARK-5237 I am working on it now, hopefully make a patch soon. Cheng Hao -----Original Message----- From: BB [mailto:bagme...@gmail.com<mailto:bagme...@gmail.com>] Sent: Friday, January 16, 2015 12:52 AM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file Hi all, Any help on the following is very much appreciated. ================================= Problem: On a schemaRDD read from a parquet file (data within file uses AVRO model) using the HiveContext: I can't figure out how to 'select' or use 'where' clause, to filter rows on a field that has a Map AVRO-data-type. I want to do a filtering using a given ('key' : 'value'). How could I do this? Details: * the printSchema of the loaded schemaRDD is like so: ------ output snippet ----- |-- created: long (nullable = false) |-- audiences: map (nullable = true) | |-- key: string | |-- value: struct (valueContainsNull = false) | | |-- percent: float (nullable = false) | | |-- cluster: integer (nullable = false) ----------------------------- * I dont get a result when I try to select on a specific value of the 'audience' like so: "SELECT created, audiences FROM mytable_data LATERAL VIEW explode(audiences) adtab AS adcol WHERE audiences['key']=='tg_loh' LIMIT 10" sequence of commands on the spark-shell (a different query and output) is: ------ code snippet ----- scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) scala> val parquetFile2 = hiveContext.parquetFile("/home/myuser/myparquetfile") scala> parquetFile2.registerTempTable("mytable_data") scala> hiveContext.cacheTable("mytable_data") scala> hiveContext.sql("SELECT audiences['key'], audiences['value'] scala> FROM mytable_data LATERAL VIEW explode(audiences) adu AS audien LIMIT 3").collect().foreach(println) ------ output --------- [null,null] [null,null] [null,null] ------------------------ gives a list of nulls. I can see that there is data when I just do the following (output is truncated): ------ code snippet ----- scala> hiveContext.sql("SELECT audiences FROM mytable_data LATERAL VIEW explode(audiences) tablealias AS colalias LIMIT 1").collect().foreach(println) ---- output -------------- [Map(tg_loh -> [0.0,1,Map()], tg_co -> [0.0,1,Map(tg_co_petrol -> 0.0)], tg_wall -> [0.0,1,Map(tg_wall_poi -> 0.0)], ... ------------------------ Q1) What am I doing wrong? Q2) How can I use 'where' in the query to filter on specific values? What works: Queries with filtering, and selecting on fields that have simple AVRO data-types, such as long or string works fine. =========================== I hope the explanation makes sense. Thanks. Best, BB -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-hiveContext-to-select-a-nested-Map-data-type-from-an-AVROmodel-parquet-file-tp21168.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>