Dear Spark Users, I'd highly appreciate any comments on whether there is a way to define an Avro table and then run select * without errors. Thank you very much in advance!
Ilya. Attached please find a sample data directory and the Avro schema. analytics_events2.zip <http://apache-spark-user-list.1001560.n3.nabble.com/file/n15070/analytics_events2.zip> event.avsc <http://apache-spark-user-list.1001560.n3.nabble.com/file/n15070/event.avsc> Creating a table in ways (1) and (2) below results in exceptions. I believe it complaints of not being able to find any column in the table definition in the case of (1) or being unable to find a dummy column notused in the case of (2). Why does it try to look at column definitions when the Avro schema is provided?? I thought I actually found a workaround by first creating a table in Hive, running "describe table analytics_events2" in Hive, getting things like (3) for column definitions, and finally using those to redefine the table in (4). I am essentially replicating the Avro schema in the column definitions using Hive syntax. It works on most columns, but not on all. Replacing 'current' by 'currentt' (Sigh) resolved the error in (4), but I got the error (5). Replacing some complex types with 'string' allows me to run queries on the remaining columns (e.g. select ts, host from analytics_events2 limit 10). Mentioning a column whose data type was replaced by 'string' results in errors, of course. (1) CREATE EXTERNAL TABLE analytics_events2 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ( 'avro.schema.url'='file:///Users/ilyagluhovsky/dev/databone/modeling/rally/data/event.avsc') STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/Users/ilyagluhovsky/dev/databone/modeling/rally/data/analytics_events2' The table gets created, but [from pyspark.sql import HiveContext sqlContext = HiveContext(sc)] sqlContext.sql("select * from analytics_events2 limit 10").collect() results in /Users/ilyagluhovsky/dev/spark-1.1.0/python/pyspark/sql.pyc in <genexpr>((f,)) 883 """ Row in SchemaRDD """ 884 __DATATYPE__ = dataType --> 885 __FIELDS__ = tuple(f.name for f in dataType.fields) 886 __slots__ = () AttributeError: 'NoneType' object has no attribute 'name' (2) CREATE EXTERNAL TABLE analytics_events2(notused INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ( 'avro.schema.url'='file:///Users/ilyagluhovsky/dev/databone/modeling/rally/data/event.avsc') STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/Users/ilyagluhovsky/dev/databone/modeling/rally/data/analytics_events' Py4JJavaError: An error occurred while calling o27.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 (TID 1, localhost): java.lang.RuntimeException: cannot find field notused from [0:ts, 1:host, 2:env, 3:servicemodule, 4:userflow, 5:feature, 6:appendix, 7:caller, 8:callee, 9:experienceid, 10:profilecontext, 11:sessioncontext, 12:personid, 13:frontenddataunit, 14:goaldataunit, 15:coinrewarddataunit, 16:activityrewarddataunit, 17:missiondataunit, 18:logindataunit, 19:devicedataunit, 20:communitydataunit, 21:surveydataunit, 22:expert24dataunit, 23:articledataunit, 24:connectiondataunit, 25:coindataunit, 26:affiliationdataunit, 27:phrdataunit, 28:profiledataunit] (3) sessioncontext struct<extendedsessionid:string,userid:string,liftsessionid:string,displayname:string,os:string,browser:string,browserversion:double,referer:string,silverpopinfo:uniontype<struct<utmsource:string,utmmedium:string,utmcampaign:string,utmcontent:string,spmailingid:string,spuserid:string,spjobid:string,spreportid:string>>> (4) sqlContext.sql('drop table analytics_events2') query = '''CREATE EXTERNAL TABLE analytics_events2(ts INT, host STRING, env STRING, servicemodule STRING, userflow STRING, feature STRING, appendix STRING, caller STRING, callee STRING, experienceid STRING, profilecontext struct<birthyear:int,zipcode:string,gender:string,partnersegmentationid:string,partner:string,client:string,roles:array<string>,displayname:string>, sessioncontext struct<extendedsessionid:string,userid:string,liftsessionid:string,displayname:string,os:string,browser:string,browserversion:double,referer:string,silverpopinfo:uniontype<struct<utmsource:string,utmmedium:string,utmcampaign:string,utmcontent:string,spmailingid:string,spuserid:string,spjobid:string,spreportid:string>>>, personid struct<userid:string,sessionid:string,cookieid:string>, frontenddataunit struct<category:string,label:string,action:string,value:bigint,count:bigint,useragent:string>, goaldataunit struct<goalid:string,goalname:string,instanceid:string,action:string,amountcheckin:double,booleancheckin:boolean>, coinrewarddataunit struct<rewardtype:string,id:string,description:string,units:int,maxunits:int,amount:int,prebalance:int>, activityrewarddataunit struct<rewardtype:string,id:string,description:string,units:int,maxunits:int,activity:string>, missiondataunit struct<missionid:string,tsreported:bigint,name:string,instanceid:string,recinstanceid:string,missionversion:string,action:string,checkintype:string,remindersettings:array<struct<day:string,time:string>>,successnumerator:int,successdenominator:int,amountcheckin:double,booleancheckin:boolean>, logindataunit struct<action:string>, devicedataunit struct<checkints:bigint,value:double,unit:string,devicetype:string,partner:string>, communitydataunit struct<communityid:string,discussionid:string,replyid:string,discussiontitle:string,discussiontext:string,replytext:string,action:string>, surveydataunit struct<surveyid:string,surveyversion:int,isretake:boolean,action:string,question:struct<id:string,text:string,questiontype:string,tags:array<string>,answers:array<struct<id:string,tag:string,text:string>>>>, expert24dataunit struct<surveyinstanceid:string,healthage:int,risks:array<struct<riskname:string,timeline:int,current:float,reduced:float,minimum:float>>,modifiablerisks:array<struct<condition:string,factor:string>>>, articledataunit struct<id:string,path:string,subpath:string,searchterms:array<string>,action:string,url:string,title:string,source:string,subsource:string,addedts:bigint,publishedts:bigint,tags:array<string>>, connectiondataunit struct<action:string,destuserid:string,destusertype:string>, coindataunit struct<vertical:string,description:string,amount:int,balance:int>, affiliationdataunit struct<affiliationname:string,partner:string,client:string,affiliationroles:array<string>,invitecode:string,partnerclientid:string,partnersegmentationid:string>, phrdataunit struct<salt:string,dataid:string,tsreported:string,name:string,quality:string,owner:string,partner:string,vitals:struct<gender:string,height:string,heightunits:string,weight:string,weightunits:string,waistcircumference:string,waistcircumferenceunits:string>,bloodpressure:struct<systolic:string,diastolic:string>,cholesterol:struct<total:string,hdl:string,ldl:string,triglycerides:string>,bloodsugar:struct<hemoglobina1c:string,glucose:string>>, profiledataunit struct<birthyear:int,gender:string,zipcode:string,about:string,location:string> ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ( 'avro.schema.url'='file:///Users/ilyagluhovsky/dev/databone/modeling/rally/data/event.avsc') STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/Users/ilyagluhovsky/dev/databone/modeling/rally/data/analytics_events2' ''' sqlContext.sql(query) ... Caused by: org.apache.hadoop.hive.ql.parse.ParseException: line 23:110 cannot recognize input near 'current' ':' 'float' in column specification ... (5) /Users/ilyagluhovsky/dev/spark-1.1.0/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o52.javaToPython. : java.lang.RuntimeException: Unsupported dataType: struct<extendedsessionid:string,userid:string,liftsessionid:string,displayname:string,os:string,browser:string,browserversion:double,referer:string,silverpopinfo:uniontype<struct<utmsource:string,utmmedium:string,utmcampaign:string,utmcontent:string,spmailingid:string,spuserid:string,spjobid:string,spreportid:string>>> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hive-Avro-table-fail-pyspark-tp15070.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org