Hive Avro table fail (pyspark)

ilyagluk Wed, 24 Sep 2014 12:29:00 -0700

Dear Spark Users,

I'd highly appreciate any comments on whether there is a way to define an
Avro table and then run  select *  without errors.  Thank you very much in
advance!


Ilya.


Attached please find a sample data directory and the Avro schema.

analytics_events2.zip
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n15070/analytics_events2.zip>
  
event.avsc
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n15070/event.avsc>  

Creating a table in ways (1) and (2) below results in exceptions.

I believe it complaints of not being able to find any column in the table
definition in the case of (1) or being unable to find a dummy column notused
in the case of (2).  Why does it try to look at column definitions when the
Avro schema is provided??

I thought I actually found a workaround by first creating a table in Hive,
running "describe table analytics_events2" in Hive, getting things like (3)
for column definitions, and finally using those to redefine the table in
(4).  I am essentially replicating the Avro schema in the column definitions
using Hive syntax.  It works on most columns, but not on all.  Replacing
'current' by 'currentt' (Sigh) resolved the error in (4), but I got the
error (5).  Replacing some complex types with 'string' allows me to run
queries on the remaining columns (e.g. select ts, host from
analytics_events2 limit 10).  Mentioning a column whose data type was
replaced by 'string' results in errors, of course.


(1) CREATE EXTERNAL TABLE analytics_events2
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  WITH SERDEPROPERTIES (
   
'avro.schema.url'='file:///Users/ilyagluhovsky/dev/databone/modeling/rally/data/event.avsc')
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  LOCATION
'/Users/ilyagluhovsky/dev/databone/modeling/rally/data/analytics_events2'

The table gets created, but

[from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)]

sqlContext.sql("select * from analytics_events2 limit 10").collect()
results in

/Users/ilyagluhovsky/dev/spark-1.1.0/python/pyspark/sql.pyc in
<genexpr>((f,))

    883         """ Row in SchemaRDD """
    884         __DATATYPE__ = dataType
--> 885         __FIELDS__ = tuple(f.name for f in dataType.fields)
    886         __slots__ = ()

AttributeError: 'NoneType' object has no attribute 'name'


(2) CREATE EXTERNAL TABLE analytics_events2(notused INT)
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  WITH SERDEPROPERTIES (
 
'avro.schema.url'='file:///Users/ilyagluhovsky/dev/databone/modeling/rally/data/event.avsc')
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  LOCATION
'/Users/ilyagluhovsky/dev/databone/modeling/rally/data/analytics_events'

Py4JJavaError: An error occurred while calling o27.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0
(TID 1, localhost): java.lang.RuntimeException: cannot find field notused
from [0:ts, 1:host, 2:env, 3:servicemodule, 4:userflow, 5:feature,
6:appendix, 7:caller, 8:callee, 9:experienceid, 10:profilecontext,
11:sessioncontext, 12:personid, 13:frontenddataunit, 14:goaldataunit,
15:coinrewarddataunit, 16:activityrewarddataunit, 17:missiondataunit,
18:logindataunit, 19:devicedataunit, 20:communitydataunit,
21:surveydataunit, 22:expert24dataunit, 23:articledataunit,
24:connectiondataunit, 25:coindataunit, 26:affiliationdataunit,
27:phrdataunit, 28:profiledataunit]


(3) sessioncontext
struct<extendedsessionid:string,userid:string,liftsessionid:string,displayname:string,os:string,browser:string,browserversion:double,referer:string,silverpopinfo:uniontype&lt;struct&lt;utmsource:string,utmmedium:string,utmcampaign:string,utmcontent:string,spmailingid:string,spuserid:string,spjobid:string,spreportid:string>>>


(4)

sqlContext.sql('drop table analytics_events2')

query = '''CREATE EXTERNAL TABLE analytics_events2(ts INT,
host STRING,
env STRING,
servicemodule STRING,
userflow STRING,
feature STRING,
appendix STRING,
caller STRING,
callee STRING,
experienceid STRING,
profilecontext
struct<birthyear:int,zipcode:string,gender:string,partnersegmentationid:string,partner:string,client:string,roles:array&lt;string>,displayname:string>,
sessioncontext
struct<extendedsessionid:string,userid:string,liftsessionid:string,displayname:string,os:string,browser:string,browserversion:double,referer:string,silverpopinfo:uniontype&lt;struct&lt;utmsource:string,utmmedium:string,utmcampaign:string,utmcontent:string,spmailingid:string,spuserid:string,spjobid:string,spreportid:string>>>,
 
personid struct<userid:string,sessionid:string,cookieid:string>,  
frontenddataunit
struct<category:string,label:string,action:string,value:bigint,count:bigint,useragent:string>,
goaldataunit
struct<goalid:string,goalname:string,instanceid:string,action:string,amountcheckin:double,booleancheckin:boolean>,
  
coinrewarddataunit
struct<rewardtype:string,id:string,description:string,units:int,maxunits:int,amount:int,prebalance:int>,
activityrewarddataunit
struct<rewardtype:string,id:string,description:string,units:int,maxunits:int,activity:string>,
missiondataunit
struct<missionid:string,tsreported:bigint,name:string,instanceid:string,recinstanceid:string,missionversion:string,action:string,checkintype:string,remindersettings:array&lt;struct&lt;day:string,time:string>>,successnumerator:int,successdenominator:int,amountcheckin:double,booleancheckin:boolean>,
logindataunit struct<action:string>,  
devicedataunit
struct<checkints:bigint,value:double,unit:string,devicetype:string,partner:string>,
communitydataunit
struct<communityid:string,discussionid:string,replyid:string,discussiontitle:string,discussiontext:string,replytext:string,action:string>,
surveydataunit
struct<surveyid:string,surveyversion:int,isretake:boolean,action:string,question:struct&lt;id:string,text:string,questiontype:string,tags:array&lt;string>,answers:array<struct&lt;id:string,tag:string,text:string>>>>,
expert24dataunit
struct<surveyinstanceid:string,healthage:int,risks:array&lt;struct&lt;riskname:string,timeline:int,current:float,reduced:float,minimum:float>>,modifiablerisks:array<struct&lt;condition:string,factor:string>>>,
articledataunit
struct<id:string,path:string,subpath:string,searchterms:array&lt;string>,action:string,url:string,title:string,source:string,subsource:string,addedts:bigint,publishedts:bigint,tags:array<string>>,
connectiondataunit
struct<action:string,destuserid:string,destusertype:string>,
coindataunit
struct<vertical:string,description:string,amount:int,balance:int>,
affiliationdataunit
struct<affiliationname:string,partner:string,client:string,affiliationroles:array&lt;string>,invitecode:string,partnerclientid:string,partnersegmentationid:string>,
phrdataunit
struct<salt:string,dataid:string,tsreported:string,name:string,quality:string,owner:string,partner:string,vitals:struct&lt;gender:string,height:string,heightunits:string,weight:string,weightunits:string,waistcircumference:string,waistcircumferenceunits:string>,bloodpressure:struct<systolic:string,diastolic:string>,cholesterol:struct<total:string,hdl:string,ldl:string,triglycerides:string>,bloodsugar:struct<hemoglobina1c:string,glucose:string>>,
profiledataunit
struct<birthyear:int,gender:string,zipcode:string,about:string,location:string> 
)
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  WITH SERDEPROPERTIES (
   
'avro.schema.url'='file:///Users/ilyagluhovsky/dev/databone/modeling/rally/data/event.avsc')
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  LOCATION
'/Users/ilyagluhovsky/dev/databone/modeling/rally/data/analytics_events2'
'''
sqlContext.sql(query)

...
Caused by: org.apache.hadoop.hive.ql.parse.ParseException: line 23:110
cannot recognize input near 'current' ':' 'float' in column specification
...


(5)
/Users/ilyagluhovsky/dev/spark-1.1.0/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)

    298                 raise Py4JJavaError(

    299                     'An error occurred while calling {0}{1}{2}.\n'.

--> 300                     format(target_id, '.', name), value)

    301             else:

    302                 raise Py4JError(



Py4JJavaError: An error occurred while calling o52.javaToPython.

: java.lang.RuntimeException: Unsupported dataType:
struct<extendedsessionid:string,userid:string,liftsessionid:string,displayname:string,os:string,browser:string,browserversion:double,referer:string,silverpopinfo:uniontype&lt;struct&lt;utmsource:string,utmmedium:string,utmcampaign:string,utmcontent:string,spmailingid:string,spuserid:string,spjobid:string,spreportid:string>>>





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hive-Avro-table-fail-pyspark-tp15070.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Hive Avro table fail (pyspark)

Reply via email to