Thanks Zheng! I was able to get this up and running, and it has been working out great so far.
On Tue, Oct 13, 2009 at 12:06 AM, Zheng Shao <[email protected]> wrote: > Hi Ryan, > > Here are a list of commands to get you started along this route: > > CREATE TABLE apache_log ( > a STRING, > b STRING, > c STRING, > extra MAP<STRING,STRING> > ) ROW FORMAT DELIMITED > FIELDS TERMINATED BY '\t' > COLLECTION ITEMS TERMINATED BY ' ' > MAP KEYS TERMINATED BY '='; > > LOAD DATA LOCAL INPATH 'myapache.log' OVERWRITE INTO TABLE apache_log; > > SELECT a, b, c, extra['key1'], extra['key2'] FROM apache_log; > > > Zheng > > > On Mon, Oct 12, 2009 at 1:48 PM, Ashish Thusoo <[email protected]>wrote: > >> One issue could be the fact that the key names will be stored for every >> entry in the map and that would increase the data sizes. A good compromise >> is to have the common fields in the log as top level columns in hive and >> then have a catch all map for the rest. >> >> Ashish >> >> ------------------------------ >> *From:* Ryan LeCompte [mailto:[email protected]] >> *Sent:* Sunday, October 11, 2009 4:19 AM >> *To:* [email protected] >> *Subject:* Performance of using map column in schema >> >> Hello all, >> >> I was wondering if there are any performance hits in using a >> map<string,string> column in a Hive schema to represent a line of an apache >> log. My issue is that frequently new parameters are added to apache log >> lines, and it would be nice to not have to always explicitly define these >> new typed columns in the Hive schema table. If we could specify a single >> column of map<string,string> that represented all of the param key=value >> pairs of the apache log line, then we could write ad-hoc queries that >> referenced whichever log params we wanted. However, it seems that Hive wants >> typed columns for each parameter to perform well. Any thoughts? >> >> Thanks, >> Ryan >> >> > > > -- > Yours, > Zheng >
