Hi Ryan, Here are a list of commands to get you started along this route:
CREATE TABLE apache_log ( a STRING, b STRING, c STRING, extra MAP<STRING,STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ' ' MAP KEYS TERMINATED BY '='; LOAD DATA LOCAL INPATH 'myapache.log' OVERWRITE INTO TABLE apache_log; SELECT a, b, c, extra['key1'], extra['key2'] FROM apache_log; Zheng On Mon, Oct 12, 2009 at 1:48 PM, Ashish Thusoo <[email protected]> wrote: > One issue could be the fact that the key names will be stored for every > entry in the map and that would increase the data sizes. A good compromise > is to have the common fields in the log as top level columns in hive and > then have a catch all map for the rest. > > Ashish > > ------------------------------ > *From:* Ryan LeCompte [mailto:[email protected]] > *Sent:* Sunday, October 11, 2009 4:19 AM > *To:* [email protected] > *Subject:* Performance of using map column in schema > > Hello all, > > I was wondering if there are any performance hits in using a > map<string,string> column in a Hive schema to represent a line of an apache > log. My issue is that frequently new parameters are added to apache log > lines, and it would be nice to not have to always explicitly define these > new typed columns in the Hive schema table. If we could specify a single > column of map<string,string> that represented all of the param key=value > pairs of the apache log line, then we could write ad-hoc queries that > referenced whichever log params we wanted. However, it seems that Hive wants > typed columns for each parameter to perform well. Any thoughts? > > Thanks, > Ryan > > -- Yours, Zheng
