Re: Performance of using map column in schema

Zheng Shao Mon, 12 Oct 2009 21:07:18 -0700

Hi Ryan,

Here are a list of commands to get you started along this route:


CREATE TABLE apache_log (
  a STRING,
  b STRING,
  c STRING,
  extra MAP<STRING,STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ' '
MAP KEYS TERMINATED BY '=';

LOAD DATA LOCAL INPATH 'myapache.log' OVERWRITE INTO TABLE apache_log;

SELECT a, b, c, extra['key1'], extra['key2'] FROM apache_log;


Zheng

On Mon, Oct 12, 2009 at 1:48 PM, Ashish Thusoo <[email protected]> wrote:

>  One issue could be the fact that the key names will be stored for every
> entry in the map and that would increase the data sizes. A good compromise
> is to have the common fields in the log as top level columns in hive and
> then have a catch all map for the rest.
>
> Ashish
>
>  ------------------------------
> *From:* Ryan LeCompte [mailto:[email protected]]
> *Sent:* Sunday, October 11, 2009 4:19 AM
> *To:* [email protected]
> *Subject:* Performance of using map column in schema
>
> Hello all,
>
> I was wondering if there are any performance hits in using a
> map<string,string> column in a Hive schema to represent a line of an apache
> log. My issue is that frequently new parameters are added to apache log
> lines, and it would be nice to not have to always explicitly define these
> new typed columns in the Hive schema table. If we could specify a single
> column of map<string,string> that represented all of the param key=value
> pairs of the apache log line, then we could write ad-hoc queries that
> referenced whichever log params we wanted. However, it seems that Hive wants
> typed columns for each parameter to perform well. Any thoughts?
>
> Thanks,
> Ryan
>
>


-- 
Yours,
Zheng

Re: Performance of using map column in schema

Reply via email to