One issue could be the fact that the key names will be stored for every entry 
in the map and that would increase the data sizes. A good compromise is to have 
the common fields in the log as top level columns in hive and then have a catch 
all map for the rest.

Ashish

________________________________
From: Ryan LeCompte [mailto:[email protected]]
Sent: Sunday, October 11, 2009 4:19 AM
To: [email protected]
Subject: Performance of using map column in schema

Hello all,

I was wondering if there are any performance hits in using a  
map<string,string> column in a Hive schema to represent a line of an apache 
log. My issue is that frequently new parameters are added to apache log lines, 
and it would be nice to not have to always explicitly define these new typed 
columns in the Hive schema table. If we could specify a single column of 
map<string,string> that represented all of the param key=value pairs of the 
apache log line, then we could write ad-hoc queries that referenced whichever 
log params we wanted. However, it seems that Hive wants typed columns for each 
parameter to perform well. Any thoughts?

Thanks,
Ryan

Reply via email to