One issue could be the fact that the key names will be stored for every entry in the map and that would increase the data sizes. A good compromise is to have the common fields in the log as top level columns in hive and then have a catch all map for the rest.
Ashish ________________________________ From: Ryan LeCompte [mailto:[email protected]] Sent: Sunday, October 11, 2009 4:19 AM To: [email protected] Subject: Performance of using map column in schema Hello all, I was wondering if there are any performance hits in using a map<string,string> column in a Hive schema to represent a line of an apache log. My issue is that frequently new parameters are added to apache log lines, and it would be nice to not have to always explicitly define these new typed columns in the Hive schema table. If we could specify a single column of map<string,string> that represented all of the param key=value pairs of the apache log line, then we could write ad-hoc queries that referenced whichever log params we wanted. However, it seems that Hive wants typed columns for each parameter to perform well. Any thoughts? Thanks, Ryan
