We're playing around with options to what I'm sure is a common problem -
changing schemas in our log data.

Specifically we collect pixel data via nginx servers.  These pixels
currently have a pretty static list of parameters in the query string.  We
have eventual plans to change this and support many different types of
parameters in the query string.

Our current logs have a static number of fields separated by a \u0001
delimiter.  So to support "dynamic fields" we have two options:

   1. Store data using a Java/Pig Map of key:chararray and val:chararray
   2. Stick with static fields, and version the log format so that we know
   exactly how many fields to expect and what the schema is per line

*Option 1 Pros:*
No versioning needed.  If we add a new param, it's automatically picked up
in the map and is available for all scripts to use.  Old scripts don't have
to worry about new params being added.

*Option 1 Cons:*
Adds significantly to our file sizes.  Compression will help big time as
many of the keys in the map are repeated string values which will benefit
largely from compression.   But eventually when logs are decompressed for
analysis, they'll eat up significantly more disk space.  Also, we're not
sure about this but dealing with a ton of Map objects in Pig could be way
more inefficient and have more overhead than just a bunch of
chararrays/Strings.  Anyone know if this is true?

*Option 2 Pros:*
Basically smaller file size is the big one here since we don't have to
store the field name in our raw logs only the value and probably a version
number also.

*Option 2 Cons:*
Becomes harder for scripts to work with different versions and we need to
explicitly state which log file version the script depends on somewhere.

Was hoping to get a few opinions on this, what are people doing to solve
this in the wild?

-- 
Mike Sukmanowsky

Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY  10018
p: +1 (416) 953-4248
e: m...@parsely.com

Reply via email to