We're playing around with options to what I'm sure is a common problem - changing schemas in our log data.
Specifically we collect pixel data via nginx servers. These pixels currently have a pretty static list of parameters in the query string. We have eventual plans to change this and support many different types of parameters in the query string. Our current logs have a static number of fields separated by a \u0001 delimiter. So to support "dynamic fields" we have two options: 1. Store data using a Java/Pig Map of key:chararray and val:chararray 2. Stick with static fields, and version the log format so that we know exactly how many fields to expect and what the schema is per line *Option 1 Pros:* No versioning needed. If we add a new param, it's automatically picked up in the map and is available for all scripts to use. Old scripts don't have to worry about new params being added. *Option 1 Cons:* Adds significantly to our file sizes. Compression will help big time as many of the keys in the map are repeated string values which will benefit largely from compression. But eventually when logs are decompressed for analysis, they'll eat up significantly more disk space. Also, we're not sure about this but dealing with a ton of Map objects in Pig could be way more inefficient and have more overhead than just a bunch of chararrays/Strings. Anyone know if this is true? *Option 2 Pros:* Basically smaller file size is the big one here since we don't have to store the field name in our raw logs only the value and probably a version number also. *Option 2 Cons:* Becomes harder for scripts to work with different versions and we need to explicitly state which log file version the script depends on somewhere. Was hoping to get a few opinions on this, what are people doing to solve this in the wild? -- Mike Sukmanowsky Product Lead, http://parse.ly 989 Avenue of the Americas, 3rd Floor New York, NY 10018 p: +1 (416) 953-4248 e: m...@parsely.com