Thanks Pradeep - none of our logs currently use Proto Buf/Thrift/Avro and we were somewhat trying to stay away from these guys but they may be a good option.
On Thu, Dec 12, 2013 at 6:35 PM, Pradeep Gollakota <pradeep...@gmail.com>wrote: > It seems like what you're asking for is Versioned Schema management. Pig is > not designed for that. Pig is only a scripting language to manipulate > datasets. > > I'd recommend you look into Thrift, Protocol Buffers and Avro. They are > compact serialization libraries that do versioned schema management. > > > On Thu, Dec 12, 2013 at 2:06 PM, Mike Sukmanowsky <m...@parsely.com> > wrote: > > > We're playing around with options to what I'm sure is a common problem - > > changing schemas in our log data. > > > > Specifically we collect pixel data via nginx servers. These pixels > > currently have a pretty static list of parameters in the query string. > We > > have eventual plans to change this and support many different types of > > parameters in the query string. > > > > Our current logs have a static number of fields separated by a \u0001 > > delimiter. So to support "dynamic fields" we have two options: > > > > 1. Store data using a Java/Pig Map of key:chararray and val:chararray > > 2. Stick with static fields, and version the log format so that we > know > > exactly how many fields to expect and what the schema is per line > > > > *Option 1 Pros:* > > No versioning needed. If we add a new param, it's automatically picked > up > > in the map and is available for all scripts to use. Old scripts don't > have > > to worry about new params being added. > > > > *Option 1 Cons:* > > Adds significantly to our file sizes. Compression will help big time as > > many of the keys in the map are repeated string values which will benefit > > largely from compression. But eventually when logs are decompressed for > > analysis, they'll eat up significantly more disk space. Also, we're not > > sure about this but dealing with a ton of Map objects in Pig could be way > > more inefficient and have more overhead than just a bunch of > > chararrays/Strings. Anyone know if this is true? > > > > *Option 2 Pros:* > > Basically smaller file size is the big one here since we don't have to > > store the field name in our raw logs only the value and probably a > version > > number also. > > > > *Option 2 Cons:* > > Becomes harder for scripts to work with different versions and we need to > > explicitly state which log file version the script depends on somewhere. > > > > Was hoping to get a few opinions on this, what are people doing to solve > > this in the wild? > > > > -- > > Mike Sukmanowsky > > > > Product Lead, http://parse.ly > > 989 Avenue of the Americas, 3rd Floor > > New York, NY 10018 > > p: +1 (416) 953-4248 > > e: m...@parsely.com > > > -- Mike Sukmanowsky Product Lead, http://parse.ly 989 Avenue of the Americas, 3rd Floor New York, NY 10018 p: +1 (416) 953-4248 e: m...@parsely.com