It seems like what you're asking for is Versioned Schema management. Pig is
not designed for that. Pig is only a scripting language to manipulate
datasets.

I'd recommend you look into Thrift, Protocol Buffers and Avro. They are
compact serialization libraries that do versioned schema management.


On Thu, Dec 12, 2013 at 2:06 PM, Mike Sukmanowsky <m...@parsely.com> wrote:

> We're playing around with options to what I'm sure is a common problem -
> changing schemas in our log data.
>
> Specifically we collect pixel data via nginx servers.  These pixels
> currently have a pretty static list of parameters in the query string.  We
> have eventual plans to change this and support many different types of
> parameters in the query string.
>
> Our current logs have a static number of fields separated by a \u0001
> delimiter.  So to support "dynamic fields" we have two options:
>
>    1. Store data using a Java/Pig Map of key:chararray and val:chararray
>    2. Stick with static fields, and version the log format so that we know
>    exactly how many fields to expect and what the schema is per line
>
> *Option 1 Pros:*
> No versioning needed.  If we add a new param, it's automatically picked up
> in the map and is available for all scripts to use.  Old scripts don't have
> to worry about new params being added.
>
> *Option 1 Cons:*
> Adds significantly to our file sizes.  Compression will help big time as
> many of the keys in the map are repeated string values which will benefit
> largely from compression.   But eventually when logs are decompressed for
> analysis, they'll eat up significantly more disk space.  Also, we're not
> sure about this but dealing with a ton of Map objects in Pig could be way
> more inefficient and have more overhead than just a bunch of
> chararrays/Strings.  Anyone know if this is true?
>
> *Option 2 Pros:*
> Basically smaller file size is the big one here since we don't have to
> store the field name in our raw logs only the value and probably a version
> number also.
>
> *Option 2 Cons:*
> Becomes harder for scripts to work with different versions and we need to
> explicitly state which log file version the script depends on somewhere.
>
> Was hoping to get a few opinions on this, what are people doing to solve
> this in the wild?
>
> --
> Mike Sukmanowsky
>
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY  10018
> p: +1 (416) 953-4248
> e: m...@parsely.com
>

Reply via email to