I am in a similar situation with 3 similar (but obviously different usage) types of log entries in my log files. Rather than write a UDF (which I don't know Java so I avoided), I went the SPLIT route. I am not sure how feasible this is for you, but by example, this is what I did:
json = LOAD '$INPUT' USING com.twitter.elephantbird.pig.load.JsonLoader(); SPLIT log_line INTO app1 IF $0#'type' MATCHES 'i', app2 IF $0#'type' MATCHES 'c', app3 IF $0#'type' MATCHES 'b'; I'll grant that all my log entries are json which may make parsing a little easier for me (since I ensured they all have a 'type' property), but the concept may be able to be extrapolated for your use-case. -e On Sat, Mar 12, 2011 at 14:37, Marko Musnjak <marko.musn...@gmail.com>wrote: > Hi, > I have log files with a dozen different entry types, and i would like to > have them loaded into several different relations. > I couldn't figure out how to attach the schema to a single tuple, so now > I'm > loading into a tuple with a type id and a map of values in the udf, and > then > split by type and create the final tuples in pig. Is there a better/more > efficient way to do this? > I would like to avoid having loading logic in both the udf and the pig > script, and generate all "final" tuples in the udf, and then just use a > split in pig. > Thanks, > Marko > Eric Lubow e: eric.lu...@gmail.com w: eric.lubow.org