I am in a similar situation with 3 similar (but obviously different usage)
types of log entries in my log files.  Rather than write a UDF (which I
don't know Java so I avoided), I went the SPLIT route.  I am not sure how
feasible this is for you, but by example, this is what I did:

json = LOAD '$INPUT' USING com.twitter.elephantbird.pig.load.JsonLoader();
SPLIT log_line INTO app1 IF $0#'type' MATCHES 'i', app2 IF $0#'type' MATCHES
'c', app3 IF $0#'type' MATCHES 'b';

I'll grant that all my log entries are json which may make parsing a little
easier for me (since I ensured they all have a 'type' property), but the
concept may be able to be extrapolated for your use-case.

-e

On Sat, Mar 12, 2011 at 14:37, Marko Musnjak <marko.musn...@gmail.com>wrote:

> Hi,
> I have log files with a dozen different entry types, and i would like to
> have them loaded into several different relations.
> I couldn't figure out how to attach the schema to a single tuple, so now
> I'm
> loading into a tuple with a type id and a map of values in the udf, and
> then
> split by type and create the final tuples in pig. Is there a better/more
> efficient way to do this?
> I would like to avoid having loading logic in both the udf and the pig
> script, and generate all "final" tuples in the udf, and then just use a
> split in pig.
> Thanks,
> Marko
>

Eric Lubow e: eric.lu...@gmail.com w: eric.lubow.org

Reply via email to