Simple .py custom loader for slightly-nested input?

Dan Brickley Thu, 05 Jul 2012 13:21:38 -0700

Cutting this over from #hadoop-pig IRC:

hi Pig people. I have some TV viewing logs in a text format - example
http://pastebin.com/raw.php?i=HS4zy2pP - ... unfortunately it has some
nesting/list structure, so I can't see a way to read it with an 'out
of the box' Pig loader. Is the conventional practice to write a custom
loader? (Python? Java? anything?). The actual parsing is quite trivial
but I'm unsure how to hook into Pig infrastructure. Ideally it would
be a simple linked .py file, not messing around with complex java
builds etc.


I found e.g. 
http://arunxjacob.blogspot.com/2010/12/writing-custom-pig-loader.html
(for a Java loader). I hate to sound ungrateful but this is looking a
bit heavy, compared to the simplicity of the task. Would a Python
loader be simpler? (ie. just a second .py script alongside my .pig
script). I was suprised that I wasn't able to find an example of
someone having done this.

Here's the target format, below. Each row is a TV-viewing session,
with a channel and total time, followed by a space-separate list of
item:minute pairs for a sequence of consecutive viewed items on that
channel making up that total.

Thanks for any pointers. I don't mind coding, I just want to find the
right framework to plug into...

cheers,

Dan

2012-03-01T00:00:29Z 1360015279 mychannela 0 asdfasdf:0
2012-03-01T00:04:23Z 0728509428 mychannelb 6 bsdf92c1:6
2012-03-01T00:01:23Z 0516050342 mchannela 20 b00s123k0:19 b0dfgdfgk1:1

(fields: timestamp userid channelid total_duration ... then a sequence
of {itemid}:{mins} for each item viewed in that session of viewing the
channel. These will sum to the total_duration.)

Simple .py custom loader for slightly-nested input?

Reply via email to