Not sure of your desired "final output" but below is the pseudo code how I
solved a similar problem with pig and python.
Use PigStorage with new-line as the delimiter (or whatever you are using to
denote a new line) in order to throw PIG a "fakie" and have it load the whole
line as the tuple.
tv_in = load '$tv_in_path' using PigStorage('\n') as (line:chararray);
Pass each line to a python UDF
tv_in2 = foreach tv_in generate udf.explode_tv(line);
That gets the whole line into the python UDF so that you can do your custom
parsing.
Since you don't know the total number of item:minute pairs you are going to
have to decide what you want to return.
You could do a bag of item:minute pairs something like:
R:bag{T:tuple(timestamp, userid, channeled, total_duration,
itemids:bag{iT:tuple(itemid, minutes)} or you could create a tuple for each
item:minute pair: R:bag{T:tuple(timestamp, userid, channeled, total_duration,
itemid, minutes)}.
Hope this helps.
Will Duckworth Senior Vice President, Software Engineering | comScore,
Inc.(NASDAQ:SCOR)
o +1 (703) 438-2108 | m +1 (301) 606-2977 | mailto:[email protected]
.....................................................................................................
Introducing Mobile Metrix 2.0 - The next generation of mobile behavioral
measurement
www.comscore.com/MobileMetrix
-----Original Message-----
From: Dan Brickley [mailto:[email protected]]
Sent: Thursday, July 05, 2012 4:21 PM
To: [email protected]
Subject: Simple .py custom loader for slightly-nested input?
Cutting this over from #hadoop-pig IRC:
hi Pig people. I have some TV viewing logs in a text format - example
http://pastebin.com/raw.php?i=HS4zy2pP - ... unfortunately it has some
nesting/list structure, so I can't see a way to read it with an 'out of the
box' Pig loader. Is the conventional practice to write a custom loader?
(Python? Java? anything?). The actual parsing is quite trivial but I'm unsure
how to hook into Pig infrastructure. Ideally it would be a simple linked .py
file, not messing around with complex java builds etc.
I found e.g.
http://arunxjacob.blogspot.com/2010/12/writing-custom-pig-loader.html
(for a Java loader). I hate to sound ungrateful but this is looking a bit
heavy, compared to the simplicity of the task. Would a Python loader be
simpler? (ie. just a second .py script alongside my .pig script). I was
suprised that I wasn't able to find an example of someone having done this.
Here's the target format, below. Each row is a TV-viewing session, with a
channel and total time, followed by a space-separate list of item:minute pairs
for a sequence of consecutive viewed items on that channel making up that total.
Thanks for any pointers. I don't mind coding, I just want to find the right
framework to plug into...
cheers,
Dan
2012-03-01T00:00:29Z 1360015279 mychannela 0 asdfasdf:0 2012-03-01T00:04:23Z
0728509428 mychannelb 6 bsdf92c1:6 2012-03-01T00:01:23Z 0516050342 mchannela 20
b00s123k0:19 b0dfgdfgk1:1
(fields: timestamp userid channelid total_duration ... then a sequence of
{itemid}:{mins} for each item viewed in that session of viewing the channel.
These will sum to the total_duration.)