I'm using org.apache.pig.piggybank.storage.XMLLoader from piggybank and that's
working well for me. I do something like this:
-- The analyze_src_recs.py script reads XML from stdin, and writes to
-- stdout comma-separated lines rec_type,...
--
define analyze_src `analyze_src_recs.py`
input (stdin)
output (stdout USING PigStreaming(','))
ship ('$scriptDir/analyze_src_recs.py');
SrcLines = load '$src_xml/*.xml*'
using org.apache.pig.piggybank.storage.XMLLoader('REC')
as (doc:chararray);
ParseOut = stream SrcLines through analyze_src
as (rec_type : int,
-- other fields my parser pulled out of the XML
);
William F Dowling
Senior Technologist
Thomson Reuters
0 +1 215 823 3853
-----Original Message-----
From: Rory McCann [mailto:[email protected]]
Sent: Friday, January 13, 2012 7:12 AM
To: [email protected]
Subject: Custom Loaders that use Input Streams for reading data?
Hi all,
I'm new to Pig (and a bit rusty with Java!) and still just playing
around with it, nothing serious yet. I might be misunderstanding
something important here.
I'm trying to write a custom loader for a custom XML file format, i.e.
deserialize the XML into Pig data type. However all the documentation
and other code is based on taking a RecordReader and spitting out things
from getNext().
Is there anyway to make a custom loader that works on InputStreams or
more common java-io-y type stuff? I'd like to use more commonly
available XML parsers (which work on these). Since it's XML, line by
line parsing doesn't really work. I will just have one input file that
will be parsed. Is there some reason why there are no InputStreams?
I have also asked this question on StackOverflow:
http://stackoverflow.com/questions/8843790/custom-apache-pig-loadfunc-where-can-i-get-the-inputstream-on-the-file
--
Rory