Pig and XML parsing

Sameer Tilak Thu, 17 Oct 2013 16:10:01 -0700

Hi All,
I have a lot of small (~2 to 3 MB) XML files that I would like to process. I 
was thinking along the following lines, please let me know if you have any 
thoughts on this.


1. Create SeqeunceFiles such that each sequence file between 60 to 64 MB and no 
XML file is split onto 2 Sequence Files.
2. Write Pig Script to that loads the sequence file, then iterates over 
individual XML files and analyzes them. 
I was planning to use Elephant-Bird to read sequencefiles. Here is what their 
documentation says:
Hadoop SequenceFiles and Pig

Reading and writing Hadoop SequenceFiles with Pig is supported via classes
SequenceFileLoader
and
SequenceFileStorage. These
classes make use of a
WritableConverter
interface, allowing pluggable conversion of key and value instances to and from
Pig data types.


Here's a short example: Suppose you have SequenceFile<Text, LongWritable> data
sitting beneath path input. We can load that data with the following Pig
script:


REGISTER '/path/to/elephant-bird.jar';

%declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
%declare LONG_CONVERTER 
'com.twitter.elephantbird.pig.util.LongWritableConverter';

pairs = LOAD 'input' USING $SEQFILE_LOADER (
  '-c $TEXT_CONVERTER', '-c $LONG_CONVERTER'
) AS (key: chararray, value: long);


I was looking at XMLLoader from piggybank. Has anyone used XPATH queries in 
their Pig scripts?

Pig and XML parsing

Reply via email to