Hi,
I am trying to parse XML using Pig. It is working fine with
simple XML but with nested XML tags I am getting problem. I tried with some
code i displayed it below which does not work fine for nested tags. If
there is any solution please suggest me.
input xml :
<students>
<student>
<rollno>1</rollno>
<name>Name1</name>
<addresses>
<address>
<type>Office</type>
<addressline1>1address1</addressline1>
</address>
<address>
<type>Resi</type>
<addressline1>1address1</addressline1>
</address>
</addresses>
</student>
<student>
<rollno>2</rollno>
<name>Name2</name>
<addresses>
<address>
<type>Office2</type>
<addressline1>2address2</addressline1>
</address>
<address>
<type>Resi1</type>
<addressline1>2address2</addressline1>
</address>
</addresses>
</student>
</students>
Pig Script :
A = LOAD 'simple.xml' using
org.apache.pig.piggybank.storage.XMLLoader('student') as (line:chararray);
B = foreach A GENERATE
REGEX_EXTRACT(line,'<rollno>(.*)</rollno>',1),REGEX_EXTRACT(line,'<name>(.*)</name>',1),REGEX_EXTRACT(line,'<address>\\n\\s*<type>(.*)</type>\\n\\s*<addressline1>(.*)</addressline1>\\n\\s*</address>',1);
My Output:
(1,Name1,Office)
(11,Name11,Office1)
Expected Output:
(1,Name1,Office)
(1,Name1,Resi)
(11,Name11,Office1)
(11,Name11,Resi1)
I also tried with some other technique like joining two tuples but I got
the error while using following code
A = LOAD 'simple.xml' using
org.apache.pig.piggybank.storage.XMLLoader('student') as
(line:chararray);
B = LOAD 'simple.xml' using
org.apache.pig.piggybank.storage.XMLLoader('address') as
(line:chararray);
C = foreach A GENERATE
flatten(REGEX_EXTRACT(line,'<rollno>(.*)</rollno>',1)) as (roll:chararray);
B = foreach B GENERATE
REGEX_EXTRACT(line,'<address>\\n\\s*<type>(.*)</type>\\n\\s*<addressline1>(.*)</addressline1>\\n\\s*</address>',1),C.roll;
I am getting the following error :
Scalar has more than one row in the output. 1st : (1), 2nd :(11)