Hi,

            I am trying to parse XML using Pig. It is working fine with
simple XML but with nested XML tags I am getting problem. I tried with some
code i displayed it below which does not work fine for nested tags. If
there is any solution please suggest me.

input xml :

<students>
  <student>
    <rollno>1</rollno>
    <name>Name1</name>
    <addresses>
      <address>
        <type>Office</type>
        <addressline1>1address1</addressline1>
      </address>
      <address>
        <type>Resi</type>
        <addressline1>1address1</addressline1>
      </address>
    </addresses>
  </student>
  <student>
    <rollno>2</rollno>
    <name>Name2</name>
    <addresses>
      <address>
        <type>Office2</type>
        <addressline1>2address2</addressline1>
      </address>
      <address>
        <type>Resi1</type>
        <addressline1>2address2</addressline1>
      </address>
    </addresses>
  </student>
</students>

Pig Script :

 A = LOAD 'simple.xml' using
org.apache.pig.piggybank.storage.XMLLoader('student') as (line:chararray);
B = foreach A GENERATE
REGEX_EXTRACT(line,'<rollno>(.*)</rollno>',1),REGEX_EXTRACT(line,'<name>(.*)</name>',1),REGEX_EXTRACT(line,'<address>\\n\\s*<type>(.*)</type>\\n\\s*<addressline1>(.*)</addressline1>\\n\\s*</address>',1);

My Output:
(1,Name1,Office)
(11,Name11,Office1)


Expected Output:

(1,Name1,Office)
(1,Name1,Resi)
(11,Name11,Office1)
 (11,Name11,Resi1)


I also tried with some other technique like joining two tuples but I got
the error while using following code

 A = LOAD 'simple.xml' using
org.apache.pig.piggybank.storage.XMLLoader('student') as
(line:chararray);
 B = LOAD 'simple.xml' using
org.apache.pig.piggybank.storage.XMLLoader('address') as
(line:chararray);
 C = foreach A GENERATE
flatten(REGEX_EXTRACT(line,'<rollno>(.*)</rollno>',1)) as (roll:chararray);
 B = foreach B GENERATE
REGEX_EXTRACT(line,'<address>\\n\\s*<type>(.*)</type>\\n\\s*<addressline1>(.*)</addressline1>\\n\\s*</address>',1),C.roll;


I am getting the following error :

Scalar has more than one row in the output. 1st : (1), 2nd :(11)

Reply via email to