Hello,
I have a PIG script to extract sequence files using the SequenceFileLoader() 
function. I can extract the XML, but when I trying parsing the XML using 
ElemenTree.py or minidom.py scripts I get an error stating 'an internal error 
occurred inside the function while returning'. My question is, can we parse an 
output from SequenceFileLoader by directly feeding it to a UDF or the string 
needs to be formatted before passing as an argument? One way is to store the 
output to HDFS as an .xml file, and then use the XMLoader function in Pig to 
parse, but I want to do it on the fly bypassing the store option.

register /use/lib/pig/piggybank.jar
register /use/lib64/python2.6/XML/etree/ElementTree.py using jython as myudf;
Define SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
a = LOAD '/data/appl/20142803/hq.seq' using SequenceFileLoader('/u001') as 
(key:chararray, value:chararray);
b = Filter a by key == 'crt.xml';
c = Foreach b Generate myudf.fromstring(value);
dump c;

Please inform if the parsing can be done on the fly as above.

Thanking you in advance for your help in this regards.

Thanks,
Debashish Dhar

Sent from my iPhone

Reply via email to