TFileLoader can not parse xml files. Script posted here tries to parse XML file via TFileLoader which could be causing the issue.
https://pig.apache.org/docs/r0.15.0/api/org/apache/pig/piggybank/storage/XMLLoader.html in piggybank.jar might be useful for parsing XML contents. You can refer to https://github.com/apache/pig/blob/a44b85a0ab941cdd1d2d7f6e457303aef1e57501/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestXMLLoader.java for example. If you are interested in using pig+tez, you need to run "pig -x tez" to inform pig to make use of tez execution engine instead of MR. ~Rajesh.B On Sun, Nov 15, 2015 at 1:11 AM, Julian Reyes <[email protected]> wrote: > Hi, > > I just was trying to get started using Pig and get familiar with it but I > am getting problems while reading the XML. > > My XML looks like the following (of course, its much bigger, I just added > first entries): > > <cn:bulkCmConfigDataFile xmlns:cn="details-CONFIG" xmlns:xt="nrmBase" > xmlns:en="CLL-NB"> > <cn:fileHeader fileFormatVersion="2.0.0" senderName="senderName" vendorName > ="vendorName"/> > <cn:configData> > <en:ManagementNode xmlns:en="CLL-NB"> > <en:neGroup>Group_1</en:neGroup> > <en:neVersion>2.1.0</en:neVersion> > <en:neId>100</en:neId> > <en:neName>TK0005</en:neName> > <en:neIp>192.168.0.2</en:neIp> > </en:ManagementNode> > <en:ManagementNode xmlns:en="CLL-NB"> > <en:neGroup>Group_1</en:neGroup> > <en:neVersion>2.1.0</en:neVersion> > <en:neId>101</en:neId> > <en:neName>TK0002</en:neName> > <en:neIp>192.168.0.3</en:neIp> > </en:ManagementNode> > </cn:configData> > <cn:fileFooter dateTime="2013-12-20T03:40:15+00:00"/> > </cn:bulkCmConfigDataFile> > > And the Pig script I am trying to use is the following: > > > set pig.splitCombination false; > set tez.grouping.min-size 5242880; > set tez.grouping.max-size 5242880; > > register '/usr/lib/tez/tez-0.7.0/tez-tfile-parser-0.7.0.jar'; > > DEFINE getDetails(raw) RETURNS void { > details = FOREACH raw GENERATE configData; > distinctDetails = DISTINCT details; > STORE distinctDetails INTO '$DETAILS' USING PigStorage(',');; > } > > > rmf $NODE_DETAILS > rawLogs = load '/user/hduser/test/test01/ManagementNode.xml' using > org.apache.tez.tools.TFileLoader() as (configData:chararray, key:chararray, > line:chararray); > raw = FOREACH rawLogs GENERATE ManagementNode,key,line; > > getDetails(raw); > exec; > > However, I am getting the following error: > > ERROR 2998: Unhandled internal error. null > > java.lang.StackOverflowError > at org.apache.tez.tools.TFileLoader.hashCode(TFileLoader.java:148) > at java.util.Arrays.hashCode(Arrays.java:3140) > ... > > Could it be because of the XML file? > > Thanks. > > > J. Reyes. > -- ~Rajesh.B
