Hi,
I just was trying to get started using Pig and get familiar with it but I
am getting problems while reading the XML.
My XML looks like the following (of course, its much bigger, I just added
first entries):
<cn:bulkCmConfigDataFile xmlns:cn="details-CONFIG" xmlns:xt="nrmBase"
xmlns:en="CLL-NB">
<cn:fileHeader fileFormatVersion="2.0.0" senderName="senderName" vendorName
="vendorName"/>
<cn:configData>
<en:ManagementNode xmlns:en="CLL-NB">
<en:neGroup>Group_1</en:neGroup>
<en:neVersion>2.1.0</en:neVersion>
<en:neId>100</en:neId>
<en:neName>TK0005</en:neName>
<en:neIp>192.168.0.2</en:neIp>
</en:ManagementNode>
<en:ManagementNode xmlns:en="CLL-NB">
<en:neGroup>Group_1</en:neGroup>
<en:neVersion>2.1.0</en:neVersion>
<en:neId>101</en:neId>
<en:neName>TK0002</en:neName>
<en:neIp>192.168.0.3</en:neIp>
</en:ManagementNode>
</cn:configData>
<cn:fileFooter dateTime="2013-12-20T03:40:15+00:00"/>
</cn:bulkCmConfigDataFile>
And the Pig script I am trying to use is the following:
set pig.splitCombination false;
set tez.grouping.min-size 5242880;
set tez.grouping.max-size 5242880;
register '/usr/lib/tez/tez-0.7.0/tez-tfile-parser-0.7.0.jar';
DEFINE getDetails(raw) RETURNS void {
details = FOREACH raw GENERATE configData;
distinctDetails = DISTINCT details;
STORE distinctDetails INTO '$DETAILS' USING PigStorage(',');;
}
rmf $NODE_DETAILS
rawLogs = load '/user/hduser/test/test01/ManagementNode.xml' using
org.apache.tez.tools.TFileLoader() as (configData:chararray, key:chararray,
line:chararray);
raw = FOREACH rawLogs GENERATE ManagementNode,key,line;
getDetails(raw);
exec;
However, I am getting the following error:
ERROR 2998: Unhandled internal error. null
java.lang.StackOverflowError
at org.apache.tez.tools.TFileLoader.hashCode(TFileLoader.java:148)
at java.util.Arrays.hashCode(Arrays.java:3140)
...
Could it be because of the XML file?
Thanks.
J. Reyes.