TFileLoader can not parse xml files. Script posted here tries to parse XML
file via TFileLoader which could be causing the issue.

https://pig.apache.org/docs/r0.15.0/api/org/apache/pig/piggybank/storage/XMLLoader.html
in piggybank.jar might be useful for parsing XML contents.  You can refer
to
https://github.com/apache/pig/blob/a44b85a0ab941cdd1d2d7f6e457303aef1e57501/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestXMLLoader.java
for
example.


If you are interested in using pig+tez, you need to run "pig -x tez" to
inform pig to make use of tez execution engine instead of MR.

~Rajesh.B

On Sun, Nov 15, 2015 at 1:11 AM, Julian Reyes <[email protected]>
wrote:

> Hi,
>
> I just was trying to get started using Pig and get familiar with it but I
> am getting problems while reading the XML.
>
> My XML looks like the following (of course, its much bigger, I just added
> first entries):
>
> <cn:bulkCmConfigDataFile xmlns:cn="details-CONFIG" xmlns:xt="nrmBase"
> xmlns:en="CLL-NB">
> <cn:fileHeader fileFormatVersion="2.0.0" senderName="senderName" vendorName
> ="vendorName"/>
> <cn:configData>
> <en:ManagementNode xmlns:en="CLL-NB">
> <en:neGroup>Group_1</en:neGroup>
> <en:neVersion>2.1.0</en:neVersion>
> <en:neId>100</en:neId>
> <en:neName>TK0005</en:neName>
> <en:neIp>192.168.0.2</en:neIp>
> </en:ManagementNode>
> <en:ManagementNode xmlns:en="CLL-NB">
> <en:neGroup>Group_1</en:neGroup>
> <en:neVersion>2.1.0</en:neVersion>
> <en:neId>101</en:neId>
> <en:neName>TK0002</en:neName>
> <en:neIp>192.168.0.3</en:neIp>
> </en:ManagementNode>
> </cn:configData>
> <cn:fileFooter dateTime="2013-12-20T03:40:15+00:00"/>
> </cn:bulkCmConfigDataFile>
>
> And the Pig script I am trying to use is the following:
>
>
> set pig.splitCombination false;
> set tez.grouping.min-size 5242880;
> set tez.grouping.max-size 5242880;
>
> register '/usr/lib/tez/tez-0.7.0/tez-tfile-parser-0.7.0.jar';
>
> DEFINE getDetails(raw) RETURNS void {
>         details = FOREACH raw GENERATE configData;
>         distinctDetails = DISTINCT details;
>         STORE distinctDetails INTO '$DETAILS' USING PigStorage(',');;
> }
>
>
> rmf $NODE_DETAILS
> rawLogs = load '/user/hduser/test/test01/ManagementNode.xml' using
> org.apache.tez.tools.TFileLoader() as (configData:chararray, key:chararray,
> line:chararray);
> raw = FOREACH rawLogs GENERATE ManagementNode,key,line;
>
> getDetails(raw);
> exec;
>
> However, I am getting the following error:
>
> ERROR 2998: Unhandled internal error. null
>
> java.lang.StackOverflowError
>         at org.apache.tez.tools.TFileLoader.hashCode(TFileLoader.java:148)
>         at java.util.Arrays.hashCode(Arrays.java:3140)
> ...
>
> Could it be because of the XML file?
>
> Thanks.
>
>
> J. Reyes.
>



-- 
~Rajesh.B

Reply via email to