Hello,

I was able to parse the XML by modifying XMLLoader.java

I set up XMLTagNameRegExp as follow: "[a-zA-Z:\\_][0-9a-zA-Z:\\-_]+" and
now seems to be working.

However the output looks like:

((Group_1),(2.1.0),(100),(TK0005))
((Group_1),(2.1.0),(101),(TK0002))

But I would like to store it into a csv file that looks like

Group_1,2.1.0,100,TK0005
Group_1,2.1.0,101,TK0002

Also I need to keep opening more XML files, but the name of those files
depend on the third column, so 100.xml , 101.xml , etc..

How could I open those files in the same pig script and generate different
outputs?

My pig script:

rmf $NODE_DETAILS
rawData = load '$INPUT_LOGS' using
org.apache.pig.piggybank.storage.XMLLoader('en:ManagementNode') as
(doc:chararray);
raw = FOREACH rawData GENERATE
XPath(doc,'ManagementNode/neGroup'),XPath(doc,'ManagementNode/neVersion'),XPath(doc,'ManagementNode/neId'),XPath(doc,'ManagementNode/neName');

--getNodeDetails(raw);
--exec;

I also tried to have the following method to try to get rid of
parenthesis.. but I am getting exceptions..:

-- Read through the node details to find out enbId
DEFINE getNodeDetails(raw) RETURNS void {
        details = FOREACH raw GENERATE
FLATTEN(neGroup,neVersion,neId,neName);
        distinctDetails = DISTINCT details PARALLEL 1;
        STORE distinctDetails INTO '$NODE_DETAILS' USING PigStorage('\t');
}


Regards,
Thanks.


J. Reyes.



On 16 November 2015 at 17:50, Julian Reyes <[email protected]>
wrote:

> Hi
>
> I see. Thanks.
>
> I just changed it and used XMLLoader as follow:
>
> rawData = load '$INPUT' using
> org.apache.pig.piggybank.storage.XMLLoader('en:ManagementNode') as
> (doc:chararray);
> raw = FOREACH rawData GENERATE doc;
>
> However I am getting this exception:
>
> java.lang.RuntimeException: XML tag identifier 'en:ManagementNode' does
> not match the regular expression /[a-zA-Z\_][0-9a-zA-Z\-_]+/
>
> It has to be because of my XML file:
>
> <cn:bulkCmConfigDataFile xmlns:cn="details-CONFIG" xmlns:xt="nrmBase"
> xmlns:en="CLL-NB">
> <cn:fileHeader fileFormatVersion="2.0.0" senderName="senderName"
> vendorName="vendorName"/>
> <cn:configData>
> <en:ManagementNode xmlns:en="CLL-NB">
> <en:neGroup>Group_1</en:neGroup>
> <en:neVersion>2.1.0</en:neVersion>
> <en:neId>100</en:neId>
> <en:neName>TK0005</en:neName>
> <en:neIp>192.168.0.2</en:neIp>
> </en:ManagementNode>
> <en:ManagementNode xmlns:en="CLL-NB">
> <en:neGroup>Group_1</en:neGroup>
> <en:neVersion>2.1.0</en:neVersion>
> <en:neId>101</en:neId>
> <en:neName>TK0002</en:neName>
> <en:neIp>192.168.0.3</en:neIp>
> </en:ManagementNode>
> </cn:configData>
> <cn:fileFooter dateTime="2013-12-20T03:40:15+00:00"/>
> </cn:bulkCmConfigDataFile>
>
> I was looking at XMLLoader.java and I see the string that should match
>
> private static final String XMLTagNameRegExp = "[a-zA-Z\\_][0-9a-zA-Z\\-_]+";
>
> So I was thinking in maybe change that String to 
> "[a-zA-Z\\_\:][0-9a-zA-Z\\-_\:]+" and re deploy ?
>
> Also, how could I use XPath?
>
> raw = FOREACH rawLogs GENERATE 
> XPath(doc,'en:ManagementNode/en:neGroup'),XPath(doc,'en:ManagementNode/en:neVersion'),XPath(doc,'en:ManagementNode/en:neId'),XPath(doc,'en:ManagementNode/en:neName');
>
> My command looks like
>
> pig -x tez -m /home/hduser/test/param.txt -f /home/hduser/test/script.pig
>
>
> Thanks.
>
>
>
>
> J. Reyes.
>
>
>
> On 15 November 2015 at 22:55, Rajesh Balamohan <[email protected]
> > wrote:
>
>> TFileLoader can not parse xml files. Script posted here tries to parse XML
>> file via TFileLoader which could be causing the issue.
>>
>>
>> https://pig.apache.org/docs/r0.15.0/api/org/apache/pig/piggybank/storage/XMLLoader.html
>> in piggybank.jar might be useful for parsing XML contents.  You can refer
>> to
>>
>> https://github.com/apache/pig/blob/a44b85a0ab941cdd1d2d7f6e457303aef1e57501/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestXMLLoader.java
>> for
>> example.
>>
>>
>> If you are interested in using pig+tez, you need to run "pig -x tez" to
>> inform pig to make use of tez execution engine instead of MR.
>>
>> ~Rajesh.B
>>
>> On Sun, Nov 15, 2015 at 1:11 AM, Julian Reyes <[email protected]>
>> wrote:
>>
>> > Hi,
>> >
>> > I just was trying to get started using Pig and get familiar with it but
>> I
>> > am getting problems while reading the XML.
>> >
>> > My XML looks like the following (of course, its much bigger, I just
>> added
>> > first entries):
>> >
>> > <cn:bulkCmConfigDataFile xmlns:cn="details-CONFIG" xmlns:xt="nrmBase"
>> > xmlns:en="CLL-NB">
>> > <cn:fileHeader fileFormatVersion="2.0.0" senderName="senderName"
>> vendorName
>> > ="vendorName"/>
>> > <cn:configData>
>> > <en:ManagementNode xmlns:en="CLL-NB">
>> > <en:neGroup>Group_1</en:neGroup>
>> > <en:neVersion>2.1.0</en:neVersion>
>> > <en:neId>100</en:neId>
>> > <en:neName>TK0005</en:neName>
>> > <en:neIp>192.168.0.2</en:neIp>
>> > </en:ManagementNode>
>> > <en:ManagementNode xmlns:en="CLL-NB">
>> > <en:neGroup>Group_1</en:neGroup>
>> > <en:neVersion>2.1.0</en:neVersion>
>> > <en:neId>101</en:neId>
>> > <en:neName>TK0002</en:neName>
>> > <en:neIp>192.168.0.3</en:neIp>
>> > </en:ManagementNode>
>> > </cn:configData>
>> > <cn:fileFooter dateTime="2013-12-20T03:40:15+00:00"/>
>> > </cn:bulkCmConfigDataFile>
>> >
>> > And the Pig script I am trying to use is the following:
>> >
>> >
>> > set pig.splitCombination false;
>> > set tez.grouping.min-size 5242880;
>> > set tez.grouping.max-size 5242880;
>> >
>> > register '/usr/lib/tez/tez-0.7.0/tez-tfile-parser-0.7.0.jar';
>> >
>> > DEFINE getDetails(raw) RETURNS void {
>> >         details = FOREACH raw GENERATE configData;
>> >         distinctDetails = DISTINCT details;
>> >         STORE distinctDetails INTO '$DETAILS' USING PigStorage(',');;
>> > }
>> >
>> >
>> > rmf $NODE_DETAILS
>> > rawLogs = load '/user/hduser/test/test01/ManagementNode.xml' using
>> > org.apache.tez.tools.TFileLoader() as (configData:chararray,
>> key:chararray,
>> > line:chararray);
>> > raw = FOREACH rawLogs GENERATE ManagementNode,key,line;
>> >
>> > getDetails(raw);
>> > exec;
>> >
>> > However, I am getting the following error:
>> >
>> > ERROR 2998: Unhandled internal error. null
>> >
>> > java.lang.StackOverflowError
>> >         at
>> org.apache.tez.tools.TFileLoader.hashCode(TFileLoader.java:148)
>> >         at java.util.Arrays.hashCode(Arrays.java:3140)
>> > ...
>> >
>> > Could it be because of the XML file?
>> >
>> > Thanks.
>> >
>> >
>> > J. Reyes.
>> >
>>
>>
>>
>> --
>> ~Rajesh.B
>>
>
>

Reply via email to