this is an example of one revison for page (in other case is more
complex but it's possible):

REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
DEFINE XMLLoader org.apache.pig.piggybank.storage.XMLLoader();
DEFINE RegexExtractAll
org.apache.pig.piggybank.evaluation.string.RegexExtractAll();

revisionXML = LOAD 'Revision.xml' USING XMLLoader('page') AS
(revision:chararray);

rev = FOREACH revisionXML GENERATE FLATTEN
(RegexExtractAll(revision,'<id>([^<]*)</id>\\n\\s*<revision>\\n\\s*<id>([^>]*)</id>\\n\\s*<username>([^>]*)</username>\\n\\s*</revision>')
)
AS
(
page: chararray,
id_revision: chararray,
username: chararray,
);


dump rev;



2012/5/17, Herbert Mühlburger <herbert.muehlbur...@gmail.com>:
> Hi list,
>
> I would like to parse the following XML-File using Pig:
>
> <page>
>    <id>1</id>
> <revision>
>      <id>1</id>
>      <username>muehlburger</username>
> </revision>
> <revision>
>      <id>2</id>
>      <username>muehlburger</username>
> </revision>
> <revision>
>      <id>3</id>
>      <username>user1</username>
> </revision>
> ...
> <revision>
>      <id>34334398</id>
>      <username>muehlburger</username>
> </revision>
> </page>
> <page>
>    <id>2</id>
> <revision>
>      <id>343434</id>
>      <username>muehlburger</username>
> </revision>
> <revision>
>      <id>25343232</id>
>      <username>muehlburger</username>
> </revision>
> <revision>
>      <id>43434333</id>
>      <username>user2</username>
> </revision>
> ...
> <revision>
>      <id>5409589854</id>
>      <username>user5</username>
> </revision>
> </page>
> ...
>
> I would like to produce the following kind of csv output:
>
> page_id revision_id username
> 1 1 muehlburger
> 1 2 muehlburger
> 1 3 user1
> 1 34334398 muehlburger
> 2 343434 muehlburger
> 2 25343232 muehlburger
> 2 43434333 user2
> 2 5409589854 user5
>
> How can I acomplish this using PIG?
>
> Thank you very much for your help!
>
> Kind regards,
> Herbert
> --
> =================================================================
> Herbert Muehlburger  Software Development and Business Management
>                                      Graz University of Technology
> www.muehlburger.at                   www.twitter.com/hmuehlburger
> =================================================================
>

Reply via email to