Re: Json Parsing in Apache Pig

Ryan Prociuk Fri, 25 Jul 2014 17:09:57 -0700

I would recommend using the elephant-bird-pig JsonLoader

Have used it quite extensively to parse nested Json datasets with no issue.


You can download the jar files from maven and Register in the script.

http://mvnrepository.com/artifact/com.twitter.elephantbird/elephant-bird-pig

https://github.com/kevinweil/elephant-bird/

It has dependencies on the following jars
json-simple-1.1.x.jar;
elephant-bird-pig-4.x.jar;
elephant-bird-hadoop-compat-4.x.jar;
elephant-bird-core-4.x.jar;

Parse the file

fileA = LOAD '/hdfs-directory/' USING
com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS
 (json:map[]);

B = FOREACH A GENERATE
       json#'col1' = col1;

Ryan




On Fri, Jul 25, 2014 at 4:55 PM, Satish Kolli <[email protected]> wrote:

> Did you try the standard JsonLoader? I didn't personally use it but it
> looks like you can specify the schema to extract/parse from your json.
>
> http://pig.apache.org/docs/r0.13.0/func.html#jsonloadstore
>
> If not, you can also look at the following example I found googling:
>
> https://gist.github.com/kimsterv/601331
>
>
> Thanks.
>
>
>
>
> On Fri, Jul 25, 2014 at 8:01 AM, praveenesh kumar <[email protected]>
> wrote:
>
> > One simple way is to write a UDF that will act as Json parser. Load your
> > data and then call your UDF to parse and extract whatever you want from
> the
> > Json. You need to build what you want to get. Pig doesn't do that for
> you,
> > it gives you the capability to do that. How you do is upto you.
> >
> >
> > On Fri, Jul 25, 2014 at 12:09 PM, unmesha sreeveni <
> [email protected]>
> > wrote:
> >
> > > Hi
> > >
> > > This is my code for sampling
> > >
> > > *--Load data*
> > > *inputdata = LOAD '$input' using PigStorage('$delimiter');*
> > >
> > > *--Group data*
> > > *groupedByAll = group inputdata all;*
> > >
> > > *--output into hdfs*
> > > *sampled = SAMPLE inputdata $fraction;*
> > > *store sampled into '$output' using PigStorage('$delimiter'); *
> > >
> > >  --Sampling.pig
> > > --pig -x mapreduce -f Sampling.pig -param input=foo.csv -param
> > > output=OUT/pig -param delimiter="," -param fraction='0.05'
> > >
> > > --Load data
> > > inputdata = LOAD '$input' using PigStorage('$delimiter');
> > >
> > > --Group data
> > > groupedByAll = group inputdata all;
> > >
> > > --output into hdfs
> > > sampled = SAMPLE inputdata $fraction;
> > > store sampled into '$output' using PigStorage('$delimiter');
> > >
> > > I am taking input parameters as customized
> > > pig -x mapreduce -f Sampling.pig -param input=foo.csv -param
> > output=OUT/pig
> > > -param delimiter="," -param fraction='0.05'
> > >
> > > I would like to do a modification in the same
> > > I am trying to take my input as json
> > >
> > > sample json:
> > >
> > >
> >
> *{"Name":"sampling","elementInfo":{"fraction":"3"},"destination":"/user/sree/OUT","source":"/user/sree/foo.txt"}*
> > >
> > > Now I need to parse the above json and take the needful params.
> > > How to do the same
> > > I know we can load json in apache pig but how to extract the needful
> from
> > > the json
> > >
> > > from here I only need
> > > fraction,destination,source
> > >
> > > Please suggest a way
> > >
> > > --
> > > *Thanks & Regards *
> > >
> > >
> > > *Unmesha Sreeveni U.B*
> > > *Hadoop, Bigdata Developer*
> > > *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> > > http://www.unmeshasreeveni.blogspot.in/
> > >
> >
>



-- 
Ryan Prociuk | Engineering Distributed Data

Re: Json Parsing in Apache Pig

Reply via email to