Bill,  thanks...

 so that is a confirmation... people have rolled their own, and it's not in
piggybank.
I would absolutely be willing to work with you to get a contribution going,
but (as
a warning) I am extremely new to Pig.

I was looking at this:
http://wiki.apache.org/pig/UDFManual
to get my mind wrapped around the framework.  And I also discovered this
https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
( I am assuming this was the UDF you mentioned that inspired you)...

A quick question about the UDF's registered at the top of a pig script:

does
REGISTER myJar.jar
distribute the jar across HDFS (like a Hadoop job jar) so that the
distribution of the code to the cluster nodes is transparent?
In other words, do we NOT have to distribute myJar.jar to each node on the
cluster.

thanks more,
daniel



On Tue, Apr 19, 2011 at 1:57 PM, Bill Graham <billgra...@gmail.com> wrote:

> We're doing the same thing using a JsonToMap UDF followed by a
> MapToBag UDF. The former was similarly inspired by the elephant bird
> JSONLoader. I'd be glad to collaborate on a contribution if you'd
> like.
>
> Here's what our scripts look like:
>
> define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
> define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
> define concat org.apache.pig.builtin.StringConcat();
>
> raw = LOAD 'hbase://user_info'
>      USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'events:*')
>      AS (events_map:map[]);
>
> -- Convert our maps to bags so we can flatten them out
> B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;
>
> C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
> event_v:chararray);
>
> -- Convert the JSON events into maps
> D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[];
>
> -- Example showing how to filter on a given field
> E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
> event_map#'levt.asid' IS NOT NULL);
>
> -- Example showing how to pull data out of a map
> F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
>                                             event_map#'levt.astid' AS
> astid;
>
>
> thanks,
> Bill
>
> On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <doekl...@gmail.com>
> wrote:
> > I noticed that there is a Pig JSON Loader (which might or might not be in
> > piggbank).
> > Could anyone confirm the existence or absence of a JSONToTuple UDF?  (not
> a
> > loader)
> >
> > I am inspired by the UDF mentioned on Slide 23 here:
> > http://www.slideshare.net/danharvey/hbase-at-mendeley
> >
> >  doc = FOREACH rawdocs GENERATE DocumentProtobufBytesToTuple(protodoc) as
> > DOC;
> >
> > My desire is to store a raw JSON doc in a cell in HBase and run pig
> queries
> > against the tuples generated by the UDF.
> > I used the HBase Loader already to get the cell-data, and now I need a
> > JSON-deserializer.
> >
> > I would be willing to roll my own, (and contribute), but I figure I'd see
> if
> > there was anything out there first.
> >
> > thanks,
> > daniel
> >
>

Reply via email to