Re: elephantbird JsonLoader doesn't like gz?

2011-05-19 Thread Eric Lubow
> > >> Again this fails: > > >> > > >> raw_json = LOAD 'cc.json.gz' USING > > >> com.twitter.elephantbird.pig.load.JsonLoader(); > > >> > > >> this works: > > >> > > >> $ gunzip cc.json.gz > > >> raw_json = LOAD 'cc.json' USING > > >> com.twitter.elephantbird.pig.load.JsonLoader(); > > >> > > >> Any suggestions for this? Or is there any other json loader library > out > > >> there? I can write my own but would rather use one if already exists. > > >> > > >> Thanks, > > >> > > >> Dexin > > >> > > > > > > Eric Lubow e: eric.lu...@gmail.com w: eric.lubow.org

Re: Loader UDF with variable schema

2011-03-13 Thread Eric Lubow
ples in pig. Is there a better/more > efficient way to do this? > I would like to avoid having loading logic in both the udf and the pig > script, and generate all "final" tuples in the udf, and then just use a > split in pig. > Thanks, > Marko > Eric Lubow e: eric.lu...@gmail.com w: eric.lubow.org

Re: Limting output

2011-03-09 Thread Eric Lubow
xpression I want to get 'x' number of > urls matching the regex pattern. I have written a UDF to filter out > urls based on regular expression. Is there a way in Pig script to > limit the number of results to 'x' ? ( 'x' is some configurable value) > > T

Re: [DISCUSSION] Pig.next

2011-03-03 Thread Eric Lubow
;>> (1) We are mature enough and produce good quality releases >>> (2) Our interface no longer change in major ways >>> (3) We have a growing user community and we want the newcomers to >>> know >>> that our releases are stable >>> (4) If the next release is 0.10 and we decide that we should switch >>> on >>> the following release going from 0.10 to 1.0 will generate a lot of >>> confusion. >>> >>> I wanted to start this conversation and see what others think before >>> deciding if it is worth while to call a vote. >>> >>> Olga >>> >>> > Eric Lubow e: eric.lu...@gmail.com w: eric.lubow.org

Re: Reading Gzip Files

2011-02-22 Thread Eric Lubow
4) > (98390,572) > (98391,567) > > Looks great. I'm going to blame it on your version? I'm using pig-0.8 > and hadoop 0.20.2. > > --jacob > @thedatachef > > > On Tue, 2011-02-22 at 08:21 -0500, Eric Lubow wrote: > > I apologize for the double mailing: > >

Re: Reading Gzip Files

2011-02-22 Thread Eric Lubow
I apologize for the double mailing: grunt> Y = LOAD 'hdfs:///mnt/test.log.gz' AS (line:chararray); grunt> foo = LIMIT Y 5; grunt> dump foo <0\Mtest.log?]?o?H??}?) It didn't work out of HDFS. -e On Tue, Feb 22, 2011 at 08:18, Eric Lubow wrote: > I'm n

Re: Reading Gzip Files

2011-02-22 Thread Eric Lubow
tFormat(); >>} else { >>return new PigTextInputFormat(); >>} >> } >> >> And in my custom loader was : >> >> public InputFormat getInputFormat() { >> return new TextInputFormat(); >> } >> >> >> I just co

Reading Gzip Files

2011-02-21 Thread Eric Lubow
not compressed. Since the logs are compressed, my hands are tied. Any suggestions to get me moving in the right direction? Thanks. -e -- Eric Lubow e: eric.lu...@gmail.com w: eric.lubow.org

JSON Loading on EMR

2011-02-17 Thread Eric Lubow
quot;:"(.*[^"])","logged_at":"(.*[^"])"}')) AS (exchange_id:chararray,exchange_user_id:chararray,bid_id:chararray,bid_amount:float,win_amount:float,ad_ids:chararray,wv:int,logged_at:chararray); WIDGET_VERSION_ONLY = FOREACH LOGS_BASE GENERATE wv; WIDGET_VERSION_COUNT = FOREACH (GROUP WIDGET_VERSION_ONLY BY $0) GENERATE $0, COUNT($1) as num; WIDGET_VERSION_SORTED_COUNT = LIMIT(ORDER WIDGET_VERSION_COUNT BY num DESC) 5; Any help that would push me in the right direction would be greatly appreciated. -e -- Eric Lubow e: eric.lu...@gmail.com w: eric.lubow.org