from:"Aniket Mokashi"

Re: Congratulations to Cheolsoo Park the new Apache Pig project chair

2014-03-20 Thread Aniket Mokashi

Woo!! Congrats Cheolsoo...


On Thu, Mar 20, 2014 at 4:25 AM, Rohini Palaniswamy  wrote:

> Thanks Julien. Great job last year.
>
> Congratulations, Cheolsoo!!!  Well deserved. Great job past 2 years with
> awesome number of commits and reviews.
>
>
> On Thu, Mar 20, 2014 at 2:07 AM, Lorand Bendig  wrote:
>
> > Congratulations, Cheolsoo!
> >
> > --Lorand
> >
> >
> > On 03/20/2014 02:03 AM, Julien Le Dem wrote:
> >
> >> Congrats Cheolsoo,
> >> This is well deserved.
> >> Julien
> >> .
> >>
> >>
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Welcome to the new Pig PMC member Aniket Mokashi

2014-01-25 Thread Aniket Mokashi

Thanks everyone for the wishes.

(My filters did a trick on me and I did not notice this email until today).
Thanks everyone for helping me out over all these years.

~Aniket


On Wed, Jan 15, 2014 at 11:33 AM, Daniel Dai  wrote:

> Congratulation!
>
>
> On Wed, Jan 15, 2014 at 10:12 AM, Mona Chitnis  wrote:
>
> > Congrats Aniket! Good work!
> >
> > --
> >
> > Mona Chitnis
> > Software Engineer, Hadoop Team
> > Yahoo!
> >
> >
> >
> > On Wednesday, January 15, 2014 9:17 AM, Xuefu Zhang  >
> > wrote:
> >
> > Congratulations, Aniket!
> >
> > --Xuefu
> >
> >
> >
> > On Tue, Jan 14, 2014 at 11:54 PM, Prasanth Jayachandran <
> > pjayachand...@hortonworks.com> wrote:
> >
> > > Congrats Aniket!
> > >
> > > Thanks
> > > Prasanth Jayachandran
> > >
> > > On Jan 15, 2014, at 10:30 AM, Bill Graham 
> wrote:
> > >
> > > > Woo! Congrats Aniket!
> > > >
> > > >
> > > > On Tue, Jan 14, 2014 at 8:47 PM, Olga Natkovich <
> onatkov...@yahoo.com
> > > >wrote:
> > > >
> > > >> Congrats, Aniket!
> > > >>
> > > >>
> > > >>
> > > >> On Tuesday, January 14, 2014 8:32 PM, Tongjie Chen <
> > > tongjie.c...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> Congrats Aniket!
> > > >>
> > > >>
> > > >>
> > > >> On Tue, Jan 14, 2014 at 8:12 PM, Cheolsoo Park <
> piaozhe...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Congrats Aniket!
> > > >>>
> > > >>>
> > > >>> On Tue, Jan 14, 2014 at 7:01 PM, Jarek Jarcec Cecho <
> > jar...@apache.org
> > > >>>> wrote:
> > > >>>
> > > >>>> Congratulations Aniket, good work!
> > > >>>>
> > > >>>> Jarcec
> > > >>>>
> > > >>>> On Tue, Jan 14, 2014 at 06:52:10PM -0800, JULIEN LE DEM wrote:
> > > >>>>> It's my pleasure to announce that Aniket Mokashi became the
> newest
> > > >>>> addition to the Pig PMC.
> > > >>>>> Aniket has been actively contributing to Pig for years.
> > > >>>>> Please join me in congratulating Aniket!
> > > >>>>>
> > > >>>>> Julien
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > *Note that I'm no longer using my Yahoo! email address. Please email
> me
> > > > at billgra...@gmail.com  going forward.*
> > >
> > >
> > > --
> > > CONFIDENTIALITY NOTICE
> > > NOTICE: This message is intended for the use of the individual or
> entity
> > to
> > > which it is addressed and may contain information that is confidential,
> > > privileged and exempt from disclosure under applicable law. If the
> reader
> > > of this message is not the intended recipient, you are hereby notified
> > that
> > > any printing, copying, dissemination, distribution, disclosure or
> > > forwarding of this communication is strictly prohibited. If you have
> > > received this communication in error, please contact the sender
> > immediately
> > > and delete it from your system. Thank You.
> > >
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Python UDFs with Pig (support for Filter functions?)

2013-09-05 Thread Aniket Mokashi

https://cwiki.apache.org/confluence/display/PIG/UDFsUsingScriptingLanguages

Now that we have a boolean datatype, filter function is just an evalfunc of
boolean.

~Aniket


On Wed, Sep 4, 2013 at 12:01 PM, Serega Sheypak wrote:

> It should work.
> filtered_result = FILTER dirty_data udf.my_python_filter_func(field1,
> field2);
>
>
> 2013/9/4 Max Von Tilden 
>
> > Quick question from a Pig noob...in 0.11 does Pig support developing
> > filter functions developed in Python? Is there any documentation or
> > examples that anyone knows of?
> > thx,
> > John
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: ERROR: java.lang.Long cannot be cast to java.lang.String

2013-08-18 Thread Aniket Mokashi

Hi Sonia,

Try adding another pair of parenthesis-
eg-
((int)(RegexMatch((chararray) genre_id, '\\d+')) == 1 ? (chararray)genre_id
:
'-1001') as genre_id


On Thu, Aug 15, 2013 at 4:28 PM, sonia gehlot wrote:

> Hi,
>
> I have pigscript in which I am flattening it and assign schema to it and
> trying to do some REGEX matching on top of it after that converting it to
> INT. But its giving me error "ERROR: java.lang.Long cannot be cast to java.
> lang.String"
>
> Here is a snippet of code where I am getting error:
> 
>
> final_flatten = foreach flattened_further generate .. watched_evidence,
> flatten(myop) as (rank:int,list:chararray), row..device_type_id;
>
>
> final_cast = foreach final_flatten generate
>
> (int)dateint,
>
> (long)event_utc_ms,
>
> (int)hour,
>
> (long)(RegexMatch((chararray) account_id, '\\d+') == 1 ?
> (chararray)account_id
> : '-1001') as account_id,
>
> request_data_type,
>
> client_request_id,
>
> (int)(RegexMatch((chararray) device_type_id, '\\d+') == 1 ?
> (chararray)device_type_id
> : '-1001') as device_type_id,
>
> (int)(RegexMatch((chararray) max_list_index, '\\d+') == 1 ?
> (chararray)max_list_index
> : '-1001') as max_list_index,
>
> esn,
>
> (long)(RegexMatch((chararray) epoch_create_ts, '\\d+') == 1 ?
> (chararray)epoch_create_ts
> : '-1001') as request_create_ts,
>
> socially_connected,
>
> gps_model,
>
> country_iso_code,
>
> status_code,
>
> uuid,
>
> (long)(RegexMatch((chararray) visitorid, '\\d+') == 1 ?
> (chararray)visitorid: '-1001') as account_profile_id,
>
> (int)(RegexMatch((chararray) track_id, '\\d+') == 1 ? (chararray)track_id :
> '-1001') as location_id,
>
> sub_root_uuid,
>
> list_type,
>
> item_type,
>
> hasevidence,
>
> listContext,
>
> (int)(RegexMatch((chararray) genre_id, '\\d+') == 1 ? (chararray)genre_id :
> '-1001') as genre_id,
>
> taste_evidence,
>
> rated_evidence,
>
> watched_evidence,
>
> *(int)(RegexMatch((chararray) list, '\\d+') == 1 ? (chararray)list :
> '-1001') as source_title_id,*
>
> row as presentation_row_number,
>
> rank as presentation_rank_number;
>
> z = limit final_cast 10;
>
> dump z;
>
> 
>
> It is returning correct results other than one field
>
> "*(int)(RegexMatch((chararray) list, '\\d+') == 1 ? (chararray)list :
> '-1001') as source_title_id,*"
>
> for this I am getting error *"ERROR: java.lang.Long cannot be cast to java.
> lang.String"*
>
> I tried explicit casting, but it is still giving me error.
>
> Any idea what I am doing wrong here.
>
> Thanks,
>
> Sonia
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Pig and Storm

2013-07-23 Thread Aniket Mokashi

Following projects might interest you:

Pig and Spark: https://github.com/twitter/pig/tree/spork
Storm and Hadoop:
https://speakerdeck.com/sritchie/summingbird-streaming-mapreduce-at-twitter

Thanks,
Aniket


On Tue, Jul 23, 2013 at 11:18 PM, Russell Jurney
wrote:

> I think a Storm backend for Pig would be AWESOME. Btw, check out
> HStreaming. It's not FOSS, but shows there is demand.
> http://www.hstreaming.com/products/community/
>
> Russell Jurney http://datasyndrome.com
>
> On Jul 23, 2013, at 9:53 AM, Pradeep Gollakota 
> wrote:
>
> Hi Pig Developers,
>
> I wanted to reach out to you all and ask for you opinion on something.
>
> As a Pig user, I have come to love Pig as a framework. Pig provides a great
> set of abstractions that make working with large datasets easy. Currently
> Pig is only backed by hadoop. However, with the new rise of Twitter Storm
> as a distributed real time processing engine, Pig users are missing out on
> a great opportunity to be able to work with Pig in Storm. As a user of Pig,
> Hadoop and Storm, and keeping with the Pig philosophy of "Pigs live
> anywhere," I'd like to get your thoughts on starting the implementation of
> a Pig backend for Storm.
>
> Thanks
> Pradeep
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Replicated Join and OOM errors

2013-07-21 Thread Aniket Mokashi

Pig does not currently have a way to do this. The development of feature
like this is tracked at - https://issues.apache.org/jira/browse/PIG-2784.
Feel free to add a subtask and take a stab at it.

~Aniket


On Fri, Jul 19, 2013 at 12:58 PM, Mehmet Tepedelenlioglu <
mehmets...@yahoo.com> wrote:

> You can always split your tables such that same keys end up in same
> splits. Then you replicated join the corresponding splits and take the
> union.
>
> On Jul 19, 2013, at 12:26 PM, Arun Ahuja  wrote:
>
> > I have been using a replicated join to join on very large set of data
> with
> > another one that is about 1000x smaller.  Generally seen large
> performance
> > gains.
> >
> > However, they do scale together, so that now  even though the RHS table
> is
> > still 1000x smaller, it is too large to fit into memory.  There will
> happen
> > on only every 20th or so dataset that join is performed on, but I'd like
> to
> > have something robust built to handle this.
> >
> > Is there anyway to setup the replicated join to back to a regular join
> only
> > on memory issues?  Or any type of conditional I could set to check the
> > dataset size first?  Willing to even dig into the Pig could and implement
> > this if anyone has ideas.
> >
> > Thanks
> >
> > Arun
>
>


-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Single Output file from STORE command

2013-06-03 Thread Aniket Mokashi

You can use pig to do what "hadoop fs -getmerge" is doing in a separate pig
script. It will still be one reducer though.


On Tue, May 28, 2013 at 8:29 AM, Alan Gates  wrote:

> Nothing that uses MapReduce as an underlying execution engine creates a
> single file when running multiple reducers because MapReduce doesn't.  The
> real question is if you want to keep the file on Hadoop, why worry about
> whether it's a single file?  Most applications on Hadoop will take a
> directory as an input and read all the files contained in it.
>
> Alan.
>
> On May 24, 2013, at 12:11 PM, Mix Nin wrote:
>
> > STORE command produces multiple output files. I want a single output file
> > and I tried using command as below
> >
> > STORE (foreach (group NoNullData all) generate flatten($1))  into '';
> >
> > This command produces one single file but at the same time forces to use
> > single reducer which kills performance.
> >
> > How do I overcome the scenario?
> >
> > Normally   STORE command produces multiple output files, apart from that
> I
> > see another file
> > "_SUCCESS" in output directory. I ma generating metadata file  ( using
> > PigStorage('\t', '-schema') ) in output directory
> >
> > I thought of using  getmerge as follows
> >
> > *hadoop* fs -*getmerge*
> >
> > But this requires
> > 1)eliminating files other than data files in HDFS directory
> > 2)It creates a single file in local directory but not in HDFS directory
> > 3)I need to again move file from local directory to HDFS directory which
> > may  take additional time , depending on size of single file
> > 4)I need to agin place the files which I eliminated in Step 1
> >
> >
> > Is there an efficient way for my problem?
> >
> > Thanks
>
>


-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Pig architecture explanation?

2013-03-21 Thread Aniket Mokashi

Also-
https://cwiki.apache.org/confluence/display/PIG/Guide+for+new+contributors

~Aniket


On Sun, Mar 17, 2013 at 4:37 PM, Prashant Kommireddi wrote:

> Hi Gardner,
>
> This paper would be a good starting point
> http://infolab.stanford.edu/~olston/publications/vldb09.pdf
>
> Additionally, you could check out some other material here
> https://cwiki.apache.org/confluence/display/PIG/PigTalksPapers
>
>
> On Mar 17, 2013, at 4:26 PM, Gardner Pomper 
> wrote:
>
> > Hello all,
> >
> > When I first saw pig, I was under the impressing that it generated java
> > code for a series of map/reduce jobs and then submitted that to hadoop. I
> > have since seen messages that indicate the is not the way it works.
> >
> > I have been trying to find a document (preferably with diagrams) that
> shows
> > what the pig architecture is and how the various mappers/reducers are
> > defined and spawned.
> >
> > I would appreciate it if someone could point me to that documentation.
> >
> > Sincerely,
> >
> > - Gardner
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Unable to upload to S3

2013-03-04 Thread Aniket Mokashi

What's BCCKIAJV5KGMZVA:xmw5F7I4AWd6rDRA@?

To work with S3-
1. Your path should be - s3n://bucket-name/key
2. Have your aws keys in core-site.xml



On Mon, Mar 4, 2013 at 3:32 PM, Mohit Anchlia wrote:

> I am trying to upload to S3 using pig but I get:
>
> grunt> store A into 's3://BCCKIAJV5KGMZVA:xmw5F7I4AWd6rDRA@
> /bucket/1/2/a';
> 2013-03-04 18:24:39,475 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: UNKNOWN
> 2013-03-04 18:24:39,528 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1002: Unable to store alias A
> Details at logfile: /data-ebs/misc/pig/pig_1362439271484.log
> (java.lang.IllegalArgumentException: Invalid hostname in URI
> s3://BCCKIAJV5KGMZVA:xmw5F7I4AWd6rDRA@/bucket/1/2/a)
> (,at
> org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:41))
> (,at
>
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:478))
> (,at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1453))
> (,at org.apache.hadoop.fs.FileSystem.access$100(FileSystem.java:69))
> (,at
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1487))
> (,at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1469))
> (,at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:235))
> (,at org.apache.hadoop.fs.Path.getFileSystem(Path.java:191))
> (,at
>
> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131))
> (,at
>
> org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:80))
> (,at
> org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:77))
> (,at
>
> org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64))
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Fwd: Replicated join: is there a setting to make this better?

2013-02-21 Thread Aniket Mokashi

I think the email was filtered out. Resending.

-- Forwarded message --
From: Aniket Mokashi 
Date: Wed, Feb 20, 2013 at 1:18 PM
Subject: Replicated join: is there a setting to make this better?
To: "d...@pig.apache.org" 

Hi devs,

I was looking into limitations of size/records for fragment replicated join
(map join) in pig. To test that I loaded a map (aka fragment) of longs in
an alias to join it with other alias which had few other columns. With a
map file of 50mb I saw GC Overheads on the mappers. I took a heap dump of
mapper to look into whats causing the GC Overheads and found that its the
memory footprint of fragment itself was high.

[image: Inline image 1]

Note, the hashmap was able to only load about 1.8 million records-
[image: Inline image 2]
Reason was that every map record has an overhead of about 1.5kb. Most of it
is part of retained heap, but it needs to be garbage collected.
[image: Inline image 3]

So, it turns out-

Size of heap required by a map join from above = 1.5 KB * Number of records
+ Size of input (uncompressed databytearray)... (assuming the key is a
long).

So, to run your replicated join, you need to satisfy following criteria:

*1.5 KB * Number of records + Size of input (uncompressed) < estimated free
memory in the mapper (total heap - io.sort.mb - some minor constant etc.)*

Is that a right conclusion? Is there a setting/way to make this better?

Thanks,

Aniket

*
*

-- 
"...:::Aniket:::... Quetzalco@tl"

Re: [ANNOUNCE] Welcome Bill Graham to join Pig PMC

2013-02-20 Thread Aniket Mokashi

Congrats Bill !!


On Wed, Feb 20, 2013 at 9:44 AM, Julien Le Dem  wrote:

> Congrats!
>
>
> On Wed, Feb 20, 2013 at 6:45 AM, Gianmarco De Francisci Morales <
> g...@gdfm.me> wrote:
>
> > Congrats Bill! :)
> >
> > --
> > Gianmarco
> >
> >
> > On Wed, Feb 20, 2013 at 10:00 AM, Jonathan Coveney  > >wrote:
> >
> > > congrats :)
> > >
> > >
> > > 2013/2/20 Jarek Jarcec Cecho 
> > >
> > > > Congratulations Bill, good job!
> > > >
> > > > Jarcec
> > > >
> > > > On Tue, Feb 19, 2013 at 01:48:18PM -0800, Daniel Dai wrote:
> > > > > Please welcome Bill Graham as our latest Pig PMC member.
> > > > >
> > > > > Congrats Bill!
> > > >
> > >
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Welcome our newest committer Cheolsoo Park

2012-10-31 Thread Aniket Mokashi

Congrats Cheolsoo...


On Fri, Oct 26, 2012 at 4:26 PM, Santhosh M S wrote:

> Congratulations Cheolsoo! Looking forward to more from you.
>
> Regards,
> Santhosh
>
>
> 
>  From: Julien Le Dem 
> To: d...@pig.apache.org; user@pig.apache.org
> Sent: Friday, October 26, 2012 2:54 PM
> Subject: Welcome our newest committer Cheolsoo Park
>
> All,
>
> Please join me in welcoming Cheolsoo Park as our newest Pig committer.
> He's been contributing to Pig for a while now, helping fixing the
> build and improve Pig. We look forward to him being a part of the
> project.
>
> Julien
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: [ANNOUNCE] Welcome new Apache Pig Committers Rohini Palaniswamy

2012-10-31 Thread Aniket Mokashi

Congrats Rohini...


On Mon, Oct 29, 2012 at 11:31 AM, Julien Le Dem  wrote:

> Congrats Rohini !
>
>
> On Sun, Oct 28, 2012 at 9:42 AM, Bill Graham  wrote:
> > Congrats Rohini! Great news indeed.
> >
> > On Saturday, October 27, 2012, Jon Coveney wrote:
> >
> >> Wonderful news!
> >>
> >> On Oct 26, 2012, at 9:51 PM, Gianmarco De Francisci Morales <
> >> g...@apache.org > wrote:
> >>
> >> > Congratulations Rohini!
> >> > Welcome onboard :)
> >> > --
> >> > Gianmarco
> >> >
> >> >
> >> > On Fri, Oct 26, 2012 at 7:32 PM, Prasanth J <
> buckeye.prasa...@gmail.com>
> >> wrote:
> >> >> Congrats Rohini!
> >> >>
> >> >> Thanks
> >> >> -- Prasanth
> >> >>
> >> >> On Oct 26, 2012, at 10:21 PM, Santhosh Srinivasan <
> >> santhosh_mut...@yahoo.com > wrote:
> >> >>
> >> >>> Congrats Rohini! Full speed ahead now :)
> >> >>>
> >> >>> On Oct 26, 2012, at 4:37 PM, Daniel Dai  >
> >> wrote:
> >> >>>
> >>  Here is another Pig committer announcement today. Please welcome
> >>  Rohini Palaniswamy to be a Pig committer!
> >> 
> >>  Thanks,
> >>  Daniel
> >> >>
> >>
> >
> >
> > --
> > Sent from Gmail Mobile
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: dryrun versus grunt

2012-10-09 Thread Aniket Mokashi

The fix is-
grunt> run -param key=value script.pig

~Aniket

On Mon, Oct 8, 2012 at 5:56 PM, Russell Jurney wrote:

> I did not know about the run command from inside grunt, but generally
> speaking grunt does not yet support macros or parameters. I am eager
> to get this fixed, myself.
>
> Russell Jurney http://datasyndrome.com
>
> On Oct 8, 2012, at 5:53 PM, Lauren Blau
>  wrote:
>
> > I have a script something like
> >
> > DEFINE udf ..
> > DEFINE udf2 ..
> >
> > IMPORT 'macros.pig'
> >
> > rel = calltomacro('string',$keyparam);
> > rel2 = calltomacro('string2',$keyparam);
> > 
> >
> >
> > if I run this with pig -p keyparam=testparam --dryrun script.pig
> > I get a valid script.pig.expanded created.
> >
> > but if I run pig -p keyparam=testparam
> > grunt> run script.pig (or IMPORT 'pig.script';)
> >
> > it fails with an error 'undefined param $keyparam'
> > why does this behave differently during --dryrun and inside grunt?
> >
> > Thanks,
> > lauren
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Input and output path

2012-09-14 Thread Aniket Mokashi

You can do something similar to -
https://cwiki.apache.org/PIG/faq.html#FAQ-Q%253AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%253F

Get input path from pig and then substitute the values for date, hour etc.
You have to also override getSchema method so that pig gets to see these
fields.

Just beware of -https://issues.apache.org/jira/browse/PIG-2462

Thanks,
Aniket

On Thu, Sep 13, 2012 at 2:04 PM, Ruslan Al-Fakikh wrote:

> MiaoMiao, Mohit,
>
> If we are talking about embedding Pig into Python, I'd like to add
> that we can also embed Pig into Java using PigServer
> http://wiki.apache.org/pig/EmbeddedPig
>
> MiaoMiao, what's the purpose of embedding here (if we already have
> parameter substitution feature)? I guess Pig embedding is mostly
> suitable in case we want to add IF/ELSE or LOOP functionality
>
> Thanks
>
> On Thu, Sep 13, 2012 at 6:31 AM, MiaoMiao  wrote:
> > I wrote a python script to do this
> >
> > import sys
> > mmddhh = sys.argv[1]
> > inputPath = getInputPath(mmddhh) #mmddhh to "/MM/DD/HH/input"
> > outputPath = getOutputPath(mmddhh) #mmddhh to
> "/MM/DD/HH/output"
> > pigScript = '''
> > some = load '$input' using PigStorage(',')
> > as(
> > id:INT,
> > value:INT
> > );
> > final = . ;
> > STORE final INTO '$output' using PigStorage(',');
> > '''
> > P = Pig.compile(pigScript)
> > result = P.bind({'input':inputPath, 'output':outputPath}).runSingle()
> > if result.isSuccessful() :
> > print 'Pig job succeeded'
> > else :
> > raise 'Pig job failed'
> >
> > Then you can run it with pig
> > pig -x local pig.py 2012091108
> >
> > On Tue, Sep 11, 2012 at 7:11 AM, Mohit Anchlia 
> wrote:
> >> Our input path is something like /MM/DD/HH/input and we like to
> write
> >> to /MM/DD/HH/output . Is it possible to get the input path as a
> String
> >> and convert it to /MM/DD/HH/output that I can use in "store into"
> >> clause?
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Reading BytesWritable in sequence file

2012-09-14 Thread Aniket Mokashi

For a simpler use case, something similar to following should work-

public class PigSequenceFileLoader extends PigStorage {

@SuppressWarnings("rawtypes")

@Override

public InputFormat getInputFormat() {

 return new SequenceFileInputFormat();

}

}

Thanks,

Aniket

On Thu, Sep 13, 2012 at 1:24 PM, Dmitriy Ryaboy  wrote:

> Install protocol buffers 2.3 and thrift 0.5
>
> From the readme:
>
> Protocol Buffer and Thrift compiler dependencies
> Elephant Bird requires Protocol Buffer compiler version 2.3 at build
> time, as generated classes are used internally. Thrift compiler is
> required to generate classes used in tests. As these are native-code
> tools they must be installed on the build machine (java library
> dependencies are pulled from maven repositories during the build).
>
>
>
> D
>
> On Wed, Sep 12, 2012 at 9:01 PM, Mohit Anchlia 
> wrote:
> > I got this error when I ran mvn package
> >
> > [ERROR] Failed to execute goal
> > com.github.igor-petruk.protobuf:protobuf-maven-pl
> > ugin:0.4:run (default) on project elephant-bird-core: Unable to find
> > 'protoc' ->
> >  [Help 1]
> > [ERROR]
> >
> > On Tue, Sep 11, 2012 at 4:24 PM, Mohit Anchlia  >wrote:
> >
> >> Thanks! I'll try it out.
> >>
> >>
> >> On Tue, Sep 11, 2012 at 4:21 PM, Dmitriy Ryaboy  >wrote:
> >>
> >>> Yup:
> >>> https://github.com/kevinweil/elephant-bird
> >>>
> >>> D
> >>>
> >>> On Tue, Sep 11, 2012 at 4:00 PM, Mohit Anchlia  >
> >>> wrote:
> >>> > Is it the code that I checkout and build?
> >>> >
> >>> > On Tue, Sep 11, 2012 at 3:27 PM, Dmitriy Ryaboy 
> >>> wrote:
> >>> >
> >>> >> Try the one in Elephant-Bird.
> >>> >>
> >>> >> On Tue, Sep 11, 2012 at 11:22 AM, Mohit Anchlia <
> >>> mohitanch...@gmail.com>
> >>> >> wrote:
> >>> >> > Is there a way to read BytesWritable using sequence file loader
> from
> >>> >> > piggybank? If not then how should I go about implementing one?
> >>> >>
> >>>
> >>
> >>
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Counters from Python UDF

2012-08-24 Thread Aniket Mokashi

I used following in my python udf (on pig 0.9) after referring to -
http://squarecog.wordpress.com/2010/12/24/incrementing-hadoop-counters-in-apache-pig/


from org.apache.pig.tools.pigstats import PigStatusReporter
reporter = PigStatusReporter.getInstance();

But, looks like, context is not set in pigreporter when udf is invoked, so
it fails. I think we need some caching logic similar to PigCountersHelper,
until something sets the context in PigCountersHelper. I wonder how this
works.

We can add a helper udf at JythonScriptingEngine.init (or some such) method
to expose these elegantly. Thoughts?

~Aniket

On Thu, Aug 23, 2012 at 2:43 PM, Jonathan Coveney wrote:

> In trunk this should be possible (it's possible in 0.10 as well, I just am
> not sure if PigCountersHelper is there). Either way, take a look at
> PigCountersHelper. All you have to do is instantiate a copy in your UDF and
> use it from there.
>
> This hinges on all of the static stuff that Pig relies on working... I
> think that the way that we invoke these scripting languages should work,
> but this will verify that :)
>
> 2012/8/23 Duckworth, Will 
>
> > This may be a better question for the DEV list but ... Is it even
> possible
> > / feasible.  Could it be done by calling the Java classes from within
> > Jython?
> >
> > I guess I would ask the same about algebraic and accumulator UDF which I
> > know are available in Ruby.
> >
> > -Original Message-
> > From: Aniket Mokashi [mailto:aniket...@gmail.com]
> > Sent: Friday, August 17, 2012 5:54 PM
> > To: user@pig.apache.org
> > Subject: Re: Counters from Python UDF
> >
> > I dont think there is a way at this point. You may have to open a jira.
> >
> > Thanks,
> > Aniket
> >
> > On Fri, Aug 17, 2012 at 7:03 AM, Duckworth, Will <
> wduckwo...@comscore.com
> > >wrote:
> >
> > > Has anyone poked around to see if there is there a way to create /
> > > increment counters from a Python UDFs?  Thanks.
> > >
> > >
> > >
> > > Will Duckworth Senior Vice President, Software Engineering | comScore,
> > > Inc. (NASDAQ:SCOR)
> > >
> > > o +1 (703) 438-2108 | m +1 (301) 606-2977 | wduckwo...@comscore.com
> > > <mailto:wduckwo...@comscore.com>
> > >
> > >
> > >
> >
> ...
> > >
> > > Introducing Mobile Metrix 2.0 - The next generation of mobile
> > > behavioral measurement www.comscore.com/MobileMetrix<
> > > http://www.comscore.com/Products_Services/Product_Index/Mobile_Metrix_
> > > 2.0>
> > >
> > >
> > >
> >
> >
> > --
> > "...:::Aniket:::... Quetzalco@tl"
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Counters from Python UDF

2012-08-17 Thread Aniket Mokashi

I dont think there is a way at this point. You may have to open a jira.

Thanks,
Aniket

On Fri, Aug 17, 2012 at 7:03 AM, Duckworth, Will wrote:

> Has anyone poked around to see if there is there a way to create /
> increment counters from a Python UDFs?  Thanks.
>
>
>
> Will Duckworth Senior Vice President, Software Engineering | comScore,
> Inc. (NASDAQ:SCOR)
>
> o +1 (703) 438-2108 | m +1 (301) 606-2977 | wduckwo...@comscore.com
> 
>
>
> ...
>
> Introducing Mobile Metrix 2.0 - The next generation of mobile behavioral
> measurement
> www.comscore.com/MobileMetrix<
> http://www.comscore.com/Products_Services/Product_Index/Mobile_Metrix_2.0>
>
>
>


-- 
"...:::Aniket:::... Quetzalco@tl"

Re: https://cwiki.apache.org/PIG/how-to-set-up-eclipse-environment.html did not work for eclipse on windows

2012-08-14 Thread Aniket Mokashi

I remember debugging this earlier. It looks like grunt gets EOF on windows
machine. I am not sure why either.

Thanks,
Aniket

On Fri, Aug 10, 2012 at 3:25 AM, lulynn_2008  wrote:

>  Hi,
> I can run pig main successfully in eclipse on linux. But I find I can not
> run pig main in eclipse on windows. Please help to check. Thanks.
>
> Run Pig Main
> Create a new Run Configurations
> Pick "org.apache.pig.Main" as the Main class  -->  After this, the shell
> immediately terminates in the console.
>
>


-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Import libraries in Jython UDFs

2012-07-25 Thread Aniket Mokashi

Hi Russell,

Check if your hadoop has MAPREDUCE-967. If not, pig's job.jar will be
unjared on the tasks. With that, jython's static method's do not put /Lib
in python sys path. You can use the instrumentation code i shared in your
python udf file and find the sys.path on task nodes to confirm this. If
/Lib is on sys.path, all imports will work.

One workaround is to put jython jar on all nodes of the cluster under
hadoop lib.

~Aniket

On Tue, Jul 24, 2012 at 5:21 PM, Russell Jurney wrote:

> Thanks, but that isn't my issue. I am unable to import any packages.
> Trying to get the path right...
>
> Russell Jurney http://datasyndrome.com
>
> On Jul 24, 2012, at 10:00 AM, Chun Yang 
> wrote:
>
> > Hi Russell,
> >
> > Are you able to import other modules beside email? If not, maybe this is
> > related to your problem: https://issues.apache.org/jira/browse/PIG-2665
> >
> > -Chun
> >
> > On 7/23/12 11:26 PM, "Russell Jurney"  wrote:
> >
> >> ls /me/jython2.5.2/Lib/
> >>
> >> tons of class files...
> >> email/
> >>
> >>
> >> This is in local mode, atm. I add this directory to my java classpath,
> >> check.
> >>
> >> On Mon, Jul 23, 2012 at 11:10 PM, Aniket Mokashi  >wrote:
> >>
> >>> jar tf jython.jar | grep email
> >>>
> >>> Having jar in PIG_CLASSPATH would work if you have
> >>> https://issues.apache.org/jira/browse/MAPREDUCE-967.
> >>>
> >>> You can use following to debug the sys.path on tasknodes-
> >>>
> >>> from java.lang import System
> >>> print "python.home "
> >>> print System.getProperties().getProperty("python.home")
> >>> print "java.class.path "
> >>> print System.getProperties().getProperty("java.class.path")
> >>> print "install.root "
> >>> print System.getProperties().getProperty("install.root")
> >>> print "python.home "
> >>> print System.getProperties().getProperty("python.home")
> >>>
> >>> ~Aniket
> >>>
> >>> On Mon, Jul 23, 2012 at 6:33 PM, Russell Jurney <
> russell.jur...@gmail.com
> >>>> wrote:
> >>>
> >>>> No, how do I find which jar the email package is in?
> >>>>
> >>>> On Mon, Jul 23, 2012 at 6:02 PM, Norbert Burger <
> >>> norbert.bur...@gmail.com
> >>>>> wrote:
> >>>>
> >>>>> Have you registered the JAR in your Pig script (for local mode) and
> >>>>> also added it to PIG_CLASSPATH (for remote mode, to get it into the
> >>>>> distributed cache)?
> >>>>>
> >>>>> Norbert
> >>>>>
> >>>>> On Mon, Jul 23, 2012 at 8:33 PM, Russell Jurney
> >>>>>  wrote:
> >>>>>> The email package is a part of Jython, I believe:
> >>>>>> http://www.jython.org/docs/library/email.html
> >>>>>>
> >>>>>> However, when I 'import email' in udfs.py, I get this error:
> >>>>>>
> >>>>>> 2012-07-23 17:32:51,027 [main] ERROR
> >>> org.apache.pig.tools.grunt.Grunt -
> >>>>>> ERROR 1121: Python Error. Traceback (most recent call last):
> >>>>>>  File "/Users/rjurney/Collecting-Data/src/pig/udfs.py", line 1, in
> >>>>> 
> >>>>>>import email
> >>>>>> ImportError: No module named email
> >>>>>>
> >>>>>>
> >>>>>> How do I import and use built-in packages in Jython?
> >>>>>>
> >>>>>> --
> >>>>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> >>>>> datasyndrome.com
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> >>>> datasyndrome.com
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> "...:::Aniket:::... Quetzalco@tl"
> >>>
> >>
> >>
> >>
> >> --
> >> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> datasyndrome.com
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Import libraries in Jython UDFs

2012-07-23 Thread Aniket Mokashi

jar tf jython.jar | grep email

Having jar in PIG_CLASSPATH would work if you have
https://issues.apache.org/jira/browse/MAPREDUCE-967.

You can use following to debug the sys.path on tasknodes-

from java.lang import System
print "python.home "
print System.getProperties().getProperty("python.home")
print "java.class.path "
print System.getProperties().getProperty("java.class.path")
print "install.root "
print System.getProperties().getProperty("install.root")
print "python.home "
print System.getProperties().getProperty("python.home")

~Aniket

On Mon, Jul 23, 2012 at 6:33 PM, Russell Jurney wrote:

> No, how do I find which jar the email package is in?
>
> On Mon, Jul 23, 2012 at 6:02 PM, Norbert Burger  >wrote:
>
> > Have you registered the JAR in your Pig script (for local mode) and
> > also added it to PIG_CLASSPATH (for remote mode, to get it into the
> > distributed cache)?
> >
> > Norbert
> >
> > On Mon, Jul 23, 2012 at 8:33 PM, Russell Jurney
> >  wrote:
> > > The email package is a part of Jython, I believe:
> > > http://www.jython.org/docs/library/email.html
> > >
> > > However, when I 'import email' in udfs.py, I get this error:
> > >
> > > 2012-07-23 17:32:51,027 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> > > ERROR 1121: Python Error. Traceback (most recent call last):
> > >   File "/Users/rjurney/Collecting-Data/src/pig/udfs.py", line 1, in
> > 
> > > import email
> > > ImportError: No module named email
> > >
> > >
> > > How do I import and use built-in packages in Jython?
> > >
> > > --
> > > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> > datasyndrome.com
> >
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> datasyndrome.com
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Number of reduce tasks

2012-06-18 Thread Aniket Mokashi

Pankaj, are you using hcatalog?

On Fri, Jun 1, 2012 at 5:24 PM, Prashant Kommireddi wrote:

> Right. And the documentation provides a list of operations that can be
> parallelized.
>
> On Jun 1, 2012, at 4:50 PM, Dmitriy Ryaboy  wrote:
>
> > That being said, some operators such as "group all" and limit, do
> require using only 1 reducer, by nature. So it depends on what your script
> is doing.
> >
> > On Jun 1, 2012, at 12:26 PM, Prashant Kommireddi 
> wrote:
> >
> >> Automatic Heuristic works the same in 0.9.1
> >> http://pig.apache.org/docs/r0.9.1/perf.html#parallel, but you might be
> >> better off setting it manually looking at job tracker counters.
> >>
> >> You should be fine with using PARALLEL for any of the operators
> mentioned
> >> on the doc.
> >>
> >> -Prashant
> >>
> >>
> >> On Fri, Jun 1, 2012 at 12:19 PM, Pankaj Gupta 
> wrote:
> >>
> >>> Hi Prashant,
> >>>
> >>> Thanks for the tips. We haven't moved to Pig 0.10.0 yet, but seems
> like a
> >>> very useful upgrade. For the moment though it seems that I should be
> able
> >>> to use the 1GB per reducer heuristic and specify the number of
> reducers in
> >>> Pig 0.9.1 by using the PARALLEL clause in the Pig script. Does this
> sound
> >>> right?
> >>>
> >>> Thanks,
> >>> Pankaj
> >>>
> >>>
> >>> On Jun 1, 2012, at 12:03 PM, Prashant Kommireddi wrote:
> >>>
>  Also, please note default number of reducers are based on input
> dataset.
> >>> In
>  the basic case, Pig will "automatically" spawn a reducer for each GB
> of
>  input, so if your input dataset size is 500 GB you should see 500
> >>> reducers
>  being spawned (though this is excessive in a lot of cases).
> 
>  This document talks about parallelism
>  http://pig.apache.org/docs/r0.10.0/perf.html#parallel
> 
>  Setting the right number of reducers (PARALLEL or set
> default_parallel)
>  depends on what you are doing with it. If the reducer is CPU intensive
> >>> (may
>  be a complex UDF running on reducer side), you would probably spawn
> more
>  reducers. Otherwise (in most cases), the suggestion in the doc (1 GB
> per
>  reducer) holds good for regular aggregations (SUM, COUNT..).
> 
> 
>  1. Take a look at Reduce Shuffle Bytes for the job on JobTracker
>  2. Re-run the job by setting default_parallel to -> 1 reducer per 1 GB
>  of reduce shuffle bytes and see if it performs well
>  3. If not, adjust it according to your Reducer heap size. More the
> >>> heap,
>  less is the data spilled to disk.
> 
>  There are a few more properties on the Reduce side (buffer size etc)
> but
>  that probably is not required to start with.
> 
>  Thanks,
> 
>  Prashant
> 
> 
> 
> 
>  On Fri, Jun 1, 2012 at 11:49 AM, Jonathan Coveney   wrote:
> 
> > Pankaj,
> >
> > What version of pig are you using? In later versions of pig, it
> should
> >>> have
> > some logic around automatically setting parallelisms (though
> sometimes
> > these heuristics will be wrong).
> >
> > There are also some operations which will force you to use 1
> reducer. It
> > depends on what your script is doing.
> >
> > 2012/6/1 Pankaj Gupta 
> >
> >> Hi,
> >>
> >> I just realized that one of my large scale pig jobs that has 100K
> map
> > jobs
> >> actually only has one reduce task. Reading the documentation I see
> that
> > the
> >> number of reduce tasks is defined by the PARALLEL clause whose
> default
> >> value is 1. I have a few questions around this:
> >>
> >> # Why is the default value of reduce tasks 1?
> >> # (Related to first question) Why aren't reduce tasks parallelized
> >> automatically in Pig?
> >> # How do I choose a good value of reduce tasks for my pig jobs?
> >>
> >> Thanks in Advance,
> >> Pankaj
> >
> >>>
> >>>
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Pig out of memory error

2012-06-18 Thread Aniket Mokashi

export HADOOP_HEAPSIZE=

Thanks,
Aniket

On Sun, Jun 17, 2012 at 11:16 PM, Pankaj Gupta wrote:

> Hi,
>
> I am getting an out of memory error while running Pig. I am running a
> pretty big job with one master node and over 100 worker nodes. Pig divides
> the execution in two map-reduce jobs. Both the jobs succeed with a small
> data set. With a large data set I get an out of memory error at the end of
> the first job. This happens right after the all the mappers and reducers of
> the first job are done and the second job hasn't started. Here is the error:
>
> 2012-06-18 03:15:29,565 [Low Memory Detector] INFO
>  org.apache.pig.impl.util.SpillableMemoryManager - first memory handler
> call - Collection threshold init = 187039744(182656K) used =
> 390873656(381712K) committed = 613744640(599360K) max = 699072512(682688K)
> 2012-06-18 03:15:31,137 [Low Memory Detector] INFO
>  org.apache.pig.impl.util.SpillableMemoryManager - first memory handler
> call- Usage threshold init = 187039744(182656K) used = 510001720(498048K)
> committed = 613744640(599360K) max = 699072512(682688K)
> Exception in thread "IPC Client (47) connection to /10.217.23.253:9001from 
> hadoop" java.lang.RuntimeException:
> java.lang.reflect.InvocationTargetException
> Caused by: java.lang.reflect.InvocationTargetException
> Caused by: java.lang.OutOfMemoryError: Java heap space
>at org.apache.hadoop.mapred.TaskReport.(TaskReport.java:46)
>at sun.reflect.GeneratedConstructorAccessor31.newInstance(Unknown
> Source)
>at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:113)
>at
> org.apache.hadoop.io.WritableFactories.newInstance(WritableFactories.java:53)
>at
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:236)
>at
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:171)
>at
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:219)
>at
> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
>at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:807)
>at org.apache.hadoop.ipc.Client$Connection.run(Client.java:742)
> Exception in thread "Low Memory Detector" java.lang.OutOfMemoryError: Java
> heap space
>at
> sun.management.MemoryUsageCompositeData.getCompositeData(MemoryUsageCompositeData.java:40)
>at
> sun.management.MemoryUsageCompositeData.toCompositeData(MemoryUsageCompositeData.java:34)
>at
> sun.management.MemoryNotifInfoCompositeData.getCompositeData(MemoryNotifInfoCompositeData.java:42)
>at
> sun.management.MemoryNotifInfoCompositeData.toCompositeData(MemoryNotifInfoCompositeData.java:36)
>at sun.management.MemoryImpl.createNotification(MemoryImpl.java:168)
>at
> sun.management.MemoryPoolImpl$CollectionSensor.triggerAction(MemoryPoolImpl.java:300)
>at sun.management.Sensor.trigger(Sensor.java:120)
>
> I will really appreciate and suggestions on how to go about debugging and
> rectifying this issue.
>
> Thanks,
> Pankaj




-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Copying files to Amazon S3 using Pig is slow

2012-06-08 Thread Aniket Mokashi

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

On Fri, Jun 8, 2012 at 4:40 AM, James Newhaven wrote:

> I want to copy 26,000 HDFS files generated by a pig script to Amazon S3.
>
> I am using the copyToLocal command, but I noticed the copy throughput is
> only one file per second - so it is going to take about 7 hours to copy all
> the files.
>
> The command I am using is: copyToLocal /tmp/files/ s3://my-bucket/
>
> Does anyone have any ideas how I could speed this up?
>
> Thanks,
> James
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: While/CROSS/FOREACH loop

2012-05-25 Thread Aniket Mokashi

This might be helpful for this use case -
http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding/

On Tue, May 22, 2012 at 11:31 PM, Russell Jurney
wrote:

> I need to repeatedly CROSS a data set, then FOREACH it, reduce it with
> a filter, then group/test it to test if it's done yet, then repeat
> until it is baked.
>
> How do I do that with pig, and maybe some other tool? Twitter has some
> ruby stuff that can do this, I think, but is there some way with
> nested foreach?
>
> Russell Jurney http://datasyndrome.com
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Load Pig metadata from file?

2012-05-15 Thread Aniket Mokashi

I think you need to play with some quotes, its more likely a bash problem.

one way to debug is bash -x pig  -f script.pig -param md=$(cat
metadata.dat) and check what does hadoop jar gets in the end.

try - md="$(cat metadata.dat)"
or -md="'$(cat metadata.dat)'" (single quote inside double quote
and so on..

Thanks,
Aniket

On Tue, May 15, 2012 at 3:34 PM, Saurabh S  wrote:

>
> Here is a sample LOAD statement from Programming Pig book:
>
> daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,
>date:chararray, open:float, high:float, low:float, close:float,
>volume:int, adj_close:float);
>
> In my case, there are around 250 columns to load. So, I created a file,
> say, metadata.dat with its contents as follows:
>
>  (exchange:chararray, symbol:chararray,
>
>date:chararray, open:float, high:float, low:float, close:float,
>
>volume:int, adj_close:float)
>
> My load statement now looks like
>
> daily = load 'NYSE_daily' as $md;
>
> and the execution looks like.
>
> pig -f script.pig -param md=$(cat metadata.dat)
>
> However, I get the following error in this method:
>
> ERROR 1000: Error during parsing. Lexical error at line 9, column 0.
>  Encountered:  after : ""
>
> Copying the contents of the file in appropriate place works fine. But the
> pig script is cluttered with the metdata and I would like to separate it
> from the script. Any ideas?
>
> HCatLoader() does not seem to be available on my system.
>
>
>
>




-- 
"...:::Aniket:::... Quetzalco@tl"

Re: "Exploding" a Hive array in Pig from an RCFile

2012-04-12 Thread Aniket Mokashi

Hi Malcolm,

arrays are converted to tuples and flatten should directly work on it. I
think you need not worry about the delimiter (assuming hive knows how to
deserialize it). Btw, does RCFile require delimiter to store arrays? I am
not sure about that.

Thanks,
Aniket


On Wed, Apr 11, 2012 at 8:14 PM, Norbert Burger wrote:

> A little wonky, but try wrapping the flattened tuple elements in a bag, and
> then re-flattening that:
>
> A = LOAD 'test.txt' USING PigStorage(',') AS
> (C_SUB_ID:chararray,seg_ids:chararray);
> B = FOREACH A GENERATE C_SUB_ID,FLATTEN(STRSPLIT(seg_ids,':'));
> C = FOREACH B GENERATE $0,FLATTEN(TOBAG($1..));
>
> Only flattened bags generate the cols -> rows transformation that you're
> trying to make.  Flattened tuples, on the other hand, simply explode the
> tuple into its composite elements, but without creating the multiple rows
> ("cross product') in your relation.  A custom UDF would be another option
> here.
>
> Norbert
>
> On Wed, Apr 11, 2012 at 6:59 PM, Malcolm Tye  >wrote:
>
> > Hi Norbert,
> >I don't seem to be getting what I'm after. If my data looks
> like
> > this
> >
> > 1133957209,61:0:1
> > 4524524233,21:0
> >
> > I want to produce
> >
> > 1133957209,61
> > 1133957209,0
> > 1133957209,1
> > 4524524233,21
> > 4524524233,0
> >
> > I changed the LOAD statement to
> >
> > mt = LOAD '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
> > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> > string,seg_ids
> > array');
> > opt = foreach mt generate C_SUB_ID, FLATTEN(STRSPLIT(seg_ids,':')) as
> > s_seg_id;
> >
> > I don't seem to be getting the cross product, just something like the
> > following
> >
> > 1133957209,61,0,1
> > 4524524233,21,0
> >
> > Any ideas ?
> >
> >
> > Thanks
> >
> > Malc
> >
> >
> > -Original Message-
> > From: Norbert Burger [mailto:norbert.bur...@gmail.com]
> > Sent: 06 April 2012 16:01
> > To: user@pig.apache.org
> > Subject: Re: "Exploding" a Hive array in Pig from an RCFile
> >
> > Malcolm -- typically, you'd use a STRSPLIT and optional FLATTEN to
> tokenize
> > a chararray on some delimeter.  So the following should work:
> >
> > opt = foreach mt generate C_SUB_ID, flatten(STRSPLIT(seg_ids,':')) as
> > s_seg_id;
> >
> > Norbert
> >
> > On Thu, Apr 5, 2012 at 8:58 AM, Malcolm Tye
> > wrote:
> >
> > > Hi,
> > >I'm storing data into a partitioned table using Hive in RCFile
> > > format, but I want to use Pig to do the aggregation of that data.
> > >
> > > In my array  in Hive, I have colon delimited data, E.g.
> > >
> > > :0:12:21:99:
> > >
> > > With the lateral view and explode functions in Hive, I can output each
> > > value as a separate row.
> > >
> > > In Pig, I think I need to use flatten, but it just outputs the array
> > > as a single field, and I can't see where to specify that the delimiter
> > > is the delimiter/value separator
> > >
> > > register /opt/pig/trunk/bin/piggybank.jar mt = LOAD
> > > '/hrly_sub_smry/year_month_day=20120329/hour=04/*' USING
> > > org.apache.pig.piggybank.storage.HiveColumnarLoader('C_SUB_ID
> > > string,seg_ids
> > > array');
> > > opt = foreach mt generate C_SUB_ID, flatten(seg_ids) as s_seg_id; dump
> > > opt;
> > >
> > >
> > >
> > > Thanks
> > >
> > > Malc
> > >
> > >
> > >
> >
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Welcome Pig's newest committer, Bill Graham!

2012-04-05 Thread Aniket Mokashi

Congrats Bill...

On Thu, Apr 5, 2012 at 3:04 PM, Prashant Kommireddi wrote:

> Congrats Bill.
>
> Sent from my iPhone
>
> On Apr 5, 2012, at 2:55 PM, Dmitriy Ryaboy  wrote:
>
> > Hi all,
> > On behalf of the Pig PMC, I'm very happy to announce that Bill Graham
> > has been invited to become a Pig committer.
> >
> > Bill's been involved in the Pig project for a long time now, and has
> > made a number of significant contributions -- big improvements to
> > HBase and Avro support, memory leak fixes in PigServer, as well as a
> > number of general usability improvements. He's helpful on the mailing
> > lists, provides great feedback and discussion on the  JIRAs, and is a
> > great advocate for Pig.
> >
> > Looking forward to more great work from Bill!
> >
> > -Dmitriy
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: [ANNOUNCE] Welcome new Apache Pig Committers and PMC members

2012-03-20 Thread Aniket Mokashi

Congrats Jonathan and Julien! :)

On Mon, Mar 19, 2012 at 6:36 PM, Russell Jurney wrote:

> congratulations!
>
> On Mon, Mar 19, 2012 at 5:03 PM, Daniel Dai  wrote:
>
> > Pig users and developers,
> >
> > The Apache Pig PMCs is pleased to announce the new additions to Pig
> > project:
> > * Jonathan Coveney is now Apache Pig committer
> > * Julien Le Dem is now Apache Pig PMC member
> >
> > Thanks for their work for the Apache Pig project in the past and look
> > forward for their future contributions.
> >
> > Thanks,
> > Daniel
> >
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> datasyndrome.com
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: python modules

2012-03-13 Thread Aniket Mokashi

Root cause for this problem was
https://issues.apache.org/jira/browse/MAPREDUCE-967 backward
incompatibility.

Hadoop unpacks job.jar and puts jobCacheDir on classpath and hence jython
is not able to find its jar (job.jar) on classpath (as its unpacked).

With MAPREDUCE-967, hadoop puts job.jar on classpath and hence jython finds
its jar (=job.jar) and updates the sys.path.

Stan, I think if you put jython.jar on classpath for TT, it will work. (you
have to restart TT). PIG-2010 is another possible choice.

Thanks,
Aniket

On Mon, Mar 12, 2012 at 1:16 AM, Aniket Mokashi  wrote:

> This looks like a bug to me. Jython cuts out jython.jar location from
> classpath and appends Lib to it. But, in general on TT jython,jar is not
> available and its "merged" into job.jar by pig. Hence, imports will always
> fail.
>
> ~Aniket
>
>
> On Mon, Mar 12, 2012 at 12:54 AM, Aniket Mokashi wrote:
>
>> I spent some time debugging this. The reason is --
>>
>> Sys.path on TT for jython is - ['__classpath__', '__pyclasspath__/']
>>
>> And for client is ['', '/users/lib/Lib',
>> '/users/lib/jython_simplejson.jar/Lib', '__classpath__', '__pyclasspath__/']
>>
>> I am still figuring out why CLASSPATH (java.class.path property) on
>> tasktracker doesn't have job.jar on it. Hints anyone?
>>
>>
>> Thanks,
>>
>> Aniket
>>
>> On Tue, Oct 18, 2011 at 9:54 AM, Stan Rosenberg <
>> srosenb...@proclivitysystems.com> wrote:
>>
>>> Hi Clay,
>>>
>>> I am running a very recent version (one that contains this patch) of
>>> pig which was compiled from the trunk.
>>> How can I examine the jar file to determine which jython modules have
>>> been added?
>>>
>>> Thanks,
>>>
>>> stan
>>>
>>> On Tue, Oct 18, 2011 at 12:38 PM, Clay B.  wrote:
>>> > Hi Stan,
>>> >
>>> > I believe you are hitting
>>> https://issues.apache.org/jira/browse/PIG-1824
>>> >
>>> > -Clay
>>> >
>>> > On Mon, 17 Oct 2011, Stan Rosenberg wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> What's a proper way to deploy python udfs? I've dropped the latest
>>> >> version of jython.jar in $PIG_HOME/lib.
>>> >> Things work in "local" mode, but when I run on a cluster, built-in
>>> >> python modules cannot be found. E.g., urlparse cannot be located:
>>> >>
>>> >> ImportError: No module named urlparse
>>> >>
>>> >>   at
>>> org.python.core.PyException.fillInStackTrace(PyException.java:70)
>>> >>   at java.lang.Throwable.(Throwable.java:181)
>>> >>   at java.lang.Exception.(Exception.java:29)
>>> >>   at java.lang.RuntimeException.(RuntimeException.java:32)
>>> >>   at org.python.core.PyException.(PyException.java:46)
>>> >>   at org.python.core.PyException.(PyException.java:43)
>>> >>   at org.python.core.PyException.(PyException.java:61)
>>> >>   at org.python.core.Py.ImportError(Py.java:290)
>>> >>   at org.python.core.imp.import_first(imp.java:750)
>>> >>   at org.python.core.imp.import_name(imp.java:834)
>>> >>   at org.python.core.imp.importName(imp.java:884)
>>> >>   at
>>> org.python.core.ImportFunction.__call__(__builtin__.java:1220)
>>> >>   at org.python.core.PyObject.__call__(PyObject.java:357)
>>> >>   at org.python.core.__builtin__.__import__(__builtin__.java:1173)
>>> >>   at org.python.core.imp.importFromAs(imp.java:978)
>>> >>   at org.python.core.imp.importFrom(imp.java:954)
>>> >>   at org.python.pycode._pyx3.f$0(udfs.py:40)
>>> >>   at org.python.pycode._pyx3.call_function(udfs.py)
>>> >>   at org.python.core.PyTableCode.call(PyTableCode.java:165)
>>> >>   at org.python.core.PyCode.call(PyCode.java:18)
>>> >>   at org.python.core.Py.runCode(Py.java:1261)
>>> >>   at
>>> >> org.python.util.PythonInterpreter.execfile(PythonInterpreter.java:235)
>>> >>   at
>>> >>
>>> org.apache.pig.scripting.jython.JythonScriptEngine$Interpreter.execfile(JythonScriptEngine.java:176)
>>> >>   ... 15 more
>>> >>
>>> >> Thanks,
>>> >>
>>> >> stan
>>> >>
>>> >
>>>
>>
>>
>>
>> --
>> "...:::Aniket:::... Quetzalco@tl"
>>
>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: python modules

2012-03-12 Thread Aniket Mokashi

This looks like a bug to me. Jython cuts out jython.jar location from
classpath and appends Lib to it. But, in general on TT jython,jar is not
available and its "merged" into job.jar by pig. Hence, imports will always
fail.

~Aniket

On Mon, Mar 12, 2012 at 12:54 AM, Aniket Mokashi wrote:

> I spent some time debugging this. The reason is --
>
> Sys.path on TT for jython is - ['__classpath__', '__pyclasspath__/']
>
> And for client is ['', '/users/lib/Lib',
> '/users/lib/jython_simplejson.jar/Lib', '__classpath__', '__pyclasspath__/']
>
> I am still figuring out why CLASSPATH (java.class.path property) on
> tasktracker doesn't have job.jar on it. Hints anyone?
>
>
> Thanks,
>
> Aniket
>
> On Tue, Oct 18, 2011 at 9:54 AM, Stan Rosenberg <
> srosenb...@proclivitysystems.com> wrote:
>
>> Hi Clay,
>>
>> I am running a very recent version (one that contains this patch) of
>> pig which was compiled from the trunk.
>> How can I examine the jar file to determine which jython modules have
>> been added?
>>
>> Thanks,
>>
>> stan
>>
>> On Tue, Oct 18, 2011 at 12:38 PM, Clay B.  wrote:
>> > Hi Stan,
>> >
>> > I believe you are hitting
>> https://issues.apache.org/jira/browse/PIG-1824
>> >
>> > -Clay
>> >
>> > On Mon, 17 Oct 2011, Stan Rosenberg wrote:
>> >
>> >> Hi,
>> >>
>> >> What's a proper way to deploy python udfs? I've dropped the latest
>> >> version of jython.jar in $PIG_HOME/lib.
>> >> Things work in "local" mode, but when I run on a cluster, built-in
>> >> python modules cannot be found. E.g., urlparse cannot be located:
>> >>
>> >> ImportError: No module named urlparse
>> >>
>> >>   at
>> org.python.core.PyException.fillInStackTrace(PyException.java:70)
>> >>   at java.lang.Throwable.(Throwable.java:181)
>> >>   at java.lang.Exception.(Exception.java:29)
>> >>   at java.lang.RuntimeException.(RuntimeException.java:32)
>> >>   at org.python.core.PyException.(PyException.java:46)
>> >>   at org.python.core.PyException.(PyException.java:43)
>> >>   at org.python.core.PyException.(PyException.java:61)
>> >>   at org.python.core.Py.ImportError(Py.java:290)
>> >>   at org.python.core.imp.import_first(imp.java:750)
>> >>   at org.python.core.imp.import_name(imp.java:834)
>> >>   at org.python.core.imp.importName(imp.java:884)
>> >>   at org.python.core.ImportFunction.__call__(__builtin__.java:1220)
>> >>   at org.python.core.PyObject.__call__(PyObject.java:357)
>> >>   at org.python.core.__builtin__.__import__(__builtin__.java:1173)
>> >>   at org.python.core.imp.importFromAs(imp.java:978)
>> >>   at org.python.core.imp.importFrom(imp.java:954)
>> >>   at org.python.pycode._pyx3.f$0(udfs.py:40)
>> >>   at org.python.pycode._pyx3.call_function(udfs.py)
>> >>   at org.python.core.PyTableCode.call(PyTableCode.java:165)
>> >>   at org.python.core.PyCode.call(PyCode.java:18)
>> >>   at org.python.core.Py.runCode(Py.java:1261)
>> >>   at
>> >> org.python.util.PythonInterpreter.execfile(PythonInterpreter.java:235)
>> >>   at
>> >>
>> org.apache.pig.scripting.jython.JythonScriptEngine$Interpreter.execfile(JythonScriptEngine.java:176)
>> >>   ... 15 more
>> >>
>> >> Thanks,
>> >>
>> >> stan
>> >>
>> >
>>
>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: python modules

2012-03-12 Thread Aniket Mokashi

I spent some time debugging this. The reason is --

Sys.path on TT for jython is - ['__classpath__', '__pyclasspath__/']

And for client is ['', '/users/lib/Lib',
'/users/lib/jython_simplejson.jar/Lib', '__classpath__', '__pyclasspath__/']

I am still figuring out why CLASSPATH (java.class.path property) on
tasktracker doesn't have job.jar on it. Hints anyone?


Thanks,

Aniket

On Tue, Oct 18, 2011 at 9:54 AM, Stan Rosenberg <
srosenb...@proclivitysystems.com> wrote:

> Hi Clay,
>
> I am running a very recent version (one that contains this patch) of
> pig which was compiled from the trunk.
> How can I examine the jar file to determine which jython modules have
> been added?
>
> Thanks,
>
> stan
>
> On Tue, Oct 18, 2011 at 12:38 PM, Clay B.  wrote:
> > Hi Stan,
> >
> > I believe you are hitting https://issues.apache.org/jira/browse/PIG-1824
> >
> > -Clay
> >
> > On Mon, 17 Oct 2011, Stan Rosenberg wrote:
> >
> >> Hi,
> >>
> >> What's a proper way to deploy python udfs? I've dropped the latest
> >> version of jython.jar in $PIG_HOME/lib.
> >> Things work in "local" mode, but when I run on a cluster, built-in
> >> python modules cannot be found. E.g., urlparse cannot be located:
> >>
> >> ImportError: No module named urlparse
> >>
> >>   at
> org.python.core.PyException.fillInStackTrace(PyException.java:70)
> >>   at java.lang.Throwable.(Throwable.java:181)
> >>   at java.lang.Exception.(Exception.java:29)
> >>   at java.lang.RuntimeException.(RuntimeException.java:32)
> >>   at org.python.core.PyException.(PyException.java:46)
> >>   at org.python.core.PyException.(PyException.java:43)
> >>   at org.python.core.PyException.(PyException.java:61)
> >>   at org.python.core.Py.ImportError(Py.java:290)
> >>   at org.python.core.imp.import_first(imp.java:750)
> >>   at org.python.core.imp.import_name(imp.java:834)
> >>   at org.python.core.imp.importName(imp.java:884)
> >>   at org.python.core.ImportFunction.__call__(__builtin__.java:1220)
> >>   at org.python.core.PyObject.__call__(PyObject.java:357)
> >>   at org.python.core.__builtin__.__import__(__builtin__.java:1173)
> >>   at org.python.core.imp.importFromAs(imp.java:978)
> >>   at org.python.core.imp.importFrom(imp.java:954)
> >>   at org.python.pycode._pyx3.f$0(udfs.py:40)
> >>   at org.python.pycode._pyx3.call_function(udfs.py)
> >>   at org.python.core.PyTableCode.call(PyTableCode.java:165)
> >>   at org.python.core.PyCode.call(PyCode.java:18)
> >>   at org.python.core.Py.runCode(Py.java:1261)
> >>   at
> >> org.python.util.PythonInterpreter.execfile(PythonInterpreter.java:235)
> >>   at
> >>
> org.apache.pig.scripting.jython.JythonScriptEngine$Interpreter.execfile(JythonScriptEngine.java:176)
> >>   ... 15 more
> >>
> >> Thanks,
> >>
> >> stan
> >>
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: how to set one var equal to another

2012-03-10 Thread Aniket Mokashi

Hi Colleen,

I'm not sure whats your use case, but you may want to watch
https://issues.apache.org/jira/browse/PIG-438.

Thanks,
Aniket


On Sat, Mar 10, 2012 at 11:33 AM, Jonathan Coveney wrote:

> It's important to remember that the aliases to the left of the equals are
> not variables, per se, they represent relations, which is basically a bunch
> of data. You can achieve what you want by doing:
>
> subjects2 = foreach subjects1 generate *;
>
> As far as doing something like:
>
> subjects 2 = dump subject1;
>
> dump, store, etc do not generate relations, they are what generate M/R
> jobs.
>
> It can all be pretty opaque. Post if you have more questions.
> Jon
>
> 2012/3/9 Colleen Ross 
>
> > I tried to subscribe but a mail client box came up, not what I wanted, so
> > we'll see if this works.
> >
> > I wrote this script:
> >
> > register s3n://uw-cse344-code/myudfs.jar
> >
> >
> > -- load the test file into Pig
> > --raw = LOAD 's3n://uw-cse344-test/cse344-test-file' USING TextLoader as
> > (line:chararray);
> > -- later you will load to other files, example:
> > raw = LOAD 's3n://uw-cse344/btc-2010-chunk-000' USING TextLoader as
> > (line:chararray);
> >
> > -- parse each line into ntriples
> > ntriples = foreach raw generate FLATTEN(myudfs.RDFSplit3(line)) as
> > (subject:chararray,predicate:chararray,object:chararray);
> >
> > --filter 1
> > subjects1 = filter ntriples by subject matches '.*rdfabout\\.com.*'
> > PARALLEL 50;
> > --filter 2
> > subjects2 = subjects1;
> >
> > but I got the error:
> > 2012-03-10 01:19:18,039 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> > ERROR 1200:   mismatched input ';' expecting
> LEFT_PAREN
> > Details at logfile: /home/hadoop/pig_1331342327467.log
> >
> > how do I simply set one variable equal to another?
> >
> > I also tried subjects2 = dump subjects1;
> >
> > thanks!
> >
> >
> > --
> > ~Colleen Ross
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: View Map-Reduce payload

2012-03-06 Thread Aniket Mokashi

http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#EXPLAIN

On Tue, Mar 6, 2012 at 5:28 AM, shan shan  wrote:

> Hi
> Can  I see the user-payload for the MapReduce job that is created by Pig.
> How?
> i.e. the Map and Reduce function code that is generated by Pig script..
>
> Thanks,
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Scalars can only be used within projections

2012-03-01 Thread Aniket Mokashi

I think you are looking for-

C = join FILTERED_A by key1, B by key1;
C1 = filter C by ;

if key1 equality is not your join condition, you may have to go for a CROSS.

Thanks,
Aniket

On Thu, Mar 1, 2012 at 4:26 AM, mete  wrote:

> Hello folks,
>
> i am new to pig-latin and i am trying to implement a use case as poc.
>
> I have 2 files that i should correlate, similar to this:
>
> A (date,key1,key2)
> B (startdate,enddate,key1,key3)
>
> so what i am trying to do is:
> query for key2
> for all the matches
> find key3 from B if the date range matches
>
> So this is what i have come up with so far:
>
> A = LOAD ...;
> B = LOAD ...;
>
> FILTERED_A = FILTER A BY key2="my_value";
> XX = FOREACH FILTERED_A {
>RESULT= FILTER B BY ( some conditions .)
>DUMP RESULT;
> };
>
> But this just gives me the error in subject without pointers to any
> line/char. I am using 0.8.1-cdh3u3.
> Any ideas?
>
>
> As a side question, i could not figure out howto provide multiple input
> files for pigunit for  a case like the above,
> Is anyone familiar with pigunit?
>
>
>
> Thanks in advance
> Mete
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Jython UDF problem

2012-02-05 Thread Aniket Mokashi

Looks like this is jython bug.

Btw, afaik, the return type of this function would be a bytearray if
decorator is not specified.

Thanks,
Aniket

On Sat, Feb 4, 2012 at 9:39 PM, Russell Jurney wrote:

> Why am I having tuple objects in my python udfs?  This isn't how the
> examples work.
>
> Error:
>
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error
> executing function
> at
>
> org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:106)
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:275)
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:320)
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85)
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> Caused by: Traceback (most recent call last):
>  File "udfs.py", line 27, in hour
>return tuple_time.tm_hour
> AttributeError: 'tuple' object has no attribute 'tm_hour'
>
>
> udfs.py:
>
> #!/usr/bin/python
>
> import time
>
> def hour(iso_string):
>  tuple_time = time.strptime(iso_string, "%Y-%m-%dT%H:%M:%S")
>  return str(tuple_time.tm_hour)
>
>
> my.pig:
>
> register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
> register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
> register /me/pig/contrib/piggybank/java/piggybank.jar
> register /me/pig/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar
> register /me/pig/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar
>
> define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
> define CustomFormatToISO
> org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
> define substr org.apache.pig.piggybank.evaluation.string.SUBSTRING();
>
> register 'udfs.py' using jython as agiledata;
>
> rmf /tmp/sent_distribution.txt
>
> /* Get email address pairs for each type of connection, and union them
> together */
> emails = load '/me/tmp/test_inbox' using AvroStorage();
>
> /* Filter emails according to existence of header pairs, from and [to, cc,
> bcc]
> project the pairs (may be more than one to/cc/bcc), then emit them,
> lowercased. */
> filtered = FILTER emails BY (from is not null) and (to is not null) and
> (date is not null);
> flat = FOREACH filtered GENERATE flatten(from) as from,
> flatten(to) as to,
> agiledata.hour(date) as date;
> a = limit flat 10;
> dump a
>
>
>
> --
> Russell Jurney
> twitter.com/rjurney
> russell.jur...@gmail.com
> datasyndrome.com
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: LOWER

2012-02-04 Thread Aniket Mokashi

I think pig UDFs are just classnames (case sensitive, LOWER is all capitals
in built-in). Are you suggesting to add something like function registry to
pig? That would be a good idea. As a workaround (or solution), we have
pigrc and pigbootup to rename functions.

Thanks,
Aniket

On Sat, Feb 4, 2012 at 5:59 PM, Russell Jurney wrote:

> Is it me, or is it weird that builtins like LOWER only work in lowercase?
>
> pairs = FOREACH flat GENERATE lower(from) AS from, lower(to) AS to, date;
>
> 2012-02-04 17:57:53,851 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1070: Could not resolve lower using imports: [,
> org.apache.pig.builtin., org.apache.pig.impl.builtin.]
>
>
> I don't like.  I have taken a moral oauth to move my Pig to lowercase, and
> this is messing it up.
>
> --
> Russell Jurney
> twitter.com/rjurney
> russell.jur...@gmail.com
> datasyndrome.com
>

-- 
"...:::Aniket:::... Quetzalco@tl"

Re: issue with partioning sdf

2012-02-01 Thread Aniket Mokashi

I think pig will use default partitioner for the same.

You can use following syntax--
A = load 'input_data';
B = group A by $0 PARTITION BY
org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;

Take a look-
https://issues.apache.org/jira/browse/PIG-282

Thanks,
Aniket

On Wed, Feb 1, 2012 at 5:04 PM, Aleksandr Elbakyan wrote:

> Hello All,
>
> I am trying to understand how does pig group partitioning work, I was not
> able to find any documentation regarding what happen under the hood.
>
>
> For example
>
> B = GROUP A BY age;
>
> Does pig partition data by age? Or it will partition by something else?
>
>
> Other question:
> If I want to create custom partitioner can I pass fields I want data be
> partition by or it will be the same as group by key?
>
>
> Regards,
> Aleksandr
>
>


-- 
"...:::Aniket:::... Quetzalco@tl"

Re: explode operation

2012-01-29 Thread Aniket Mokashi

Isnt FLATTEN similar to explode?

On Sun, Jan 29, 2012 at 5:46 PM, Stan Rosenberg <
srosenb...@proclivitysystems.com> wrote:

> Hi Jonathan,
>
> What you recommended below is not quite right.  The right solution
> would need to do something similar to 'explode'.
>
> Thanks,
>
> stan
>
> On Thu, Jan 26, 2012 at 3:04 PM, Jonathan Coveney 
> wrote:
> > I think this might give you what you want
> >
> > X = LOAD 'input.txt' using PigStorage(',') AS (id1:chararray,
> > id2:chararray, id3:chararray, id4:chararray, id5:chararray);
> > Y_0 = foreach X generate FLATTEN(TOBAG(*));
> > Y = filter Y_0 by $0 is not null;
> >
> > 2012/1/25 Prashant Kommireddi 
> >
> >> Sorry I misunderstood your initial question. You would have to write a
> >> custom UDF to do this.
> >>
> >> Thanks,
> >> Prashant
> >>
> >> On Jan 25, 2012, at 7:32 PM, Stan Rosenberg
> >>  wrote:
> >>
> >> > To clarify, here is our input:
> >> >
> >> > X = LOAD 'input.txt' AS (id1:chararray, id2:charrarray,
> >> > id3:charrarray, id4:chararray, id5:chararray);
> >> >
> >> > We want to compute Y that consists of a single column denoting the set
> >> > of all (non-null) ids coming from X.
> >> >
> >> > stan
> >> >
> >> >
> >> > On Wed, Jan 25, 2012 at 10:26 PM, Stan Rosenberg
> >> >  wrote:
> >> >> I don't see how flatten would help in this case.
> >> >>
> >> >> On Wed, Jan 25, 2012 at 10:19 PM, Prashant Kommireddi
> >> >>  wrote:
> >> >>> Hi Stan,
> >> >>>
> >> >>> Would using FLATTEN and then DISTINCT work?
> >> >>>
> >> >>> Thanks,
> >> >>> Prashant
> >> >>>
> >> >>> On Wed, Jan 25, 2012 at 7:11 PM, Stan Rosenberg <
> >> >>> srosenb...@proclivitysystems.com> wrote:
> >> >>>
> >>  Hi Guys,
> >> 
> >>  I came across a use case that seems to require an 'explode'
> operation
> >>  which to my knowledge is not currently available.
> >>  That is, given a tuple (x,y,z), 'explode' would generate the tuples
> >>  (x), (y), (z).
> >> 
> >>  E.g., consider a relation that contains an arbitrary number of
> >>  different identifier columns, say,
> >>  social security id, student id, etc.  We want to compute the set of
> >>  all distinct identifiers.  Assume that the number of identifier
> >>  columns is large and intermingled with other
> >>  columns that should be projected out; this is to avoid a solution
> >>  using 'SPLIT', e.g.
> >> 
> >>  To be concrete, if X = {(..., 2, 4, ..., 3), (..., 2,,...,5)} is
> such
> >>  a relation, then the answer we want is
> >>  Y={2,3,4,5}.
> >> 
> >>  Any suggestions?
> >> 
> >>  Thanks,
> >> 
> >>  stan
> >> 
> >>
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Concat multiple strings

2012-01-22 Thread Aniket Mokashi

Thanks Prashant. I just realized that!
I thought we had it in 0.8, good stuff to know. :)

Thanks,
Aniket

On Sun, Jan 22, 2012 at 5:53 PM, Prashant Kommireddi wrote:

> Aniket, if you read through the comments you would notice the feature
> was actually added in 0.9. The one in 0.8 had an issue.
>
> Thanks,
> Prashant
>
> Sent from my iPhone
>
> On Jan 22, 2012, at 5:44 PM, Aniket Mokashi  wrote:
>
> > Alan, I just noticed its Pig 0.8 and later.
> > https://issues.apache.org/jira/browse/PIG-1420
> > Am I missing something?
> >
> > Thanks,
> > Aniket
> >
> > On Thu, Jan 19, 2012 at 8:04 AM, Alan Gates 
> wrote:
> >
> >> In Pig 0.9 and later CONCAT accepts more than two strings or bytearrays.
> >>
> >> Alan.
> >>
> >> On Jan 18, 2012, at 11:39 PM, Michael Lok wrote:
> >>
> >>> Hi folks,
> >>>
> >>> Is there an another way to perform string concat on multiple columns
> >>> instead of using the built in CONCAT function which only takes 2
> >>> arguments?
> >>>
> >>> I can do CONCAT(str1, CONCAT(str2, str3)), but that's really
> >>> stretching it if I have more than 4 fields :)
> >>>
> >>>
> >>> Thanks!
> >>
> >>
> >
> >
> > --
> > "...:::Aniket:::... Quetzalco@tl"
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Concat multiple strings

2012-01-22 Thread Aniket Mokashi

Alan, I just noticed its Pig 0.8 and later.
https://issues.apache.org/jira/browse/PIG-1420
Am I missing something?

Thanks,
Aniket

On Thu, Jan 19, 2012 at 8:04 AM, Alan Gates  wrote:

> In Pig 0.9 and later CONCAT accepts more than two strings or bytearrays.
>
> Alan.
>
> On Jan 18, 2012, at 11:39 PM, Michael Lok wrote:
>
> > Hi folks,
> >
> > Is there an another way to perform string concat on multiple columns
> > instead of using the built in CONCAT function which only takes 2
> > arguments?
> >
> > I can do CONCAT(str1, CONCAT(str2, str3)), but that's really
> > stretching it if I have more than 4 fields :)
> >
> >
> > Thanks!
>
>


-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Choosing output directory based on field value

2012-01-09 Thread Aniket Mokashi

Pig has MultiStorage in piggybank.

https://github.com/apache/pig/blob/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/MultiStorage.java

I think it has some limitation. You can check the javadoc/jiras for it.

Thanks,
Aniket

On Mon, Jan 9, 2012 at 10:21 PM, IGZ Nick  wrote:

> I am able to group the tuples by date. But the problem I am facing is how
> do I ensure that when I finally STORE it, it is stored in separate folders?
>
> On Tue, Jan 10, 2012 at 11:27 AM, Daniel Dai 
> wrote:
>
> > You can use custom partitioner. Check
> > http://pig.apache.org/docs/r0.9.1/basic.html#partitionby.
> >
> > Daniel
> >
> > On Mon, Jan 9, 2012 at 9:39 PM, IGZ Nick  wrote:
> >
> > > Hi,
> > >
> > > What I would like to do is to store outputs to different directories
> > based
> > > on record value. Essentially I want to read the date from a field and
> > store
> > > the output in /mm/dd directory structure. How should I go about
> > this? I
> > > want to use AvroStorage for storing the stuff. I want to specify STORE
> > xyz
> > > INTO '$location' USING MyStorage(); where $location would be the base
> > > output directory. MyStorage() would be the modified version of
> > AvroStorage
> > > which stores the values in $location//mm/dd/part-abc files, reading
> > the
> > > mmdd from a particular field in the input records.
> > >
> > > What is the way to achieve this with minimal changes?
> > >
> > > Nick
> > >
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: getWrappedSplit() is incorrectly returning the first split

2012-01-09 Thread Aniket Mokashi

The change was added as part of PIG-1518. It has release notes-

"This change will not cause any backward compatibility issue except if a
loader implementation makes use of the PigSplit object passed through the
prepareToRead method where a rebuild of the loader might be necessary as
PigSplit's definition has been modified. However, currently we know of no
external use of the object.

This change also requires the loader to be stateless across the invocations
to the prepareToRead method. That is, the method should reset any internal
states that are not affected by the RecordReader argument.
Otherwise, this feature should be disabled.

It looks like returning 0th split was done deliberately. Comments?

Thanks,
Aniket

On Mon, Jan 9, 2012 at 9:10 PM, Alex Rovner  wrote:

> I have already created the patch and tested with some of my jobs. I ran
> into unit tests failure issues though as well. I can attach the patch to
> Jira tomorrow anyways to be applied once things are straightened out.
>
> Alex R
>
> On Mon, Jan 9, 2012 at 8:07 PM, Jonathan Coveney 
> wrote:
>
> > If it is affecting production jobs, I see no reason why we can't put the
> > fix into 0.9.2, though I sense that a vote will be coming soon for a
> 0.9.2
> > release, so a fix would have to come soon..the issues running the tests
> > brought up in Bill's thread will have to be fixed before we can, though.
> I
> > have a patch that's completely stopped because I can develop any new
> tests,
> > and so on.
> >
> > 2012/1/9 Prashant Kommireddi 
> >
> > > Is this critical enough to make it back into 0.9.1?
> > >
> > > -Prashant
> > >
> > > On Mon, Jan 9, 2012 at 4:44 PM, Aniket Mokashi 
> > > wrote:
> > >
> > > > Thanks so much for finding this out.
> > > >
> > > > I was using
> > > >
> > > > @Override
> > > >
> > > > public void prepareToRead(@SuppressWarnings("rawtypes")
> > > > RecordReaderreader, PigSplit split)
> > > >
> > > >  throws IOException {
> > > >
> > > >  this.in = reader;
> > > >
> > > >  partValues =
> > > >
> > > >
> > >
> >
> ((DataovenSplit)split.getWrappedSplit()).getPartitionInfo().getPartitionValues();
> > > >
> > > >
> > > > in my loader that behaves like hcatalog for delimited text in hive.
> > That
> > > > returns me same partvalues for all the values. I hacked it with
> > something
> > > > else. But, I think I must have hit this case. I will confirm. Thanks
> > > again
> > > > for reporting this.
> > > >
> > > > Thanks,
> > > >
> > > > Aniket
> > > >
> > > > On Mon, Jan 9, 2012 at 11:06 AM, Daniel Dai 
> > > wrote:
> > > >
> > > > > Yes, please. Thanks!
> > > > >
> > > > > On Mon, Jan 9, 2012 at 10:48 AM, Alex Rovner  >
> > > > wrote:
> > > > >
> > > > > > Jira opened.
> > > > > >
> > > > > > I can attempt to submit a patch as this seems like a fairly
> > straight
> > > > > > forward fix.
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/PIG-2462
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > > Alex R
> > > > > >
> > > > > > On Sat, Jan 7, 2012 at 6:14 PM, Daniel Dai <
> da...@hortonworks.com>
> > > > > wrote:
> > > > > >
> > > > > > > Sounds like a bug. I guess no one ever rely on specific split
> > info
> > > > > > before.
> > > > > > > Please open a Jira.
> > > > > > >
> > > > > > > Daniel
> > > > > > >
> > > > > > > On Fri, Jan 6, 2012 at 10:21 PM, Alex Rovner <
> > alexrov...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Additionally it looks like PigRecordReader is not
> incrementing
> > > the
> > > > > > index
> > > > > > > in
> > > > > > > > the PigSplit when dealing with CombinedInputFormat thus the
> > index
> > > > > will
> > > > > > be
> > > > > > > > incorrect in either case.
> > > > &g

Re: getWrappedSplit() is incorrectly returning the first split

2012-01-09 Thread Aniket Mokashi

Thanks so much for finding this out.

I was using

@Override

public void prepareToRead(@SuppressWarnings("rawtypes")
RecordReaderreader, PigSplit split)

 throws IOException {

 this.in = reader;

 partValues =
((DataovenSplit)split.getWrappedSplit()).getPartitionInfo().getPartitionValues();


in my loader that behaves like hcatalog for delimited text in hive. That
returns me same partvalues for all the values. I hacked it with something
else. But, I think I must have hit this case. I will confirm. Thanks again
for reporting this.

Thanks,

Aniket

On Mon, Jan 9, 2012 at 11:06 AM, Daniel Dai  wrote:

> Yes, please. Thanks!
>
> On Mon, Jan 9, 2012 at 10:48 AM, Alex Rovner  wrote:
>
> > Jira opened.
> >
> > I can attempt to submit a patch as this seems like a fairly straight
> > forward fix.
> >
> > https://issues.apache.org/jira/browse/PIG-2462
> >
> >
> > Thanks
> > Alex R
> >
> > On Sat, Jan 7, 2012 at 6:14 PM, Daniel Dai 
> wrote:
> >
> > > Sounds like a bug. I guess no one ever rely on specific split info
> > before.
> > > Please open a Jira.
> > >
> > > Daniel
> > >
> > > On Fri, Jan 6, 2012 at 10:21 PM, Alex Rovner 
> > wrote:
> > >
> > > > Additionally it looks like PigRecordReader is not incrementing the
> > index
> > > in
> > > > the PigSplit when dealing with CombinedInputFormat thus the index
> will
> > be
> > > > incorrect in either case.
> > > >
> > > > On Fri, Jan 6, 2012 at 4:50 PM, Alex Rovner 
> > > wrote:
> > > >
> > > > > Ran into this today. Using trunk (0.11)
> > > > >
> > > > > If you are using a custom loader and are trying to get input split
> > > > > information In prepareToRead(), getWrappedSplit() is providing the
> > fist
> > > > > split instead of current.
> > > > >
> > > > > Checking the code confirms the suspicion:
> > > > >
> > > > > PigSplit.java:
> > > > >
> > > > > public InputSplit getWrappedSplit() {
> > > > > return wrappedSplits[0];
> > > > > }
> > > > >
> > > > > Should be:
> > > > > public InputSplit getWrappedSplit() {
> > > > > return wrappedSplits[splitIndex];
> > > > > }
> > > > >
> > > > >
> > > > > The side effect is that if you are trying to retrieve the current
> > split
> > > > > when pig is using CombinedInputFormat it incorrectly always returns
> > the
> > > > > first file in the list instead of the current one that its
> reading. I
> > > > have
> > > > > also confirmed it by outputing a log statement in the
> > prepareToRead():
> > > > >
> > > > > @Override
> > > > > public void prepareToRead(@SuppressWarnings("rawtypes")
> > > RecordReader
> > > > > reader, PigSplit split)
> > > > > throws IOException {
> > > > > String path =
> > > > >
> > > >
> > >
> >
> ((FileSplit)split.getWrappedSplit(split.getSplitIndex())).getPath().toString();
> > > > > partitions = getPartitions(table, path);
> > > > > log.info("Preparing to read: " + path);
> > > > > this.reader = reader;
> > > > > }
> > > > >
> > > > > 2012-01-06 16:27:24,165 INFO
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > Current split being processed
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-5:0+61870852012-01-06
> > > > 16:27:24,180 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader:
> > Loaded
> > > > native gpl library2012-01-06 16:27:24,183 INFO
> > > > com.hadoop.compression.lzo.LzoCodec: Successfully loaded &
> initialized
> > > > native-lzo library [hadoop-lzo rev
> > > > 2dd49ec41018ba4141b20edf28dbb43c0c07f373]2012-01-06 16:27:24,189 INFO
> > > > com.proclivitysystems.etl.pig.udf.loaders.HiveLoader: Preparing to
> > read:
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-52012-01-06
> > > > 16:27:28,053 INFO
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader:
> > > > Current split being processed
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-6:0+61814752012-01-06
> > > > 16:27:28,056 INFO
> com.proclivitysystems.etl.pig.udf.loaders.HiveLoader:
> > > > Preparing to read:
> > > >
> > >
> >
> hdfs://tuamotu:9000/user/hive/warehouse/cobra_client_consumer_cag/client_tid=3/cag_tid=150/150-r-5
> > > > >
> > > > >
> > > > > Notice how the pig is correctly reporting the split but my "info"
> > > > > statement is always reporting the first input split vs current.
> > > > >
> > > > > Bug? Jira? Patch?
> > > > >
> > > > > Thanks
> > > > > Alex R
> > > > >
> > > >
> > >
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Jira is down?

2012-01-02 Thread Aniket Mokashi

Looks like asf jira is down. Is this a scheduled downtime? Where should I
subscribe to get updates about it?

https://issues.apache.org/jira

Thanks,
Aniket

-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Possible Pig 9.1 globing bug in parameter substitution

2011-12-27 Thread Aniket Mokashi

I tried
pig --param "input=s3n://bucket_path/*/" test.pig

It worked for me. I am on EMR Pig 0.9.1.

Thanks,
Aniket

On Tue, Dec 27, 2011 at 3:35 PM, Corbin Hoenes  wrote:

> I am not sure Ayon doesn't have something here.  I am seeing a similar
> problem with the 0.9.1 build of pig.  But when I run with 0.9.0 it doesn't
> have that problem.
>
> Did something with pattern substitution change from 0.9.0 --> 0.9.1?
>  Haven't run it through a debugger yet but that is the next step tomorrow
> if someone doesn't know of some patch I'm missing?
>
> On Dec 15, 2011, at 12:25 PM,  <
> william.dowl...@thomsonreuters.com> wrote:
>
> > If
> >  -param input=s3n://foo/bar/baz/*/ blah.pig
> > is part of a command line, you'd have to add quotes:
> >  -param 'input=s3n://foo/bar/baz/*/' blah.pig
> > to inhibit your shell from trying to interpret the *.
> >
> >
> > William F Dowling
> > Senior Technologist
> > Thomson Reuters
> > 0 +1 215 823 3853
> >
> >
> > -Original Message-
> > From: Ayon Sinha [mailto:ayonsi...@yahoo.com]
> > Sent: Thursday, December 15, 2011 2:18 PM
> > To: Pig Mailinglist
> > Subject: Possible Pig 9.1 globing bug in parameter substitution
> >
> > when using -param input=s3n://foo/bar/baz/*/ blah.pig
> > it throws
> >
> > java.lang.NullPointerException
> > at
> org.apache.pig.tools.parameters.ParameterSubstitutionPreprocessor.genSubstitutedFile(ParameterSubstitutionPreprocessor.java:79)
> > at org.apache.pig.Main.runParamPreprocessor(Main.java:710)
> > at org.apache.pig.Main.run(Main.java:517)
> > at org.apache.pig.Main.main(Main.java:108)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> > at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >
> > It works when my load statement is changed from:
> > a = load '$input' using PigStorage();
> >
> > to
> >
> > a = load 's3n://foo/bar/baz/*/' using PigStorage();
> >
> > (I'm under a deadline so can't file a JIRA bug rightaway)
> >
> > -Ayon
> > See My Photos on Flickr
> > Also check out my Blog for answers to commonly asked questions.
>
>


-- 
"...:::Aniket:::... Quetzalco@tl"

Re: macros and global variables

2011-12-27 Thread Aniket Mokashi

Thanks Daniel, I will create a jira and submit a patch soon.

I was wondering if there is a way to "by default" register(define) a set of
jars and macros everytime pig command is run. Please let me know.

AFAIK, for jars, I can put them into pig.additional.jars and for params
(defines) I can do --param key=value. (macros? not sure). I feel it is too
clumsy to put as command line params.
Also, is there a way to have
logs = load 'standard_location' using LogLoader();
as a default statement that is run everytime pig is forked (similar to
bashrc)? If not, should we support this? Thoughts?

Thanks,
Aniket

On Fri, Dec 23, 2011 at 4:10 PM, Daniel Dai  wrote:

> Aniket,
> We don't have a macro repository yet, but seems it's a good time to create
> one. Make a macro directory inside piggybank seems to be a good place. It
> would be great if you can open a Jira and attach a patch.
>
> Thanks,
> Daniel
>
> On Thu, Dec 22, 2011 at 3:39 PM, Aniket Mokashi 
> wrote:
>
> > Hi,
> >
> > I was wondering if there is a place to store common macros and global
> > parameters in pig (pigrc?). This should be available to all the users
> > accessing pig via grunt or script.
> > Please let me know if you have any pointers.
> >
> > Thanks,
> > Aniket
> >
>

-- 
"...:::Aniket:::... Quetzalco@tl"

macros and global variables

2011-12-22 Thread Aniket Mokashi

Hi,

I was wondering if there is a place to store common macros and global
parameters in pig (pigrc?). This should be available to all the users
accessing pig via grunt or script.
Please let me know if you have any pointers.

Thanks,
Aniket

Re: My notes for running Pig from EC2 to EMR

2011-12-16 Thread Aniket Mokashi

Amazon supports pig 0.9.1 now. Take a look-
http://aws.amazon.com/releasenotes/Elastic-MapReduce/1044996466833146

Also, I am not very sure about copying EMR jars to EC2. You should check
that with Amazon.

Thanks,
Aniket

On Fri, Dec 16, 2011 at 12:02 PM, Ayon Sinha  wrote:

> This might get outdated quickly as EMR upgrades the Pig version and Pig
> 0.9.1 is being used by everyone anyway. But here is my write-up for your
> review:
>
> The main obstacles for running Pig on Elastic MapReduce (EMR) are:
>
>* Pig version installed on EMR is older than 0.8.1. (By some
> accounts EMR just upgraded their Pig version to 0.9.1)
>* Hadoop Version on EMR might not match the one Pig is using.
>* The user you’re running Pig as might not have permissions on the
> HDFS on the EMR cluster.
>
> How to solve each one of these issues:
>1. We will not be using Pig that is installed on EMR. We will use
> an EC2 instance as the Pig client which compiles the Pig Scripts and
> submits MapReduce jobs to the Hadoop on EMR. For this to work, the Hadoop
> version that Pig is using and whats installed on EMR must match (or at
> least be backward compatible). i.e. EMR hadoop version should be >= Pig’s
> Hadoop version.
>2. The best way to do this is to copy over the Hadoop directory
> from one of the EMR instances to the Pig client EC2 machine. The next
> problem is to make Pig use this hadoop rather than the one its been using.
> For Pig version 8.1 or earlier Pig jar has hadoop classes bundled within so
> any attempt at making Pig use the jars downloaded from EMR fails. The
> solution was to use Pig 0.9.1 which had a pigwithouthadoop.jar. When you
> use this it will use whichever hadoop you make HADOOP_HOME point to, which
> in this case will be the directory where you downloaded the EMR classes and
> configs.
>3. Now that you are using Pig 0.9.1 your version might have a big
> in the pig executable (in /bin )script where it does not
> respect the HADOOP_HOME. So patch the script.
>4. Now you want Pig to be using the Jobtracker and Namenode of the
> EMR cluster you want the computation to be on. Follow one of the usual ways
> to do this:
>1. -Dmapred.job.tracker= -Dfs.default.name=. The
> jt & nn IP will be the internal 10.xxx.xxx.xxx IP of the master EMR node.
> ports are 9000 and 9001 for the NN & JT respectively.
>2. pig.properties file in conf dir.
>3. change core-site.xml & mapred-site.xml in the local
> $HADOOP_HOME/conf dir.
>
> The precedence is a > b > c
>1. Now Pig will start but will fail if the use you are running Pig
> as does not match default EMR user which is hadoop. So this is what I do on
> the EMR:
>1. hadoop dfs -fs hdfs://:9000
> -mkdir /user/piguser;hadoop dfs -fs hdfs:// 10.xxx.xxx.xxx>:9000 -chmod -R 777  /
>2. You can argue that 777 is too generous, but I don't care as its
> the temporary files that are stored and they are gone once my instance is
> gone. All my real data is on S3.
> Now you should be all set.
> Only steps 4 & 5 need to be done every time you start you new EMR cluster.
>
>
>
>  -Ayon
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Pig counters with PigServer strange behavior

2011-12-06 Thread Aniket Mokashi

There is a good blog article on this-
http://squarecog.wordpress.com/2010/12/24/incrementing-hadoop-counters-in-apache-pig/

Thanks,
Aniket

On Tue, Dec 6, 2011 at 1:49 PM, Charles Menguy <
cmen...@proclivitysystems.com> wrote:

> Hi All,
>
> I'm trying to play with counters with PigServer and have a couple issues.
>
> First, I've found very little documentation on how to do this, so I'm not
> sure if the method I'm trying is the good one, any feedback would be
> appreciated.
>
> From what I understand, we need a PigStats in order to be able to retrieve
> the counters from it.
> To get this PigStats from a PigServer instance, here is what I do :
>
> pigServer.setBatchOn(); // needed to enable batch mode, which seems to be
> the only way to get the ExecJob instances needed to get the stats
> pigServer.registerScript(pigScript, params); // register the script i want
> to run
> List execJobs = pigServer.executeBatch(); // get the ExecJobs
> associated with the script i just ran
>
> Now I am supposed to be able to get the counters from this ExecJob class.
>
> for (ExecJob execJob : execJobs) {
> for (JobStats jobStats : execJob.getStatistics().getJobGraph()) { // not
> sure why we need to use the job graph to get the stats but that seems to be
> the only solution i found
>Counters counters = jobStats.getHadoopCounters(); // this is always
> NULL !
>  for (Group group : counters) {
> for (Counter counter : group) {
>
> Now the strange thing is that every time I call the getHadoopCounters(),
> the resulting Counters object is null, and thus I cannot get any counter at
> all.
>
> This happens in local and mapreduce mode, and I checked that execJob and
> jobStats are indeed not null.
>
> Am I doing something wrong here to get the counters, or forgetting
> something? I'm using pig 0.8.1 from cdh3u1
>
> Thanks for your help !
>
> --
> Charles Menguy | Senior Software Engineer
> Proclivity Systems
> 22 West 19th Street | 9th Floor
> New York, NY 10011
> cmen...@proclivitysystems.com
> www.proclivitysystems.com
>
> Proclivity® | We Value Your Customers™
>
> This message is the property of Proclivity Systems, Inc. and is intended
> only for the use of the addressee(s), and may contain material that is
> confidential and privileged for the sole use of the intended recipient. If
> you are not the intended recipient, reliance or forwarding without express
> permission is strictly prohibited; please contact the sender and delete all
> copies.
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Welcome Daniel Dai - our new PMC Chair!

2011-11-03 Thread Aniket Mokashi

Congrats Daniel!! Thats a great news :)

On Thu, Nov 3, 2011 at 9:31 AM, Ashutosh Chauhan wrote:

> Congratulations, Daniel!
>
> Ashutosh
> On Thu, Nov 3, 2011 at 04:18, Gianmarco De Francisci Morales <
> g...@apache.org> wrote:
>
> > Congrats Daniel, that's great!
> > --
> > Gianmarco
> >
> >
> >
> > On Wed, Nov 2, 2011 at 23:08, Olga Natkovich 
> wrote:
> >
> > > It is my pleasure to announce that Apache Board has approved the
> > > nomination of Daniel Dai as our new PMC Chair. Congrats Daniel, well
> > > deserved!
> > >
> > >
> > >
> > > Olga
> > >
> > >
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Flatten a bag to a specific datatype

2011-06-22 Thread Aniket Mokashi

Hi,

I think UDF BagToTuple should do it for you.
>From some old email thread, I find (I think you will have to change
getBagField to get etc)--

public class BagToTuple extends EvalFunc{
 @Override
 public void exec(Tuple input, Tuple output) throws IOException{
 DataBag bag = input.getBagField(0);
 Iterator iter = bag.content();
 while (iter.hasNext()){
 output.appendField(iter.next().getAtomField(0));
 }
 }
 }

Thanks,
Aniket

On Wed, Jun 22, 2011 at 9:28 AM, Jonathan Holloway <
jonathan.hollo...@gmail.com> wrote:

> I'm having trouble trying to flatten a bag to a tuple of int's in Pig,
>
> e.g.
>
> {(12),(4),(7),(190)}
>
> to:
>
> (12,4,7,190)
>
> It seems like it should be trivial to do, but not quite sure how to do it.
>  Can this by done with inbuilt Pig
> commands or do i need a custom UDF or an exec?
>
> Many thanks,
> Jon.
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: viewing current relationships loaded on the grunt shell

2011-06-13 Thread Aniket Mokashi

Hi Jeremy,

If I understand it correctly, you would want to get a list of all the
aliases loaded in the grunt. Is there a use case for this scenario/command?
.pig_history can fetch you the last few commands fired on the grunt. Also,
explaining the most dependent alias would fetch you all the dependent ones
with (in) the plan.

Thanks,
Aniket

On Sat, Jun 11, 2011 at 9:19 AM, Jeremy Hanna wrote:

> I looked through the help and the docs pages but couldn't find anything
> that did this.  Is there any way to show a list of current relations loaded
> while on the grunt shell?  It would seem that the information is available,
> just not exposed via a command.
>
> Thanks!
>
> Jeremy

-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Welcome to Aniket Mokashi

2011-05-22 Thread Aniket Mokashi

Hi,

Thank you everyone for all your support. It has been a very enjoyable
experience to work with pig community.

I plan get involved through GSoC platform to contribute to pig project. I
will be working on addition of support for nested foreach. I will also try
to work on jiras related to this support (Please assign related jiras to
me). My proposal to GSoC can be found at --
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/aniket486/1

<http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/aniket486/1>I
worked on a couple of interesting projects at Yahoo last summer to learn
about internals of pig parser, logical plan build, construction of physical
and mr plans from the logical plan. While working on support for scalars, I
learnt about various passes in pig to reconstruct plans to optimize
execution and limitations on it. In Pig 0.9, a few things have changed with
parsers and optimizers. Hence, it would be beneficial for me if you can help
me out with any comments and remarks on my approach.

Here are my current thoughts on support of Nested Foreach -(
https://issues.apache.org/jira/browse/PIG-1631)
Pig currently supports nested_proj which internally streams the bag. This
support can be extended by assigning innerplan to this streaming with
nested_foreach. First step is to add parser support for this. But, this
would need changes further to restrict generic support to the innerplan
depending upon pig limitations. Currently, I am exploring various
possibilities to add buildNestedForeachOp to logicalplanbuilder with or
without using existing "generate_clause". I will upload a patch to jira once
 get projection support through nested foreach.
Please let me know your comments on the same.

Thanks,
Aniket

On Thu, May 19, 2011 at 1:19 PM, Ashutosh Chauhan wrote:

> Congratulations, Aniket!
> Hoping to see many more contributions in Pig from you.
>
> Ashutosh
> On Thu, May 19, 2011 at 10:08, Alan Gates  wrote:
> > Please join me in welcoming Aniket Mokashi as a new committer on Pig.
> >  Aniket has been contributing to Pig since last summer.  He wrote or
> helped
> > shepherd several major features in 0.8, including the Python UDF work,
> the
> > new mapreduce functionality, and the custom partitioner.  We look forward
> to
> > more great work from him in the future.
> >
> > Alan.
> >
>

-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Understanding incompatibilities with different versions of hadoop?

2011-05-05 Thread Aniket Mokashi

Hi Jonathan,

I compiled Pig trunk jarwithouthadoop.jar and it works fine with CDH3 (Add
CDH3 libs to classpath).
I think CDH3 pig version is 0.7.

Thanks,
Aniket

On Tue, May 3, 2011 at 9:34 AM, Jonathan Coveney  wrote:

> Thanks Alan
>
> 2011/5/3 Alan Gates 
>
> > We, the Yahoo Pig team, test Pig against 0.20.2 Hadoop and the internal
> > Yahoo version of Hadoop (hopefully soon to be released through Apache as
> > 0.20.203).  My impression of CHD3 was that it was very close to 0.20.203
> > with HDFS append added.  The Cloudera guys would better be able to answer
> > what changes in CDH3 make it not compatible with Pig trunk.  They also
> give
> > a list of patches that they apply that makes their release different from
> > the official Apache version.  You could look through the patches they
> > applied to Pig 0.8 to see what they needed to make it work with their
> > version of 0.20.
> >
> > Alan.
> >
> >
> > On May 3, 2011, at 8:13 AM, Jonathan Coveney wrote:
> >
> >  I was wondering if there was any documentation around (or if anyone
> simply
> >> knew) which versions of pig work with which versions of Hadoop? We are
> now
> >> using 0.20.2 CDH3 and it is not compatible with the pig trunk...I'm
> hoping
> >> to understand why.
> >>
> >> Thanks
> >> Jon
> >>
> >
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Pig FILTER with INDEXOF not working

2011-04-22 Thread Aniket Mokashi

I think the fix is-
tuple.set(0, new DataByteArray(url));
to
tuple.set(0, url);

Thanks,
Aniket

On Fri, April 22, 2011 8:30 pm, Steve Watt wrote:
> Richard, if you're coming to OSCON or Hadoop Summit, please let me know
> so I can buy you a beer. Thanks for the help. This now works for with the
> excite log using PigStorage();
>
> It is however still not working with my custom LoadFunc and data. For
> reference, I am using Pig 0.8. I have written a custom LoadFunc for Apache
>  Nutch Segments that reads in each page that is crawled and represents it
> as a Tuple of (Url, ContentType, PageContent) as shown in the script
> below:
>
>
> webcrawl = load 'crawled/segments/20110404124435/content/part-0/data'
>  using com.hp.demo.SegmentLoader() AS (url:chararray, type:chararray,
> content:chararray);
> companies = FILTER webcrawl BY (INDEXOF(url,'comp') >= 0); dump companies;
>
> This keeps failing with ERROR 1071: Cannot convert a
> generic_writablecomparable to a String. However, If I change the script to
>  the following (remove schema type & straight dump after load), it works:
>
>
> webcrawl = load 'crawled/segments/20110404124435/content/part-0/data'
>  using com.hp.demo.SegmentLoader() AS (url, type, content); dump webcrawl;
>
>
> Clearly, as soon as I inject types into the Load Schema it starts
> bombing. Can anyone tell me what I am doing wrong? I have attached my
> Nutch LoadFunc
> below for reference:
>
> public class SegmentLoader extends FileInputLoadFunc {
>
> private SequenceFileRecordReader reader;
> protected static final Log LOG = LogFactory.getLog(SegmentLoader.class);
> @Override
> public void setLocation(String location, Job job) throws IOException {
> FileInputFormat.setInputPaths(job, location);
> }
> @SuppressWarnings("unchecked")
> @Override
> public InputFormat getInputFormat() throws IOException { return new
> SequenceFileInputFormat();
> }
>
>
> @SuppressWarnings("unchecked")
> @Override
> public void prepareToRead(RecordReader reader, PigSplit split) throws
> IOException {
> this.reader = (SequenceFileRecordReader) reader; }
>
>
> @Override
> public Tuple getNext() throws IOException { try { if
> (!reader.nextKeyValue()){
> return null; }
> Content value = ((Content)reader.getCurrentValue());
> String url = value.getUrl();
> String type = value.getContentType();
> String content = value.getContent().toString();
> Tuple tuple  = TupleFactory.getInstance().newTuple(3);
> tuple.set(0, new DataByteArray(url)); tuple.set(1, new
> DataByteArray(type));
> tuple.set(2, new DataByteArray(content)); return tuple; } catch
> (InterruptedException e){
> throw new ExecException(e); }
> }
>
>
> }
>
>
> On Fri, Apr 22, 2011 at 5:17 PM, Richard Ding 
> wrote:
>
>
>> raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
>>  query:chararray);
>>
>>
>> queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0); dump queries;
>>
>>
>> On 4/22/11 2:25 PM, "Steve Watt"  wrote:
>>
>>
>> Hi Folks
>>
>>
>> I've done a load of a dataset and I am attempting to filter out
>> unwanted records by checking that one of my tuple fields contains a
>> particular string. I've distilled this issue down to the sample
>> excite.log that ships with Pig for easy recreation. I've read through
>> the INDEXOF code and I think this should work (lots of queries that
>> contain the word yahoo) but my queries dump always contains zero
>> records. Can anyone tell me what I am doing wrong?
>>
>> raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
>>  query); queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0); dump
>> queries;
>>
>> Regards
>> Steve Watt
>>
>>
>>
>

Re: Filter on contents of other dataset

2011-04-14 Thread Aniket Mokashi

Thanks Mridul,

(Although, small might grow bigger) For instance, lets have small as
in-memory-small stored in a local file.

When does my udf load the data from the file. Earlier, I wrote a bag
loader that returns a bag of small data (eg- load 'smalldata' using
BagLoader() as (smallbag)). But then, I had to write CONTAINSBAG(hdata,
smallbag) to make this work.

I think your solution would solve my problem, but how do I make my udf
read file? Can you give me some pointers?

Thanks,
Aniket


On Thu, April 14, 2011 11:29 pm, Mridul Muralidharan wrote:
>

> The way you described it, it does look like an application of cross.
>
>
> How 'small' is small ?
> If it is pretty small, you can avoid the shuffle/reduce phase and
> directly stream huge through a udf which does a task local cross with
> 'small' (assuming it fits in memory).
>
>
>
> %define my_udf MYUDF('smalldata')
>
>
> huge = load 'mydata' as (hkey:chararray, hdata:chararray); filtered =
> FILTER huge BY my_udf(hkey, hdata);
>
>
>
>
> Where my_udf returns true if there exists some skey in smalldata for
> which F(hdata, skey) is true - as you defined.
>
>
> Regards,
> Mridul
>
>
> On Friday 15 April 2011 08:51 AM, Aniket Mokashi wrote:
>
>> Hi,
>>
>>
>> What would be the best way to write this script?
>> I have two datasets - huge (hkey, hdata), small(skey). I want to filter
>> all the data from huge dataset for which F(hdata, skey) is true. Please
>> advise.
>>
>> For example,
>> huge = load 'mydata' as (key:chararray, value:chararray); small = load
>> 'smalldata' as skey:chararray;
>> h_s_cross = cross huge, small; filtered = foreach h_s_cross generate
>> CONTAINS(value, skey);
>>
>>
>> Thanks,
>> Aniket
>>
>>
>
>
>

Filter on contents of other dataset

2011-04-14 Thread Aniket Mokashi

Hi,

What would be the best way to write this script?
I have two datasets - huge (hkey, hdata), small(skey). I want to filter
all the data from huge dataset for which F(hdata, skey) is true.
Please advise.

For example,
huge = load 'mydata' as (key:chararray, value:chararray);
small = load 'smalldata' as skey:chararray;
h_s_cross = cross huge, small;
filtered = foreach h_s_cross generate CONTAINS(value, skey);

Thanks,
Aniket

Re: CDH3 fail python udf

2011-04-01 Thread Aniket Mokashi

Method.java:597) at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1849) at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1947)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1871) at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
> at java.util.ArrayList.readObject(ArrayList.java:593) at
> sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorI
> mpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597) at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1849) at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
> at java.util.HashMap.readObject(HashMap.java:1030) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java
> :39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorI
> mpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597) at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1849) at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1947)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1871) at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1947)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1871) at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
> at
> org.apache.pig.impl.util.ObjectSerializer.deserialize(ObjectSerializer.ja
> va:53)
> ... 9 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAc
> cessorImpl.java:39)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConst
> ructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at
> org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:47
> 0)
> ... 87 more
> Caused by: java.lang.IllegalStateException: Could not initialize:
> /home/shawn/TESS/code/mypyudfs.py
> at
> org.apache.pig.scripting.jython.JythonFunction.(JythonFunction.java
> :86)
> ... 92 more
> 2011-04-01 14:31:40,977 INFO org.apache.hadoop.mapred.Task: Runnning
> cleanup for the task
>
> Thanks!
>
>
> Regards,
> Shawn
>
>
> On Fri, Apr 1, 2011 at 2:24 PM, Aniket Mokashi 
> wrote:
>
>> Hi Shawn,
>>
>>
>> Every time we throw an Exception with 'could not instantiate ..' error
>> message, we also pass down the real exception instance, this might be
>> able to point to the reason why we fail in this scenario. Can you provide
>> details of your exception message from the log?
>>
>> The way this works is, when you register the myudf.py script we
>> register all the function names inside script to pig and when we use
>> these functions, we parse and construct them with JythonFunction
>> constructor.
>>
>> Thanks,
>> Aniket
>>
>>
>> On Fri, April 1, 2011 12:06 pm, Xiaomeng Wan wrote:
>>
>>> Hi Aniket,
>>>
>>>
>>>
>>> We put both jython.jar and myudf.py in classpath and also register
>>> jython.jar in our pig script. It worked well before the upgrading,
>>> only failed after.
>>>
>>> Regards,
>>> Shawn
>>>
>>>
>>>
>>> On T

Re: CDH3 fail python udf

2011-04-01 Thread Aniket Mokashi

Hi Shawn,

Every time we throw an Exception with 'could not instantiate ..' error
message, we also pass down the real exception instance, this might be able
to point to the reason why we fail in this scenario.
Can you provide details of your exception message from the log?

The way this works is, when you register the myudf.py script we register
all the function names inside script to pig and when we use these
functions, we parse and construct them with JythonFunction constructor.

Thanks,
Aniket

On Fri, April 1, 2011 12:06 pm, Xiaomeng Wan wrote:
> Hi Aniket,
>
>
> We put both jython.jar and myudf.py in classpath and also register
> jython.jar in our pig script. It worked well before the upgrading, only
> failed after.
>
> Regards,
> Shawn
>
>
> On Thu, Mar 31, 2011 at 4:38 PM, Aniket Mokashi 
> wrote:
>
>> I think this might be because when you start in hadoop mode, your
>> classpath configuration does not have jython.jar. Can you put that
>> explicitly in classpath and check it out?
>>
>> Thanks,
>> Aniket
>>
>>
>> On Thu, March 31, 2011 6:07 pm, Xiaomeng Wan wrote:
>>
>>> Hi,
>>> We recently updated our hadoop from CDH2 to CDH3b4, and had problems
>>> using some old python udfs. Runing in local mode still works, but in
>>> hadoop mode, it gives errors like "could not instantiate
>>> 'org.apache.pig.scripting.jython.JythonFunction' with arguments...".
>>> Anyone see similar error with python udf on this hadoop distribution?
>>> We are using pig 0.8.0. Thanks!
>>>
>>>
>>>
>>> Regards
>>> Shawn
>>>
>>>
>>>
>>>
>>
>>
>>
>
>

Re: Custom Storage Functions - MultiStorage

2011-03-31 Thread Aniket Mokashi

In my opinion, MultiStorage should work just fine if you have less number
of buckets (0-100+, not sure about the limit, but definitely not 512) even
if you have large number of records in one bucket.
But, I think this method is error-prone against the task failures. I think
more scalable way is to generate files with tagged names and then move
them into one directory.
If you take a bag of grouped tuples and change your partitioner to fork
more than one reducer spitting into one directory it should work too. But,
this is only useful if you have uniform distribution of your bucket size
(and again another limit on no of buckets).

~Aniket
On Thu, March 31, 2011 5:17 pm, Dmitriy Ryaboy wrote:
> I think the problem there is # of unique keys -- one winds up creating
> way too many filehandles all at the same time. I may be misunderstanding
> the nature of the bug. If I do understand it correctly, it's endemic to the
> whole concept of MultiStorage; creating 7K files * # reducers sounds like
> a really bad thing to do; if you are running into the problem, you
> probably shouldn't be using MultiStorage.
>
>
> Or am I misreading what's happening?
>
>
> D
>
>
> On Thu, Mar 31, 2011 at 9:12 AM, Jonathan Holloway <
> jonathan.hollo...@gmail.com> wrote:
>
>> Hi all,
>>
>>
>> I'm working with some data at the moment, for which I needed to
>> generate multiple reports for a given grouped set of data by name. I
>> wasn't initially sure about how to do this, I came across MultiStorage
>> in Pig contrib, but a little worried about the 7k limit there at
>> the moment due to a bug:
>>
>> https://issues.apache.org/jira/browse/PIG-1547
>>
>>
>> Does anybody know what the issue here is - I can take a look at this if
>>  necessary and someone can point me in the right way in terms of fixing
>> it?  I've currently hacked MultiStorage to take a bag and the contained
>> tuples and spit out the tuples with a tab delimiter between them.  Is
>> this the best way to go?
>>
>> Just looking for some feedback.
>>
>>
>> Cheers,
>> Jon.
>>
>>
>

Re: CDH3 fail python udf

2011-03-31 Thread Aniket Mokashi

I think this might be because when you start in hadoop mode, your
classpath configuration does not have jython.jar. Can you put that
explicitly in classpath and check it out?

Thanks,
Aniket

On Thu, March 31, 2011 6:07 pm, Xiaomeng Wan wrote:
> Hi,
> We recently updated our hadoop from CDH2 to CDH3b4, and had problems
> using some old python udfs. Runing in local mode still works, but in hadoop
> mode, it gives errors like "could not instantiate
> 'org.apache.pig.scripting.jython.JythonFunction' with arguments...".
> Anyone see similar error with python udf on this hadoop distribution?
> We are using pig 0.8.0. Thanks!
>
>
> Regards
> Shawn
>
>
>

Re: UDF problem: Java Heap space

2011-02-24 Thread Aniket Mokashi

Thanks everyone for helping me out, I figured it was one of those logical
errors which lead to infinite loops. Actually indexof operation doesnt
always return -1 on failure which was causing this to get into infinite
loop (I should have thought about this). (ie. indexof('[', 187) would
return 187 and the loop would continue always.
Thanks again,
Aniket

On Thu, February 24, 2011 7:47 pm, Aniket Mokashi wrote:
> This is a map side udf.
> pig script loads a log file and grabs contents inside angle brackets. a =
> load; b = foreach a generate F(a); dump b;
>
> I see following on tasktrackers-
> 2011-02-23 18:01:25,992 INFO
> org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call
>  - Collection threshold init = 5439488(5312K) used = 409337824(399743K)
> committed = 534118400(521600K) max = 715849728(699072K) 2011-02-23
> 18:01:26,102 INFO
> org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
> call- Usage threshold init = 5439488(5312K) used = 546751088(533936K)
> committed = 671547392(655808K) max = 715849728(699072K)
>
> I am trying out some changes in udf to see if they work.
>
>
> Thanks,
> Aniket
>
>
> On Thu, February 24, 2011 7:25 pm, Daniel Dai wrote:
>
>> Hi, Aniket,
>> What is your Pig script? Is the UDF in map side or reduce side?
>>
>>
>>
>> Daniel
>>
>>
>>
>> Dmitriy Ryaboy wrote:
>>
>>
>>> That's a max of 3.3K single-character strings. Even with the java
>>> overhead that shouldn't be more than a meg right? none of these should
>>>  make it out of young gen assuming the list "cats" doesn't stick
>>> around outside the udf.
>>>
>>> On Thu, Feb 24, 2011 at 3:49 PM, Aniket Mokashi
>>> wrote:
>>>
>>>
>>>
>>>
>>>> Hi Jai,
>>>>
>>>>
>>>>
>>>> Thanks for your email. I suspect that its the Strings in tight loop
>>>>  reason as you have suggested. I have a loop in my udf that does
>>>> the following.
>>>>
>>>> while((startInd = someLog.indexOf('[',startInd)) > 0) { endInd =
>>>> someLog.indexOf(']', startInd); if(endInd > 0) { category =
>>>> someLog.substring(startInd, endInd+1); cats.add(category); }
>>>> startInd = endInd; }
>>>>
>>>>
>>>> My jobs are failing in both local and mr mode. UDF works fine for a
>>>>  smaller input (a few lines). Also, I checked that sizeof someLog
>>>> doesnt exceed a 1.
>>>>
>>>> Thanks,
>>>> Aniket
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, February 24, 2011 3:58 am, Jai Krishna wrote:
>>>>
>>>>
>>>>
>>>>> Sharing the code would be useful as mentioned. Also of help would
>>>>>  the heap settings that the JVM had.
>>>>>
>>>>> However, off the top of my head, one common situation (esp. in
>>>>> text processing/tokenizing) is instantiating Strings in a tight
>>>>> loop.
>>>>>
>>>>> Besides you could also exercise your UDF in a local JVM and take
>>>>> a heap dump / profile it. If your heap is less than 512M, you
>>>>> could use basic profiling via hprof/hat (see
>>>>> http://java.sun.com/developer/technicalArticles/Programming/HPROF
>>>>> .h
>>>>> tml).
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Jai
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2/24/11 9:26 AM, "Dmitriy Ryaboy"  wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Aniket, share the code?
>>>>> It really depends on how you create them.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -D
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I ve written a simple UDF that parses a chararray (which looks
>>>>>> like ...[a].[b]...[a]...) to capture stuff inside brackets
>>>>>> and return them as String a=2;b=1; and so on. The input
>>>>>> chararray are rarely more than 1000 characters and are not more
>>>>>> than 10 (I ve added log.warn in my udf to ensure this). But,
>>>>>> I still see java
>>>>>> heap error while running this udf (even in local mode, the job
>>>>>> simply fails). My assumption is maps and lists that I use
>>>>>> locally will be recollected by gc. Am I missing something?
>>>>>>
>>>>>> Thanks,
>>>>>> Aniket
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>
>>
>>
>
>
>
>

Re: UDF problem: Java Heap space

2011-02-24 Thread Aniket Mokashi

This is a map side udf.
pig script loads a log file and grabs contents inside angle brackets.
a = load; b = foreach a generate F(a); dump b;

I see following on tasktrackers-
2011-02-23 18:01:25,992 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call
- Collection threshold init = 5439488(5312K) used = 409337824(399743K)
committed = 534118400(521600K) max = 715849728(699072K)
2011-02-23 18:01:26,102 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler
call- Usage threshold init = 5439488(5312K) used = 546751088(533936K)
committed = 671547392(655808K) max = 715849728(699072K)

I am trying out some changes in udf to see if they work.

Thanks,
Aniket

On Thu, February 24, 2011 7:25 pm, Daniel Dai wrote:
> Hi, Aniket,
> What is your Pig script? Is the UDF in map side or reduce side?
>
>
> Daniel
>
>
> Dmitriy Ryaboy wrote:
>
>> That's a max of 3.3K single-character strings. Even with the java
>> overhead that shouldn't be more than a meg right? none of these should
>> make it out of young gen assuming the list "cats" doesn't stick around
>> outside the udf.
>>
>> On Thu, Feb 24, 2011 at 3:49 PM, Aniket Mokashi
>> wrote:
>>
>>
>>
>>> Hi Jai,
>>>
>>>
>>> Thanks for your email. I suspect that its the Strings in tight loop
>>> reason as you have suggested. I have a loop in my udf that does the
>>> following.
>>>
>>> while((startInd = someLog.indexOf('[',startInd)) > 0) { endInd =
>>> someLog.indexOf(']', startInd); if(endInd > 0) { category =
>>> someLog.substring(startInd, endInd+1); cats.add(category); }
>>> startInd = endInd; }
>>>
>>>
>>> My jobs are failing in both local and mr mode. UDF works fine for a
>>> smaller input (a few lines). Also, I checked that sizeof someLog
>>> doesnt exceed a 1.
>>>
>>> Thanks,
>>> Aniket
>>>
>>>
>>>
>>> On Thu, February 24, 2011 3:58 am, Jai Krishna wrote:
>>>
>>>
>>>> Sharing the code would be useful as mentioned. Also of help would
>>>> the heap settings that the JVM had.
>>>>
>>>> However, off the top of my head, one common situation (esp. in text
>>>>  processing/tokenizing) is instantiating Strings in a tight loop.
>>>>
>>>> Besides you could also exercise your UDF in a local JVM and take a
>>>> heap dump / profile it. If your heap is less than 512M, you could
>>>> use basic profiling via hprof/hat (see
>>>> http://java.sun.com/developer/technicalArticles/Programming/HPROF.h
>>>> tml).
>>>>
>>>>
>>>> Thanks,
>>>> Jai
>>>>
>>>>
>>>>
>>>>
>>>> On 2/24/11 9:26 AM, "Dmitriy Ryaboy"  wrote:
>>>>
>>>>
>>>>
>>>> Aniket, share the code?
>>>> It really depends on how you create them.
>>>>
>>>>
>>>>
>>>> -D
>>>>
>>>>
>>>>
>>>> On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi
>>>> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> I ve written a simple UDF that parses a chararray (which looks
>>>>> like ...[a].[b]...[a]...) to capture stuff inside brackets and
>>>>> return them as String a=2;b=1; and so on. The input chararray are
>>>>> rarely more than 1000 characters and are not more than 10 (I
>>>>> ve added log.warn in my udf to ensure this). But, I still see java
>>>>> heap error while running this udf (even in local mode, the job
>>>>> simply fails). My assumption is maps and lists that I use locally
>>>>> will be recollected by gc. Am I missing something?
>>>>>
>>>>> Thanks,
>>>>> Aniket
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>
>
>

Re: UDF problem: Java Heap space

2011-02-24 Thread Aniket Mokashi

Hi Jai,

Thanks for your email. I suspect that its the Strings in tight loop reason
as you have suggested. I have a loop in my udf that does the following.

while((startInd = someLog.indexOf('[',startInd)) > 0) {
endInd = someLog.indexOf(']', startInd);
if(endInd > 0) {
category = someLog.substring(startInd, 
endInd+1);
cats.add(category);
}
startInd = endInd;
}

My jobs are failing in both local and mr mode. UDF works fine for a
smaller input (a few lines). Also, I checked that sizeof someLog doesnt
exceed a 1.

Thanks,
Aniket


On Thu, February 24, 2011 3:58 am, Jai Krishna wrote:
> Sharing the code would be useful as mentioned. Also of help would the
> heap settings that the JVM had.
>
> However, off the top of my head, one common situation (esp. in text
> processing/tokenizing) is instantiating Strings in a tight loop.
>
> Besides you could also exercise your UDF in a local JVM and take a heap
> dump / profile it. If your heap is less than 512M, you could use basic
> profiling via hprof/hat (see
> http://java.sun.com/developer/technicalArticles/Programming/HPROF.html ).
>
>
> Thanks,
> Jai
>
>
>
> On 2/24/11 9:26 AM, "Dmitriy Ryaboy"  wrote:
>
>
> Aniket, share the code?
> It really depends on how you create them.
>
>
> -D
>
>
> On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi
> wrote:
>
>
>> I ve written a simple UDF that parses a chararray (which looks like
>> ...[a].[b]...[a]...) to capture stuff inside brackets and return
>> them as String a=2;b=1; and so on. The input chararray are rarely more
>> than 1000 characters and are not more than 10 (I ve added log.warn
>> in my udf to ensure this). But, I still see java heap error while
>> running this udf (even in local mode, the job simply fails). My
>> assumption is maps and lists that I use locally will be recollected by
>> gc. Am I missing something?
>>
>> Thanks,
>> Aniket
>>
>>
>>
>
>

UDF problem: Java Heap space

2011-02-23 Thread Aniket Mokashi

I ve written a simple UDF that parses a chararray (which looks like
...[a].[b]...[a]...) to capture stuff inside brackets and return them
as String a=2;b=1; and so on. The input chararray are rarely more than
1000 characters and are not more than 10 (I ve added log.warn in my
udf to ensure this). But, I still see java heap error while running this
udf (even in local mode, the job simply fails). My assumption is maps and
lists that I use locally will be recollected by gc. Am I missing
something?

Thanks,
Aniket

FLATTEN custom bags

2011-02-16 Thread Aniket Mokashi

Hi,

I have a custom loader that creates and returns a tuple of id, bags. I
want to open these bags and get their contents.
For example-
data = load 'loc' using myLoader() as (id, bag1, bag2);
bag1Content = foreach data generate FLATTEN(bag1);
This works.

But when I do bag1Content = foreach data generate id, FLATTEN(bag1) --it
fails.
How should I fix it?

Thanks,
Aniket

Re: Using a UDF written in Python

2010-12-27 Thread Aniket Mokashi

I think decorator used here is incorrect.
In general, "output:chararray" needs to be schema-string-compatible. Also,
you are using "outputSchemaFunction", which is used in case you want to
write a udf that has output schema dependent on input schema (êg -square)
and this should have a function with decorator "schemaFunction" (named
"output" in your case). I think using "outputSchema" decorator would fix
the problem here.

More details can be found at-
http://wiki.apache.org/pig/UDFsUsingScriptingLanguages

Thanks,
Aniket

On Mon, December 27, 2010 4:30 pm, Jonathan Coveney wrote:
> so I have module.py, and I want to be able to use it in a pig script. It
> has no special imports or anything. I do have
> @outputSchemaFunction("output:chararray)
>
>
> In my pig script, I have this
>
>
> register '/my/udf/location/udf.py' using jython as myfunc;
>
> is there any reason why this wouldn't work? here is the error I get:
>
> 2010-12-27 16:29:41,288 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 2998: Unhandled internal error. org/python/util/PythonInterpreter
>
>
> Not the most instructive error, but is there anything more I need to be
> doing to be able to use a python UDF?
>
> As an aside, are simply python UDF's as efficient as Java ones? I like
> Python a lot and love the idea of being able to UDF in it, but can use
> java if necessary.
>

Re: Using alias results in future calculations?

2010-10-20 Thread Aniket Mokashi

In pig 0.8, you can say,
P = foreach G generate G.$0 * C.$0, G.$1;

Other methods are discussed here-
https://issues.apache.org/jira/browse/PIG-1434

Thanks,
Aniket

On Wed, October 20, 2010 3:41 pm, Matt Tanquary wrote:
> I need to use the output of one alias in a future calculation:
>
>
> Suppose I have:
>
>
> C=(5)
>
>
> and then later, I have
>
> G=(1,A)
> (3,B)
> (5,C)
>
>
> then I want to do a foreach on G where I multiply each G.$0 by C.$0,
> ending up with
>
> H=(5,A)
> (15,B)
> (25,C)
>
>
> how can I do that?
>
> I did accomplish it by doing a CROSS against the two tables and then
> running foreach against that, but I want to avoid using CROSS.
>
> Thanks for any ideas!
> -M@
>
>
> --
> Have you thanked a teacher today? ---> http://www.liftateacher.org
>
>

69 matches

Mail list logo