Re: [ANNOUNCE] Congratulations to our new PMC members Rohini Palaniswamy and Cheolsoo Park

2013-09-13 Thread Jonathan Coveney
Exciting time for Pig!! 2013/9/12 ajay kumar > congratulations guys ...! > > > On Thu, Sep 12, 2013 at 11:54 PM, Bill Graham > wrote: > > > Congrats guys! Well deserved indeed. > > > > > > On Wed, Sep 11, 2013 at 10:58 PM, Jarek Jarcec Cecho > >wrote: > > > > > Congratulations Rohini and Cheo

Re: Welcome new Pig Committer - Koji Noguchi

2013-09-13 Thread Jonathan Coveney
Very well deserved!! 2013/9/12 Thejas Nair > Congrats Koji! Very well deserved! > > > On Wed, Sep 11, 2013 at 9:49 AM, Daniel Dai wrote: > > Congratulation! You are well deserved. > > > > > > > > > > On Wed, Sep 11, 2013 at 6:33 AM, Miguel Angel Martin junquera < > > mianmarjun.mailingl...@gma

Re: SchemaTuple doesn't seem to work on YARN

2013-09-04 Thread Jonathan Coveney
Hello! I implemented the SchemaTuple stuff. Glad to hear you're trying it out! I did not test it with YARN at all. It looks like the way that the filesystem and distributed cache work have changed. I myself am not super up on that, but perhaps there is known documentation on how it differs? The wa

Re: Introducing rPig

2013-06-17 Thread Jonathan Coveney
Very cool! 2013/6/17 Russell Jurney > Awesome! > > > On Sun, Jun 16, 2013 at 3:15 PM, Connor Woodson >wrote: > > > I mentioned a few months ago that I was interested in creating a new > > Scripting Engine for Pig based off of the R language. I have finally > gotten > > that project to a point

Re: problems with .gz

2013-06-12 Thread Jonathan Coveney
s (since gz doesn't allow splitting). > > In the uncompressed case, blocks before AND AFTER the nulls were ok and > contributed data to my COUNT(*). > > In the compressed case, only data before the nulls contributed data to my > COUNT(*). > > will > > > On Tue,

Re: problems with .gz

2013-06-11 Thread Jonathan Coveney
William, It would be really awesome if you could furnish a file that replicates this issue that we can attach to a bug in jira. A long time ago I had a very weird issue with some gzip files and never got to the bottom of it...I'm wondering if this could be it! 2013/6/10 Niels Basjes > Bzip2 is

Re: A UDF that is both Algebraic and Accumulator

2013-06-04 Thread Jonathan Coveney
It uses the best one it can. Algebraic is generally better than Accumulator, and if it can use Algebraic it will. If it can't use either, it will use the default EvalFunc. In Pig, there aren't too many cases where an Algebraic/Accumulator EvalFunc will have to be evaluated as an Accumulator...in i

Re: Synthetic keys

2013-05-24 Thread Jonathan Coveney
You can do this, but pig has a CROSS keyword that you can use. 2013/5/23 Mehmet Tepedelenlioglu > Hi, > > I am using this: > > x = join a by 1, b by 1 using 'replicated'; > > with the hope that it generates some synthetic key '1' on both a and b and > joins it on that key, thereby, in this cas

Re: Cross product bug pig 0.10?

2013-05-21 Thread Jonathan Coveney
Any chance you could replicate this for us? Ideally some dummy data and a script? 2013/5/19 Mehmet Tepedelenlioglu > Hi, > > Recently I was taking the cross product between 2 bags of tuples one of > which has only one tuple, to append the one with one element to all the > others (I know this is

Re: Nb of reduce tasks when GROUPing

2013-05-19 Thread Jonathan Coveney
Also, look into the TOP udf instead of doing the limit. It can potentially be a lot faster and is cleaner, IMHO. 2013/5/19 Norbert Burger > Take a look at the PARALLEL clause: > > http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause > > On Fri, May 17, 2013 at 10:48 AM, Vince

Re: does pig support loop and branching now?

2013-05-07 Thread Jonathan Coveney
pig latin does not support it, but it is pretty easy to do it by using the python control flow. this or java is the preferred way of doing it. 2013/5/7 yonghu > Dear all, > > I wonder if someone can tell me if the current version of pig support loop > and branching? > > regards! > > Yong >

Re: Pig question

2013-05-07 Thread Jonathan Coveney
cdh-user to bcc Your question doesn't make much sense...I think you may have left a piece off? 2013/5/7 abhishek > Hi all, > > In my script > > a = load 'data' using PigStorage(); > > b = foreach a generate > 342 as col1, > substring(x,0,4) as col2, > ; > > I want to use col2 later in foreach

Re: Hbase Hex Values

2013-05-06 Thread Jonathan Coveney
= load 'hbase://data' using > org.apache.pig.backend.hadoop.hbase.HBaseStorage('1:*', ' -loadKey true') > AS (id:chararray, data:map[]); > > Would i call the invoke after the load? > > > thanks. > > JM > > > > > > > > ---

Re: Hbase Hex Values

2013-05-06 Thread Jonathan Coveney
You could also use the following (in trunk): https://issues.apache.org/jira/browse/PIG-3198 so you'd do: invoke&Integer.valueOf(x, 16); where x would be the hex string 2013/5/6 Alan Gates > I am not aware of any built in or Piggybank UDF that converts Hex to Int, > but it would be a welcome co

Re: Pig Unique Counts on Multiple Subsets of a Large Input

2013-05-06 Thread Jonathan Coveney
Are you familiar with the CUBE keyword that was relatively recently added? This sounds like a perfect use case for it. Furthermore, how are you splitting on activity? There is a SPLIT operator which is perfect for this, as you can have a different relation for each one. What I would do would be to

Re: PigServer, load query fails, passes on grunt

2013-05-04 Thread Jonathan Coveney
Why do you have an "as" statement with the store? The schema should come down with the script. That's probably the issue. 2013/5/4 ÐΞ€ρ@Ҝ (๏̯͡๏) > Ignore above query. Its incorrect. > > I have following pig script > A = LOAD 'textinput' using PigStorage() as (a0:chararray, a1:chararray, > a2:ch

Re: Ruby 1.9 and jRuby 1.7

2013-04-11 Thread Jonathan Coveney
Dan, I implemented most of the jruby stuff. Glad to hear you're trying it out! Please let us know what your experience is like. I definitely had plans to upgrade to jruby 1.7, and am not sure why I never did...hmm. Ahh, ok, it's part of this patch...which still isn't committed...you should bump

Re: How is a union of multiple primitives handled?

2013-04-05 Thread Jonathan Coveney
woops, wrong listserv :) 2013/4/5 Jonathan Coveney > The following gist illustrates my question: > > https://gist.github.com/jcoveney/5320422 > > It seems pretty surprising to me that all of these cases all return 1.0, > at least in python (I will now do this in Java, it&#

How is a union of multiple primitives handled?

2013-04-05 Thread Jonathan Coveney
The following gist illustrates my question: https://gist.github.com/jcoveney/5320422 It seems pretty surprising to me that all of these cases all return 1.0, at least in python (I will now do this in Java, it's just more verbose). Is this an issue with python? Is this an issue period? Is this une

Re: Sorting/Partitioning of Pig output

2013-03-27 Thread Jonathan Coveney
as far as when the storefunc works, it depends on whether the job is map only or map/reduce. It'll work on the last phase. Generally this is the reduce phase. As far as how pig knows where to send it's output, there are keys in pig. Basically, a reduce job is necessary any time you have a group, j

Re: EvalFunc finish() closing connections prematurely

2013-03-26 Thread Jonathan Coveney
t; > When initialized, the ParselyMetadataService creates a new Mongo and > > Jedis > > > instance which the EvalFunc queries using a public method fetch(). > > > Instance of ParselyMetadataService also have a close() function which > > > simply calls: > >

Re: String Representation of DataBag and its Schema

2013-03-19 Thread Jonathan Coveney
Ack, hit enter. I'd look at the LoadFunc interface, the PigSTorage class, and if you can't make it work without playing a little, let me know. 2013/3/19 Jonathan Coveney > doing "new PigStorage()" is possible, but tricky. Maybe some of the other > contributors have a

Re: String Representation of DataBag and its Schema

2013-03-19 Thread Jonathan Coveney
https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html > >, > would you know how to construct this process from a baseline PigStorage > Object, such as: > > PigStorage pigstorage = new PigStorage(); > > Any ideas? > > -Dan > > On Tue, Mar 1

Re: String Representation of DataBag and its Schema

2013-03-19 Thread Jonathan Coveney
l also make the process more approachable for > another programmer to write additional unit tests. > > -Dan > > On Tue, Mar 19, 2013 at 11:43 AM, Jonathan Coveney >wrote: > > > How are you planning on generating these cases? By hand? Or automated? > > > > >

Re: String Representation of DataBag and its Schema

2013-03-19 Thread Jonathan Coveney
esting of my UDFs. > > -Dan > > On Tue, Mar 19, 2013 at 11:27 AM, Jonathan Coveney >wrote: > > > how was string_databag generated? > > > > > > 2013/3/19 Dan DeCapria, CivicScience > > > > > Expanding upon this, the follow

Re: String Representation of DataBag and its Schema

2013-03-19 Thread Jonathan Coveney
ach > nesting > > from Tuple and DataBag factories, append data, and next them manually. > For > > larger unit tests, this process becomes unwieldy (hundreds of lines per > > method, non-dynamic), and it would be much simpler to go directly from a > > String and a Sc

Re: String Representation of DataBag and its Schema

2013-03-18 Thread Jonathan Coveney
Why not just use PigStorage? This is essentially what it does. It saves a bag as text, and then loads it again. I suppose the question becomes: why do you need to do this? 2013/3/18 Dan DeCapria, CivicScience > In Java, I am trying to convert a DataBag from it's String representation > with it

Re: UDF that takes bag as input and returns another bag

2013-03-18 Thread Jonathan Coveney
andling) > > -Kris > > On Mon, Mar 18, 2013 at 11:19:17AM +0100, Jonathan Coveney wrote: > > Absolutely. > > > > public class MyUdf extends EvalFunc { > > public DataBag exec(Tuple input) throws IOException { > > return (DataBag)input.get(0); > > }

Re: UDF that takes bag as input and returns another bag

2013-03-18 Thread Jonathan Coveney
Absolutely. public class MyUdf extends EvalFunc { public DataBag exec(Tuple input) throws IOException { return (DataBag)input.get(0); } } A dummy example, but there you go. DataBag is a valid pig type like any other, so you just returnit like you would normally. 2013/3/18 pranjal rajpu

Re: Loader partitioning on field

2013-03-15 Thread Jonathan Coveney
ystem. This time is > passed in by the system to pig when the job is launched. Since I > partition files by time field, a user could filter based on the result > of this UDF. > > > > On Thu, Mar 14, 2013 at 3:15 PM, Jonathan Coveney > wrote: > > No, it is not.

Re: How to control a number of reducers in Apache Pig

2013-03-15 Thread Jonathan Coveney
The script you posted wouldn't have any reducers, so it wouldn't matter. It's a map only job. 2013/3/15 > Dear Apache Pig Users, > > It is easy to control a number of reducers in JOIN, GROUP, COGROUP, > etc. statements by a general "set default_parallel $NUM" command or > "parallel $NUM" info i

Re: Loader partitioning on field

2013-03-14 Thread Jonathan Coveney
No, it is not. But if it knew that, how would that filter be meaningful? What do you have in mind? 2013/3/14 Jeff Yuan > Rohini, I see your point. > > One followup question: it's possible for the result of a UDF to be > constant and not dependent on the tuples of each record, right? Is Pig > ab

Re: EvalFunc finish() closing connections prematurely

2013-03-14 Thread Jonathan Coveney
Can you perhaps share more of your implementation? I can imagine a couple of things which would cause errors like this. Are you making sure that each instance of EvalFunc is dealing with a different connection? That's what I'd take a look at first...if that isn't the issue, I can look into how fin

Re: Introducing Parquet: efficient columnar storage for Hadoop.

2013-03-12 Thread Jonathan Coveney
ncoding, and RLE encoding of > > > data (Cloudera and Twitter) > > > * Further improvements to Pig support (Twitter) > > > > > > Company names in parenthesis indicate whose engineers signed up to do > the > > > work -- others can feel free to jump in too, of course. > > > > > > We've also heard requests to provide an Avro container layer, similar > to > > > what we do with Thrift. Seeking volunteers! > > > > > > We welcome all feedback, patches, and ideas; to foster community > > > development, we plan to contribute Parquet to the Apache Incubator when > > the > > > development is farther along. > > > > > > Regards, > > > Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy, > > > Jonathan Coveney, and friends. > > >

Re: Pig script compiling too slow with pig-0.10 and pig-0.11

2013-03-06 Thread Jonathan Coveney
Can you try this on trunk and let me know if you have a similar error? Also can you turn on DEBUG and say if it is taking forever during or after parsing? 2013/3/6 Haitao Yao > Hi all > I have a big pig script running under pig-0.9.2. While upgrading > to pig 0.11 or 0.10, the script n

Re: avoiding Group by or filter

2013-03-05 Thread Jonathan Coveney
There have been a number of explanations on the topic before, so I would prefer to point at one of them (or ensure we document it better), but basically all of the aggregation functions we use (sum, avg, etc) all function on bags of stuff. This is actually true in SQL as well (it just hides the "gr

Re: Pig job result output and schema

2013-03-05 Thread Jonathan Coveney
if you use the alias "@", it should properly dump etc the last alias. If not file a JIRA. 2013/3/5 Jeff Yuan > Thanks for your suggestions, they work very well. One follow up question: > > Is there a way to dynamically strip STORE and DUMP commands from a > loaded in script? So everything work

Re: UDF to calculate Average of whole dataset

2013-03-05 Thread Jonathan Coveney
dividends = load 'try.txt' a = foreach dividends generate FLATTEN(TOBAG(*)); b = foreach (group a all) generate CalculateAvg($1); I think that should work 2013/3/5 pablomar > what is the error ? > function not found or something like that ? > > what about this ? > avg = generate myudfs.C

Re: avoiding Group by or filter

2013-03-05 Thread Jonathan Coveney
Why don't you want to group? 2013/3/5 Preeti Gupta > I want to compute the Average for 1 column dataset > 1 > 2 > 3 > 4 > 5 > > and I am not able to do without grouping. > > However I got an average with > > avg = foreach (group dividends all) generate AVG(dividends); > > But > > avg = fo

Re: Multiple CurrentTime calls return the same timestamp

2013-02-27 Thread Jonathan Coveney
it correctly, for the same run, all the CurrentTime() > will return the same timestamp. I wonder if there any udf can provide > runtime timestamp. > > Thanks. > Dan > > -Original Message- > From: Jonathan Coveney [mailto:jcove...@gmail.com] > Sent: Wednesday,

Re: Multiple CurrentTime calls return the same timestamp

2013-02-27 Thread Jonathan Coveney
This is by design, as the notion of a CurrentTime() in a Pig job is a big poorly specified, so we went with something "unremarkable." What do you think it should be? 2013/2/27 Cheolsoo Park > Hi Dan, > > Are you using 0.11 or trunk? > > If you're using trunk, please take a look at PIG-3014. > h

Re: Changing the hadoop jobs generation logic

2013-02-27 Thread Jonathan Coveney
What do you have in mind? 2013/2/26 Preeti Gupta > Hello Everyone, > > I want to make some changes in the way Pig generates Hadoop jobs. Any one > got some idea on how to do this? > > regards > > Preeti

Re: reading input parameters in a pig script

2013-02-21 Thread Jonathan Coveney
me:chararray, description:chararray has to be dynamically > created based on the parameters passed. > Is there any way of getting this done? > > -Original Message- > From: Jonathan Coveney [mailto:jcove...@gmail.com] > Sent: Wednesday, February 20, 2013 10:48 PM > To: us

Re: reading input parameters in a pig script

2013-02-20 Thread Jonathan Coveney
be achieved in a pig script? > > Also depending on the output file format, I need to invoke the > corresponding exporter script (html or csv) from my wrapper script. I don’t > see any conditional operators available (if/else) in pig. Any idea how this > can be achieved? > > ---

Re: reading input parameters in a pig script

2013-02-20 Thread Jonathan Coveney
c, in my pig script I will not be able to > > refrence the parameters as '$param1' . Is there any way to access these > > params in the script without referring to the param name? > > > > > > From: Jonathan Coveney [jcove...@

Re: [ANNOUNCE] Welcome Bill Graham to join Pig PMC

2013-02-20 Thread Jonathan Coveney
congrats :) 2013/2/20 Jarek Jarcec Cecho > Congratulations Bill, good job! > > Jarcec > > On Tue, Feb 19, 2013 at 01:48:18PM -0800, Daniel Dai wrote: > > Please welcome Bill Graham as our latest Pig PMC member. > > > > Congrats Bill! >

Re: reading input parameters in a pig script

2013-02-19 Thread Jonathan Coveney
Can you give an example of what you'd like this to look like? 2013/2/19 Siddhi Borkar > Hi , > > I need to pass parameters dynamically to a pig script. Is there any way to > read the parameters passed and their corresponding values without giving > the parameter names in the pig script? > > Tha

Re: run pig explain command over the entire script in java

2013-02-18 Thread Jonathan Coveney
produce the compact execution plan for the whole script and not > the several separate ones (one for each alias). > > > > On 2/18/2013 10:21 PM, Jonathan Coveney wrote: > >> I guess I'm confused at what you want then. >> >> So we have a script: >> >>

Re: run pig explain command over the entire script in java

2013-02-18 Thread Jonathan Coveney
command > > $ pig -x local -e 'explain -script Temp1/TPC_test.pig -out > explain-out9.txt' > it will not give the same output as if we did it for each operation > separately. > > > On 2/18/2013 7:04 PM, Jonathan Coveney wrote: > >> Hacky way: grep

Re: run pig explain command over the entire script in java

2013-02-18 Thread Jonathan Coveney
Hacky way: grep for "^\S =", pull out the names, and then do the explains. Why is doing the progressive explains useful? it wouldn't be too hard to build this into pig but the results would be pretty unwieldy, it'd be really big, and pretty redundant. 2013/2/18 Petar Jovanovic > Hi, > I am try

Re: $HOME

2013-02-11 Thread Jonathan Coveney
Prashant: not sure. It probably isn't. We should make a ticket for that if it isn't. Russell: can you close the ticket you made, with the answer? 2013/2/11 Russell Jurney > Well, I don't know if that works? But Software/ isn't assumed either. > Really I just need a home directory... I've set mi

Re: $HOME

2013-02-11 Thread Jonathan Coveney
Bill was right, he just forgot an escape: %default HOME `echo \$HOME` I believe that should work 2013/2/11 Russell Jurney > Yes, I agree. Please edit! :) > > > On Mon, Feb 11, 2013 at 1:31 AM, Prashant Kommireddi >wrote: > > > Only suggesting a workaround, not implying it's the best solution

Re: How to refer field name in Jython UDF

2013-02-07 Thread Jonathan Coveney
Sorry, hit enter prematurely. Although in this particular case, it's a little janky, but you could have a helper which takes the thrift class i.e. get_name(some_field, 'SomeClass') and could use that SomeClass to let you refer by name. 2013/2/8 Jonathan Coveney > Curren

Re: How to refer field name in Jython UDF

2013-02-07 Thread Jonathan Coveney
Currently, the answer to this is no. In Javaland in 0.11.0 you can get the schema in an EvalFunc, and it would not be hard to make this available from a Jython UDF, though we'd need a patch. 2013/2/7 Stanley Xu > Dear All, > > We are using pig with elephant-bird thrift to process structured rec

Re: Is there a size limit on tuple or a field in tuple?

2013-02-07 Thread Jonathan Coveney
A tuple must fit in memory. That is the only bound. 2013/2/6 Dexin Wang > I'm writing a UDF of my own that would produce tuples, each tuple has a > string field that could be real large. I did a quick test and the current > size of the field is 146,447 characters and it doesn't seem to have any

Re: reference tuple field by name in UDF

2013-02-02 Thread Jonathan Coveney
Similar question for Python UDF. In my Python UDF, is referencing field by > index (instead of alias) is the only option I have? > > > On Tue, Jan 15, 2013 at 2:20 PM, Jonathan Coveney >wrote: > > > Another way to do it would be to make a helper function that does

Re: Ignore missing paths on load

2013-02-02 Thread Jonathan Coveney
AFAIK, this is a Hadoop issue and not a pig issue. That said, we could make this a configurable thing to overload and do something more reasonable. Feel free to open a JIRA and suggest that (or else someone will see it and say exactly that it is a Hadoop issue and not a Pig issue)) 2013/2/2 Benja

Re: Some optimization advices

2013-01-31 Thread Jonathan Coveney
Even better, push the tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity'); as high as possible. 2013/1/31 Cheolsoo Park > Hi Jerome, > > Try this: > > XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0); > XmlTag2 = FOREACH XmlTag { > tag_with_amenity = FILTER tag BY (tag_attr_k == 'am

Re: How many nested bincond operator does pig support?

2013-01-29 Thread Jonathan Coveney
There was an issue in the parser that has been resolved in trunk (I forget if it went into 0.11 or not). Can you test your script on trunk and see if you still have the issue? 2013/1/28 Dongliang Sun > Hi All, > > When there are too many nested bincond operators (more than 10), it's > frozen th

Re: Run a job async

2013-01-25 Thread Jonathan Coveney
else. However, I think > that making the front-end thread-safe is an achievable goal. > > Thanks, > Cheolsoo > > > > On Thu, Jan 24, 2013 at 11:18 PM, Ramakrishna Nalam > wrote: > > > That clarifies it for me, thanks a lot. > > > > Regards, > > Rama. &g

Re: Run a job async

2013-01-24 Thread Jonathan Coveney
> > pig job submitting thread waiting until the job completes? > > > > Is this just a shortcoming today or are there more concrete reasons > against > > providing with a pigserver which can submit to the cluster in mapreduce > > mode async? > > > > Thanks,

Re: Run a job async

2013-01-23 Thread Jonathan Coveney
by deploying daemons that run pig jobs as local processes. 2013/1/23 Prashant Kommireddi > Both. Think of it as an app server handling all of these requests. > > Sent from my iPhone > > On Jan 23, 2013, at 9:09 PM, Jonathan Coveney wrote: > > > Thousands of requests,

Re: Run a job async

2013-01-23 Thread Jonathan Coveney
Thousands of requests, or thousands of Pig jobs? Or both? 2013/1/23 Prashant Kommireddi > Did not want to have several threads launched for this. We might have > thousands of requests coming in, and the app is doing a lot more than only > Pig. > > On Wed, Jan 23, 2013 at 5:

Re: Run a job async

2013-01-23 Thread Jonathan Coveney
start a separate Process which runs Pig? 2013/1/23 Prashant Kommireddi > Hey guys, > > I am trying to do the following: > >1. Launch a pig job asynchronously via Java program >2. Get a notification once the job is complete (something similar to >Hadoop callback with a servlet) > > I

Re: Pig APIs

2013-01-22 Thread Jonathan Coveney
s if this > is not possible at the moment. > > -Prashant > > On Mon, Jan 21, 2013 at 5:47 PM, Prashant Kommireddi >wrote: > > > At the moment, basically info on I/O paths, operators used (group by, > > foreach ..), job level info such as number of reducers etc. > >

Re: Shared script commands

2013-01-22 Thread Jonathan Coveney
At Twitter, we have a lightweight framework that handles stitching code togetherso I think with pig, stitching stuff together in some organized way is the current best practice. 2013/1/22 Cheolsoo Park > Hi Eric, > > You can move REGISTER and SET to a properties file and DECLARE and DEFAULT

Re: Syntax for dereferenced project-range

2013-01-22 Thread Jonathan Coveney
I do not believe that this is currently supported for nested projections, though it should be. Feel free to make a JIRA ticket, I do not think it would be hard. 2013/1/22 Uri Laserson > I have tuple like so: > > (a: (b:int, c:int, d:int, e:int)) > > I would like to call a UDF and pass a ran

Re: Pig APIs

2013-01-21 Thread Jonathan Coveney
What level of information would you like? IE if you do "explain relation," which of the three do you want to hook into? 2013/1/21 Prashant Kommireddi > Been coding with the APIs and wondering if there is anything that allows > you to only retrieve the operators, I/O paths etc without actually i

Re: reference tuple field by name in UDF

2013-01-15 Thread Jonathan Coveney
Another way to do it would be to make a helper function that does the following: input.get(getInputSchema().getPosition(alias)); Only available in 0.10 and later (I think getInputSchema is in 0.10, at least...may only be in 0.11) 2013/1/15 Dexin Wang > Hi, > > In my own UDF, is reference a fi

Re: Pig error

2013-01-14 Thread Jonathan Coveney
Can you share a script which replicates this? Ideally one that isolates the issue, if it is quite long... 2013/1/14 abhishek > >> Hi all, > >> > >> When am using JOIN operator in pig, am getting following error > >> > >> Pig joins inner plans can only have one output leaf? > >> > >> Can any one

Re: Making Pig run faster in local mode

2013-01-04 Thread Jonathan Coveney
How long is it taking? 2013/1/4 Malcolm Tye > Hi, > > Any ideas on how to make Pig run quicker when running it in > local mode ? > > > > I'm processing 3 files of about 13MB each with 3 group by statements in my > script which seem to suck up the time. There's no joins > > > > I

Re: Group by with count

2012-12-27 Thread Jonathan Coveney
a = load 'tab1' as (col1, col2, col3); b = group a by (col1, col2, col3); c = foreach b generate FLATTEN(group), COUNT_STAR(a); 2012/12/26 abhishek > Hi all, > > How can I achieve above hive query in pig > > Create table x as select y.col1,y.col2,y.col3,count(*) as count from tab1 > y group by

Re: what happens under the hood

2012-12-19 Thread Jonathan Coveney
This is a very broad question. On the Pig website you can find some papers on how Pig was implemented, and this should give you a high level view of what is going on. For this code, you can use the explain command (explain in; instead of dump in;) to see the 3 plans that this code generates (logic

Re: PhysicalPlan leaves

2012-12-19 Thread Jonathan Coveney
> Peace be on you, Jonathan, > It gives one leaf also > > -- > Regards, > Sarah M. Hassan > > > > On Tue, Dec 18, 2012 at 4:24 AM, Jonathan Coveney >wrote: > > > Try it with joins, I think > > > > > > 2012/12/16 Sarah Mohamed > > &

Re: PhysicalPlan leaves

2012-12-17 Thread Jonathan Coveney
Try it with joins, I think 2012/12/16 Sarah Mohamed > PhysicalPlan.getLeaves() return a list of leaves, Most of the cases it's > only one"the root", is there any cases that the physical plan will have > more than one leaf ? > > Thanks > Sarah >

Re: Join Multiple Relations by Different Fields

2012-12-14 Thread Jonathan Coveney
it's a little confusing, but the following is a tuple: (key1,foo,) it's just not the tuple you want. it is a tuple where the first field is "key1,foo" and the second field is null. The printing makes this ambiguous 2012/12/14 Thomas Bach > (key1,foo,)

Re: pig support for in operator

2012-12-13 Thread Jonathan Coveney
This is a join. This is equivalent to. A = load 'test_data' as (value); B = foreach 'filter_data' as (x:int); C = join A by value, B by x using 'replicated'; D = foreach C generate value as value; One thing pig does not currently do nicely is let you create a relation from nothing (ie define the

Re: RESOLVED: ERROR 2999: Unexpected internal error. null

2012-12-12 Thread Jonathan Coveney
build/lib/jars/xmlenc-0.52.jar:/Library/apache-cassand > >r > >a-1.1.7-src/build/apache-cassandra-1.1.7-SNAPSHOT.jar:/Library/apache-cass > >a > >ndra-1.1.7-src/build/apache-cassandra-clientutil-1.1.7-SNAPSHOT.jar:/Libra > >r > >y/apache-cassandra-1.1.7-src/build/apache

Re: Parsing variable schema

2012-12-12 Thread Jonathan Coveney
I'm a little vague on what you want to do. Can you provide an example? 2012/12/11 Prashant Kommireddi > Here is a snippet of how schema is applied to tuples > > String serializedSchema = p.getProperty(signature + SCHEMA_FILE); > if (serializedSchema != null) { >

Re: ERROR 2999: Unexpected internal error. null

2012-12-11 Thread Jonathan Coveney
str), > making your code path impossible... > > will > > > On Tue, Dec 11, 2012 at 1:00 PM, Jonathan Coveney >wrote: > > > If I were debugging this (note, I know nothing about cassandra), I would > > put a flag in my ide on cassandra storage and see what is goi

Re: ERROR 2999: Unexpected internal error. null

2012-12-11 Thread Jonathan Coveney
If I were debugging this (note, I know nothing about cassandra), I would put a flag in my ide on cassandra storage and see what is going on in there, and why it is erroring out. Then I would follow that backwards into whatever in Pig was generating that issue. That's pretty vague but can't really s

Re: Piggybank date time functions

2012-12-11 Thread Jonathan Coveney
I did not implement those UDF's... I imagine the reason for rigorously using UTC instead of system time is because that can introduce subtle bugs where your servers have a different time than your client and it can be hard to debug, etc. It would be pretty easy to add support for timezone to those

Re: PIG script - PIGStorage

2012-12-10 Thread Jonathan Coveney
The default loader can't handle this. You would need a custom InputFormat, which isn't too bad. 2012/12/9 L N > Hi, > > > > > I have an unstructured file format. Assume below is the data in a file > > > > > > > > abxcd xyxc > > > > > > > > > I need to process the data in between < >

Re: Reconstruct the PhysicalPlan

2012-11-28 Thread Jonathan Coveney
he physical plan as a > tree/graph structure. > > What I did that I implemented the PigProgressNotificationListener interface > and I built it myself, and this is what you mean right? > > Thanks you for your help. > > -- > Regards, > Sarah M. Hassan > > >

Re: Request for suggestions

2012-11-26 Thread Jonathan Coveney
Can you flesh out what you want it to do a little more? Maybe some example queries? 2012/11/26 > Hi, > > > We have a scenario where we want a single Hadoop job to create/manage > multiple mapper tasks where each mapper task will query a subset of columns > in a relational database table. We loo

Re: Reconstruct the PhysicalPlan

2012-11-26 Thread Jonathan Coveney
What is your goal? When you say reconstruct, do you just mean get a handle on the physical plan? You can make your own execution flag (ie extend the interface behind local mode etc) and that method gives you the physical plan. 2012/11/24 Sarah Mohamed > Peace be on you, > > Is there a way to re

Re: Re: Re: Pig UT last nearly 8 hours and TestEvalPipeline2 lasts for 37 minutes

2012-11-20 Thread Jonathan Coveney
Pig is very much not thread safe. It uses static methods to add stuff to contexts all over the place. It would be a ton of work to fix this. 2012/11/20 Cheolsoo Park > Hi, > > I actually tried to run entire unit test suite in multiple threads, and I > used this junit extension: > http://tempusf

Re: need help about pig script on this case

2012-11-19 Thread Jonathan Coveney
In pure Pig, you wouldn't do something like this. However, PIg supports control flow in Python (I really should get on making the JRuby wrapper, but I digress). You can find docs for this on the pig website. Basically the control flow is in Python, and you launch jobs from there. 2012/11/19 Sheng

Re: PigStorage

2012-11-19 Thread Jonathan Coveney
Make a JIRA and attach the patch, please. 2012/11/19 pablomar > hi all, > > I did it as simple as I could. What about this changes ? > > > PigStorage.java > original: > private void readField(byte[] buf, int start, int end) { > if (start == end) { > // NULL value >

Re: Accessing tuple field names from within a python udf

2012-11-16 Thread Jonathan Coveney
be there is an easier way I am missing here. If people have any ideas > for a more elegant solution I would be happy to contribute develop it and > contribute the code. > > Martin > > > > > > > > On 15 November 2012 20:20, Jonathan Coveney wrote: > >

Re: Accessing tuple field names from within a python udf

2012-11-15 Thread Jonathan Coveney
Martin, That is a reasonable workaround. Even in java UDF's, you can't directly access fields by name. Tuples are indexed only by numbers. Using the Schema is how I would do it. 2012/11/14 Martin Goodson > Sorry to reply to my question post but I've found a workaround that I > thought I should

Re: The essence from hundreds of posts from Apache Pig user mailing list

2012-11-15 Thread Jonathan Coveney
This is great! PS "I really liked the simple explanation of FLATTEN: it turns Tuples into columns (because Tuples contain columns) and turns Bags into rows (because Bags contain rows)." I'm so glad someone appreciated that :D I put a lot of effort into that portion of it... 2012/11/14 Steve Bern

Re: Re: Pig UT last nearly 8 hours and TestEvalPipeline2 lasts for 37 minutes

2012-11-14 Thread Jonathan Coveney
There are a couple of ways to shorten the time... one (super helpful one) would be to look at tests using the MiniCluster, and convert them to use local mode. A lot of tests are run using a full MR job when they aren't testing a piece of Pig relevant to that interop. Another way is to split up the

Re: Dynamically generating load/store path

2012-11-13 Thread Jonathan Coveney
If it's a parameter, it could just be passed in as a $var 2012/11/13 Miki Tebeka > Greetings, > > Is there a way to dynamically generate (maybe via UDF) the path to > load/store data? (something like "A = LOAD InputPath() USING > PigStorage();") > > Currently we calculate the load/store path ou

Re: Boolean pig UDF constructor

2012-11-08 Thread Jonathan Coveney
UDF's can only be given String arguments, period. So you can pass it a boolean in String form and parse it. 2012/11/8 meghana narasimhan > Hi All, > > Can I pass in a boolean value to Pig UDF constructor with Pig 0.9.2? > > I have a constructor : > > public GenStartEndDate(boolean mtdNoGlob) {

Re: CONCAT(null, "something") == NULL ?

2012-11-05 Thread Jonathan Coveney
I agree with Alan on all counts. I think the confusing part is that null is overloaded. Alas. 2012/11/5 Alan Gates > Better in terms of semantics or terms of documentation? We can't change > the semantics of null in Pig; it's been that way the whole time. Plus this > concept of unknown data i

Re: Welcome our newest committer Cheolsoo Park

2012-10-26 Thread Jonathan Coveney
Now is when the real fun starts, Cheolsoo. Congrats :) 2012/10/26 Alan Gates > Welcome Cheolsoo, and well deserved. > > Alan. > > On Oct 26, 2012, at 2:54 PM, Julien Le Dem wrote: > > > All, > > > > Please join me in welcoming Cheolsoo Park as our newest Pig committer. > > He's been contributing

Re: Error while exporting the data from pig latin to RDBMS

2012-10-17 Thread Jonathan Coveney
I have not used DBStorage myself and the comments are lacking, but there is a syntactical issue here. All store statements have to be in the following form: store relation into 'location' using storefunc(args); So you're case needs to be STORE data INTO 'location' using DBStorage ('com.mysql.jdb

Re: Group Data By UDF Result?

2012-10-16 Thread Jonathan Coveney
Howdy Joshua. This question comes up a fair amount, in various forms, and here is the answer: unless you can figure out a way to reduce this to an equi-join, then it is going to be tough. Why is that? Because of how joining in map-reduce land works. The way joining generally works is by hashing th

Re: Parallel Join with Pig

2012-10-15 Thread Jonathan Coveney
M/R is a useful but sometimes leaky abstraction. :) 2012/10/15 Alberto Cordioli > Ok, I've found > > I was using values that return all the same value % number of reducers. > For an unfortunate case I tested always multiple values..Ohhh, my > fault. > > > Cheers, > Alberto > > > On 13 Oc

  1   2   3   4   5   >