Re: Avoiding serialization/de-serialization in pig

2010-06-30 Thread Thejas Nair
On 6/28/10 5:51 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: I have a feeling that propagating schemas when known, and using them to for (de)serialization instead of reflecting every field, would also be a big win. Thoughts on just using Avro for the internal PigStorage? When I

Avoiding serialization/de-serialization in pig

2010-06-28 Thread Thejas Nair
I have created a wiki which puts together some ideas that can help in improving performance by avoiding/delaying serialization/de-serialization . http://wiki.apache.org/pig/AvoidingSedes These are ideas that don't involve changes to optimizer. Most of them involve changes in the load/store

Re: Begin a discussion about Pig as a top level project

2010-04-02 Thread Thejas Nair
I agree with Alan and Dmitriy - Pig is tightly coupled with hadoop, and heavily influenced by its roadmap. I think it makes sense to continue as a sub-project of hadoop. -Thejas On 3/31/10 4:04 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Over time, Pig is increasing its coupling to Hadoop

Re: LoadFunc.skipNext() function for faster sampling ?

2009-11-03 Thread Thejas Nair
will have access to the InputFormat instance, correct? Can it not call InputFormat.getNext the desired number of times (which will not parse the tuple) and then call LoadFunc.getNext to get the next parsed tuple? Alan. On Nov 3, 2009, at 4:28 PM, Thejas Nair wrote: In the new

LoadFunc.skipNext() function for faster sampling ?

2009-11-03 Thread Thejas Nair
In the new implementation of SampleLoader subclasses (used by order-by, skew-join ..) as part of the loader redesign, we are not only reading all the records input but also parsing them as pig tuples. This is because the SampleLoaders are wrappers around the actual input loaders specified in the

Definition of equality of bags

2009-11-02 Thread Thejas Nair
I could not find any documentation (in piglatin manual) on what the definition of equality of bags is (or what it should be), does the order of tuples in the bag matter ? But the definition of a bag does not imply any ordering. This has implication on the definition of join/cogroup/group on bags.

Re: Definition of equality of bags

2009-11-02 Thread Thejas Nair
fix it, I am not filing a jira. -Thejas On 11/2/09 9:19 AM, Thejas Nair te...@yahoo-inc.com wrote: I could not find any documentation (in piglatin manual) on what the definition of equality of bags is (or what it should be), does the order of tuples in the bag matter ? But the definition

Re: switching to different parser in Pig

2009-08-25 Thread Thejas Nair
Jflex is covered by GPL, but code generated by it is not. Only the code that is generated by Jflex goes into pig.jar. We can't checkin Jflex.jar into svn, ivy will be setup to download it from maven repository. -Thejas On 8/25/09 11:57 AM, Dmitriy Ryaboy dvrya...@cloudera.com wrote: Santosh,

Re: Proposal to create a branch for contrib project Zebra

2009-08-18 Thread Thejas Nair
I think we are creating unnecessary bureaucratic hurdles here by preventing contrib project from having a branch. I don't see why zebra has to use pig release branch, as the new pig release does not include it. The decisions are supposed to help keeping things open, but this seems to be forcing

Re: [Pig Wiki] Update of ProposedProjects by AlanGates

2009-04-16 Thread Thejas Nair
This paper seems very relevant to the proposal - Compiled Query Execution Engine using JVM http://www2.computer.org/portal/web/csdl/doi/10.1109/ICDE.2006.40 From the abstract - Our experimental results on the TPC-H data set show that, despite both engines benefiting from JIT, the compiled engine

Re: scope string in OperatorKey

2009-03-11 Thread Thejas Nair
without it? IIRC the OperatorKey includes an operator number. When looking at the explain plans this is useful for cases where there is more than one of a given type of operator and you want to be able to distinguish between them. Alan. On Mar 6, 2009, at 3:14 PM, Thejas Nair wrote: What

scope string in OperatorKey

2009-03-06 Thread Thejas Nair
What is the purpose of scope string in org.apache.pig.impl.plan.OperatorKey ?Is it meant to be used if we have a pig deamon process ? Is it ok to stop printing the scope part in explain output? It does not seem to add value to it and makes the output more verbose. Thanks, Thejas