On 6/28/10 5:51 PM, "Dmitriy Ryaboy" wrote:
>
> I have a feeling that propagating schemas when known, and using them to for
> (de)serialization instead of reflecting every field, would also be a big
> win.
>
> Thoughts on just using Avro for the internal PigStorage?
When I profiled pig quer
I have created a wiki which puts together some ideas that can help in
improving performance by avoiding/delaying serialization/de-serialization .
http://wiki.apache.org/pig/AvoidingSedes
These are ideas that don't involve changes to optimizer. Most of them
involve changes in the load/store functi
I agree with Alan and Dmitriy - Pig is tightly coupled with hadoop, and
heavily influenced by its roadmap. I think it makes sense to continue as a
sub-project of hadoop.
-Thejas
On 3/31/10 4:04 PM, "Dmitriy Ryaboy" wrote:
> Over time, Pig is increasing its coupling to Hadoop (for good reasons
In the new implementation of SampleLoader subclasses (used by order-by,
skew-join ..) as part of the loader redesign, we are not only reading all
the records input but also parsing them as pig tuples.
This is because the SampleLoaders are wrappers around the actual input
loaders specified in the q
? Pig will have access
> to the InputFormat instance, correct? Can it not call
> InputFormat.getNext the desired number of times (which will not parse
> the tuple) and then call LoadFunc.getNext to get the next parsed tuple?
>
> Alan.
>
> On Nov 3, 2009, at 4:28 PM, Thejas Nair w
fix it, I am not filing a jira.
-Thejas
On 11/2/09 9:19 AM, "Thejas Nair" wrote:
> I could not find any documentation (in piglatin manual) on what the
> definition of equality of bags is (or what it should be), does the order of
> tuples in the bag matter ? But the definitio
I could not find any documentation (in piglatin manual) on what the
definition of equality of bags is (or what it should be), does the order of
tuples in the bag matter ? But the definition of a bag does not imply any
ordering.
This has implication on the definition of join/cogroup/group on bags.
I think we should include fix for PIG-1048 (skew join incorrect results) in
the release. There is already a patch for it.
-Thejas
On 10/29/09 1:54 PM, "Olga Natkovich" wrote:
> With 3 +1s from Hadoop PMC (Alan Gates, Chris Douglas, and Olga
> Natkovich) and no -1s, the release passed the vote
Jflex is covered by GPL, but code generated by it is not. Only the code that
is generated by Jflex goes into pig.jar.
We can't checkin Jflex.jar into svn, ivy will be setup to download it from
maven repository.
-Thejas
On 8/25/09 11:57 AM, "Dmitriy Ryaboy" wrote:
> Santosh,
> Am I missing some
I think we are creating unnecessary bureaucratic hurdles here by preventing
contrib project from having a branch. I don't see why zebra has to use pig
release branch, as the new pig release does not include it.
The decisions are supposed to help keeping things open, but this seems to be
forcing Ra
With a constraint that all scalar values in a tuple should fit into a single
buffer, the values will always have to be copied whenever a tuple contents
need to be copied to a new tuple after a relational operation.
The overhead of copying is not large for numeric types compared to the
existing imp
This paper seems very relevant to the proposal -
"Compiled Query Execution Engine using JVM"
http://www2.computer.org/portal/web/csdl/doi/10.1109/ICDE.2006.40
>From the abstract -
"Our experimental results on the TPC-H data set show that, despite both
engines benefiting from JIT, the compiled engi
Pig users might not know enough to decide on a good default parallelism,
specially when running adhoc queries.
Instead of defaulting to 1 , if a user does not specify the parallelism , we
should use as default a higher number which does not have negative impact on
the throughput of the system.
Ha
I will create a JIRA for this change.
-Thejas
-- Forwarded Message
From: Alan Gates
Date: Mon, 16 Mar 2009 07:56:32 -0700
To: Thejas Nair
Subject: Re: scope string in OperatorKey
+1.
Alan.
On Mar 11, 2009, at 11:53 AM, Thejas Nair wrote:
> The id in OperatorKey helps distingu
easy to
> distinguish operators without it? IIRC the OperatorKey includes an
> operator number. When looking at the explain plans this is useful for
> cases where there is more than one of a given type of operator and you
> want to be able to distinguish between them.
>
> Alan.
&
What is the purpose of scope string in org.apache.pig.impl.plan.OperatorKey
?Is it meant to be used if we have a pig deamon process ?
Is it ok to stop printing the scope part in explain output? It does not seem
to add value to it and makes the output more verbose.
Thanks,
Thejas
16 matches
Mail list logo