Sure I can do that. Isn't this something that should be done already? Or
does it not work if the filter is working on a field that is part of the
group?
On Wed, Mar 21, 2012 at 11:02 PM, Dmitriy Ryaboy wrote:
> Prashant, mind filing a jira with this example? Technically, this is
> something we c
Prashant, mind filing a jira with this example? Technically, this is
something we could do automatically.
On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi wrote:
> Please pull your FILTER out of GROUP BY and do it earlier
> http://pig.apache.org/docs/r0.9.1/perf.html#filter
>
> In this case,
The numbers 100, 20 denotes meta-data numbers. The instances of data is
large. Moreover given the demoralized form, it can’t take advantage of
indexes.
The data is currently demoralized, in the sense instead of having 100 parse
columns, the data is stored as key value pair in 3 column table.
One r
What about denormalizing and just representing these as 4-tuples of (id,
type, name, value) in a text file? You could always then group by type if
you need to get back to distinct types.
Are you joining against a larger dataset? I ask just because 10x200 values
is not a lot and can be done without
Hi All,
I am using pig-0.8.1-cdh3u2 and hadoop hadoop-0.20.2-cdh3u1 on a linux box.
Once in a while, I get following exception:
WARN : org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher -
There is no log file to write to.ERROR:
org.apache.pig.backend.hadoop.executionengine.ma
There is a patch for AvroStorage which computes a union schema thereby
allowing input avro files having different
schemas, specifically (un-nested) records with different fields.
https://issues.apache.org/jira/browse/PIG-2579
Best,
stan
On Wed, Mar 21, 2012 at 8:31 PM, Jonathan Coveney wrote:
A question about this: does Avro have clear cut rules for how to
essentially merge two arbitrary JSON schemas?
2012/3/21 Jonathan Coveney
> ATM, there is no quick and easy solution short of patching Pig... feel
> free to make a ticket.
>
> Short of that, what you can do is load each relation wit
Please pull your FILTER out of GROUP BY and do it earlier
http://pig.apache.org/docs/r0.9.1/perf.html#filter
In this case, you could use a FILTER followed by a bincond to introduce a
new field "employerOrLocation", then do a group by and include the new
field in the GROUP BY clause.
Thanks,
Prash
My input data size is 9GB and I am using 20 machines.
My grouped criteria has two cases so I want 1) counts by the criteria I
have grouped 2) counts of the two inviduals cases in each of my group.
So my script in detail is:
counts = FOREACH grouped {
selectedFields1 = FILTER
you are not doing grouping followed by counting. You are doing grouping
followed by filtering followed by counting.
Try filtering before grouping.
D
On Wed, Mar 21, 2012 at 12:34 PM, Rohini U wrote:
> Hi,
>
> I have a pig script which does a simple GROUPing followed by couting and I
> get this
Hi Rohini,
Can you provide some details into how big is the input dataset, data volume
that reducers receive from Mappers and the number of reducers you are using?
Thanks,
Prashant
On Wed, Mar 21, 2012 at 12:34 PM, Rohini U wrote:
> Hi,
>
> I have a pig script which does a simple GROUPing foll
Hi,
I have a pig script which does a simple GROUPing followed by couting and I
get this error. My data is certaining not that big for it to cause this
out of memory error. Is there a chance that this is because of some bug ?
Did any one come across this kind of error before?
I am using pig 0.9.1
unbelievable!
https://twitter.com/#!/mcuban/status/182273293347328000
anyone has more scoop on this?
ATM, there is no quick and easy solution short of patching Pig... feel free
to make a ticket.
Short of that, what you can do is load each relation with a different
schema separately, and then do a union of it. Given that there might be a
lot of different relations and schemas involved, you could p
Hi guys,
Thanks again for your awesome hint about sqoop.
I have another question: The data I'm working with is stored as AVRO
Files in the Hadoop. When I try to glob them everything works just
perfectly. But. When I add something to the schema of a single data file
while the others remain, every
Hi Prashant -- yes, 8 GB total RAM, but we're seeing 300-400 MB heap
consumption per Pig invocation client-side.
We're also migrating soon to Azkaban, but it doesn't seem like it'd resolve
this issue, since from what I understand it simply wraps Grunt.
Norbert
On Wed, Mar 21, 2012 at 10:18 AM, P
Norbert,
You mean 8GB memory on client side to launch Pig right? That seems
like a lot for simply spawning jobs. We use Azkaban to schedule jobs
and there are 10s of jobs spawned at once. Pig by itself should not be
so memory intensive.
Thanks,
Prashant
On Mar 21, 2012, at 6:50 AM, Norbert Burge
Folks -- how are folks handling the "productionalization" of their Pig
submit nodes?
For our PROD environment, I originally thought we'd just have a few VMs
from which Pig jobs would be submitted onto our cluster. But on our 8GB
VMs, I found that we were often hitting heap OOM errors on a relativ
18 matches
Mail list logo