Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-21 Thread Prashant Kommireddi
Sure I can do that. Isn't this something that should be done already? Or does it not work if the filter is working on a field that is part of the group? On Wed, Mar 21, 2012 at 11:02 PM, Dmitriy Ryaboy wrote: > Prashant, mind filing a jira with this example? Technically, this is > something we c

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-21 Thread Dmitriy Ryaboy
Prashant, mind filing a jira with this example? Technically, this is something we could do automatically. On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi wrote: > Please pull your FILTER out of GROUP BY and do it earlier > http://pig.apache.org/docs/r0.9.1/perf.html#filter > > In this case,

Re: how to best process key-value pairs with Pig

2012-03-21 Thread shan s
The numbers 100, 20 denotes meta-data numbers. The instances of data is large. Moreover given the demoralized form, it can’t take advantage of indexes. The data is currently demoralized, in the sense instead of having 100 parse columns, the data is stored as key value pair in 3 column table. One r

Re: how to best process key-value pairs with Pig

2012-03-21 Thread Bill Graham
What about denormalizing and just representing these as 4-tuples of (id, type, name, value) in a text file? You could always then group by type if you need to get back to distinct types. Are you joining against a larger dataset? I ask just because 10x200 values is not a lot and can be done without

Exception during map-reduce

2012-03-21 Thread rakesh sharma
Hi All, I am using pig-0.8.1-cdh3u2 and hadoop hadoop-0.20.2-cdh3u1 on a linux box. Once in a while, I get following exception: WARN : org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - There is no log file to write to.ERROR: org.apache.pig.backend.hadoop.executionengine.ma

Re: Globbing several AVRO files with different (extended) schemes

2012-03-21 Thread Stan Rosenberg
There is a patch for AvroStorage which computes a union schema thereby allowing input avro files having different schemas, specifically (un-nested) records with different fields. https://issues.apache.org/jira/browse/PIG-2579 Best, stan On Wed, Mar 21, 2012 at 8:31 PM, Jonathan Coveney wrote:

Re: Globbing several AVRO files with different (extended) schemes

2012-03-21 Thread Jonathan Coveney
A question about this: does Avro have clear cut rules for how to essentially merge two arbitrary JSON schemas? 2012/3/21 Jonathan Coveney > ATM, there is no quick and easy solution short of patching Pig... feel > free to make a ticket. > > Short of that, what you can do is load each relation wit

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-21 Thread Prashant Kommireddi
Please pull your FILTER out of GROUP BY and do it earlier http://pig.apache.org/docs/r0.9.1/perf.html#filter In this case, you could use a FILTER followed by a bincond to introduce a new field "employerOrLocation", then do a group by and include the new field in the GROUP BY clause. Thanks, Prash

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-21 Thread Rohini U
My input data size is 9GB and I am using 20 machines. My grouped criteria has two cases so I want 1) counts by the criteria I have grouped 2) counts of the two inviduals cases in each of my group. So my script in detail is: counts = FOREACH grouped { selectedFields1 = FILTER

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-21 Thread Dmitriy Ryaboy
you are not doing grouping followed by counting. You are doing grouping followed by filtering followed by counting. Try filtering before grouping. D On Wed, Mar 21, 2012 at 12:34 PM, Rohini U wrote: > Hi, > > I have a pig script which does a simple GROUPing followed by couting and I > get this

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-21 Thread Prashant Kommireddi
Hi Rohini, Can you provide some details into how big is the input dataset, data volume that reducers receive from Mappers and the number of reducers you are using? Thanks, Prashant On Wed, Mar 21, 2012 at 12:34 PM, Rohini U wrote: > Hi, > > I have a pig script which does a simple GROUPing foll

Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-21 Thread Rohini U
Hi, I have a pig script which does a simple GROUPing followed by couting and I get this error. My data is certaining not that big for it to cause this out of memory error. Is there a chance that this is because of some bug ? Did any one come across this kind of error before? I am using pig 0.9.1

most high profile user

2012-03-21 Thread Raghu Angadi
unbelievable! https://twitter.com/#!/mcuban/status/182273293347328000 anyone has more scoop on this?

Re: Globbing several AVRO files with different (extended) schemes

2012-03-21 Thread Jonathan Coveney
ATM, there is no quick and easy solution short of patching Pig... feel free to make a ticket. Short of that, what you can do is load each relation with a different schema separately, and then do a union of it. Given that there might be a lot of different relations and schemas involved, you could p

Globbing several AVRO files with different (extended) schemes

2012-03-21 Thread Markus Resch
Hi guys, Thanks again for your awesome hint about sqoop. I have another question: The data I'm working with is stored as AVRO Files in the Hadoop. When I try to glob them everything works just perfectly. But. When I add something to the schema of a single data file while the others remain, every

Re: Pig submit nodes

2012-03-21 Thread Norbert Burger
Hi Prashant -- yes, 8 GB total RAM, but we're seeing 300-400 MB heap consumption per Pig invocation client-side. We're also migrating soon to Azkaban, but it doesn't seem like it'd resolve this issue, since from what I understand it simply wraps Grunt. Norbert On Wed, Mar 21, 2012 at 10:18 AM, P

Re: Pig submit nodes

2012-03-21 Thread Prashant Kommireddi
Norbert, You mean 8GB memory on client side to launch Pig right? That seems like a lot for simply spawning jobs. We use Azkaban to schedule jobs and there are 10s of jobs spawned at once. Pig by itself should not be so memory intensive. Thanks, Prashant On Mar 21, 2012, at 6:50 AM, Norbert Burge

Pig submit nodes

2012-03-21 Thread Norbert Burger
Folks -- how are folks handling the "productionalization" of their Pig submit nodes? For our PROD environment, I originally thought we'd just have a few VMs from which Pig jobs would be submitted onto our cluster. But on our 8GB VMs, I found that we were often hitting heap OOM errors on a relativ