Pig load to HBase not invoking coprocessor

2012-03-22 Thread Nick
I'm having a possible issue with a simple pig load that writes to an HBase table. The issue is that when I run the test pig script it does not invoke the region observer coprocessor on the table. I have verified that my coprocessor executes when I use the HBase client API to do a simple put(). S

Re: IllegalArgumentException: Not a host:port pair

2012-03-22 Thread Ryan Cole
Hmm. The data in my tables is not important. So, I dropped the table and recreated it. This doesn't seem to have resolved the issue, though. Is there perhaps a Pig query I can run that would use a built in HBase table, like that .META. table, and see if it works? I don't know if that'd help or

Re: IllegalArgumentException: Not a host:port pair

2012-03-22 Thread Norbert Burger
Actually on second glance, this seems like an issue not with the HBase config, but with the server:port info inside your .META. table. Have you tried LOADing from a different table besides "events"? From the HBase shell, you can use the following command to extract server hostnames for each of yo

Re: IllegalArgumentException: Not a host:port pair

2012-03-22 Thread Ryan Cole
I was thinking that maybe it was because I did not have HBase config path on PIG_CLASSPATH, so I added it. This did not help, though. Ryan On Thu, Mar 22, 2012 at 9:07 PM, Ryan Cole wrote: > Norbert, > > I have confirmed that this is indeed an issue connecting to HBase. I tried > just running a

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-22 Thread Prashant Kommireddi
Rohini, Here is the JIRA. https://issues.apache.org/jira/browse/PIG-2610 Can you please post the stacktrace as a comment to it? Thanks, Prashant On Thu, Mar 22, 2012 at 2:37 PM, Jonathan Coveney wrote: > Rohini, > > In the meantime, something like the following should work: > > aw = LOAD 'inpu

Re: IllegalArgumentException: Not a host:port pair

2012-03-22 Thread Ryan Cole
Norbert, I have confirmed that this is indeed an issue connecting to HBase. I tried just running a Pig script that did not use HBaseStorage, and it works. Here is my hbase-site.xml config file, as well as my query that I'm running: https://gist.github.com/2166187 Also, for ease of reference, her

Re: IllegalArgumentException: Not a host:port pair

2012-03-22 Thread Norbert Burger
You're encountering problems connecting to HBase (presumably your Pig script uses HBaseStorage). How does your hbase/conf/hbase-site.xml look? Norbert On Thu, Mar 22, 2012 at 9:16 PM, Ryan Cole wrote: > Hello, > > I'm new to these lists. I'm trying to get Pig working, for my first time. I > ha

IllegalArgumentException: Not a host:port pair

2012-03-22 Thread Ryan Cole
Hello, I'm new to these lists. I'm trying to get Pig working, for my first time. I have setup Hadoop and HBase (on HDFS) using the psuedo-distributed setup, all on one machine. I am able to run MapReduce jobs, using the example.jar file included with the Hadoop release. Whenever I try to run even

Re: Inserting date

2012-03-22 Thread Mohit Anchlia
you write to populate the date for you!) >>> >>> 2012/3/22 Mohit Anchlia >>> >>> On Thu, Mar 22, 2012 at 2:34 PM, Thejas Nair >>>> wrote: >>>> >>>> Is this what you are looking for ? - >>>>> >>>

Re: Inserting date

2012-03-22 Thread Thejas Nair
se you write to populate the date for you!) 2012/3/22 Mohit Anchlia On Thu, Mar 22, 2012 at 2:34 PM, Thejas Nair wrote: Is this what you are looking for ? - A = LOAD '$in' USING PigStorage('\t') AS (... B = foreach A generate *, '20120322' as date; STORE B i

Re: Inserting date

2012-03-22 Thread Mohit Anchlia
; > wrote: > > > > > Is this what you are looking for ? - > > > > > > > > > A = LOAD '$in' USING PigStorage('\t') AS (... > > > > > > B = foreach A generate *, '20120322' as date; > > &g

Re: Exception during map-reduce

2012-03-22 Thread Alex Rovner
You are over allocating memory per each java process in Hadoop. Memory allocation = (mappers + reducers) * child.java.opts memory setting. This would only happen when your node is fully utilized. Alex Rovner Sent from my iPhone On Mar 21, 2012, at 10:41 PM, rakesh sharma wrote: > > Hi Al

Re: Inserting date

2012-03-22 Thread Jonathan Coveney
te to populate the date for you!) 2012/3/22 Mohit Anchlia > On Thu, Mar 22, 2012 at 2:34 PM, Thejas Nair > wrote: > > > Is this what you are looking for ? - > > > > > > A = LOAD '$in' USING PigStorage('\t') AS (... > > > &

Re: Inserting date

2012-03-22 Thread Mohit Anchlia
On Thu, Mar 22, 2012 at 2:34 PM, Thejas Nair wrote: > Is this what you are looking for ? - > > > A = LOAD '$in' USING PigStorage('\t') AS (... > > B = foreach A generate *, '20120322' as date; > > STORE B into ... > > Thanks, > Thej

Re: Could not infer the matching function for org.apache.pig.builtin.COUNT

2012-03-22 Thread Jason Alexander
Jonathan, Prashant, you guys are awesome! Thanks for the explanation! It's much clearer now! On Mar 22, 2012, at 4:40 PM, Prashant Kommireddi wrote: > Aggregation functions (COUNT, SUM, AVG..) work on bags. Since you are > counting on the entire relation in this case you did a GROUP ALL, in whi

Re: Inserting date

2012-03-22 Thread Prashant Kommireddi
Mohit, Is date a field in your dataset, or current date or something else? Few options 1. You could let Database implicitly create a date field if you need the INSERT date 2. As Thejas suggested, simply insert it as '20120322' as date. I don't think DB has an

Re: Could not infer the matching function for org.apache.pig.builtin.COUNT

2012-03-22 Thread Prashant Kommireddi
Aggregation functions (COUNT, SUM, AVG..) work on bags. Since you are counting on the entire relation in this case you did a GROUP ALL, in which case, as you said, forms a bag out of all tuples. grunt> A = load 'data' as (a:int, b:int); grunt> describe A; A: {a: int,b: int} Now, once the GROUP op

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-22 Thread Jonathan Coveney
Rohini, In the meantime, something like the following should work: aw = LOAD 'input' using MyCustomLoader(); searches = FOREACH raw GENERATE day, searchType, FLATTEN(impBag) AS (adType, clickCount) ; searches_2 = foreach searches generate *, ( adType ==

Re: Inserting date

2012-03-22 Thread Thejas Nair
Is this what you are looking for ? - A = LOAD '$in' USING PigStorage('\t') AS (... B = foreach A generate *, '20120322' as date; STORE B into ... Thanks, Thejas On 3/22/12 1:13 PM, Mohit Anchlia wrote: Yes that's exactly what I am asking. Reading from f

Re: Could not infer the matching function for org.apache.pig.builtin.COUNT

2012-03-22 Thread Jonathan Coveney
Woops, fat fingere dit. Part two: grunt> d = foreach c generate SUM($0); Wait a second...this doesn't make much sense. Foreaches work on columns in rows, not on relations (nothing works on relations). So how do we count things? We need to put everything in one row. grunt> d = group c all; grunt

Re: Could not infer the matching function for org.apache.pig.builtin.COUNT

2012-03-22 Thread Jonathan Coveney
The reason can be a little hard to grok at first, but it's core to Pig...perhaps we need a tutorial explaining the model a bit more clearly. The foundation of Pig is a relation, ie, scans. What does this means? It means that you have a bunch of rows, and these rows have things. I'm going to diverg

Re: Could not infer the matching function for org.apache.pig.builtin.COUNT

2012-03-22 Thread Jason Alexander
Very nice, worked like a champ, Prashant. Any chance you could explain why? I'd love to be taught to fish, not just given the fish to eat. ;-) GROUP ALL, as I read it, pulls the tuples into a single group. But, FOREACH'ing on each group, and counting against productscans is where my brain start

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-22 Thread Rohini U
Thanks Prashant, I am using Pig 0.9.1 and hadoop 0.20.205 Thanks, Rohini On Thu, Mar 22, 2012 at 1:27 PM, Prashant Kommireddi wrote: > This makes more sense, grouping and filter are on different columns. I will > open a JIRA soon. > > What version of Pig and Hadoop are you using? > > Thanks, > P

Re: Could not infer the matching function for org.apache.pig.builtin.COUNT

2012-03-22 Thread Prashant Kommireddi
Hi Jason, Are you trying to count the number of records in the relation 'productscans'? In which case you would have to use GROUP http://pig.apache.org/docs/r0.9.1/basic.html#GROUP grpd = GROUP productscans ALL; scancount = FOREACH grpd GENERATE COUNT(productscans); DUMP scancount; Thanks, Prash

Could not infer the matching function for org.apache.pig.builtin.COUNT

2012-03-22 Thread Jason Alexander
Hey all, I'm trying to write a script to pull the count of a dataset that I've filtered. Here's the script so far: /* scans by title */ scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-22 Thread Prashant Kommireddi
This makes more sense, grouping and filter are on different columns. I will open a JIRA soon. What version of Pig and Hadoop are you using? Thanks, Prashant On Thu, Mar 22, 2012 at 1:12 PM, Rohini U wrote: > Hi Prashant, > > Here is my script in full. > > > raw = LOAD 'input' using MyCustomLoa

Re: Inserting date

2012-03-22 Thread Mohit Anchlia
Yes that's exactly what I am asking. Reading from flat file and then inserting it into the database. And I want to insert date before storing. for eg I want to add date before A gets stored: A = LOAD '$in' USING PigStorage('\t') AS (... STORE A into ... On Thu, Mar 22, 2012 at 12:54 PM, Jonath

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-22 Thread Rohini U
Hi Prashant, Here is my script in full. raw = LOAD 'input' using MyCustomLoader(); searches = FOREACH raw GENERATE day, searchType, FLATTEN(impBag) AS (adType, clickCount) ; groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50; counts =

Re: Inserting date

2012-03-22 Thread Jonathan Coveney
Do you mean you're reading a relation from Hadoop, and want to append the date to the row before inserting it? I'm not quite sure what you're asking for. 2012/3/22 Mohit Anchlia > Sorry I mean to ask if there is any way to insert date into the ALIAS so > that I can use it before storing it into

Re: Inserting date

2012-03-22 Thread Mohit Anchlia
Sorry I mean to ask if there is any way to insert date into the ALIAS so that I can use it before storing it into DB. On Thu, Mar 22, 2012 at 12:47 PM, Mohit Anchlia wrote: > I am reading bunch of columns from a flat file and inserting it into the > database. Is there a way to also insert date? >

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-22 Thread Prashant Kommireddi
Hi Rohini, >From your query it looks like you are already grouping it by TYPE, so not sure why you would want the SUM of, say "EMPLOYER" type in "LOCATION" and vice-versa. Your output is already broken down by TYPE. Thanks, Prashant On Thu, Mar 22, 2012 at 9:03 AM, Rohini U wrote: > Thanks for

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-22 Thread Dmitriy Ryaboy
It's done for some cases, but this one is different since the group key needs to change. D On Wed, Mar 21, 2012 at 11:41 PM, Prashant Kommireddi wrote: > Sure I can do that. Isn't this something that should be done already? Or > does it not work if the filter is working on a field that is part o

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-22 Thread Dmitriy Ryaboy
So, as explained earlier, the reason you are running out of memory is that you are loading all records into memory when you want to do non-algebraic things to results of grouping. Can you come up with ways to achieve what you need without having to have the raw records at the reducer? One way has

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-22 Thread Rohini U
Has a Jira been filed for this? I can send my example I am trying if that helps. Thanks, Rohini On Wed, Mar 21, 2012 at 11:41 PM, Prashant Kommireddi wrote: > Sure I can do that. Isn't this something that should be done already? Or > does it not work if the filter is working on a field that is p

Re: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded

2012-03-22 Thread Rohini U
Thanks for the suggestion Prashant. However, that will not work in my case. If I filter before the group and include the new field in group as you suggested, I get the individual counts broken by the select field critieria. However, I want the totals also without taking the select fields into acco