Replicated Join and OOM errors

2013-07-19 Thread Arun Ahuja
I have been using a replicated join to join on very large set of data with another one that is about 1000x smaller. Generally seen large performance gains. However, they do scale together, so that now even though the RHS table is still 1000x smaller, it is too large to fit into memory. There wi

Re: count duplicate entries

2013-04-02 Thread Arun Ahuja
You can solve this using the DISTINCT operator to solve this, it will give you only the unique entries and than you can count them. Example: data = LOAD '...' USING PigStorage() as (id:int, field1:chararray, field2:chararray); unique_data = DISTINCT data; unique_count = FOREACH (GROUP unique_data

Re: question

2012-10-12 Thread Arun Ahuja
Instead of count = foreach perCust generate group, COUNT(filtered_times.movie); use count = foreach perCust generate FLATTEN(group), COUNT(filtered_times.movie); FLATTEN is a special operator that replaces a tuple with the elements inside the tuple. On Thu, Oct 11, 2012 at 4:36 PM, jamal sasha

Re: Small question

2012-10-12 Thread Arun Ahuja
>From my interpretation Hive coaelsce returns the first non-null value. So it seems you are just doing a null check on x and return y if it is null and z otherwise? In Pig you could do something like --- " (x is null ? y : z) This a standard ternary if/else. Don't see if the 0.00 actually plays

Re: Counting elements in a bag

2012-09-21 Thread Arun Ahuja
le system at: file:/// >> grunt> a = load 'input'; >> grunt> b = group a by $0; >> grunt> c = foreach b generate group, COUNT(a); >> grunt> dump c; >> (John,3) >> (Lisa,2) >> (James,2) >> (Larry,1) >> (Amanda,2) >> >>

Re: Two or more arguments in udf

2012-09-19 Thread Arun Ahuja
Just some obvious checks - I assume there is some register statement at the top of the script and you have the proper package name in the function call "org.apache..udfs.MyUdf" or use a DEFINE statement above? What are the asterisks for? On Wed, Sep 19, 2012 at 2:11 PM, Dipesh Kumar Singh wrote

Counting elements in a bag

2012-09-19 Thread Arun Ahuja
Looking for an elegant way to do this: Suppose there is a bag with names { James, John, Lisa, Larry, Amanda, Amanda, John, James, Lisa, John} I'd like to get something back along the lines of a tuple (2, 2, 3, 1, 2) where those are the counts for Amanda, James, John, Larry, Lisa respectively. Obv