I have been using a replicated join to join on very large set of data with
another one that is about 1000x smaller. Generally seen large performance
gains.
However, they do scale together, so that now even though the RHS table is
still 1000x smaller, it is too large to fit into memory. There wi
You can solve this using the DISTINCT operator to solve this, it will give
you only the unique entries and than you can count them.
Example:
data = LOAD '...' USING PigStorage() as (id:int, field1:chararray,
field2:chararray);
unique_data = DISTINCT data;
unique_count = FOREACH (GROUP unique_data
Instead of
count = foreach perCust generate group, COUNT(filtered_times.movie);
use
count = foreach perCust generate FLATTEN(group), COUNT(filtered_times.movie);
FLATTEN is a special operator that replaces a tuple with the elements
inside the tuple.
On Thu, Oct 11, 2012 at 4:36 PM, jamal sasha
>From my interpretation Hive coaelsce returns the first non-null value.
So it seems you are just doing a null check on x and return y if it is
null and z otherwise?
In Pig you could do something like --- " (x is null ? y : z) This a
standard ternary if/else. Don't see if the 0.00 actually plays
le system at: file:///
>> grunt> a = load 'input';
>> grunt> b = group a by $0;
>> grunt> c = foreach b generate group, COUNT(a);
>> grunt> dump c;
>> (John,3)
>> (Lisa,2)
>> (James,2)
>> (Larry,1)
>> (Amanda,2)
>>
>>
Just some obvious checks -
I assume there is some register statement at the top of the script and
you have the proper package name in the function call
"org.apache..udfs.MyUdf" or use a DEFINE statement above? What are
the asterisks for?
On Wed, Sep 19, 2012 at 2:11 PM, Dipesh Kumar Singh
wrote
Looking for an elegant way to do this:
Suppose there is a bag with names { James, John, Lisa, Larry, Amanda,
Amanda, John, James, Lisa, John}
I'd like to get something back along the lines of a tuple (2, 2, 3, 1,
2) where those are the counts for Amanda, James, John, Larry, Lisa
respectively.
Obv