I can share demo data to go with the script. Anyone has any clue?

On 24 March 2015 at 14:04, Ronald Green <green.ron...@gmail.com> wrote:

> Hi,
>
> I stumbled upon a case where MIN/MAX on strings results with values that
> are definitely not the minimum or the maximum:
>
> When executed on 1 million records the following script results in wrong
> values for MIN/MAX:
>
> ```
> src = LOAD 's3n://.../' USING PigStorage('\t','-noschema') AS (field1:int,
> field2:int, field3:int, field4:chararray, field5:chararray,
> field6:chararray, field7:chararray, field8:chararray);
> agg = GROUP src BY (field3);
> proj = FOREACH agg GENERATE group AS field3, COUNT_STAR(proj) AS
> countme, datafu.pig.stats.HyperLogLogPlusPlus(proj.field5) AS HLL1,
> MIN(proj.field8) AS Minval, MAX(proj.field8) AS Maxval;
> STORE copy_of_destination14 INTO 's3n://...' USING PigStorage('\t');
> ```
>
> If I make the following changes, the results for MIN and MAX are as
> expected:
>
> 1. Remove use of HyperLogLogPlusPlus
> 2. If I treat field8 as a datetime field instead of chararray
> 3. If I only execute this on 1/100 of the data
>
> Note that the job is comprised of a single map/reduce job with a single
> map task and a single reduce task.
>
> Any idea?
>
> Thanks,
> Ron
>

Reply via email to