I can share demo data to go with the script. Anyone has any clue? On 24 March 2015 at 14:04, Ronald Green <green.ron...@gmail.com> wrote:
> Hi, > > I stumbled upon a case where MIN/MAX on strings results with values that > are definitely not the minimum or the maximum: > > When executed on 1 million records the following script results in wrong > values for MIN/MAX: > > ``` > src = LOAD 's3n://.../' USING PigStorage('\t','-noschema') AS (field1:int, > field2:int, field3:int, field4:chararray, field5:chararray, > field6:chararray, field7:chararray, field8:chararray); > agg = GROUP src BY (field3); > proj = FOREACH agg GENERATE group AS field3, COUNT_STAR(proj) AS > countme, datafu.pig.stats.HyperLogLogPlusPlus(proj.field5) AS HLL1, > MIN(proj.field8) AS Minval, MAX(proj.field8) AS Maxval; > STORE copy_of_destination14 INTO 's3n://...' USING PigStorage('\t'); > ``` > > If I make the following changes, the results for MIN and MAX are as > expected: > > 1. Remove use of HyperLogLogPlusPlus > 2. If I treat field8 as a datetime field instead of chararray > 3. If I only execute this on 1/100 of the data > > Note that the job is comprised of a single map/reduce job with a single > map task and a single reduce task. > > Any idea? > > Thanks, > Ron >