Hi all,

I have an activity log stored in an external Hive table, LZO-compressed, and 
partitioned by 'dt' which is the date that the data was recorded. Because of 
time zones and when we dump the data into HDFS, there are about 22 hours of one 
day and 2 of the following in each partition. In the example below, 95% of 

The first column is a timestamp ("ts", bigint). I want to count the number of 
actions on each day within a single partition with this query:

    SELECT TO_DATE(FROM_UNIXTIME(ts)) AS day, count(*) FROM
        activity_log
        WHERE
        dt = "2012-05-29"
        GROUP BY TO_DATE(FROM_UNIXTIME(ts))
    
    
The correct results are (I confirmed these by decompressing the files locally 
and counting with a simple python script):

    NULL 201
    2012-05-29 80677204
    2012-05-30 3826101
    
The NULL is caused by a relatively tiny number of corrupt rows. I sometimes get 
this with the exact same query:

    NULL 201
    2012-05-29 84503305
    
It seems to convert all of the non-null timestamps to 2012-05-29. I'd say that 
it returns the wrong value 1 out of every 5 times.

I've tried other things like 

   SELECT COUNT(*) FROM activity_log WHERE dt="2012-05-29" AND  ts < 1338354000 
AND ts >1338267600) 

but it seems to have the same problem with interpreting the ts value 
consistently.

Any tips or explanations would be greatly appreciated. I'm not sure if the 
corrupt rows have something to do with it, but I still don't get why that only 
happens sometimes. 

Thanks,
Adam

Reply via email to