Hi all, I have an activity log stored in an external Hive table, LZO-compressed, and partitioned by 'dt' which is the date that the data was recorded. Because of time zones and when we dump the data into HDFS, there are about 22 hours of one day and 2 of the following in each partition. In the example below, 95% of
The first column is a timestamp ("ts", bigint). I want to count the number of actions on each day within a single partition with this query: SELECT TO_DATE(FROM_UNIXTIME(ts)) AS day, count(*) FROM activity_log WHERE dt = "2012-05-29" GROUP BY TO_DATE(FROM_UNIXTIME(ts)) The correct results are (I confirmed these by decompressing the files locally and counting with a simple python script): NULL 201 2012-05-29 80677204 2012-05-30 3826101 The NULL is caused by a relatively tiny number of corrupt rows. I sometimes get this with the exact same query: NULL 201 2012-05-29 84503305 It seems to convert all of the non-null timestamps to 2012-05-29. I'd say that it returns the wrong value 1 out of every 5 times. I've tried other things like SELECT COUNT(*) FROM activity_log WHERE dt="2012-05-29" AND ts < 1338354000 AND ts >1338267600) but it seems to have the same problem with interpreting the ts value consistently. Any tips or explanations would be greatly appreciated. I'm not sure if the corrupt rows have something to do with it, but I still don't get why that only happens sometimes. Thanks, Adam