Hi All, I'm seeing some data loss/corruption in hive. This isn't HDFS-level corruption - hdfs reports that the files and blocks are healthy.
I'm using managed ORC tables. Normally we write once an hour to each table, with occasional concatenations through hive. We perform the writing using spark 1.3.1, (using the spark sql interface) running either locally or over yarn. Occasionally we will run many insertion jobs against a table, generally when backfilling data. The data loss seems to happen more frequently when we are doing frequent concatenations and multiple insertion jobs at once. The problem goes away when we drop the table and reingest. The problem also appears to be localised to specific orc files within the table - if we delete the affected files (detectable by trying to orcdump each file), the rest are just fine. Has anyone seen this? Any suggestions for avoiding this or chasing down a root cause? Thanks, Marcin -- Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> Latest news <http://www.handy.com/press> at Handy Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity