We are planning to start enabling ad-hoc querying on our hive warehouse and we tested some of the concurrent queries and found the following issue:
Query 1 doing insert overwrite table yyy .... partition (dateint = xxx) select ... from yyy where dateint = xxx¹ This is done to merge small files within a partition in table yyy Query 2 doing some select on the same table joining another table. What we found is that query 2 would fail with the following exceptions in multiple reducers. java.io.FileNotFoundException: File does not exist: hdfs://ip-10-251-98-80.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_se ssion_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T 09-r-00006 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSy stem.java:457) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412) at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader. java:43) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFil eInputFormat.java:63) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat .java:236) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Is this expected? If so, is there a jira or is it planned to be addressed? We are trying to think of workaround, but haven¹t thought of good ones as swapping of files would ideally be handled inside hive. Please let us know your feedback. Thanks, Eva.