We believe the problem is that one of the query is doing an ‘insert
overwrite ... select from’ which actually is deleting and merging the small
files. The other query somehow couldn’t find those files that it thought it
has seen before and failed. So, it looks like a concurrency issue.

Could you elaborate a bit on why you say this is not a bug?


On 9/9/09 9:55 AM, "Prasad Chakka" <> wrote:

> If a certain input file/dir does not exist then the job can’t be submitted.
> Since only a few reducers are failing, the problem could be something else.
> Eva, Does the same job succeed on a second try? Ie. Is the file/dir available
> eventually? What is the replication factor?
> Prasad
> From: Yongqiang He <>
> Reply-To: <>
> Date: Wed, 9 Sep 2009 04:07:31 -0700
> To: <>
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
> Hi Eva,
>    After a close at the code, I think this is not a bug. We need to find out
> how to avoid this.
> Thanks,
> Yongqiang
> On 09-9-9 下午1:31, "He Yongqiang" <> wrote:
>> Hi Eva,
>>     Can you open a new jira for this?  And let’s discuss and resolve this
>> issue. 
>> I guess this is because the partition metadata is added before the data is
>> available. 
>> Thanks
>> Yongqiang
>> On 09-9-9 下午1:18, "Eva Tse" <> wrote:
>>> We are planning to start enabling ad-hoc querying on our hive warehouse and
>>> we tested some of the concurrent queries and found the following issue:
>>> Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx)
>>> select ...  from yyy where dateint = xxx’  This is done to merge small files
>>> within a partition in table yyy
>>> Query 2 – doing some select on the same table joining another table.
>>> What we found is that query 2 would fail with the following exceptions in
>>> multiple reducers.
>>> File does not exist:
>>> hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_sess
>>> ion_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09
>>> -r-00006
>>>  at 
>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSy
>>>  at org.apache.hadoop.fs.FileSystem.getLength(
>>>  at$Reader.(
>>>  at$Reader.(
>>>  at 
>>> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.
>>> java:43)
>>>  at 
>>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFil
>>>  at 
>>> .java:236)
>>>  at org.apache.hadoop.mapred.MapTask.runOldMapper(
>>>  at
>>>  at org.apache.hadoop.mapred.Child.main(
>>> Is this expected? If so, is there a jira or is it planned to be addressed?
>>> We are trying to think of workaround, but haven’t thought of good ones as
>>> swapping of files would ideally be handled inside hive.
>>> Please let us know your feedback.
>>> Thanks,
>>> Eva.

Reply via email to