Re: DFSClient: Could not complete file

Todd Lipcon Sat, 08 Oct 2011 01:58:46 -0700

On Fri, Oct 7, 2011 at 12:40 PM, Chris Curtin <curtin.ch...@gmail.com> wrote:
> hi Todd,
>
> Thanks for the reply.
>
> Yes I'm seeing > 30,000 ms a couple of times a day, though it looks like
> 4000 ms is average. Also see 150,000+ and lots of 50,000.
>
> Is there anything I can do about this? The bug is still open in JIRA.


Currently the following workarounds may be effective:
- schedule a cron job to run once every couple minutes that runs: find
/data/1/hdfs /data/2/hdfs/ ... -length 2 > /dev/null    (this will
cause your inodes and dentries to get paged into cache so the block
report runs quickly)
- tune /proc/sys/vm/vfs_cache_pressure to a lower value (this will
encourage Linux to keep inodes and dentries in cache)

Both have some associated costs, but at least one of our customers has
found the above set of workarounds to be effective. Currently I'm
waiting on review of HDFS-2379, though if you are adventurous you
could consider building your own copy of Hadoop with this patch
applied. I've tested on a cluster and fairly confident it is safe.

Thanks
-Todd

> On Fri, Oct 7, 2011 at 2:15 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> Hi Chris,
>>
>> You may be hitting HDFS-2379.
>>
>> Can you grep your DN logs for the string "BlockReport" and see if you
>> see any taking more than 30000ms or so?
>>
>> -Todd
>>
>> On Fri, Oct 7, 2011 at 6:31 AM, Chris Curtin <curtin.ch...@gmail.com>
>> wrote:
>> > Sorry to bring this back from the dead, but we're having the issues
>> again.
>> >
>> > This is on a NEW cluster, using Cloudera 0.20.2-cdh3u0 (old was stock
>> Apache
>> > 0.20.2). Nothing carried over from the old cluster except data in HDFS
>> > (copied from old cluster). Bigger/more machines, more RAM, faster disks
>> etc.
>> > And it is back.
>> >
>> > Confirmed that all the disks setup for HDFS are 'deadline'.
>> >
>> > Runs fine for  few days then hangs again with the 'Could not complete'
>> error
>> > in the JobTracker log until we kill the cluster.
>> >
>> > 2011-09-09 08:04:32,429 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>> > complete file
>> >
>> /log/hadoop/tmp/flow_BYVMTA_family_BYVMTA_72751_8284775/_logs/history/10.120.55.2_1311201333949_job_201107201835_13900_deliv_flow_BYVMTA%2Bflow_BYVMTA*family_B%5B%284%2F5%29+...UNCED%27%2C+
>> > retrying...
>> >
>> > Found HDFS-148 (https://issues.apache.org/jira/browse/HDFS-148) which
>> looks
>> > like what could be happening to us. Anyone found a good workaround?
>> > Any other ideas?
>> >
>> > Also, does the HDFS system try to do 'du' on disks not assigned to it?
>> The
>> > HDFS disks are separate from the root and OS disks. Those disks are NOT
>> > setup to be 'deadline'. Should that matter?
>> >
>> > Thanks,
>> >
>> > Chris
>> >
>>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: DFSClient: Could not complete file

Reply via email to