Hi Wei-Chiu,

I came across HDFS-14476 when searching if anyone else is seeing the same
issues as us. I didn't see it merged and so assumed there are other fixes
done around what you mention in there.

HDFS-11187 is already applied to 2.9.1 and we are on 2.9.2 and so it might
not be impacting us though I need to be 100% sure of it.

I will find the jstack as soon as I am on the server again, but I have the
following WARN statement that I felt could point to the issue. Btw, we have
tested the disk I/O and when these slowdown happen the disk is hardly under
any kind of I/O pressure. And on restart of the datanode process all goes
back to normal i.e. the disks are able to do high I/O.

2019-09-27 00:48:14,101 WARN
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Lock held time above threshold: lock identifier:
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsD
atasetImpl lockHeldTimeMs=12628 ms. Suppressed 1 lock warnings. The
stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1021)
org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:143)
org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186)
org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133)
org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84)
org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96)
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1781)
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.finalizeBlock(BlockReceiver.java:1517)
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1474)
java.lang.Thread.run(Thread.java:748)


Thanks,
Viral






On Thu, Sep 26, 2019 at 7:03 PM Wei-Chiu Chuang <weic...@cloudera.com>
wrote:

> or maybe https://issues.apache.org/jira/browse/HDFS-14476
>
> I reverted this fix and I've not looked at it further. But take a look.
> It's disappointing to me that none of the active Hadoop contributors seem
> to understand DirectoryScanner well enough.
>
> or HDFS-11187 <https://issues.apache.org/jira/browse/HDFS-11187>
>
> If you can post a jstack snippet I might be able to help out.
> On Thu, Sep 26, 2019 at 9:48 PM Viral Bajaria <viral.baja...@gmail.com>
> wrote:
>
>> Thanks for the quick response Jonathan.
>>
>> Honestly, I am not sure if 2.10.0 will fix my issue but it looks similar
>> to
>> https://issues.apache.org/jira/browse/HDFS-14536 which is not fixed yet
>> so
>> we will probably not see the benefit.
>>
>> We need to either dig more into the logs and jstack and create a JIRA to
>> see if the developers can comment on what's going on. A datanode restart
>> fixes the latency issue and so it is something that happens over time and
>> need the right instrumentation to figure it out!
>>
>> Thanks,
>> Viral
>>
>>
>> On Thu, Sep 26, 2019 at 6:41 PM Jonathan Hung <jyhung2...@gmail.com>
>> wrote:
>>
>> > Hey Viral, yes. We're working on a 2.10.0 release, I'm the release
>> manager
>> > for that. I can't comment on the particular issue you're seeing, but I
>> plan
>> > to start the release process for 2.10.0 early next week, then 2.10.0
>> will
>> > be released shortly after that (assuming all goes well).
>> >
>> > Thanks,
>> > Jonathan Hung
>> >
>> >
>> > On Thu, Sep 26, 2019 at 6:34 PM Viral Bajaria <viral.baja...@gmail.com>
>> > wrote:
>> >
>> >> (Cross posting from user list based on feedback by Sean Busbey)
>> >>
>> >> All,
>> >>
>> >> Just saw the announcement for new release for Hadoop 3.2.1,
>> >> congratulations
>> >> team and thanks for all the hard work!
>> >>
>> >> Are we going to see a new release in the 2.x.x ?
>> >>
>> >> I noticed a bunch of tickets that have been resolved in the last year
>> have
>> >> been tagged with 2.10.0 as a fix version and it's been a while since
>> 2.9.2
>> >> was released so I was wondering if we are going to see a 2.10.0 release
>> >> soon OR should we start looking to upgarde to 3.x line ?
>> >>
>> >> The reason I ask is that we are seeing very high ReadBlockOp Latency on
>> >> our
>> >> datanodes and feel that the issue is due to some locking going on
>> between
>> >> VolumeScanner, DirectoryScanner, RW Block Operations and
>> MetricsRegistry.
>> >> Looking at a few JIRA it looks like 2.10.0 might have some fixes that
>> we
>> >> should try, not fully sure yet!
>> >>
>> >> Thanks,
>> >> Viral
>> >>
>> >
>>
>

Reply via email to