Hey Austin, Sanjeev,
The ports defined are as follows in hdfs-site.xml:
cm-r01wn01.mws.mds.xyz root … run cloudera-scm-agent process grep -Ei
"dfs.datanode.http.address|dfs.datanode.https.address" -A 2
./3370-hdfs-DATANODE/hdfs-site.xml
<name>dfs.datanode.http.address</name>
<value>cm-r01wn01.mws.mds.xyz:1006</value>
</property>
--
<name>dfs.datanode.https.address</name>
<value>cm-r01wn01.mws.mds.xyz:9865</value>
</property>
cm-r01wn01.mws.mds.xyz root … run cloudera-scm-agent process
Checking the ports used:
cm-r01wn01.mws.mds.xyz root ~ netstat -pnltu|grep -Ei
"9866|1004|9864|9865|1006|9867"
tcp 0 0 10.3.0.160:9867 0.0.0.0:* LISTEN
30096/jsvc.exec
tcp 0 0 10.3.0.160:1004 0.0.0.0:* LISTEN
30096/jsvc.exec
tcp 0 0 10.3.0.160:1006 0.0.0.0:* LISTEN
30096/jsvc.exec
cm-r01wn01.mws.mds.xyz root ~
cm-r01wn01.mws.mds.xyz root ~
cm-r01wn01.mws.mds.xyz root ~ hds getconf -confKey
dfs.datanode.address
-bash: hds: command not found
cm-r01wn01.mws.mds.xyz root ~ hdfs getconf -confKey
dfs.datanode.address
0.0.0.0:9866
cm-r01wn01.mws.mds.xyz root ~ hdfs getconf -confKey
dfs.datanode.http.address
0.0.0.0:9864
cm-r01wn01.mws.mds.xyz root ~ hdfs getconf -confKey
dfs.datanode.https.address
0.0.0.0:9865
cm-r01wn01.mws.mds.xyz root ~ hdfs getconf -confKey
dfs.datanode.ipc.address
0.0.0.0:9867
cm-r01wn01.mws.mds.xyz root ~
The scanner looks to be initialized:
cm-r01wn01.mws.mds.xyz root / var log hadoop-hdfs grep
-EiR "Periodic block scanner is not running" *
cm-r01wn01.mws.mds.xyz root / var log hadoop-hdfs
cm-r01wn01.mws.mds.xyz root / var log hadoop-hdfs grep
-EiR "Initialized block scanner with targetBytesPerSec" *|wc -l;
32
cm-r01wn01.mws.mds.xyz root / var log hadoop-hdfs
And yes, indeed it is started up. It kicked off around the time when I
restarted the DataNode service.
cm-r01wn01.mws.mds.xyz root / var log hadoop-hdfs vi
hadoop-cmf-hdfs-DATANODE-cm-r01wn01.mws.mds.xyz.log.out
STARTUP_MSG: build = http://github.com/cloudera/hadoop -r
7f07ef8e6df428a8eb53009dc8d9a249dbbb50ad; compiled by 'jenkins' on
2019-07-18T17:09Z
STARTUP_MSG: java = 1.8.0_181
************************************************************/
2020-10-22 20:54:58,488 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: registered UNIX signal
handlers for [TERM, HUP, INT]
2020-10-22 20:54:59,762 INFO
org.apache.hadoop.security.UserGroupInformation: Login successful for
user hdfs/cm-r01wn01.mws.mds....@mws.mds.xyz using keytab file
hdfs.keytab. Keytab auto renewal enabled : false
2020-10-22 20:55:00,265 INFO
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker:
Scheduling a check for [DISK]file:/hdfs/1/dfs/dn
2020-10-22 20:55:00,295 INFO
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker:
Scheduling a check for [DISK]file:/hdfs/2/dfs/dn
2020-10-22 20:55:00,296 INFO
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker:
Scheduling a check for [DISK]file:/hdfs/3/dfs/dn
2020-10-22 20:55:00,297 INFO
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker:
Scheduling a check for [DISK]file:/hdfs/4/dfs/dn
2020-10-22 20:55:00,521 INFO
org.apache.hadoop.metrics2.impl.MetricsConfig: Loaded properties from
hadoop-metrics2.properties
2020-10-22 20:55:00,723 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric
snapshot period at 10 second(s).
2020-10-22 20:55:00,723 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics
system started
2020-10-22 20:55:00,947 INFO org.apache.hadoop.hdfs.server.common.Util:
dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling
file IO profiling
2020-10-22 20:55:00,953 INFO
org.apache.hadoop.hdfs.server.datanode.BlockScanner: *Initialized block
scanner with targetBytesPerSec 1048576*
2020-10-22 20:55:00,961 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: File descriptor passing
is enabled.
2020-10-22 20:55:00,963 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Configured hostname is
cm-r01wn01.mws.mds.xyz
2020-10-22 20:55:00,965 INFO org.apache.hadoop.hdfs.server.common.Util:
dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling
file IO profiling
2020-10-22 20:55:00,995 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Starting DataNode with
maxLockedMemory = 299892736
2020-10-22 20:55:01,018 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Opened streaming server
at /10.3.0.160:1004
2020-10-22 20:55:01,023 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwidth is
10485760 bytes/s
2020-10-22 20:55:01,024 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Number threads for
balancing is 50
2020-10-22 20:55:01,029 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwidth is
10485760 bytes/s
2020-10-22 20:55:01,029 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Number threads for
balancing is 50
2020-10-22 20:55:01,029 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Listening on UNIX
domain socket: /var/run/hdfs-sockets/dn
2020-10-22 20:55:01,304 INFO org.eclipse.jetty.util.log: Logging
initialized @8929ms
2020-10-22 20:55:01,559 INFO org.apache.hadoop.http.HttpRequestLog: Http
request log for http.requests.datanode is not defined
2020-10-22 20:55:01,585 INFO org.apache.hadoop.http.HttpServer2: Added
global filter 'safety'
(class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
2020-10-22 20:55:01,589 INFO org.apache.hadoop.http.HttpServer2: Added
filter authentication
(class=org.apache.hadoop.security.authentication.server.AuthenticationFilter)
to context datanode
This answers another question I had: under what conditions does the
block / volume checker kick off. When removing a datanode and adding it
back in, it appears the checker will get kicked off on the worker at
that time.
Only the Secure DataNode port returns a login, as to be expected. (
http://cm-r01wn01.mws.mds.xyz:1006/ )
Thx,
TK
On 10/22/2020 11:56 AM, संजीव (Sanjeev Tripurari) wrote:
Hi Tom,
Can you start your datanode service, and share the datanode logs,
check if it is started properly or not.
Regards
-Sanjeev
On Thu, 22 Oct 2020 at 20:33, Austin Hackett <hacketta...@me.com
<mailto:hacketta...@me.com>> wrote:
Hi Tom
It might be worth restarting the DataNode process? I didn’t think
you could disable the DataNode Web UI as such, but I could be
wrong on this point. Out of interest, what does hdfs-site.xml say
with regards to dfs.datanode.http.address/dfs.datanode.https.address?
Regarding the logs, a quick look on GitHub suggests there may be a
couple of useful log messages:
https://github.com/apache/hadoop/blob/88a9f42f320e7c16cf0b0b424283f8e4486ef286/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockScanner.java
<https://github.com/apache/hadoop/blob/88a9f42f320e7c16cf0b0b424283f8e4486ef286/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockScanner.java>
For example, LOG.warn(“Periodic block scanner is not running”) or
LOG.info <http://LOG.info>(“Initialized block scanner with
targetBytesPerSec {}”).
Of course, you’d need make sure those LOG statements are present
in the Hadoop version included with CDH 6.3. Git “blame” suggests
the LOG statements were added 6 years, so chance are you have them...
Thanks
Austin
On 22 Oct 2020, at 14:44, TomK <tomk...@mdevsys.com
<mailto:tomk...@mdevsys.com>> wrote:
Thanks Austin. However none of these are open on a standard
Cloudera 6.3 build.
# netstat -pnltu|grep -Ei "9866|1004|9864|9865|1006|9867"
#
Would there be anything in the logs to indicate whether or not
the block / volume scanner is running?
Thx,
TK
On 10/22/2020 3:09 AM, Austin Hackett wrote:
Hi Tom
I not too familiar with the CDH distribution, but this page has
the default ports used by DataNode:
https://docs.cloudera.com/documentation/enterprise/latest/topics/cdh_ports.html
<https://docs.cloudera.com/documentation/enterprise/latest/topics/cdh_ports.html>
I believe it’s the settings for
dfs.datanode.http.address/dfs.datanode.https.address that you’re
interested in (9864/9865)
Since the data block scanner related config parameters are not
set, the defaults of 3 weeks and 1MB should be applied.
Thanks
Austin
On 22 Oct 2020, at 06:35, TomK <tomk...@mdevsys.com>
<mailto:tomk...@mdevsys.com> wrote:
Hey Austin, Sanjeev,
Thanks once more! Took some time to review the pages. That
was certainly very helpful. Appreciated!
However, I tried to access https://dn01/blockScannerReport
<https://dn01/blockScannerReport> on a test Cloudera 6.3
cluster. Didn't work Tried the following as well:
http://dn01:50075/blockscannerreport?listblocks
<http://dn01:50075/blockscannerreport?listblocks>
https://dn01:50075/blockscannerreport
<https://dn01:50075/blockscannerreport>
https://dn01:10006/blockscannerreport
<https://dn01:10006/blockscannerreport>
Checked that port 50075 is up ( netstat -pnltu ). There's no
service on that port on the workers. Checked the pages:
https://docs.cloudera.com/documentation/enterprise/5-14-x/topics/cdh_ig_ports_cdh5.html
<https://docs.cloudera.com/documentation/enterprise/5-14-x/topics/cdh_ig_ports_cdh5.html>
It is defined on the pages. Checked if the following is set:
The following 2 configurations in/hdfs-site.xml/are the most
used for block scanners.
*
*
* *dfs.block.scanner.volume.bytes.per.second* to throttle the
scan bandwidth to configurable bytes per second. *Default
value is 1M*. Setting this to 0 will disable the block scanner.
* *dfs.datanode.scan.period.hours*to configure the scan
period, which defines how often a whole scan is performed.
This should be set to a long enough interval to really take
effect, for the reasons explained above. *Default value is
3 weeks (504 hours)*. Setting this to 0 will use the
default value. Setting this to a negative value will
disable the block scanner.
These are NOT explicitly set. Checked hdfs-site.xml. Nothing
defined there. Checked the Configuration tab in the cluster.
It's not defined either.
Does this mean that the defaults are applied OR does it mean
that the block / volume scanner is disabled? I see the pages
detail what values for these settings mean but I didn't see any
notes pertaining to the situation where both values are not
explicitly set.
Thx,
TK
On 10/21/2020 1:34 PM, संजीव (Sanjeev Tripurari) wrote:
Yes Austin,
you are right every datanode will do its block verification,
which is send as health check report to the namenode
Regards
-Sanjeev
On Wed, 21 Oct 2020 at 21:53, Austin Hackett
<hacketta...@me.com <mailto:hacketta...@me.com>> wrote:
Hi Tom
It is my understanding that in addition to block
verification on client reads, each data node runs a
DataBlockScanner in a background thread that periodically
verifies all the blocks stored on the data node. The
dfs.datanode.scan.period.hours property controls how often
this verification occurs.
I think the reports are available via the data node
/blockScannerReport HTTP endpoint, although I’m not sure I
ever actually looked at one. (add ?listblocks to get the
verification status of each block).
More info here:
https://blog.cloudera.com/hdfs-datanode-scanners-and-disk-checker-explained/
<https://blog.cloudera.com/hdfs-datanode-scanners-and-disk-checker-explained/>
Thanks
Austin
On 21 Oct 2020, at 16:47, TomK <tomk...@mdevsys.com
<mailto:tomk...@mdevsys.com>> wrote:
Hey Sanjeev,
Allright. Thank you once more. This is clear.
However, this poses an issue then. If during the two
years, disk drives develop bad blocks but do not
necessarily fail to the point that they cannot be
mounted, that checksum would have changed since those
filesystem blocks can no longer be read. However, from an
HDFS perspective, since no checks are done regularly,
that is not known. So HDFS still reports that the file
is fine, in other words, no missing blocks. For example,
if a disk is going bad, but those files are not read for
two years, the system won't know that there is a
problem. Even when removing a data node temporarily and
re-adding the datanode, HDFS isn't checking because that
HDFS file isn't read.
So let's assume this scenario. Data nodes *dn01* to
*dn10* exist. Each data node has 10 x 10TB drives.
And let's assume that there is one large file on those
drives and it's replicated to factor of X3.
If during the two years the file isn't read, and 10 of
those drives develop bad blocks or other underlying
hardware issues, then it is possible that HDFS will still
report everything fine, even with a replication factor of
3. Because with 10 disks failing, it's possible a block
or sector has failed under each of the 3 copies of the
data. But HDFS would NOT know since nothing triggered a
read of that HDFS file. Based on everything below, then
corruption is very much possible even with a replication
of factor X3. A this point the file is unreadable but
HDFS still reports no missing blocks.
Similarly, if once I take a data node out, I adjust one
of the files on the data disks, HDFS will not know and
still report everything fine. That is until someone
read's the file.
Sounds like this is a very real possibility.
Thx,
TK
On 10/21/2020 10:26 AM, संजीव (Sanjeev Tripurari) wrote:
Hi Tom
Therefore, if I write a file to HDFS but access it two
years later, then the checksum will be computed only
twice, at the beginning of the two years and again at
the end when a client connects? Correct? As long as no
process ever accesses the file between now and two years
from now, the checksum is never redone and compared to
the two year old checksum in the fsimage?
yes, Exactly unless data is read checksum is not
verified. (when data is written and when the data is read),
if checksum is mismatched, there is no way to correct
it, you will have to re-write that file.
When datanode is added back in, there is no real read
operation on the files themselves. The datanode just
reports the blocks but doesn't really read the blocks
that are there to re-verify the files and ensure
consistency?
yes, Exactly, datanode maintains list of files and their
blocks, which it reports, along with total disk size and
used size.
Namenode only has list of blocks, unless datanodes is
connected it wont know where the blocks are stored.
Regards
-Sanjeev
On Wed, 21 Oct 2020 at 18:31, TomK <tomk...@mdevsys.com
<mailto:tomk...@mdevsys.com>> wrote:
Hey Sanjeev,
Thank you very much again. This confirms my suspision.
Therefore, if I write a file to HDFS but access it
two years later, then the checksum will be computed
only twice, at the beginning of the two years and
again at the end when a client connects? Correct?
As long as no process ever accesses the file between
now and two years from now, the checksum is never
redone and compared to the two year old checksum in
the fsimage?
When datanode is added back in, there is no real
read operation on the files themselves. The
datanode just reports the blocks but doesn't really
read the blocks that are there to re-verify the
files and ensure consistency?
Thx,
TK
On 10/21/2020 12:38 AM, संजीव (Sanjeev Tripurari) wrote:
Hi Tom,
Every datanode sends heartbeat to namenode, on its
list of blocks it has.
When a datanode which is disconnected for a while,
after connecting will send heartbeat to namenode,
with list of blocks it has (till then namenode will
have under-replicated blocks).
As soon as the datanode is connected to namenode,
it will clear under-replicatred blocks.
*When a client connects to read or write a file, it
will run checksum to validate the file.*
There is no independent process running to do
checksum, as it will be heavy process on each node.
Regards
-Sanjeev
On Wed, 21 Oct 2020 at 00:18, Tom <t...@mdevsys.com
<mailto:t...@mdevsys.com>> wrote:
Thank you. That part I understand and am Ok
with it.
What I would like to know next is when again
the CRC32C checksum is ran and checked against
the fsimage that the block file has not changed
or become corrupted?
For example, if I take a datanode out, and
within 15 minutes, plug it back in, does HDF
rerun the CRC 32C on all data disks on that
node to make sure blocks are ok?
Cheers,
TK
Sent from my iPhone
On Oct 20, 2020, at 1:39 PM, संजीव (Sanjeev
Tripurari) <sanjeevtripur...@gmail.com
<mailto:sanjeevtripur...@gmail.com>> wrote:
its done as sson as a file is stored on disk..
Sanjeev
On Tuesday, 20 October 2020, TomK
<tomk...@mdevsys.com
<mailto:tomk...@mdevsys.com>> wrote:
Thanks again.
At what points is the checksum validated
(checked) after that? For example, is it
done on a daily basis or is it done only
when the file is accessed?
Thx,
TK
On 10/20/2020 10:18 AM, संजीव (Sanjeev
Tripurari) wrote:
As soon as the file is written first time
checksum is calculated and updated in
fsimage (first in edit logs), and same is
replicated other replicas.
On Tue, 20 Oct 2020 at 19:15, TomK
<tomk...@mdevsys.com
<mailto:tomk...@mdevsys.com>> wrote:
Hi Sanjeev,
Thank you. It does help.
At what points is the checksum
calculated?
Thx,
TK
On 10/20/2020 3:03 AM, संजीव (Sanjeev
Tripurari) wrote:
For Missing blocks and corrupted
blocks, do check if all the datanode
services are up, non of the disks
where hdfs data is stored is
accessible and have no issues, hosts
are reachable from namenode,
If you are able to re-generate the
data and write its great, otherwise
hadoop cannot correct itself.
Could you please elaborate on this?
Does it mean I have to continuously
access a file for HDFS to be able to
detect corrupt blocks and correct itself?
*"Does HDFS check that the data node
is up, data disk is mounted, path to
the file exists and file can be read?"*
-- yes, only after it fails it will
say missing blocks.
*Or does it also do a filesystem
check on that data disk as well as
perhaps a checksum to ensure block
integrity?*
-- yes, every file cheksum is
maintained and cross checked, if it
fails it will say corrupted blocks.
hope this helps.
-Sanjeev
*
*
On Tue, 20 Oct 2020 at 09:52, TomK
<tomk...@mdevsys.com
<mailto:tomk...@mdevsys.com>> wrote:
Hello,
HDFS Missing Blocks / Corrupt
Blocks Logic: What are the
specific
checks done to determine a block
is bad and needs to be replicated?
Does HDFS check that the data
node is up, data disk is
mounted, path to
the file exists and file can be
read?
Or does it also do a filesystem
check on that data disk as well as
perhaps a checksum to ensure
block integrity?
I've googled on this quite a
bit. I don't see the exact
answer I'm
looking for. I would like to
know exactly what happens during
file
integrity verification that then
constitutes missing blocks or
corrupt
blocks in the reports.
--
Thank You,
TK.
---------------------------------------------------------------------
To unsubscribe, e-mail:
user-unsubscr...@hadoop.apache.org
<mailto:user-unsubscr...@hadoop.apache.org>
For additional commands, e-mail:
user-h...@hadoop.apache.org
<mailto:user-h...@hadoop.apache.org>
--
Thx,
TK.
--
Thx,
TK.
--
Thx,
TK.
--
Thx,
TK.
--
Thx,
TK.
--
Thx,
TK.