[jira] [Created] (HDFS-11755) Underconstruction blocks can be considered missing
Nathan Roberts created HDFS-11755: - Summary: Underconstruction blocks can be considered missing Key: HDFS-11755 URL: https://issues.apache.org/jira/browse/HDFS-11755 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0-alpha2, 2.8.1 Reporter: Nathan Roberts Assignee: Nathan Roberts Following sequence of events can lead to a block underconstruction being considered missing. - pipeline of 3 DNs, DN1->DN2->DN3 - DN3 has a failing disk so some updates take a long time - Client writes entire block and is waiting for final ack - DN1, DN2 and DN3 have all received the block - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3 - DN3 is having trouble finalizing the block due to the failing drive. It does eventually succeed but it is VERY slow at doing so. - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so DN1 notices and does the same. Neither DN1 nor DN2 finalized the block. - DN3 finally sends an IBR to the NN indicating the block has been received. - Drive containing the block on DN3 fails enough that the DN takes it offline and notifies NN of failed volume - NN removes DN3's replica from the triplets and then declares the block missing because there are no other replicas Seems like we shouldn't consider uncompleted blocks for replication. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-11661) GetContentSummary uses excessive amounts of memory
Nathan Roberts created HDFS-11661: - Summary: GetContentSummary uses excessive amounts of memory Key: HDFS-11661 URL: https://issues.apache.org/jira/browse/HDFS-11661 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.8.0 Reporter: Nathan Roberts Priority: Blocker ContentSummaryComputationContext::nodeIncluded() is being used to keep track of all INodes visited during the current content summary calculation. This can be all of the INodes in the filesystem, making for a VERY large hash table. This simply won't work on large filesystems. We noticed this after upgrading a namenode with ~100Million filesystem objects was spending significantly more time in GC. Fortunately this system had some memory breathing room, other clusters we have will not run with this additional demand on memory. This was added as part of HDFS-10797 as a way of keeping track of INodes that have already been accounted for - to avoid double counting. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Reopened] (HDFS-4946) Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable
[ https://issues.apache.org/jira/browse/HDFS-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts reopened HDFS-4946: -- [~jrkinley], re-opening because this is a very useful patch. Let me know if you disagree or would like me to assign it to myself to close out any remaining issues. > Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable > --- > > Key: HDFS-4946 > URL: https://issues.apache.org/jira/browse/HDFS-4946 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: James Kinley >Assignee: James Kinley > Attachments: HDFS-4946-1.patch, HDFS-4946-2.patch > > > Allow preferLocalNode in BlockPlacementPolicyDefault to be disabled in > configuration to prevent a client from writing the first replica of every > block (i.e. the entire file) to the local DataNode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8894) Set SO_KEEPALIVE on DN server sockets
Nathan Roberts created HDFS-8894: Summary: Set SO_KEEPALIVE on DN server sockets Key: HDFS-8894 URL: https://issues.apache.org/jira/browse/HDFS-8894 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.7.1 Reporter: Nathan Roberts SO_KEEPALIVE is not set on things like datastreamer sockets which can cause lingering ESTABLISHED sockets when there is a network glitch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8873) throttle directoryScanner
Nathan Roberts created HDFS-8873: Summary: throttle directoryScanner Key: HDFS-8873 URL: https://issues.apache.org/jira/browse/HDFS-8873 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.7.1 Reporter: Nathan Roberts The new 2-level directory layout can make directory scans expensive in terms of disk seeks (see HDFS-8791) for details. It would be good if the directoryScanner() had a configurable duty cycle that would reduce its impact on disk performance (much like the approach in HDFS-8617). Without such a throttle, disks can go 100% busy for many minutes at a time (assuming the common case of all inodes in cache but no directory blocks cached, 64K seeks are required for full directory listing which translates to 655 seconds) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4
Nathan Roberts created HDFS-8791: Summary: block ID-based DN storage layout can be very slow for datanode on ext4 Key: HDFS-8791 URL: https://issues.apache.org/jira/browse/HDFS-8791 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.1 Reporter: Nathan Roberts Priority: Critical We are seeing cases where the new directory layout causes the datanode to basically cause the disks to seek for 10s of minutes. This can be when the datanode is running du, and it can also be when it is performing a checkDirs(). Both of these operations currently scan all directories in the block pool and that's very expensive in the new layout. The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories where block files are placed. So, what we have on disk is: - 256 inodes for the first level directories - 256 directory blocks for the first level directories - 256*256 inodes for the second level directories - 256*256 directory blocks for the second level directories - Then the inodes and blocks to store the the HDFS blocks themselves. The main problem is the 256*256 directory blocks. inodes and dentries will be cached by linux and one can configure how likely the system is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache pages (even if it did I'm not sure I would want it to in general). Also, ext4 tries hard to spread directories evenly across the entire volume, this basically means the 64K directory blocks are probably randomly spread across the entire disk. A du type scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding seeks, meaning the seeks will be random and far. In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot is less than 1 second. When things are cold, about 20 minutes. How do things get cold? - A large set of tasks run on the node. This pushes almost all of the buffer cache out, causing the next DU to hit this situation. We are seeing cases where a large job can cause a seek storm across the entire cluster. Why didn't the previous layout see this? - It might have but it wasn't nearly as pronounced. The previous layout would be a few hundred directory blocks. Even when completely cold, these would only take a few a hundred seeks which would mean single digit seconds. - With only a few hundred directories, the odds of the directory blocks getting modified is quite high, this keeps those blocks hot and much less likely to be evicted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8404) pending block replication can get stuck using older genstamp
Nathan Roberts created HDFS-8404: Summary: pending block replication can get stuck using older genstamp Key: HDFS-8404 URL: https://issues.apache.org/jira/browse/HDFS-8404 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.7.0, 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts If an under-replicated block gets into the pending-replication list, but later the genstamp of that block ends up being newer than the one originally submitted for replication, the block will fail replication until the NN is restarted. It will be safer if processPendingReplications() gets up-to-date blockinfo before resubmitting replication work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7742) favoring decommissioning node for replication can cause a block to stay underreplicated for long periods
Nathan Roberts created HDFS-7742: Summary: favoring decommissioning node for replication can cause a block to stay underreplicated for long periods Key: HDFS-7742 URL: https://issues.apache.org/jira/browse/HDFS-7742 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts When choosing a source node to replicate a block from, a decommissioning node is favored. The reason for the favoritism is that decommissioning nodes aren't servicing any writes so in-theory they are less loaded. However, the same selection algorithm also tries to make sure it doesn't get "stuck" on any particular node: {noformat} // switch to a different node randomly // this to prevent from deterministically selecting the same node even // if the node failed to replicate the block on previous iterations {noformat} Unfortunately, the decommissioning check is prior to this randomness so the algorithm can get stuck trying to replicate from a decommissioning node. We've seen this in practice where a decommissioning datanode was failing to replicate a block for many days, when other viable replicas of the block were available. Given that we limit the number of streams we'll assign to a given node (default soft limit of 2, hard limit of 4), It doesn't seem like favoring a decommissioning node has significant benefit. i.e. when there is significant replication work to do, we'll quickly hit the stream limit of the decommissioning nodes and use other nodes in the cluster anyway; when there isn't significant replication work then in theory we've got plenty of replication bandwidth available so choosing a decommissioning node isn't much of a win. I see two choices: 1) Change the algorithm to still favor decommissioning nodes but with some level of randomness that will avoid always selecting the decommissioning node 2) Remove the favoritism for decommissioning nodes I prefer #2. It simplifies the algorithm, and given the other throttles we have in place, I'm not sure there is a significant benefit to selecting decommissioning nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7645) Rolling upgrade is restoring blocks from trash multiple times
Nathan Roberts created HDFS-7645: Summary: Rolling upgrade is restoring blocks from trash multiple times Key: HDFS-7645 URL: https://issues.apache.org/jira/browse/HDFS-7645 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Nathan Roberts When performing an HDFS rolling upgrade, the trash directory is getting restored twice when under normal circumstances it shouldn't need to be restored at all. iiuc, the only time these blocks should be restored is if we need to rollback a rolling upgrade. On a busy cluster, this can cause significant and unnecessary block churn both on the datanodes, and more importantly in the namenode. The two times this happens are: 1) restart of DN onto new software {code} private void doTransition(DataNode datanode, StorageDirectory sd, NamespaceInfo nsInfo, StartupOption startOpt) throws IOException { if (startOpt == StartupOption.ROLLBACK && sd.getPreviousDir().exists()) { Preconditions.checkState(!getTrashRootDir(sd).exists(), sd.getPreviousDir() + " and " + getTrashRootDir(sd) + " should not " + " both be present."); doRollback(sd, nsInfo); // rollback if applicable } else { // Restore all the files in the trash. The restored files are retained // during rolling upgrade rollback. They are deleted during rolling // upgrade downgrade. int restored = restoreBlockFilesFromTrash(getTrashRootDir(sd)); LOG.info("Restored " + restored + " block files from trash."); } {code} 2) When heartbeat response no longer indicates a rollingupgrade is in progress {code} /** * Signal the current rolling upgrade status as indicated by the NN. * @param inProgress true if a rolling upgrade is in progress */ void signalRollingUpgrade(boolean inProgress) throws IOException { String bpid = getBlockPoolId(); if (inProgress) { dn.getFSDataset().enableTrash(bpid); dn.getFSDataset().setRollingUpgradeMarker(bpid); } else { dn.getFSDataset().restoreTrash(bpid); dn.getFSDataset().clearRollingUpgradeMarker(bpid); } } {code} HDFS-6800 and HDFS-6981 were modifying this behavior making it not completely clear whether this is somehow intentional. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-6407) new namenode UI, lost ability to sort columns in datanode tab
Nathan Roberts created HDFS-6407: Summary: new namenode UI, lost ability to sort columns in datanode tab Key: HDFS-6407 URL: https://issues.apache.org/jira/browse/HDFS-6407 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Nathan Roberts Priority: Minor old ui supported clicking on column header to sort on that column. The new ui seems to have dropped this very useful feature. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6407) new namenode UI, lost ability to sort columns in datanode tab
Nathan Roberts created HDFS-6407: Summary: new namenode UI, lost ability to sort columns in datanode tab Key: HDFS-6407 URL: https://issues.apache.org/jira/browse/HDFS-6407 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Nathan Roberts Priority: Minor old ui supported clicking on column header to sort on that column. The new ui seems to have dropped this very useful feature. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6166) revisit balancer so_timeout
Nathan Roberts created HDFS-6166: Summary: revisit balancer so_timeout Key: HDFS-6166 URL: https://issues.apache.org/jira/browse/HDFS-6166 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 2.3.0, 3.0.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Priority: Blocker HDFS-5806 changed the socket read timeout for the balancer connection to DN to 60 seconds. This works as long as balancer bandwidth is such that it's safe to assume that the DN will easily complete the operation within this time. Obviously this isn't a good assumption. When this assumption isn't valid, the balancer will timeout the cmd BUT it will then be out-of-sync with the datanode (balancer thinks the DN has room to do more work, DN is still working on the request and will fail any subsequent requests with "threads quota exceeded errors"). This causes expensive NN traffic via getBlocks() and also causes lots of WARNS int the balancer log. Unfortunately the protocol is such that it's impossible to tell if the DN is busy working on replacing the block, OR is in bad shape and will never finish. So, in the interest of a small change to deal with both situations, I propose the following two changes: * Crank of the socket read timeout to 20 minutes * Delay looking at a node for a bit if we did timeout in this way (the DN could still have xceiver threads working on the replace -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-5806) balancer should set SoTimeout to avoid indefinite hangs
Nathan Roberts created HDFS-5806: Summary: balancer should set SoTimeout to avoid indefinite hangs Key: HDFS-5806 URL: https://issues.apache.org/jira/browse/HDFS-5806 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 2.2.0, 3.0.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Simple patch to avoid the balancer hanging when datanode stops responding to requests. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HDFS-5788) listLocatedStatus response can be very large
Nathan Roberts created HDFS-5788: Summary: listLocatedStatus response can be very large Key: HDFS-5788 URL: https://issues.apache.org/jira/browse/HDFS-5788 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.2.0, 0.23.10, 3.0.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Currently we limit the size of listStatus requests to a default of 1000 entries. This works fine except in the case of listLocatedStatus where the location information can be quite large. As an example, a directory with 7000 entries, 4 blocks each, 3 way replication - a listLocatedStatus response is over 1MB. This can chew up very large amounts of memory in the NN if lots of clients try to do this simultaneously. Seems like it would be better if we also considered the amount of location information being returned when deciding how many files to return. Patch will follow shortly. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HDFS-5535) Umbrella jira for improved HDFS rolling upgrades
Nathan Roberts created HDFS-5535: Summary: Umbrella jira for improved HDFS rolling upgrades Key: HDFS-5535 URL: https://issues.apache.org/jira/browse/HDFS-5535 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, ha, hdfs-client, namenode Affects Versions: 2.2.0, 3.0.0 Reporter: Nathan Roberts In order to roll a new HDFS release through a large cluster quickly and safely, a few enhancements are needed in HDFS. An initial High level design document will be attached to this jira, and sub-jiras will itemize the individual tasks. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HDFS-5446) Consider supporting a mechanism to allow datanodes to drain outstanding work during rolling upgrade
Nathan Roberts created HDFS-5446: Summary: Consider supporting a mechanism to allow datanodes to drain outstanding work during rolling upgrade Key: HDFS-5446 URL: https://issues.apache.org/jira/browse/HDFS-5446 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.2.0 Reporter: Nathan Roberts Rebuilding write pipelines is expensive and this can happen many times during a rolling restart of datanodes (i.e. during a rolling upgrade). It seems like it might help if datanodes could be told to drain current work while rejecting new requests - possibly with a new response indicating the node is temporarily unavailable (it's not broken, it's just going through a maintenance phase where it shouldn't accept new work). Waiting just a few seconds is normally enough to clear up a good percentage of the open requests without error, thus reducing the overhead associated with restarting lots of datanodes in rapid succession. Obviously would need a timeout to make sure the datanode doesn't wait forever. -- This message was sent by Atlassian JIRA (v6.1#6144)