[jira] [Created] (HDFS-11755) Underconstruction blocks can be considered missing

2017-05-04 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-11755:
-

 Summary: Underconstruction blocks can be considered missing
 Key: HDFS-11755
 URL: https://issues.apache.org/jira/browse/HDFS-11755
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0-alpha2, 2.8.1
Reporter: Nathan Roberts
Assignee: Nathan Roberts


Following sequence of events can lead to a block underconstruction being 
considered missing.

- pipeline of 3 DNs, DN1->DN2->DN3
- DN3 has a failing disk so some updates take a long time
- Client writes entire block and is waiting for final ack
- DN1, DN2 and DN3 have all received the block 
- DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3
- DN3 is having trouble finalizing the block due to the failing drive. It does 
eventually succeed but it is VERY slow at doing so. 
- DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so 
DN1 notices and does the same. Neither DN1 nor DN2 finalized the block.
- DN3 finally sends an IBR to the NN indicating the block has been received.
- Drive containing the block on DN3 fails enough that the DN takes it offline 
and notifies NN of failed volume
- NN removes DN3's replica from the triplets and then declares the block 
missing because there are no other replicas

Seems like we shouldn't consider uncompleted blocks for replication.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-11661) GetContentSummary uses excessive amounts of memory

2017-04-17 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-11661:
-

 Summary: GetContentSummary uses excessive amounts of memory
 Key: HDFS-11661
 URL: https://issues.apache.org/jira/browse/HDFS-11661
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.8.0
Reporter: Nathan Roberts
Priority: Blocker


ContentSummaryComputationContext::nodeIncluded() is being used to keep track of 
all INodes visited during the current content summary calculation. This can be 
all of the INodes in the filesystem, making for a VERY large hash table. This 
simply won't work on large filesystems. 

We noticed this after upgrading a namenode with ~100Million filesystem objects 
was spending significantly more time in GC. Fortunately this system had some 
memory breathing room, other clusters we have will not run with this additional 
demand on memory.

This was added as part of HDFS-10797 as a way of keeping track of INodes that 
have already been accounted for - to avoid double counting.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-4946) Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable

2016-02-18 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts reopened HDFS-4946:
--

[~jrkinley], re-opening because this is a very useful patch. Let me know if you 
disagree or would like me to assign it to myself to close out any remaining 
issues.

> Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable
> ---
>
> Key: HDFS-4946
> URL: https://issues.apache.org/jira/browse/HDFS-4946
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: James Kinley
>Assignee: James Kinley
> Attachments: HDFS-4946-1.patch, HDFS-4946-2.patch
>
>
> Allow preferLocalNode in BlockPlacementPolicyDefault to be disabled in 
> configuration to prevent a client from writing the first replica of every 
> block (i.e. the entire file) to the local DataNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8894) Set SO_KEEPALIVE on DN server sockets

2015-08-13 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-8894:


 Summary: Set SO_KEEPALIVE on DN server sockets
 Key: HDFS-8894
 URL: https://issues.apache.org/jira/browse/HDFS-8894
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.7.1
Reporter: Nathan Roberts


SO_KEEPALIVE is not set on things like datastreamer sockets which can cause 
lingering ESTABLISHED sockets when there is a network glitch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8873) throttle directoryScanner

2015-08-07 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-8873:


 Summary: throttle directoryScanner
 Key: HDFS-8873
 URL: https://issues.apache.org/jira/browse/HDFS-8873
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.7.1
Reporter: Nathan Roberts


The new 2-level directory layout can make directory scans expensive in terms of 
disk seeks (see HDFS-8791) for details. 

It would be good if the directoryScanner() had a configurable duty cycle that 
would reduce its impact on disk performance (much like the approach in 
HDFS-8617). 

Without such a throttle, disks can go 100% busy for many minutes at a time 
(assuming the common case of all inodes in cache but no directory blocks 
cached, 64K seeks are required for full directory listing which translates to 
655 seconds) 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-07-16 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-8791:


 Summary: block ID-based DN storage layout can be very slow for 
datanode on ext4
 Key: HDFS-8791
 URL: https://issues.apache.org/jira/browse/HDFS-8791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.1
Reporter: Nathan Roberts
Priority: Critical


We are seeing cases where the new directory layout causes the datanode to 
basically cause the disks to seek for 10s of minutes. This can be when the 
datanode is running du, and it can also be when it is performing a checkDirs(). 
Both of these operations currently scan all directories in the block pool and 
that's very expensive in the new layout.

The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf 
directories where block files are placed.

So, what we have on disk is:
- 256 inodes for the first level directories
- 256 directory blocks for the first level directories
- 256*256 inodes for the second level directories
- 256*256 directory blocks for the second level directories
- Then the inodes and blocks to store the the HDFS blocks themselves.

The main problem is the 256*256 directory blocks. 

inodes and dentries will be cached by linux and one can configure how likely 
the system is to prune those entries (vfs_cache_pressure). However, ext4 relies 
on the buffer cache to cache the directory blocks and I'm not aware of any way 
to tell linux to favor buffer cache pages (even if it did I'm not sure I would 
want it to in general).

Also, ext4 tries hard to spread directories evenly across the entire volume, 
this basically means the 64K directory blocks are probably randomly spread 
across the entire disk. A du type scan will look at directories one at a time, 
so the ioscheduler can't optimize the corresponding seeks, meaning the seeks 
will be random and far. 

In a system I was using to diagnose this, I had 60K blocks. A DU when things 
are hot is less than 1 second. When things are cold, about 20 minutes.

How do things get cold?
- A large set of tasks run on the node. This pushes almost all of the buffer 
cache out, causing the next DU to hit this situation. We are seeing cases where 
a large job can cause a seek storm across the entire cluster.

Why didn't the previous layout see this?
- It might have but it wasn't nearly as pronounced. The previous layout would 
be a few hundred directory blocks. Even when completely cold, these would only 
take a few a hundred seeks which would mean single digit seconds.  
- With only a few hundred directories, the odds of the directory blocks getting 
modified is quite high, this keeps those blocks hot and much less likely to be 
evicted.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8404) pending block replication can get stuck using older genstamp

2015-05-14 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-8404:


 Summary: pending block replication can get stuck using older 
genstamp
 Key: HDFS-8404
 URL: https://issues.apache.org/jira/browse/HDFS-8404
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.7.0, 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts


If an under-replicated block gets into the pending-replication list, but later 
the  genstamp of that block ends up being newer than the one originally 
submitted for replication, the block will fail replication until the NN is 
restarted. 

It will be safer if processPendingReplications()  gets up-to-date blockinfo 
before resubmitting replication work.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7742) favoring decommissioning node for replication can cause a block to stay underreplicated for long periods

2015-02-06 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-7742:


 Summary: favoring decommissioning node for replication can cause a 
block to stay underreplicated for long periods
 Key: HDFS-7742
 URL: https://issues.apache.org/jira/browse/HDFS-7742
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts


When choosing a source node to replicate a block from, a decommissioning node 
is favored. The reason for the favoritism is that decommissioning nodes aren't 
servicing any writes so in-theory they are less loaded.

However, the same selection algorithm also tries to make sure it doesn't get 
"stuck" on any particular node:
{noformat}
  // switch to a different node randomly
  // this to prevent from deterministically selecting the same node even
  // if the node failed to replicate the block on previous iterations
{noformat}
Unfortunately, the decommissioning check is prior to this randomness so the 
algorithm can get stuck trying to replicate from a decommissioning node. We've 
seen this in practice where a decommissioning datanode was failing to replicate 
a block for many days, when other viable replicas of the block were available.

Given that we limit the number of streams we'll assign to a given node (default 
soft limit of 2, hard limit of 4), It doesn't seem like favoring a 
decommissioning node has significant benefit. i.e. when there is significant 
replication work to do, we'll quickly hit the stream limit of the 
decommissioning nodes and use other nodes in the cluster anyway; when there 
isn't significant replication work then in theory we've got plenty of 
replication bandwidth available so choosing a decommissioning node isn't much 
of a win.

I see two choices:
1) Change the algorithm to still favor decommissioning nodes but with some 
level of randomness that will avoid always selecting the decommissioning node
2) Remove the favoritism for decommissioning nodes

I prefer #2. It simplifies the algorithm, and given the other throttles we have 
in place, I'm not sure there is a significant benefit to selecting 
decommissioning nodes. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7645) Rolling upgrade is restoring blocks from trash multiple times

2015-01-20 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-7645:


 Summary: Rolling upgrade is restoring blocks from trash multiple 
times
 Key: HDFS-7645
 URL: https://issues.apache.org/jira/browse/HDFS-7645
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Nathan Roberts


When performing an HDFS rolling upgrade, the trash directory is getting 
restored twice when under normal circumstances it shouldn't need to be restored 
at all. iiuc, the only time these blocks should be restored is if we need to 
rollback a rolling upgrade. 

On a busy cluster, this can cause significant and unnecessary block churn both 
on the datanodes, and more importantly in the namenode.

The two times this happens are:
1) restart of DN onto new software
{code}
  private void doTransition(DataNode datanode, StorageDirectory sd,
  NamespaceInfo nsInfo, StartupOption startOpt) throws IOException {
if (startOpt == StartupOption.ROLLBACK && sd.getPreviousDir().exists()) {
  Preconditions.checkState(!getTrashRootDir(sd).exists(),
  sd.getPreviousDir() + " and " + getTrashRootDir(sd) + " should not " +
  " both be present.");
  doRollback(sd, nsInfo); // rollback if applicable
} else {
  // Restore all the files in the trash. The restored files are retained
  // during rolling upgrade rollback. They are deleted during rolling
  // upgrade downgrade.
  int restored = restoreBlockFilesFromTrash(getTrashRootDir(sd));
  LOG.info("Restored " + restored + " block files from trash.");
}
{code}

2) When heartbeat response no longer indicates a rollingupgrade is in progress
{code}
  /**
   * Signal the current rolling upgrade status as indicated by the NN.
   * @param inProgress true if a rolling upgrade is in progress
   */
  void signalRollingUpgrade(boolean inProgress) throws IOException {
String bpid = getBlockPoolId();
if (inProgress) {
  dn.getFSDataset().enableTrash(bpid);
  dn.getFSDataset().setRollingUpgradeMarker(bpid);
} else {
  dn.getFSDataset().restoreTrash(bpid);
  dn.getFSDataset().clearRollingUpgradeMarker(bpid);
}
  }
{code}

HDFS-6800 and HDFS-6981 were modifying this behavior making it not completely 
clear whether this is somehow intentional. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-6407) new namenode UI, lost ability to sort columns in datanode tab

2014-05-16 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-6407:


 Summary: new namenode UI, lost ability to sort columns in datanode 
tab
 Key: HDFS-6407
 URL: https://issues.apache.org/jira/browse/HDFS-6407
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Nathan Roberts
Priority: Minor


old ui supported clicking on column header to sort on that column. The new ui 
seems to have dropped this very useful feature.





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HDFS-6407) new namenode UI, lost ability to sort columns in datanode tab

2014-05-16 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-6407:


 Summary: new namenode UI, lost ability to sort columns in datanode 
tab
 Key: HDFS-6407
 URL: https://issues.apache.org/jira/browse/HDFS-6407
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Nathan Roberts
Priority: Minor


old ui supported clicking on column header to sort on that column. The new ui 
seems to have dropped this very useful feature.





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HDFS-6166) revisit balancer so_timeout

2014-03-27 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-6166:


 Summary: revisit balancer so_timeout 
 Key: HDFS-6166
 URL: https://issues.apache.org/jira/browse/HDFS-6166
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 2.3.0, 3.0.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
Priority: Blocker


HDFS-5806 changed the socket read timeout for the balancer connection to DN to 
60 seconds. This works as long as balancer bandwidth is such that it's safe to 
assume that the DN will easily complete the operation within this time. 
Obviously this isn't a good assumption. When this assumption isn't valid, the 
balancer will timeout the cmd BUT it will then be out-of-sync with the datanode 
(balancer thinks the DN has room to do more work, DN is still working on the 
request and will fail any subsequent requests with "threads quota exceeded 
errors"). This causes expensive NN traffic via getBlocks() and also causes lots 
of WARNS int the balancer log.

Unfortunately the protocol is such that it's impossible to tell if the DN is 
busy working on replacing the block, OR is in bad shape and will never finish.

So, in the interest of a small change to deal with both situations, I propose 
the following two changes:
* Crank of the socket read timeout to 20 minutes
* Delay looking at a node for a bit if we did timeout in this way (the DN could 
still have xceiver threads working on the replace 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HDFS-5806) balancer should set SoTimeout to avoid indefinite hangs

2014-01-21 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-5806:


 Summary: balancer should set SoTimeout to avoid indefinite hangs
 Key: HDFS-5806
 URL: https://issues.apache.org/jira/browse/HDFS-5806
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 2.2.0, 3.0.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts


Simple patch to avoid the balancer hanging when datanode stops responding to 
requests. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HDFS-5788) listLocatedStatus response can be very large

2014-01-16 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-5788:


 Summary: listLocatedStatus response can be very large
 Key: HDFS-5788
 URL: https://issues.apache.org/jira/browse/HDFS-5788
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.2.0, 0.23.10, 3.0.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts


Currently we limit the size of listStatus requests to a default of 1000 
entries. This works fine except in the case of listLocatedStatus where the 
location information can be quite large. As an example, a directory with 7000 
entries, 4 blocks each, 3 way replication - a listLocatedStatus response is 
over 1MB. This can chew up very large amounts of memory in the NN if lots of 
clients try to do this simultaneously.

Seems like it would be better if we also considered the amount of location 
information being returned when deciding how many files to return.

Patch will follow shortly.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HDFS-5535) Umbrella jira for improved HDFS rolling upgrades

2013-11-20 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-5535:


 Summary: Umbrella jira for improved HDFS rolling upgrades
 Key: HDFS-5535
 URL: https://issues.apache.org/jira/browse/HDFS-5535
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, ha, hdfs-client, namenode
Affects Versions: 2.2.0, 3.0.0
Reporter: Nathan Roberts


In order to roll a new HDFS release through a large cluster quickly and safely, 
a few enhancements are needed in HDFS. An initial High level design document 
will be attached to this jira, and sub-jiras will itemize the individual tasks.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HDFS-5446) Consider supporting a mechanism to allow datanodes to drain outstanding work during rolling upgrade

2013-10-31 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-5446:


 Summary: Consider supporting a mechanism to allow datanodes to 
drain outstanding work during rolling upgrade
 Key: HDFS-5446
 URL: https://issues.apache.org/jira/browse/HDFS-5446
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.2.0
Reporter: Nathan Roberts


Rebuilding write pipelines is expensive and this can happen many times during a 
rolling restart of datanodes (i.e. during a rolling upgrade). It seems like it 
might help if datanodes could be told to drain current work while rejecting new 
requests - possibly with a new response indicating the node is temporarily 
unavailable (it's not broken, it's just going through a maintenance phase where 
it shouldn't accept new work). 

Waiting just a few seconds is normally enough to clear up a good percentage of 
the open requests without error, thus reducing the overhead associated with 
restarting lots of datanodes in rapid succession.

Obviously would need a timeout to make sure the datanode doesn't wait forever.




--
This message was sent by Atlassian JIRA
(v6.1#6144)