Andre Araujo created HDFS-12528:
-----------------------------------

             Summary: Short-circuit reads getting disabled frequently in 
certain scenarios
                 Key: HDFS-12528
                 URL: https://issues.apache.org/jira/browse/HDFS-12528
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: hdfs-client, performance
    Affects Versions: 2.6.0
            Reporter: Andre Araujo


We have scenarios where data ingestion makes use of the -appendToFile operation 
to add new data to existing HDFS files. In these situations, we're frequently 
running into the problem described below.

We're using Impala to query the HDFS data with short-circuit reads (SCR) 
enabled. After each file read, Impala "unbuffer"'s the HDFS file to reduce the 
memory footprint. In some cases, though, Impala still keeps the HDFS file 
handle open for reuse.

The "unbuffer" call, however, causes the file's current block reader to be 
closed, which makes the associated ShortCircuitReplica evictable from the 
ShortCircuitCache. When the cluster is under load, this means that the 
ShortCircuitReplica can be purged off the cache pretty fast, which closes the 
file descriptor to the underlying storage file.

That means that when Impala re-reads the file it has to re-open the storage 
files associated with the ShortCircuitReplica's that were evicted from the 
cache. If there were no appends to those blocks, the re-open will succeed 
without problems. If one block was appended since the ShortCircuitReplica was 
created, the re-open will fail with the following error:

{code}
Meta file for BP-810388474-172.31.113.69-1499543341726:blk_1074012183_273087 
not found
{code}

This error is handled as an "unknown response" by the BlockReaderFactory [1], 
which disables short-circuit reads for 10 minutes [2] for the client.

These 10 minutes without SCR can have a big performance impact for the client 
operations. In this particular case ("Meta file not found") it would suffice to 
return null without disabling SCR. This particular block read would fall back 
to the normal, non-short-circuited, path and other SCR requests would continue 
to work as expected.

It might also be interesting to be able to control how long SCR is disabled for 
in the "unknown response" case. 10 minutes seems a bit to long and not being 
able to change that is a problem.

[1] 
https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/client/impl/BlockReaderFactory.java#L646

[2] 
https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/shortcircuit/DomainSocketFactory.java#L97




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to