[jira] [Created] (HDFS-13807) Large overhead when seek and read only a small piece from a file

Jack Fan (JIRA) Wed, 08 Aug 2018 19:41:09 -0700

Jack Fan created HDFS-13807:
-------------------------------

             Summary: Large overhead when seek and read only a small piece from 
a file
                 Key: HDFS-13807
                 URL: https://issues.apache.org/jira/browse/HDFS-13807
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode, hdfs-client
    Affects Versions: 2.7.6, 2.8.4
         Environment: HDFS server is 2.8.2


HDFS client is 2.7.1

I use `pyarrow` with both `libhdfs` and `libhdfs3`, I observe the same behavior 
on both drivers.
            Reporter: Jack Fan


I'm storing small files (~500KB in size) in big file chunks (256MB~2GB) in HDFS.

I then maintain a separate index file about the offset and length of the small 
files in those file chunks.

When I randomly read those small files, for each small file I open the 
corresponding file chunk, seek to the `offset`, and read `length` data.

However, I noticed when I read a small piece of data (say, 500KB), the datanode 
will transfer more data (~4MB) than that to the HDFS client.

I original thought this is the readahead feature on datanode, that sends more 
data to the client in advance to speed up streaming of file. However, I tried 
to set `dfs.client.cache.read ahead` to 0 on client configuration but the 
behavior still persist.

I also use `tcpdump` to capture packets and discovered the datanode will keep 
sending data after the HDFS client closes the TCP connection for rpc (I 
observed a bunch of RST packets sent out by HDFS client).

 

It seems the datanode spontaneously sends more data then requested to the HDFS 
client, I want to know how to stop such a behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-13807) Large overhead when seek and read only a small piece from a file

Reply via email to