Jack Fan created HDFS-13807: ------------------------------- Summary: Large overhead when seek and read only a small piece from a file Key: HDFS-13807 URL: https://issues.apache.org/jira/browse/HDFS-13807 Project: Hadoop HDFS Issue Type: Bug Components: datanode, hdfs-client Affects Versions: 2.7.6, 2.8.4 Environment: HDFS server is 2.8.2
HDFS client is 2.7.1 I use `pyarrow` with both `libhdfs` and `libhdfs3`, I observe the same behavior on both drivers. Reporter: Jack Fan I'm storing small files (~500KB in size) in big file chunks (256MB~2GB) in HDFS. I then maintain a separate index file about the offset and length of the small files in those file chunks. When I randomly read those small files, for each small file I open the corresponding file chunk, seek to the `offset`, and read `length` data. However, I noticed when I read a small piece of data (say, 500KB), the datanode will transfer more data (~4MB) than that to the HDFS client. I original thought this is the readahead feature on datanode, that sends more data to the client in advance to speed up streaming of file. However, I tried to set `dfs.client.cache.read ahead` to 0 on client configuration but the behavior still persist. I also use `tcpdump` to capture packets and discovered the datanode will keep sending data after the HDFS client closes the TCP connection for rpc (I observed a bunch of RST packets sent out by HDFS client). It seems the datanode spontaneously sends more data then requested to the HDFS client, I want to know how to stop such a behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org