[ https://issues.apache.org/jira/browse/HDFS-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ramtin updated HDFS-1950: ------------------------- Assignee: Uma Maheswara Rao G > Blocks that are under construction are not getting read if the blocks are > more than 10. Only complete blocks are read properly. > -------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-1950 > URL: https://issues.apache.org/jira/browse/HDFS-1950 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client, namenode > Affects Versions: 0.20.205.0 > Reporter: ramkrishna.s.vasudevan > Assignee: Uma Maheswara Rao G > Priority: Blocker > Attachments: HDFS-1950-2.patch, HDFS-1950.1.patch, > hdfs-1950-0.20-append-tests.txt, hdfs-1950-trunk-test.txt, > hdfs-1950-trunk-test.txt > > > Before going to the root cause lets see the read behavior for a file having > more than 10 blocks in append case.. > Logic: > ==== > There is prefetch size dfs.read.prefetch.size for the DFSInputStream which > has default value of 10 > This prefetch size is the number of blocks that the client will fetch from > the namenode for reading a file.. > For example lets assume that a file X having 22 blocks is residing in HDFS > The reader first fetches first 10 blocks from the namenode and start reading > After the above step , the reader fetches the next 10 blocks from NN and > continue reading > Then the reader fetches the remaining 2 blocks from NN and complete the write > Cause: > ======= > Lets see the cause for this issue now... > Scenario that will fail is "Writer wrote 10+ blocks and a partial block and > called sync. Reader trying to read the file will not get the last partial > block" . > Client first gets the 10 block locations from the NN. Now it checks whether > the file is under construction and if so it gets the size of the last partial > block from datanode and reads the full file > However when the number of blocks is more than 10, the last block will not be > in the first fetch. It will be in the second or other blocks(last block will > be in (num of blocks / 10)th fetch) > The problem now is, in DFSClient there is no logic to get the size of the > last partial block(as in case of point 1), for the rest of the fetches other > than first fetch, the reader will not be able to read the complete data > synced...........!! > also the InputStream.available api uses the first fetched block size to > iterate. Ideally this size has to be increased -- This message was sent by Atlassian JIRA (v6.3.4#6332)