Hi Chao, As far as I know, if client B opens the file which is under construction, the DFSInputStream will get the LocatedBlocks object and it contains a member variable which called "underConstruction" to mark this file is under construction. If the file is reopen, the client will get a different length. I think this is make sense because that the file is no longer the old one but one with new append data.
Write edits log to HBase, additions are appended to the end of the WAL file rather than reopen the HDFS file second. 2013/12/27 Chao Shi <stepi...@live.com> > Hi users, > > Suppose a client A opens /f and keep appending data then hflushing. > Another client B opens this file for read. I found that B can only see the > snapshot of data at the time he opens the file. (After B's opening, A may > continue to write more data. B cannot see it unless reopen.) > > Looking into the code, I think this is because DFSInputStream maintains a > file length and simply report EOF when we read beyond the file length. The > file length is updated and thus the client has a chance to see longer file > when: > 1) the file is open > 2) no live DNs to read from (correct? not very sure.) > > I think such behaviour is inconsistent. Clients may see a sudden change of > file length. I guess a better behaviour is to always try to read beyond the > known file length at client-side and let the DN to return EOF if no more > data. In this way, the client B can continue to see what A wrote and > hflushed. > > A real use case for this is HBase log replication. In the region server, > there is a background thread keep polling for new HLog entries. It has to > reopen every second. This may put a pressure on NN if the number of region > servers gets larger. > > Please correct me if there is anything wrong. > > Thanks, > Chao >