Hi users, Suppose a client A opens /f and keep appending data then hflushing. Another client B opens this file for read. I found that B can only see the snapshot of data at the time he opens the file. (After B's opening, A may continue to write more data. B cannot see it unless reopen.)
Looking into the code, I think this is because DFSInputStream maintains a file length and simply report EOF when we read beyond the file length. The file length is updated and thus the client has a chance to see longer file when: 1) the file is open 2) no live DNs to read from (correct? not very sure.) I think such behaviour is inconsistent. Clients may see a sudden change of file length. I guess a better behaviour is to always try to read beyond the known file length at client-side and let the DN to return EOF if no more data. In this way, the client B can continue to see what A wrote and hflushed. A real use case for this is HBase log replication. In the region server, there is a background thread keep polling for new HLog entries. It has to reopen every second. This may put a pressure on NN if the number of region servers gets larger. Please correct me if there is anything wrong. Thanks, Chao