[ https://issues.apache.org/jira/browse/HADOOP-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548911 ]
Konstantin Shvachko commented on HADOOP-1707: --------------------------------------------- I think this patch has been tested quite thoroughly, and I don't see any algorithmic flaws in it. The logic is fairly complicated though, so imo # we need better documentation either in JavaDoc or at least in Jira. # it would be good if you could extract common actions for the client and the data-node into separate classes, not inner ones. =========== DFSClient.java - DFSClient: 4 unused variables, members. - DFSOutputStream.lb should be local variable. - processDatanodeError() and DFSOutputStream.close() have common code. - BlockReader.readChunk() {code} 07/12/04 18:36:22 INFO fs.FSInputChecker: DFSClient readChunk got seqno 14 offsetInBlock 7168 {code} Should be DEBUG. - More comments: What is e.g. dataQueue, ackQueue, bytesCurBlock? - Some new members in DFSOutputStream can be calculated from the other. No need to store them all. See e.g. {code} private int packetSize = 0; private int chunksPerPacket = 0; private int chunksPerBlock = 0; private int chunkSize = 0; {code} - In the line below "8" should be defined as a constant. Otherwise, the meaning of that is not clear. {code} chunkSize = bytesPerChecksum + 8; // user data + checksum {code} - currentPacket should be a local variable of writeChunk() - The 4 in the code snippet below looks misterious: {code} if (len + cklen + 4 > chunkSize) { {code} - why start ResponseProcessor in processDatanodeError() - some methods should be moved into new inner classes, like nextBlockOutputStream() should be a part of DataStreamer - Packet should be factored out to a separate class (named probably DataPacket). It should have serialization/deserialization methods for packet header, which should be reused in DFSClient and DataNodes for consistency in data transfer. It also should have methods readPacker() and writePacket() =========== DataNode.java - import org.apache.hadoop.io.Text; is redundant. - My Eclipse shows 5 variables that are "never read". - Rather than using "4" on several occasions a constant should be defined {code} SIZE_OF_INTEGER = Integer.SIZE / Byte.SIZE; {code} and used whenever required. - lastDataNodeRun() should not be public =========== FSDataset.java - writeToBlock(): These are two searches in a map instead of one. {code} if (ongoingCreates.containsKey(b)) { ActiveFile activeFile = ongoingCreates.get(b); {code} - unfinalizeBlock() I kinda find the name funny. =========== General - Convert comments like // .......... to JavaDoc /** ... */ style comments when used as method or class headers even if they are private. - Formatting. Tabs should be replaced by 2 spaces. Eg: ResponseProcessor.run(), DataStreamer.run() - Formatting. Long lines. > Remove the DFS Client disk-based cache > -------------------------------------- > > Key: HADOOP-1707 > URL: https://issues.apache.org/jira/browse/HADOOP-1707 > Project: Hadoop > Issue Type: Improvement > Components: dfs > Reporter: dhruba borthakur > Assignee: dhruba borthakur > Fix For: 0.16.0 > > Attachments: clientDiskBuffer.patch, clientDiskBuffer10.patch, > clientDiskBuffer11.patch, clientDiskBuffer2.patch, clientDiskBuffer6.patch, > clientDiskBuffer7.patch, clientDiskBuffer8.patch, clientDiskBuffer9.patch, > DataTransferProtocol.doc, DataTransferProtocol.html > > > The DFS client currently uses a staging file on local disk to cache all > user-writes to a file. When the staging file accumulates 1 block worth of > data, its contents are flushed to a HDFS datanode. These operations occur > sequentially. > A simple optimization of allowing the user to write to another staging file > while simultaneously uploading the contents of the first staging file to HDFS > will improve file-upload performance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.