[
https://issues.apache.org/jira/browse/HADOOP-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548911
]
Konstantin Shvachko commented on HADOOP-1707:
---------------------------------------------
I think this patch has been tested quite thoroughly, and I don't see any
algorithmic flaws in it.
The logic is fairly complicated though, so imo
# we need better documentation either in JavaDoc or at least in Jira.
# it would be good if you could extract common actions for the client and the
data-node into
separate classes, not inner ones.
=========== DFSClient.java
- DFSClient: 4 unused variables, members.
- DFSOutputStream.lb should be local variable.
- processDatanodeError() and DFSOutputStream.close() have common code.
- BlockReader.readChunk()
{code}
07/12/04 18:36:22 INFO fs.FSInputChecker: DFSClient readChunk got seqno 14
offsetInBlock 7168
{code}
Should be DEBUG.
- More comments: What is e.g. dataQueue, ackQueue, bytesCurBlock?
- Some new members in DFSOutputStream can be calculated from the other.
No need to store them all. See e.g.
{code}
private int packetSize = 0;
private int chunksPerPacket = 0;
private int chunksPerBlock = 0;
private int chunkSize = 0;
{code}
- In the line below "8" should be defined as a constant. Otherwise, the meaning
of that is not clear.
{code}
chunkSize = bytesPerChecksum + 8; // user data + checksum
{code}
- currentPacket should be a local variable of writeChunk()
- The 4 in the code snippet below looks misterious:
{code}
if (len + cklen + 4 > chunkSize) {
{code}
- why start ResponseProcessor in processDatanodeError()
- some methods should be moved into new inner classes, like
nextBlockOutputStream() should be a part of DataStreamer
- Packet should be factored out to a separate class (named probably DataPacket).
It should have serialization/deserialization methods for packet header, which
should
be reused in DFSClient and DataNodes for consistency in data transfer.
It also should have methods readPacker() and writePacket()
=========== DataNode.java
- import org.apache.hadoop.io.Text; is redundant.
- My Eclipse shows 5 variables that are "never read".
- Rather than using "4" on several occasions a constant should be defined
{code}
SIZE_OF_INTEGER = Integer.SIZE / Byte.SIZE;
{code}
and used whenever required.
- lastDataNodeRun() should not be public
=========== FSDataset.java
- writeToBlock(): These are two searches in a map instead of one.
{code}
if (ongoingCreates.containsKey(b)) {
ActiveFile activeFile = ongoingCreates.get(b);
{code}
- unfinalizeBlock() I kinda find the name funny.
=========== General
- Convert comments like // .......... to JavaDoc /** ... */ style
comments
when used as method or class headers even if they are private.
- Formatting. Tabs should be replaced by 2 spaces. Eg: ResponseProcessor.run(),
DataStreamer.run()
- Formatting. Long lines.
> Remove the DFS Client disk-based cache
> --------------------------------------
>
> Key: HADOOP-1707
> URL: https://issues.apache.org/jira/browse/HADOOP-1707
> Project: Hadoop
> Issue Type: Improvement
> Components: dfs
> Reporter: dhruba borthakur
> Assignee: dhruba borthakur
> Fix For: 0.16.0
>
> Attachments: clientDiskBuffer.patch, clientDiskBuffer10.patch,
> clientDiskBuffer11.patch, clientDiskBuffer2.patch, clientDiskBuffer6.patch,
> clientDiskBuffer7.patch, clientDiskBuffer8.patch, clientDiskBuffer9.patch,
> DataTransferProtocol.doc, DataTransferProtocol.html
>
>
> The DFS client currently uses a staging file on local disk to cache all
> user-writes to a file. When the staging file accumulates 1 block worth of
> data, its contents are flushed to a HDFS datanode. These operations occur
> sequentially.
> A simple optimization of allowing the user to write to another staging file
> while simultaneously uploading the contents of the first staging file to HDFS
> will improve file-upload performance.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.