[jira] Commented: (HADOOP-1707) Remove the DFS Client disk-based cache

Konstantin Shvachko (JIRA) Wed, 05 Dec 2007 19:40:03 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548911
 ]


Konstantin Shvachko commented on HADOOP-1707:
---------------------------------------------

I think this patch has been tested quite thoroughly, and I don't see any 
algorithmic flaws in it. 
The logic is fairly complicated though, so imo
# we need better documentation either in JavaDoc or at least in Jira. 
# it would be good if you could extract common actions for the client and the 
data-node into 
separate classes, not inner ones.

===========   DFSClient.java
- DFSClient: 4 unused variables, members.
- DFSOutputStream.lb should be local variable.
- processDatanodeError() and DFSOutputStream.close() have common code.
- BlockReader.readChunk()
{code}
07/12/04 18:36:22 INFO fs.FSInputChecker: DFSClient readChunk got seqno 14 
offsetInBlock 7168
{code}
Should be DEBUG.
- More comments: What is e.g. dataQueue, ackQueue, bytesCurBlock?
- Some new members in DFSOutputStream can be calculated from the other. 
No need to store them all. See e.g.
{code}
    private int packetSize = 0;
    private int chunksPerPacket = 0;
    private int chunksPerBlock = 0;
    private int chunkSize = 0;
{code}
- In the line below "8" should be defined as a constant. Otherwise, the meaning 
of that is not clear.
{code}
      chunkSize = bytesPerChecksum + 8; // user data + checksum
{code}
- currentPacket should be a local variable of writeChunk()
- The 4 in the code snippet below looks misterious:
{code}
      if (len + cklen + 4 > chunkSize) {
{code}
- why start ResponseProcessor in processDatanodeError()
- some methods should be moved into new inner classes, like 
nextBlockOutputStream() should be a part of DataStreamer
- Packet should be factored out to a separate class (named probably DataPacket).
  It should have serialization/deserialization methods for packet header, which 
should 
  be reused in DFSClient and DataNodes for consistency in data transfer.
  It also should have methods readPacker() and writePacket()

===========   DataNode.java
- import org.apache.hadoop.io.Text; is redundant.
- My Eclipse shows 5 variables that are "never read".
- Rather than using "4" on several occasions a constant should be defined
{code}
SIZE_OF_INTEGER = Integer.SIZE / Byte.SIZE;
{code}
and used whenever required.
- lastDataNodeRun() should not be public

===========   FSDataset.java
- writeToBlock(): These are two searches in a map instead of one.
{code}
      if (ongoingCreates.containsKey(b)) {
        ActiveFile activeFile = ongoingCreates.get(b);
{code}
- unfinalizeBlock() I kinda find the name funny.

===========   General
- Convert comments like   // ..........  to JavaDoc   /**  ...  */ style 
comments 
  when used as method or class headers even if they are private.
- Formatting. Tabs should be replaced by 2 spaces. Eg: ResponseProcessor.run(), 
DataStreamer.run()
- Formatting. Long lines.


> Remove the DFS Client disk-based cache
> --------------------------------------
>
>                 Key: HADOOP-1707
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1707
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>             Fix For: 0.16.0
>
>         Attachments: clientDiskBuffer.patch, clientDiskBuffer10.patch, 
> clientDiskBuffer11.patch, clientDiskBuffer2.patch, clientDiskBuffer6.patch, 
> clientDiskBuffer7.patch, clientDiskBuffer8.patch, clientDiskBuffer9.patch, 
> DataTransferProtocol.doc, DataTransferProtocol.html
>
>
> The DFS client currently uses a staging file on local disk to cache all 
> user-writes to a file. When the staging file accumulates 1 block worth of 
> data, its contents are flushed to a HDFS datanode. These operations occur 
> sequentially.
> A simple optimization of allowing the user to write to another staging file 
> while simultaneously uploading the contents of the first staging file to HDFS 
> will improve file-upload performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1707) Remove the DFS Client disk-based cache

Reply via email to