[jira] [Comment Edited] (HDFS-13102) Implement SnapshotSkipList class to store Multi level DirectoryDiffs

Shashikant Banerjee (JIRA) Tue, 20 Feb 2018 09:38:13 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-13102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370330#comment-16370330
 ]


Shashikant Banerjee edited comment on HDFS-13102 at 2/20/18 5:37 PM:
---------------------------------------------------------------------

Thanks [~szetszwo], for the review comments.
{code:java}
TestDirectoryDiffList does not test remove(..). As mentioned, remove(..) seems 
having some bugs.{code}
I agree that TestDirectoryDiffList does not test removes. Removes will be 
handled as a part HDFS-13171.Removes need to balance the list and hence, i 
would like to update it in a different Jira.
{code:java}
diffSetIndexList does not seem useful since it is the same as the nodes in 
level 1. BTW, diffSetIndexList is not updated when remove an element so that it 
seems a bug. I suggest removing diffSetIndexList since it can be computed if 
necessary.{code}
diffSetIndexList is a list which maintains the indices for all the 
multiLevelNodes. I think diffSetIndexList should be kept as otherwise,

for determining multiLevel nodes when sequence of delete happen will be 
cumbersome. For example,

let's say we have snapshotDiff stored for a directory  as follows with a skip 
interval of 3:

s0-->s1->s2>s3>s4>s5->s6-

In this case, s0 and s3 will be multiLevel nodes. Let's say s2 gets deleted 
followed by s3. Now locating the previous multiLevel node and next 
multiLevelNode might be a little cumbersome as we need to iterate through the 
list again and check which node is actually a multiLevel node. DiffsetIndexList 
will simplify the balancing the skipList in case of deletions. It will be 
updated accordingly when we handle deletes with HDFS-13171.
{code:java}
Pass INodeDirectory as a parameter in getSumForRange(..). Then, we could remove 
INodeDirectory dir from DirectoryDiffList{code}
We need to have the INodeDirectory passed to DirectoryDiffList, as with 
INodeDirectory reference itself, we will be able to read the configured value 
of SkipInterval. This will be a part of HDFS-13173.
{code:java}
Let's replace getSumForRange with getMinListForRange in DiffList so that we may 
implement it DiffListByArrayList using subList.{code}
getMinListForRange actually gives a list of childrenDiff(not DirectoryDiffs 
which is the basic element stored in the List). Putting this API in the 
diffList interface method might not make much sense. In case, this API seems 
suitable for DiffList interface method, we need to change the API to return 
list of DirectoryDiffs rather than childrenDiff here.

For DiffListByArrayList (which will be used to store FileDiffs in general), 
does not have childrenDiff element.


was (Author: shashikant):
Thanks [~szetszwo], for the review comments.
{code:java}
TestDirectoryDiffList does not test remove(..). As mentioned, remove(..) seems 
having some bugs.{code}
I agree that TestDirectoryDiffList does not test removes. Removes will be 
handled as a part HDFS-13171.Removes need to balance the list and hence, i 
would like to update it in a different Jira.
{code:java}
diffSetIndexList does not seem useful since it is the same as the nodes in 
level 1. BTW, diffSetIndexList is not updated when remove an element so that it 
seems a bug. I suggest removing diffSetIndexList since it can be computed if 
necessary.{code}
diffSetIndexList is a list which maintains the indices for all the 
multiLevelNodes. I think diffSetIndexList should be kept as otherwise,

for determining multiLevel nodes when sequence of delete happen will be 
cumbersome. For example,

let's say we have snapshotDiff stored for a directory  as follows with a skip 
interval of 3:

s0-->s1->s2>s3>s4>s5->s6-

In this case, s0 and s3 will be multiLevel nodes. Let's say s2 gets deleted 
followed by s3. Now locating the previous multiLevel node and

next multiLevelNode might be a little cumbersome as we need to iterate through 
the list again and check which node is actually a multiLevel node. 
DiffsetIndexList will simplify the balancing the skipList in case of deletions. 
It will be updated accordingly when we handle deletes with HDFS-13171.
{code:java}
Pass INodeDirectory as a parameter in getSumForRange(..). Then, we could remove 
INodeDirectory dir from DirectoryDiffList{code}
We need to have the INodeDirectory passed to DirectoryDiffList, as with 
INodeDirectory reference itself, we will be able to read the configured value 
of SkipInterval. This will be a part of HDFS-13173.
{code:java}
Let's replace getSumForRange with getMinListForRange in DiffList so that we may 
implement it DiffListByArrayList using subList.{code}
getMinListForRange actually gives a list of childrenDiff(not DirectoryDiffs 
which is the basic element stored in the List). Putting this API

in the diffList interface method might not make much sense. In case, this API 
seems suitable for DiffList interface method, we need to change the API to 
return list of DirectoryDiffs rather than childrenDiff here.

For DiffListByArrayList (which will be used to store FileDiffs in general), 
does not have childrenDiff element.

> Implement SnapshotSkipList class to store Multi level DirectoryDiffs
> --------------------------------------------------------------------
>
>                 Key: HDFS-13102
>                 URL: https://issues.apache.org/jira/browse/HDFS-13102
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Shashikant Banerjee
>            Assignee: Shashikant Banerjee
>            Priority: Major
>         Attachments: HDFS-13102.001.patch, HDFS-13102.002.patch, 
> HDFS-13102.003.patch
>
>
> HDFS-11225 explains an issue where deletion of older snapshots can take a 
> very long time in case the no of snapshot diffs is quite large for 
> directories. For any directory under a snapshot, to construct the children 
> list , it needs to combine all the diffs from that particular snapshot to the 
> last snapshotDiff record and reverseApply to the current children list of the 
> directory on live fs. This can take  a significant time if the no of snapshot 
> diffs are quite large and changes per diff is significant.
> This Jira proposes to store the Directory diffs in a SnapshotSkip list, where 
> we store multi level DirectoryDiffs. At each level, the Directory Diff will 
> be cumulative diff of k snapshot diffs,
> where k is the level of a node in the list. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-13102) Implement SnapshotSkipList class to store Multi level DirectoryDiffs

Reply via email to