[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-12-31 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976379#action_12976379
 ] 

dhruba borthakur commented on HDFS-898:
---

I think we should do sequential-ids only for newly formatted clusters and not 
require that existing upgrade their disk formats.

> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
> Fix For: 0.23.0
>
> Attachments: DuplicateBlockIds.patch, FreeBlockIds.pdf, 
> HighBitProjection.pdf
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-03-01 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839870#action_12839870
 ] 

Konstantin Shvachko commented on HDFS-898:
--

Great! Yet another cluster would have had its block ids converted using 8 bit 
projection collision free. Thanks Dmytro.

> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: DuplicateBlockIds.patch, FreeBlockIds.pdf, 
> HighBitProjection.pdf
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-03-01 Thread Dmytro Molkov (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839805#action_12839805
 ] 

Dmytro Molkov commented on HDFS-898:


I just ran a tool on FB cluster. Here is the output:

Bit map applied: ff00
Number of collisions = 0
=
Number of blocks = 57909756
Number of negtive ids  = 57909756
Number of positive ids = 0
Largest segment = (-277768208, 9223372036854775807)
Segment size = 9.223372037132544E18
Expected max = 318542942464


> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: DuplicateBlockIds.patch, FreeBlockIds.pdf, 
> HighBitProjection.pdf
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-02-05 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829998#action_12829998
 ] 

Konstantin Shvachko commented on HDFS-898:
--

Yes the birthday problem is similar, but not the same. In BDP you project 
/mm/dd to . 
But in BDP years are arbitrary, while in our case -s are limited to a 
certain number as Nicholas mentioned. 
Also birthdays do not uniquely identify a person, while in our case all blocks 
have unique ids.

> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: DuplicateBlockIds.patch, HighBitProjection.pdf
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-02-04 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829897#action_12829897
 ] 

Todd Lipcon commented on HDFS-898:
--

You're completely right, now that I think more about it. However, it's a good 
sanity check since we know the probability of duplicates to be relatively 
small. Using the correct numbers, the birthday problem approximation is 0.0307 
which lines up with yours very closely. Thanks, and apologies for the junk on 
the jira.

> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: DuplicateBlockIds.patch, HighBitProjection.pdf
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-02-04 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829894#action_12829894
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-898:
-


This is not a birthday problem.  There is no limitation on the number of people 
having the same birthday but there is a limitation on block ids.

For 365 days and 2^26 people, all 2^26 people could have the same birthday.

However, for 2^(64-b) id space and B=2^26 blocks, at most two blocks could 
share the same block id after removing 1 bit.

> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: DuplicateBlockIds.patch, HighBitProjection.pdf
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-02-04 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829893#action_12829893
 ] 

Todd Lipcon commented on HDFS-898:
--

Wait, I just realized my mistake... I had used ^ instead of **. Please ignore!

> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: DuplicateBlockIds.patch, HighBitProjection.pdf
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-02-04 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829890#action_12829890
 ] 

Todd Lipcon commented on HDFS-898:
--

I get slightly different figures than you guys... I am looking at this as 
identical to the well-known Birthday Problem: 
http://en.wikipedia.org/wiki/Birthday_problem

In our case, we have 2^(64-b) "days" and 2^26 "people"

We have 2^(64-b) "days" and B=2^26 "people". Following the formula on Wikipedia:

{noformat}
In [21]: n = 2^26
In [22]: d = 2^(64-8)
In [23]: reduce(operator.mul, [(1 - float(i)/d)  for i in xrange(0, n)])
Out[23]: 0.0037908372356959502
{noformat}

whereas you've calculated 0.03065 for this case.

The python above agrees with Wikipedia for the birthday example, so I think the 
code is correct:

{noformat}
In [25]: d = 365
In [26]: n = 23
In [27]: reduce(operator.mul, [(1 - float(i)/d)  for i in xrange(0, n)])
Out[27]: 0.49270276567601451
{noformat}

Wary of floating point math, I also checked using int math to calculate 
numerator and denominator, then int division to make them smaller, then float 
division to get a fraction:
{noformat}
In [70]: num,denom = (reduce(operator.mul, [d - i for i in xrange(0, n)])), 
(d**(n))
In [71]: float(num/1)/float(denom/1)
Out[71]: 0.0037908372356959502
{noformat}

So where are our numbers diverging?

> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: DuplicateBlockIds.patch, HighBitProjection.pdf
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-02-04 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829860#action_12829860
 ] 

Konstantin Shvachko commented on HDFS-898:
--

h2. Approach #2

Here is another idea initiated by Rob Chansler, which will potentially let us 
free all positive longs for future block id generation.
Let's set 8 or less high bits of existing block ids to 1. There is a high 
probability there will be no id collisions in the resulting set.
More formally. We have a collection of blocks .  And we build a new 
collection , where Mask = ff00   . Collection 
 does not have repetitions, and there is a very good chance that 
collection  also does not have them.
The intuition here is that if you have randomly distributed points in 
n-dimensional space and you project them into (n-1)-dimensional sub-space, say 
along one of the axis, the probability that two points will be projected into 
the same image is low.

I am attaching a document which derives a formula for the probability. The 
formula was independently confirmed by Nicholas. The bottom line is this:
- The probability of getting a collision when setting the first 8 bits to 1 is 
< 0.03065
- The probability of getting a collision when setting only one high bit to 1 is 
< 0.0001221

>From practical point of view it is enough to set at least one highest bit. 
>Then we'll free the entire segment of positive longs for new block id 
>generation.

I implemented a block id analyzing routine, which combines the two approaches 
described in the issue. The routine reads the image using OIV and first tries 
to reset bits starting from the high 8 and going down to 1 highest bit whenever 
resetting more bits leads to collisions. If the routine fails to reset any bits 
collision-free, then it uses the initial approach, that is, just finds the 
biggest free gap in existing block ids.

If the bit reset approach succeeds then block replicas need to be renamed on 
data-nodes. This will be done via an upgrade. During the upgrade data-nodes 
hardlink replicas into the new directory (automatically handled by the upgrade 
framework), and then rename the newly created links reflecting the new block 
ids (the specific part of this upgrade).
In order to avoid any mishaps I also propose to assign new generation stamp to 
all blocks renamed. So that when data-nodes that missed the upgrade wake up 
they will not report old blocks. The latter is impossible, because data-nodes 
cannot join the cluster until they upgrade, but we don't want to take any 
chances.

I tested the routine on 6 images so far, containing from 150,000 to 40,000,000 
blocks. All of them successfully tolerated the reset of 8 bits without ANY 
collisions.

I would like to ask the community for some help. If people could run the tool 
on their images and post some stats or just send the results to me I'll take 
care of summarizing them.


> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
> Attachments: HighBitProjection.pdf
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-01-20 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803142#action_12803142
 ] 

Konstantin Shvachko commented on HDFS-898:
--

I ran some experiments on some large and small images as a proof of concept. 
Here is the table. 
- First line is the number blocks in the file system. The largest I had is 40 
million blocks. 
- Second line is the largest hole free of block ids.
- Third line is the minimum segment that we expect to find which is calculated 
as the ration 2 ^64^ / num_blocks.

I don't know how to right align numbers, so I used leading zeroes, hope it is 
not confusing.

| Number of blocks | 40,509,569 | 31,959,139 | 241,777 | 178,278 | 148,035 
| 
| Largest segment size | 8,623,203,281,141 | 10,662,709,581,709 | 
889,137,135,725,504 | 1,324,814,576,358,595 | 1,849,602,429,191,491 |
| Expected minimum | 0,455,367,560,644 | 00,577,197,761,694 | 
076,296,205,914,968 | 0,103,471,211,268,346 | 0,124,609,852,155,620 |

We see that selected segments are larger than the expected minimums and larger 
than 2 ^38^ = 274,877,906,944.
This speaks of the quality of the random generator, but also projects longer 
than 43 years life span with the first segment we choose.

> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-01-20 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803126#action_12803126
 ] 

Konstantin Shvachko commented on HDFS-898:
--

My main motivation is (3) - multiple name-nodes should have disjoint segments 
of ids. It would be good to do (2) - reduce the size of generation stamp. Many 
people will be happy to know some weird scenarios (1.1) like I described do not 
apply to HDFS. We cannot do anything about pre-generation-stamp blocks, I am 
just saying it exists.
In a sense current generation stamps play the role of sequential ids, wouldn't 
it be good to finally turn it right?

> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-01-20 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803092#action_12803092
 ] 

dhruba borthakur commented on HDFS-898:
---

Thanks Konstantin for the explanation. I understand the scenario much better 
now.

My view is that we do not need to solve either of the problems (1) or (2) that 
you have listed below. Is it not? Is there a use case for (1)?

> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-01-20 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803044#action_12803044
 ] 

Konstantin Shvachko commented on HDFS-898:
--

 (1) I have a block (id = 1001, gs = ), which belongs to an existing file, 
replication factor = 3. I have two replicas of this block (id = 1001, gs = 
7775) and (id = 1001, gs = 7770). This makes the block corrupt, and we don't 
remove replicas of a corrupt block. Then a third data-node starts, which was 
down for long time, it has replica (id = 1001, gs = 7), which at some point 
belonged to a different file, which was removed from the system since then. So 
I have three replicas, and the name-node does not have info to determine which 
replica is stale, and which one belonged to the removed file.
Does that make sense?

(2) Yes, all old blocks have gs=0. So if a have an old block with three valid 
replicas, and a DN with a prehistoric replica comes up, then you will not be 
able to distinguish between them. I mean that introduction of generation stamps 
does not solve the prehistoric block problem for blocks that existed before 
blocks had generation stamps.

> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-01-20 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803028#action_12803028
 ] 

dhruba borthakur commented on HDFS-898:
---

> 1. In case of corrupt blocks, when all generation stamps of its replicas are 
> less than the blocks' gen stamp

I think the Namenode can handle both stale replicas as well as prehistoric 
blocks the same. It will assume that the block does not belong to any files in 
the namespace and can safely be deleted?

> 2. The really old blocks, created before introduction of generation stamps, 
> can still cause the problem
Those datanodes will go through the upgrade process that will convert all those 
old blocks to have a generation stamp of 0, isn't it?

> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-01-20 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803024#action_12803024
 ] 

Konstantin Shvachko commented on HDFS-898:
--

You are absolutely right if everything goes well generation stamps help 
recognizing prehistoric replicas of newly created blocks.
When I say the problem is relevant I mean to emphasize two cases:
# In case of corrupt blocks, when all generation stamps of its replicas are 
less than the block's gen. stamp on the NN, we do not have means to recognize 
whether the replica is just a stale replica of this particular block or some 
other (prehistoric) block. This happens when you don't have up to date  
replicas of the block and a data-node with a prehistoric replica of previously 
removed block with the same id starts up.
# The really old blocks, created before introduction of generation stamps, can 
still cause the problem.

I agree this is not exactly the problem as you defined it, but they are still 
related problems.

> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-01-15 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800978#action_12800978
 ] 

dhruba borthakur commented on HDFS-898:
---

>The prehistoric block problem is still relevant, since the name-node's 
>block-map keys blocks by their ids, 
> see HDFS-512. Although now there is less chance to corrupt a block with a 
> stale replica, 
> because stale replicas will have older generation stamps.

Can you pl explain why this is so? I thought that the list of blocks that hang 
off a Inode still has the generation stamp in it. If a datanode sends a block 
report with a pre-historic blockId, it can be detected at the NN because the 
genstamp of that block will not match the generation stamp of the blocks that 
hang off the inode.

> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-898) Sequential generation of block ids

2010-01-13 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800083#action_12800083
 ] 

Konstantin Shvachko commented on HDFS-898:
--

h3. Problem
Currently HDFS generates ids for new blocks by randomly selecting a 64-bit 
number and verifying this id is not in the system yet. If the id is assigned to 
one of the blocks the procedure repeats again.
This was discussed in several issues before: HADOOP-2656, HADOOP-146, 
HADOOP-158.
The problem with this approach is that data-nodes that had been offline for a 
long time getting back online may bring in blocks that have already been 
eleased by the system and then their ids were regenerated and reused again for 
different blocks in different files. These are so called prehistoric blocks, 
see 
[here|https://issues.apache.org/jira/browse/HADOOP-2656?focusedCommentId=12590551&comment-tabpanel#action_12590551].
 Bringing up a data-node with a prehistoric replica can potentially corrupt the 
block.

h3. Motivation
# The prehistoric block problem is still relevant, since the name-node's 
block-map keys blocks by their ids, see HDFS-512. Although now there is less 
chance to corrupt a block with a stale replica, because stale replicas will 
have older generation stamps.
# Non-reusable block ids may let us convert generation stamps to 32-bit 
numbers. Currently generation stamp is 64-bit and is a global file system 
variable. If we are sure block ids are not reusable, then we can implement 
per-file generation stamps and 32-bit will be enough for that.
# A forward looking reason for turning into sequential ids is that if we ever 
introduce multiple or distributed name-nodes, then it will not be feasible to 
verify the existence of a particular block id across these name-nodes. 

h3. Solution
This simple solution was born some time ago in a discussion with Nicholas and 
Rob.
Suppose that you have a cluster with 64 million blocks (N = 2 ^26^). Block ids 
are 64-bit numbers there is 2 ^64^ of such numbers. If I order the existing 
block ids b ~1~ , ..., b ~N~ , where b ~i~ < b ~i+1~ . Then each i defines a 
contiguous segment (b ~i~ , b ~i+1~ ) of numbers, which does not contain other 
block ids inside it. It is easy to see that at least one segment will be of 
size 2 ^38^ = 2 ^64^ / 2 ^26^ .
So the proposal is to find such a segment and use it for generating block ids 
starting from b ~i~ + 1 by incrementing the previous generation stamp until it 
reaches b ~i+1~ .
Currently we increment the generation stamp every time the file is created or a 
block write fails. Suppose we do 200 new-generation-stamps per second. This is 
rather optimistic: I think in practice the number is lower. With that pace we 
can keep generating new stamps for 43 years. Then we will find another large 
gap in what will remain out of those 43-year-old blocks and start using this 
new segment. I don't think anybody in this community will care what may happen 
after the third segment is comsumed, but HDFS will keep looking for new 
segments and the expectation is that very few blocks will outlive their 
creators.

As the next step I'll look at some real HDFS images and verify that practice 
confirms the theory.


> Sequential generation of block ids
> --
>
> Key: HDFS-898
> URL: https://issues.apache.org/jira/browse/HDFS-898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: name-node
>Affects Versions: 0.20.1
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
> Fix For: 0.22.0
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.