[ 
https://issues.apache.org/jira/browse/HDFS-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813784#comment-13813784
 ] 

Konstantin Shvachko commented on HDFS-2832:
-------------------------------------------

> UUID#randomUUID generates RFC-4122 compliant UUIDs which are unique *for all 
> practical purposes*

RFC-4122 has a special note about "distributed applications". But let's just 
think about it in general. 
randomUUID is based on pseudo random sequence of numbers, which is like a 
Mobius Strip or just a loop. It actually works well if you generate IDs on a 
single node, because the sequence lasts long without repetitions. In our case 
we initiate thousands of pseudo random sequences (one per node), each starting 
from a random number. Let's mark those starting numbers on the Mobius Strip or 
the loop. Then we actually decreased the probability of uniqueness because now 
in order to get a collision one of the nodes need to reach the starting point 
of another node, rather than going all around the loop. So in  distributed 
environment we increase the probability of collision with each new node added. 
And when you add more storage types per node you further increase the collision 
probability.
"for all practical purposes" as I understand it in the case means that 
probability of non-unique IDs is low. But it does not mean impossible. The 
consequences of a storageID collision are pretty bad, hard to detect and 
recover. At the same time {{DataNode.createNewStorageId()}} generates unique 
IDs as of today. Why changing it to a problematic approach?

> Part of the rationale is in HDFS-5115. Making them UUIDs simplifies the 
> generation logic.

Looks like HDFS-5115 was based on an incomplete assumption:
bq. The Storage ID is currently generated from the DataNode's IP+Port+Random 
components
while in fact it also includes currentTime, which guarantees the uniqueness of 
ids generated on the same node, unless somebody resets the machine clock to the 
past.

> Enable support for heterogeneous storages in HDFS
> -------------------------------------------------
>
>                 Key: HDFS-2832
>                 URL: https://issues.apache.org/jira/browse/HDFS-2832
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>    Affects Versions: 0.24.0
>            Reporter: Suresh Srinivas
>            Assignee: Suresh Srinivas
>         Attachments: 20130813-HeterogeneousStorage.pdf, h2832_20131023.patch, 
> h2832_20131023b.patch, h2832_20131025.patch, h2832_20131028.patch, 
> h2832_20131028b.patch, h2832_20131029.patch, h2832_20131103.patch, 
> h2832_20131104.patch
>
>
> HDFS currently supports configuration where storages are a list of 
> directories. Typically each of these directories correspond to a volume with 
> its own file system. All these directories are homogeneous and therefore 
> identified as a single storage at the namenode. I propose, change to the 
> current model where Datanode * is a * storage, to Datanode * is a collection 
> * of strorages. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to