[ 
https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317035#comment-15317035
 ] 

Anu Engineer commented on HDFS-7240:
------------------------------------

Hi [~eddyxu] , Thank  you for reviewing the design doc and comments.  Please 
see my comments below.


bq. Since Ozone is decided to use range partition, how would key / data 
distribution achieve balancing from initial state? For example, a user Foo runs 
Hive and creates 10GB of data, these data are distributed to up to 6 
(containers) DNs?

You bring up a very valid point. This was the most contentious issue in ozone 
world for a while. We originally went with hash partition schemes and secondary 
index because of these concerns. The issue (and very rightly so) with that 
approach was that secondary index is eventually consistent and makes it hard to 
use. So we switched over to this scheme. 

So our current thought is this, each of the containers will report -- size, 
number of operations and number of keys to SCM. This will allow SCM to balance 
the allocation of the key space. So if you have a large number of reads and 
writes, which are completely independent, then they will fill up the 
cluster/container space evenly.

But we have an opposing requirement here, generally there is a locality of 
access in the namespace. So for most cases if you are reading and writing to a 
bucket, then it is most efficient to keep that data together.

Now let us look at this specific case, if you have containers configured to say 
2GB, then 10GB of data will map to 5 containers. So the model works out to 5 
containers. These containers will be spread across a set of machines due to the 
SCM’s location choosing algorithms.

bq. Would you explain what is the benefit of recovering failure pipeline by 
using a parallel writes to all 3 containers? It is not very clear in the design.

The point I was trying to make is that pipeline relies on Quorum as defined by 
RSM. 
So if we decide to use this pipeline with RAFT, then I was just trying to make 
a point that pipeline can be broken, and we will not attempt to heal it. Please 
let me know if this makes sense. 

bq. How does ozone differentiate a recover write from a malicious (or buggy) 
re-write?

 Thanks for flagging this, right now we do not. We can always prevent it in the 
container layer. It is small extension to make, we can write to a temporary 
file and replace the original if and only if the hashes match. I will file a 
work item to fix this.

bq. You mentioned that KMS/SCM separation is for future scalability. Do KMS / 
SCM maintains 1:1, 1:n or n:m relationship? Though it is not in this phase. I'd 
like to know whether it is considered. Btw, they are also Raft replicated?

KSM:SCM has a n:m relationship. Even though in easiest deployment configuration 
it is 1:1. So yes it is defined that way. They are always Raft replicated.

bq. The raft ring / leader is per-container?

Yes, and No. Let me explain this a little more. If you think only in terms of 
RAFT protocol, then we have a RAFT leader is per machine set. That is, we are 
going to have a leader for 3 machines (assuming a 3 machine RAFT ring). Now let 
us switch over to a developer’s point of view. Someone like me who is writing 
code against containers thinks strictly in terms of containers. So from an 
ozone developers point of view, we have a Raft leader for a container.  In 
other words, containers provide an abstraction that makes you think that RAFT 
protocol is for the container, whereas in reality it is a shared ring that is 
used by many containers that share those 3 machines. This might be something 
that we want to explore in greater depth during the call.

bq. For pipeline, say if we have a pipeline A->B->C, if the data writes 
successfully on A->B, and the metadata Raft writes are succeed on B,C, IIUC, 
that is a What would be the result for a read request sent to A or C?

I am going to walk thru this with little more details, so that we are all on 
the same page. 

What you are describing is a situation where the RAFT leader is either B or C 
(Since RAFT is an active leader protocol) and for the sake of this illustration 
let us assume that we are talking about 2 scenarios. One where data is written 
to leader and another datanode and scenario two, where data is written to 
followers but not to the leader. 

Let us look at both in greater detail.

Case 1: Data is written to machines B (leader) and Machine A. But when RAFT 
commit happens, machine A is off-line and RAFT data is written to Machine B and 
Machine C.

So we have situation where B is the only machine with metadata as well as data. 
We deal with this issue in two ways, one when the commit callback happens in C, 
C will check if it has the data block and since it does not, it will attempt to 
copy that block from either B or A.

Also when A's RAFT ring comes back up it will catch up with the RAFT log and 
the data is already available on Machine A. 

So in both cases, we are replicating data/metadata as soon as we can. Now let 
us look at the case where a client goes to C and says I want this data block, 
before the copy is done -- the client will feel that read is a bit slow, since 
Machine C will copy the data from Machine B or A, write to its local storage 
and then return the data.

Case 2: Data is written to 2 followers and leader does not have the data block. 
The work flow is identical; leader will copy block from another machine before 
returning the block.

Case 3: I also want to illustrate an extreme case, let us say a client did NOT 
write any data blocks, and attempted to write a key, a key will get committed, 
but container will not be able to find the data blocks at all. Since no data 
blocks were written by the client, the copy attempt will fail, and RAFT leader 
will learn that this is Block with No replicas. This would be similar in nature 
to HDFS.

bq. How to handle split (merge, migrate) container during writes?

I have made an eloquent argument about why we don't need to do merge in the 
first release of Ozone. 

When split is happening, the easiest way to deal with it is to pause the 
writes. 

if you don't mind, could you please take a look at 
http://schd.ws/hosted_files/apachebigdata2016/fc/Hadoop%20Object%20Store%20-%20Ozone.pdf
 - slides 38-42. 
I avoided repeating that in the design doc, since it was already quite large. 
We can go over this in detail if you like during the call.

bq. Since container size is determined by the space usage instead of # of keys, 
would that result large performance variants on listing operation. 
You are absolutely right; it can have variation in performance. The alternative 
we have is to use hash partition with secondary indices. if you like we can 
revisit hash/range partition in the conf, call. 
Last time, we decided to have range as the primary method, but reserved the 
option of bringing hash partition back at a later stage. 




> Object store in HDFS
> --------------------
>
>                 Key: HDFS-7240
>                 URL: https://issues.apache.org/jira/browse/HDFS-7240
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jitendra Nath Pandey
>            Assignee: Jitendra Nath Pandey
>         Attachments: Ozone-architecture-v1.pdf, Ozonedesignupdate.pdf, 
> ozone_user_v0.pdf
>
>
> This jira proposes to add object store capabilities into HDFS. 
> As part of the federation work (HDFS-1052) we separated block storage as a 
> generic storage layer. Using the Block Pool abstraction, new kinds of 
> namespaces can be built on top of the storage layer i.e. datanodes.
> In this jira I will explore building an object store using the datanode 
> storage, but independent of namespace metadata.
> I will soon update with a detailed design document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to