[ 
https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285948#comment-15285948
 ] 

Anu Engineer commented on HDFS-7240:
------------------------------------

[~andrew.wang] Thank you for your comments, They are well thought out and 
extremely valuable questions. I will make sure that all areas that you are 
asking about is discussed in the next update of design doc.

bq. Anu said he'd be posting a new design doc soon to address my questions.
 
I am working on that, but just to make sure your questions are not lost in the 
big picture of the design doc, I am answering them individually here.
 
bq. My biggest concern is that erasure coding is not a first-class 
consideration in this system.
Nothing in ozone prevents a chunk being EC encoded. In fact ozone makes no 
assumptions about the location or the types of chunks at all. So it is quite 
trivial to create a new chunk type and write them into containers. We are 
focused on overall picture of ozone right now, and I would welcome any 
contribution you can make on EC and ozone chunks if that is a concern that you 
would like us to address earlier. From the architecture point of view I do not 
see any issues.

bq. Since LevelDB is being used for metadata storage and separately being 
replicated via Raft, are there concerns about metadata write amplification?
Metadata is such a small slice of information of a block – really what you are 
saying is that block Name, hash for the block gets written twice, once thru 
RAFT log and second time when RAFT commits this information. Since the data we 
are talking about is so small I am not worried about it at all.
 
bq. Can we re-use the QJM code instead of writing a new replicated log 
implementation? QJM is battle-tested, and bq.sensus is a known hard problem to 
get right.
We considered this, however the consensus is to write a *consensus protocol* 
that is easier to understand and make it easy for more contributors to work on 
it. The fact that QJM was not written as a library makes it very hard for us to 
pull it out in a clean fashion. Again if you feel very strongly about it, 
please feel free to move QJM to a library which can be reused and all of us 
will benefit from it.

bq. Are there concerns about storing millions of chunk files per disk? Writing 
each chunk as a separate file requires more metadata ops and fsyncs than 
appending to a file. We also need to be very careful to never require a full 
scan of the filesystem. The HDFS DN does full scans right now (DU, volume 
scanner).
Nothing in the chunk architecture assumes that chunk files are separate files. 
The fact that a chunk is a triplet \{FileName, Offset, Length\} gives you the 
flexibility to store 1000s of chunks in a physical file.
 
 bq. Any thoughts about how we go about packing multiple chunks into a larger 
file?
Yes, write the first chunk and then write the second chunk to the same file. In 
fact, chunks are specifically designed to address the small file problem. So 
two keys can point to a same file.
For example
KeyA -> \{File,0, 100\}
KeyB -> \{File,101, 1000\} Is a perfectly valid layout under container 
architecture
 
bq. Merges and splits of containers. We need nice large 5GB containers to hit 
the SCM scalability targets. Together, these factors mean Ozone will likely be 
doing many more merges and splits than HBase to keep the container size high
  
Ozone actively tries to avoid merges and tries to split only when needed. A 
container can be thought of as a really large block, so I am not sure if I am 
going to see anything other than standard block workload on containers.  The 
fact that containers can be split, is something that allows us to avoid 
pre-allocation of container space. That is merely a convenience and if you 
think of these as blocks,  you will see that it is very similar.
 
Ozone will never try to do merges and splits at HBase level. From the container 
and ozone perspective we are more focused on a good data distribution on the 
cluster – aka what the balancer does today, and containers are a flat namespace 
– just like blocks which we allocate when needed.
 
So once more – just make sure we are on the same page – Merges are rare(not 
required generally) and splits happen if we want to re-distribute data on a 
same machine.
 
bq. What kind of sharing do we get with HDFS, considering that HDFS doesn't use 
block containers, and the metadata services are separate from the NN? not 
shared?

Great question. We initially started off by attacking the scalability question 
of ozone and soon realized that HDFS scalability and ozone scalability has to 
solve the same problems. So the container infrastructure that we have built is 
something that can be used by both ozone and HDFS. Currently we are focused on 
ozone and containers will co-exist on datanodes with blockpools. That is ozone 
should be and will be deployable on a vanilla HDFS cluster. In future, if we 
want to scale HDFS, containers might be an easy way to do it.
 
bq. Any thoughts on how we will transition applications like Hive and HBase to 
Ozone? These apps use rename and directories for synchronization, which are not 
possible on Ozone.
These applications are written with the assumption of a Posix file system, so 
migrating them to Ozone does not make much sense. The work we are doing in 
ozone, especially container layer is useful in HDFS and these applications 
might benefit indirectly.
 
bq. Have you experienced data loss from independent node failures, thus 
motivating the need for copysets? 
 
Yes, we have seen this issue in real world. Since we had to pick an algorithm 
for allocation and copysets offered a set of advantages with very minimal work. 
Hence the allocation choice. If you look at the paper you will see this was 
done on HDFS itself with very minimal cost, and also this work originally came 
from facebook. So it is useful in both normal HDFS and in RAMCloud.
 
bq. It's also not clear how this type of data placement meshes with EC, or the 
other quite sophisticated types of block placement we currently support in HDFS
 
The chunks will support remote blocks. That is a notion that we did not discuss 
due to time constraints in the presentation. In the presentation we just showed 
chunks as a files on the same machine, but a chunk could exist on a EC block 
and key could point to it.
 
bq. How do you plan to handle files larger than 5GB?
Large files will live on a set of containers, again the easiest model to reason 
about containers is to think of them as blocks. In other words, just like 
normal HDFS.
 
bq. Large files right now are also not spread across multiple nodes and disks, 
limiting IO performance.
Are you saying that when you write a file of say 256 MB size you would be 
constrained to same machine since the block size is much larger? That is why we 
have chunks. With chunks we can easily distribute this information and leave 
pointers to those chunks.  I agree with the concern and we would might have to 
tune the placement algorithms once we see  more real world workloads.

 bq. Are all reads and writes served by the container's Raft master? IIUC 
that's how you get strong consistency, but it means we don't have the same 
performance benefits we have now in HDFS from 3-node replication.
Yes, and I don't think it will be a bottleneck. Since we are only reading and 
writing *metadata* – which is very small – and data can be read from any 
machine.

bq. I also ask that more of this information and decision making be shared on 
public mailing lists and JIRA.

Absolutely, we have been working actively and posting code patches to ozone 
branch. Feel free to ask us anything. It is kind of sad that we have to wait 
for an apachecon for someone to ask me these questions. I would encourage you 
to follow the JIRAs tagged with ozone to keep up with what is happening in the 
ozone world. I can assure you all the work we are doing is always tagged with 
_ozone:_` so that when you get the mail from JIRA it is immediately visible.

bq. The KSM is not mentioned in the architecture document.
The namespace management component did not have separate name, it was called 
SCM in the original arch. doc. Thanks to [~arpitagarwal] – he came up with this 
name (just like ozone) and argued that we should make it separate, which will  
make it  easy to understand and maintain. If you look at code you will see KSM 
functionality is currently in SCM. I am planning to move it out to KSM.

bq. the fact that the Ozone metadata is being replicated via Raft rather than 
stored in containers. 
Metadata is  stored in containers, but containers need replication.  
Replication via RAFT was something that design evolved to, and we are surfacing 
that to the community. Like all proposals under consideration, we will update 
the design doc and listen to community feedback. You just had a chance to 
listen to our design (little earlier) in person since we got a chance meet. 
Writing documents take far more energy and time than chatting face to face. We 
will soon update the design docs.

bq. I not aware that there is already progress internally at Hortonworks on a 
Raft implementation.
We have been prototyping various parts of the system. Including KSM, SCM and 
other parts of Ozone.

bq. We've previously expressed interest in being involved in the design and 
implementation of Ozone, but we can't meaningfully contribute if this work is 
being done privately.
I feel that this is a completely uncalled for statement / misplaced sentiment 
here. We have 54 JIRAs on ozone so far. You are always welcome to ask questions 
or comment. I have personally posted a 43 page document explaining how ozone 
would look like to users, how they can manage it and REST interfaces offered by 
Ozone. Unfortunately I have not seen much engagement. While I am sorry that you 
feel left out, I do not see how you can attribute it to engineers who work on 
ozone. We have been more than liberal about the sharing internals including 
this specific presentation we are discussing here. So I am completely lost when 
you say that this work is being done privately. In fact, I would request you to 
be active on ozone JIRAs, start by asking questions, make comments (just like 
you did with this one except for this last comment) and I would love working 
with you on ozone. It is never our intent to abandon our fellow contributors in 
this journey, but unless you participate in JIRAs it is very difficult for us 
to know that you have more than a fleeting interest.
 

> Object store in HDFS
> --------------------
>
>                 Key: HDFS-7240
>                 URL: https://issues.apache.org/jira/browse/HDFS-7240
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jitendra Nath Pandey
>            Assignee: Jitendra Nath Pandey
>         Attachments: Ozone-architecture-v1.pdf, ozone_user_v0.pdf
>
>
> This jira proposes to add object store capabilities into HDFS. 
> As part of the federation work (HDFS-1052) we separated block storage as a 
> generic storage layer. Using the Block Pool abstraction, new kinds of 
> namespaces can be built on top of the storage layer i.e. datanodes.
> In this jira I will explore building an object store using the datanode 
> storage, but independent of namespace metadata.
> I will soon update with a detailed design document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to