[ 
https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285372#comment-15285372
 ] 

Andrew Wang commented on HDFS-7240:
-----------------------------------

Hi all, I had the opportunity to hear more about Ozone at Apache Big Data, and 
chatted with Anu afterwards. Quite interesting, I learned a lot. Thanks Anu for 
the presentation and fielding my questions.

I'm re-posting my notes and questions here. Anu said he'd be posting a new 
design doc soon to address my questions.

Notes:

* Key Space Manager and Storage Container Manager are the "master" services in 
Ozone, and are the equivalent of FSNamesystem and the BlockManager in HDFS. 
Both are Raft-replicated services. There is a new Raft implementation being 
worked on internally.
* The block container abstraction is a mutable range of KV pairs. It's 
essentially a ~5GB LevelDB for metadata + on-disk files for the data. Container 
metadata is replicated via Raft. Container data is replicated via chain 
replication.
* Since containers are mutable and the replicas are independent, the on-disk 
state will be different. This means we need to do logical rather than physical 
replication.
* Container data is stored as chunks, where a chunk is maybe 4-8MB. Chunks are 
immutable. Chunks are a (file, offset, length) triplet. Currently each chunk is 
stored as a separate file.
* Use of copysets to reduce the risk of data loss due to independent node 
failures.

Questions:

* My biggest concern is that erasure coding is not a first-class consideration 
in this system, and seems like it will be quite difficult to implement. EC is 
table stakes in the blobstore world, it's implemented by all the cloud 
blobstores I'm aware of (S3, WASB, etc). Since containers are mutable, we are 
not able to erasure-code containers together, else we suffer from the 
equivalent of the RAID-5 write hole. It's the same issue we're dealing with on 
HDFS-7661 for hflush/hsync EC support. There's also the complexity that a 
container is replicated to 3 nodes via Raft, but EC data is typically stored 
across 14 nodes.
* Since LevelDB is being used for metadata storage and separately being 
replicated via Raft, are there concerns about metadata write amplification?
* Can we re-use the QJM code instead of writing a new replicated log 
implementation? QJM is battle-tested, and consensus is a known hard problem to 
get right.
* Are there concerns about storing millions of chunk files per disk? Writing 
each chunk as a separate file requires more metadata ops and fsyncs than 
appending to a file. We also need to be very careful to never require a full 
scan of the filesystem. The HDFS DN does full scans right now (DU, volume 
scanner).
* Any thoughts about how we go about packing multiple chunks into a larger file?
* Merges and splits of containers. We need nice large 5GB containers to hit the 
SCM scalability targets. However, I think we're going to have a harder time 
with this than a system like HBase. HDFS sees a relatively high delete rate for 
recently written data, e.g. intermediate data in a processing pipeline. HDFS 
also sees a much higher variance in key/value size. Together, these factors 
mean Ozone will likely be doing many more merges and splits than HBase to keep 
the container size high. This is concerning since splits and merges are 
expensive operations, and based on HBase's experience, are hard to get right.
* What kind of sharing do we get with HDFS, considering that HDFS doesn't use 
block containers, and the metadata services are separate from the NN? not 
shared?
* Any thoughts on how we will transition applications like Hive and HBase to 
Ozone? These apps use rename and directories for synchronization, which are not 
possible on Ozone.
* Have you experienced data loss from independent node failures, thus 
motivating the need for copysets? I think the idea is cool, but the RAMCloud 
network hardware expectations are quite different from ours. Limiting the set 
of nodes for re-replication means you have less flexibility to avoid 
top-of-rack switches and decreased parallelism. It's also not clear how this 
type of data placement meshes with EC, or the other quite sophisticated types 
of block placement we currently support in HDFS.
* How do you plan to handle files larger than 5GB? Large files right now are 
also not spread across multiple nodes and disks, limiting IO performance.
* Are all reads and writes served by the container's Raft master? IIUC that's 
how you get strong consistency, but it means we don't have the same performance 
benefits we have now in HDFS from 3-node replication.

I also ask that more of this information and decision making be shared on 
public mailing lists and JIRA. The KSM is not mentioned in the architecture 
document, nor the fact that the Ozone metadata is being replicated via Raft 
rather than stored in containers. I not aware that there is already progress 
internally at Hortonworks on a Raft implementation. We've previously expressed 
interest in being involved in the design and implementation of Ozone, but we 
can't meaningfully contribute if this work is being done privately.

> Object store in HDFS
> --------------------
>
>                 Key: HDFS-7240
>                 URL: https://issues.apache.org/jira/browse/HDFS-7240
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jitendra Nath Pandey
>            Assignee: Jitendra Nath Pandey
>         Attachments: Ozone-architecture-v1.pdf, ozone_user_v0.pdf
>
>
> This jira proposes to add object store capabilities into HDFS. 
> As part of the federation work (HDFS-1052) we separated block storage as a 
> generic storage layer. Using the Block Pool abstraction, new kinds of 
> namespaces can be built on top of the storage layer i.e. datanodes.
> In this jira I will explore building an object store using the datanode 
> storage, but independent of namespace metadata.
> I will soon update with a detailed design document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to