[ https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325389#comment-15325389 ]
Anu Engineer commented on HDFS-7240: ------------------------------------ *Ozone meeting notes – Jun, 9th, 2016* Attendees: ??Thomas Demoor, Arpit Agarwal, JV Jujjuri, Jing Zhao, Andrew Wang, Lei Xu, Aaron Myers, Colin McCabe, Aaron Fabbri, Lars Francke, Stiwari, Anu Engineer?? We started the discussion with how Erasure coding will be supported in ozone. This was quite a lengthy discussion taking over half the meeting time. Jing Zhao explained the high-level architecture and pointed to similar work done by Drobox. We then divide into details of this problem, since we wanted to make sure that supporting Erasure coding will be easy and efficient in ozone. Here are the major points: SCM currently supports a simple replicated container. To support Erasure coding, SCM will have to return more than 3 machines, let us say we were using 6 + 3 model of erasure coding then then a container is spread across nine machines. Once we modify SCM to support this model, the container client will have write data to the locations and update the RAFT state with the metadata of this block. When a file read happens in ozone, container client will go to KSM/SCM and find out the container to read the metadata from. The metadata will tell the client where the actual data is residing and it will re-construct the data from EC coded blocks. We all agreed that getting EC done for ozone is an important goal, and to get to that objective, we will need to get the SCM and KSM done first. We also discussed how small files will cause an issue with EC especially since container would pack lots of these together and how this would lead to requiring compaction due to deletes. Eddy brought up this issue of making sure that data is spread evenly across the cluster. Currently our plan is to maintain a list of machines based on container reports. The container reports would contain number of keys, bytes stored and number of accesses to that container. Based on this SCM would be able to maintain a list that allows it to pick machines that are under-utilized from the cluster, thus ensuring a good data spread. Andrew Wang pointed out that counting I/O requests is not good enough and we actually need the number of bytes read/written. That is an excellent suggestion and we will modify container reports to have this information and will use that in SCMs allocation decisions. Eddy followed up this question with how would something like Hive behave over ozone? Say hive creates a bucket, and creates lots of tables and after work, it deletes all the tables. Ozone would have allocated containers to accommodate the overflowing bucket. So it is possible to have many empty containers on an ozone cluster. SCM is free to delete any container that does not have a key. This is because in the ozone world, metadata exists inside a container. Therefore, if a container is empty, then we know that no objects (Ozone volume, bucket or key) exists in that container. This gives the freedom to delete any empty container. This is how the containers would be removed in the ozone world. Andrew Wang pointed out that it is possible to create thousands of volumes and map them to similar number of containers. He was worried that it would become a scalability bottle neck. While is this possible in reality if you have cluster with only volumes – then KSM is free to map as many ozone volumes to a container. We agreed that if this indeed becomes a problem, we can write a simple compaction tool for KSM which will move all these volumes to few containers. Then SCM delete containers would kick in and clean up the cluster. We reiterated through all the scenarios for merge and concluded the for v1, ozone can live without needing to support merges of containers. Then Eddy pointed out that by switching to range partitions from hash partitions we have introduced a variability in the list operations for a container. Since it is not documented on JIRA why we switched to using range partition, we discussed the issue which caused us to switch over to using range partition. The original design called for hash partition and operations like list relying on secondary index. This would create an eventual consistency model where you might create key, but it is visible in the namespace only after the secondary index is updated. Colin argued that is easier for our users to see consistent namespace operations. This is the core reason why we moved to using range partitions. However, range partitions do pose the issue, that a bucket might be split across a large number of containers and list operation does not have fixed time guarantees. The worst case scenario is if you have bucket with thousands of 5 GB objects which internally causes that the bucket to be mapped over a set of containers. This would imply that list operation could have to be read sequentially from many containers to build the list. We discussed many solutions to this problem: • In the original design, we had proposed a separate meta-data container and data container. We can follow the same model, with the assumption that data container and metadata container are on the same machine. Both Andrew and Thomas seemed to think that is a good idea. • Anu argued that this may not be an issue since the datanode (front ends) would be able to cache lots of this info as well as pre-fetch lists since it is a forward iteration. • Arpit pointed out that while this is an issue that we need to tackle, we would need to build the system, measure and choose the appropriate solution based on data. • In an off-line conversation after the call, Jitendra pointed out that this will not have any performance impact since each split point is well known in KSM, it is trivial to add hints / caching in the KSM layer itself to address this issue – In other words, we can issue parallel reads to all the containers if the client wants 1000 keys and we know that we need to reach out to 3 containers to get that many keys, since KSM would give us that hint. While we agree that this is an issue that we might have to tackle eventually in ozone world, we were not able to converge to an exact solution since we ran out of time at this point. ATM mentioned that we would benefit by getting together and doing some white boarding of ozone’s design and we intend to do that soon. This was a very productive discussion and I want thank all participants. It was a pleasure talking to all of you. Please feel free to add/edit these notes for completeness or corrections. > Object store in HDFS > -------------------- > > Key: HDFS-7240 > URL: https://issues.apache.org/jira/browse/HDFS-7240 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Jitendra Nath Pandey > Assignee: Jitendra Nath Pandey > Attachments: Ozone-architecture-v1.pdf, Ozonedesignupdate.pdf, > ozone_user_v0.pdf > > > This jira proposes to add object store capabilities into HDFS. > As part of the federation work (HDFS-1052) we separated block storage as a > generic storage layer. Using the Block Pool abstraction, new kinds of > namespaces can be built on top of the storage layer i.e. datanodes. > In this jira I will explore building an object store using the datanode > storage, but independent of namespace metadata. > I will soon update with a detailed design document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org