[ https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288087#comment-15288087 ]
Andrew Wang commented on HDFS-7240: ----------------------------------- Thanks for the reply Anu, I'd like to follow up on some points. bq. Nothing in ozone prevents a chunk being EC encoded. In fact ozone makes no assumptions about the location or the types of chunks at all... bq. The chunks will support remote blocks... My understanding was that the SCM was the entity responsible for the equivalent of BlockPlacementPolicy, and doing it on containers. It sounds like that's incorrect, and each container is independently doing chunk placement. That raises a number of questions: * How are we coordinating distributed data placement and replication? Are all containers heartbeating to other containers to determine liveness? Giving up global coordination of replication makes it hard to do throttling and control use of top-of-rack switches. It also makes it harder to understand the operation of the system. * Aren't 4MB chunks a rather small unit for cross-machine replication? We've been growing the HDFS block size over the years as networks get faster, since it amortizes overheads. * Does this mean also we have a "chunk report" from the remote chunk servers to the master? I also still have the same questions about mutability of an EC group requiring the parities to be rewritten. How are we forming and potentially rewriting EC groups? bq. The fact that QJM was not written as a library makes it very hard for us to pull it out in a clean fashion. Again if you feel very strongly about it, please feel free to move QJM to a library which can be reused and all of us will benefit from it. I don't follow the argument that a new consensus implementation is more understandable than the one we've been supporting and using for years. Working with QJM, and adding support for missing functionality like multiple logs and dynamic quorum membership, would also have benefits in HDFS. I'm also just asking questions here. I'm not required to refactor QJM into a library to discuss the merits of code reuse. bq. Nothing in the chunk architecture assumes that chunk files are separate files. The fact that a chunk is a triplet {FileName, Offset, Length} gives you the flexibility to store 1000s of chunks in a physical file. Understood, but in this scenario how do you plan to handle compaction? We essentially need to implement mutability on immutability. The traditional answer here is an LSM tree, a la Kudu or HBase. If this is important, it really should be discussed. One easy option would be storing the data in LevelDB as well. I'm not sure about the performance though, and it also doesn't blend well with the mutability of EC groups. bq. So once more – just make sure we are on the same page – Merges are rare(not required generally) and splits happen if we want to re-distribute data on a same machine. I think I didn't explain my two points thoroughly enough. Let me try again: The first problem is the typical write/delete pattern for a system like HDFS. IIUC in Ozone, each container is allocated a contiguous range of the keyspace by the KSM. As an example, perhaps the KSM decides to allocate the range {{(i,j]}} to a container. Then, the user decides to kick off a job that writes a whole bunch of files with the format {{ingest/file_N}}. Until we do a split, all those files are landing in that {{{{i,j]}} container. So we split. Then, it's common for ingested data to be ETL'd and deleted. If we split earlier, that means we now have a lot of very small containers. This kind of hotspotting is less common in HBase, since DB users aren't encoding this type of nested structure in their keys. The other problem is that files can be pretty big. 1GB is common for data warehouses. If we have a 5GB container, a few deletes could quickly drop us below that target size. Similarly, a few additions can quickly raise us past it. Would appreciate an answer in light of the above concerns. bq. So the container infrastructure that we have built is something that can be used by both ozone and HDFS...In future, if we want to scale HDFS, containers might be an easy way to do it. This sounds like a major refactoring of HDFS. We'd need to start by splitting the FSNS and BM locks, which is a massive undertaking, and possibly incompatible for operations like setrep. Moving the BM across an RPC boundary is also a demanding task. I think a split FSN / BM is a great architecture, but it's also something that has been attempted unsuccessfully a number of times in the community. bq. These applications are written with the assumption of a Posix file system, so migrating them to Ozone does not make much sense. If we do not plan to support Hive and HBase, what is the envisioned set of target applications for Ozone? bq. Yes, we have seen this issue in real world. That's a very interesting datapoint. Could you give any more details, e.g. circumstances and frequency? I asked around internally, and AFAIK we haven't encountered this issue before. bq. It is kind of sad that we have to wait for an apachecon for someone to ask me these questions. bq. I feel that this is a completely uncalled for statement / misplaced sentiment here. We have 54 JIRAs on ozone so far. You are always welcome to ask questions or comment. bq. We have been more than liberal about the sharing internals including this specific presentation we are discussing here. So I am completely lost when you say that this work is being done privately. bq. but unless you participate in JIRAs it is very difficult for us to know that you have more than a fleeting interest. As I said in my previous comment, it's very hard for the community to contribute meaningfully if the design is being done internally. I wouldn't have had the context to ask any of the above questions. None of the information about the KSM, Raft, or remote chunks have been mentioned in JIRA. For all I knew, we were still progressing down the design proposed in the architecture doc. I think community interest in Ozone has also been very clearly expressed. Speaking for myself, you can see my earlier comments on this JIRA, as well as my attendance at previous community phone calls. Really though, even if the community *hadn't* explicitly expressed interest, all of this activity should *still* have been done in a public forum. It's very hard for newcomers to ramp up unless design discussions are being done publicly. This "newcomer" issue is basically what's happening right now with our conversation. I'm sure you've already discussed many of the points I'm raising now with your colleagues. I actually would be delighted to commit my time and energy to Ozone development if I believed it's the solution to our HDFS scalability issues. However, since this work is going on internally at HWX, it's hard for me to assess, assist, or advocate for the project. > Object store in HDFS > -------------------- > > Key: HDFS-7240 > URL: https://issues.apache.org/jira/browse/HDFS-7240 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Jitendra Nath Pandey > Assignee: Jitendra Nath Pandey > Attachments: Ozone-architecture-v1.pdf, ozone_user_v0.pdf > > > This jira proposes to add object store capabilities into HDFS. > As part of the federation work (HDFS-1052) we separated block storage as a > generic storage layer. Using the Block Pool abstraction, new kinds of > namespaces can be built on top of the storage layer i.e. datanodes. > In this jira I will explore building an object store using the datanode > storage, but independent of namespace metadata. > I will soon update with a detailed design document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org