[ 
https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288087#comment-15288087
 ] 

Andrew Wang commented on HDFS-7240:
-----------------------------------

Thanks for the reply Anu, I'd like to follow up on some points.

bq. Nothing in ozone prevents a chunk being EC encoded. In fact ozone makes no 
assumptions about the location or the types of chunks at all...
bq. The chunks will support remote blocks...

My understanding was that the SCM was the entity responsible for the equivalent 
of BlockPlacementPolicy, and doing it on containers. It sounds like that's 
incorrect, and each container is independently doing chunk placement. That 
raises a number of questions:

* How are we coordinating distributed data placement and replication? Are all 
containers heartbeating to other containers to determine liveness? Giving up 
global coordination of replication makes it hard to do throttling and control 
use of top-of-rack switches. It also makes it harder to understand the 
operation of the system.
* Aren't 4MB chunks a rather small unit for cross-machine replication? We've 
been growing the HDFS block size over the years as networks get faster, since 
it amortizes overheads.
* Does this mean also we have a "chunk report" from the remote chunk servers to 
the master?

I also still have the same questions about mutability of an EC group requiring 
the parities to be rewritten. How are we forming and potentially rewriting EC 
groups?

bq. The fact that QJM was not written as a library makes it very hard for us to 
pull it out in a clean fashion. Again if you feel very strongly about it, 
please feel free to move QJM to a library which can be reused and all of us 
will benefit from it.

I don't follow the argument that a new consensus implementation is more 
understandable than the one we've been supporting and using for years. Working 
with QJM, and adding support for missing functionality like multiple logs and 
dynamic quorum membership, would also have benefits in HDFS.

I'm also just asking questions here. I'm not required to refactor QJM into a 
library to discuss the merits of code reuse.

bq. Nothing in the chunk architecture assumes that chunk files are separate 
files. The fact that a chunk is a triplet {FileName, Offset, Length} gives you 
the flexibility to store 1000s of chunks in a physical file.

Understood, but in this scenario how do you plan to handle compaction? We 
essentially need to implement mutability on immutability. The traditional 
answer here is an LSM tree, a la Kudu or HBase. If this is important, it really 
should be discussed.

One easy option would be storing the data in LevelDB as well. I'm not sure 
about the performance though, and it also doesn't blend well with the 
mutability of EC groups.

bq. So once more – just make sure we are on the same page – Merges are rare(not 
required generally) and splits happen if we want to re-distribute data on a 
same machine.

I think I didn't explain my two points thoroughly enough. Let me try again:

The first problem is the typical write/delete pattern for a system like HDFS. 
IIUC in Ozone, each container is allocated a contiguous range of the keyspace 
by the KSM. As an example, perhaps the KSM decides to allocate the range 
{{(i,j]}} to a container. Then, the user decides to kick off a job that writes 
a whole bunch of files with the format {{ingest/file_N}}. Until we do a split, 
all those files are landing in that {{{{i,j]}} container. So we split. Then, 
it's common for ingested data to be ETL'd and deleted. If we split earlier, 
that means we now have a lot of very small containers. This kind of hotspotting 
is less common in HBase, since DB users aren't encoding this type of nested 
structure in their keys.

The other problem is that files can be pretty big. 1GB is common for data 
warehouses. If we have a 5GB container, a few deletes could quickly drop us 
below that target size. Similarly, a few additions can quickly raise us past it.

Would appreciate an answer in light of the above concerns.

bq. So the container infrastructure that we have built is something that can be 
used by both ozone and HDFS...In future, if we want to scale HDFS, containers 
might be an easy way to do it.

This sounds like a major refactoring of HDFS. We'd need to start by splitting 
the FSNS and BM locks, which is a massive undertaking, and possibly 
incompatible for operations like setrep. Moving the BM across an RPC boundary 
is also a demanding task.

I think a split FSN / BM is a great architecture, but it's also something that 
has been attempted unsuccessfully a number of times in the community.

bq. These applications are written with the assumption of a Posix file system, 
so migrating them to Ozone does not make much sense.

If we do not plan to support Hive and HBase, what is the envisioned set of 
target applications for Ozone?

bq. Yes, we have seen this issue in real world.

That's a very interesting datapoint. Could you give any more details, e.g. 
circumstances and frequency? I asked around internally, and AFAIK we haven't 
encountered this issue before.

bq. It is kind of sad that we have to wait for an apachecon for someone to ask 
me these questions.
bq. I feel that this is a completely uncalled for statement / misplaced 
sentiment here. We have 54 JIRAs on ozone so far. You are always welcome to ask 
questions or comment.
bq. We have been more than liberal about the sharing internals including this 
specific presentation we are discussing here. So I am completely lost when you 
say that this work is being done privately.
bq.  but unless you participate in JIRAs it is very difficult for us to know 
that you have more than a fleeting interest.

As I said in my previous comment, it's very hard for the community to 
contribute meaningfully if the design is being done internally. I wouldn't have 
had the context to ask any of the above questions. None of the information 
about the KSM, Raft, or remote chunks have been mentioned in JIRA. For all I 
knew, we were still progressing down the design proposed in the architecture 
doc.

I think community interest in Ozone has also been very clearly expressed. 
Speaking for myself, you can see my earlier comments on this JIRA, as well as 
my attendance at previous community phone calls. Really though, even if the 
community *hadn't* explicitly expressed interest, all of this activity should 
*still* have been done in a public forum. It's very hard for newcomers to ramp 
up unless design discussions are being done publicly.

This "newcomer" issue is basically what's happening right now with our 
conversation. I'm sure you've already discussed many of the points I'm raising 
now with your colleagues. I actually would be delighted to commit my time and 
energy to Ozone development if I believed it's the solution to our HDFS 
scalability issues. However, since this work is going on internally at HWX, 
it's hard for me to assess, assist, or advocate for the project.

> Object store in HDFS
> --------------------
>
>                 Key: HDFS-7240
>                 URL: https://issues.apache.org/jira/browse/HDFS-7240
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jitendra Nath Pandey
>            Assignee: Jitendra Nath Pandey
>         Attachments: Ozone-architecture-v1.pdf, ozone_user_v0.pdf
>
>
> This jira proposes to add object store capabilities into HDFS. 
> As part of the federation work (HDFS-1052) we separated block storage as a 
> generic storage layer. Using the Block Pool abstraction, new kinds of 
> namespaces can be built on top of the storage layer i.e. datanodes.
> In this jira I will explore building an object store using the datanode 
> storage, but independent of namespace metadata.
> I will soon update with a detailed design document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to