[ 
https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487558#comment-14487558
 ] 

Charles Lamb commented on HDFS-7240:
------------------------------------

[~jnp] et al,

This is very interesting. Thanks for posting it.

Is the 1KB key size limit a hard limit or just a design/implementation target? 
There will be users who want keys that can be arbitrarily large (e.g. 10's to 
100's of KB). So although it may be acceptable to degrade above 1KB, I don't 
think you want to make it a hard limit. You could argue that they could just 
hash their keys, or that they could have some sort of key map, but then it 
would be hard to do secondary indices in the future.

The details of partitions are kind of lacking beyond the second to last 
paragraph on page 4. Are partitions and storage containers 1:1? ("A storage 
container can contain a maximum of one partition..."). Obviously a storage 
container holds more than just a partition. Perhaps a little more detail about 
partitions and how they are located, etc. is warranted.

In the call flow diagram on page 6, it looks like there's a lot going on in 
terms of network traffic. There's the initial REST call, then an RPC to get the 
bucket metadata, then one to read the bucket metadata, then another to get the 
object's container location, then back to the client who gets redirected. 
That's a lot of REST/RPCs just to get to the actual data. Will any of this be 
cached, perhaps in the Ozone Handler or maybe even on the client (I realize 
that's a bit hard with a REST based protocol). For instance, if it were 
possible to cache some of the hash in the client, then that would cut some RPCs 
to the Ozone Handler. If the cache were out of date, then the actual call to 
the data (step (9) in the diagram) could be rejected, the cache invalidated, 
and the entire call sequence (1) - (8) could be executed to get the right 
location.

IWBNI there was some description of the protocols used between all these moving 
parts. I know that it's REST from client to Ozone Handler, but what about the 
other network calls in the diagram? Will it be more REST, or Hadoop RPC, or 
something else? You talk about security at the end so I guess the 
authentication will be Kerberos based? Or will you allow more authentication 
options such as those that HDFS currently has?

Hash partitioning can also suffer from hotspots depending on the semantics of 
the key. That's not to say that it's the wrong decision to use it, only that it 
can have similar drawbacks as key partitioning. Since it looks like you have 
two separate hashes, one for buckets, and then one for the object key within 
the bucket, it is possible that there could be hotspots based on a particular 
bucket. Presumably some sort of caching would help here since the bucket 
mapping is relatively immutable.

Secondary indexing will not be easy in a distributed sharded system, especially 
the consistency issues in dealing with updates. That said, I am reasonably 
certain that you will find that many users will need this feature relatively 
soon such that it is high on the roadmap.

You don't say much about ACLs other than to include them in the REST API. I 
suppose they'll be implemented in the Ozone Handler, but what will they look 
like? HDFS/Linux ACLs?

In the Cluster Level APIs, presumably DELETE Storage Volume is only allowed by 
the admin. What about GET?

How are quotas enabled and set? I don't see it in the API anywhere. There's 
mention early on that they're set up by the administrator. Perhaps it's via 
some http jsp thing to the Ozone Handler or Storage Container Manager? Who 
enforces them?

"no guarantees on partially written objects" - Does this also mean that there 
are no block-order guarantees during write? Are "holey" objects allowed or will 
the only inconsistencies be at the tail of an object. This is obviously 
important for log-based storage systems.

In the "Size requirements" section on page 3 you say "Number of objects per 
bucket: 1 million", and then later on you say "A bucket can have millions of 
objects". You may want to shore that up a little.

Also in the Size requirements section you say "Object Size: 5G", but then later 
it says "The storage container needs to store object data that can vary from a 
few hundred KB to hundreds of megabytes". I'm not sure those are necessarily 
inconsistent, but I'm also not sure how to reconcile them.

Perhaps you could include a diagram showing how an object maps to partitions 
and storage containers and then onto DNs. In other words, a general diagram 
showing all the various storage concepts (objects, partitions, storage 
containers, hash tables, transactions, etc.)

"We plan to re-use Namenode's block management implementation for container 
management, as much as possible." I'd love to see more detail on what can be 
reused, what high level changes to the BlkMgr code will be needed, what 
existing APIs (RPCs) you'll continue to use or need to be changed, etc.

wrt the storage container prototype using leveldbjni. Where would this fit into 
the scheme of things? I get the impression that it would just be a backend 
storage component for the storage container manager. Would it use HDFS blocks 
as its persistent storage? Presumably not. Maybe a little bit more detail here?

s/where where/where/


> Object store in HDFS
> --------------------
>
>                 Key: HDFS-7240
>                 URL: https://issues.apache.org/jira/browse/HDFS-7240
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jitendra Nath Pandey
>            Assignee: Jitendra Nath Pandey
>         Attachments: Ozone-architecture-v1.pdf
>
>
> This jira proposes to add object store capabilities into HDFS. 
> As part of the federation work (HDFS-1052) we separated block storage as a 
> generic storage layer. Using the Block Pool abstraction, new kinds of 
> namespaces can be built on top of the storage layer i.e. datanodes.
> In this jira I will explore building an object store using the datanode 
> storage, but independent of namespace metadata.
> I will soon update with a detailed design document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to