[ 
https://issues.apache.org/jira/browse/HDFS-12589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191054#comment-16191054
 ] 

Ewan Higgs commented on HDFS-12589:
-----------------------------------

Some discussion happened off jira; but we'd much prefer these discussions to be 
in the open and tracked:

>From [~ehiggs]
{quote}
Regarding the BlockAlias, we suggested that we get rid of it since the 
interface is insufficient to work with and it’s not clear how it should be used 
to e.g. dispatch writes to the correct ProvidedVolumeImpl. We proposed to 
replace this with having new styles of retrieving data use their own URI scheme 
(e.g. myformat://). Also, if there are other requirements, it could potentially 
be held in an extra byte[] in the FileRegion to hold extra information that a 
custom ProvidedVolumeImpl could use.
{quote}

>From [~chris.douglas]:
{quote}
There’s a long tradition of stuffing metadata into URIs, so I won’t argue that 
this restricts possible implementations. As we discussed during the call, if 
there are a set of possible providers, an alias doesn’t contain enough 
information to dispatch among them. Since Hadoop already has a mechanism, 
kludgy as it may be, for looking up different FileSystems based on Path/URIs, 
we could use the existing scheme/authority/principal cache instead of layering 
another layer of indirection on top of it.
 
I’ll outline my reservations. The existing object store “FileSystem” 
implementations already manage some impedance mismatches, translating 
hierarchical operations into those stores. Moreover, the layers people are 
adding to HDFS in HBase, Hive/LLAP, etc. are working around the namesystem, 
mostly treating HDFS as if it were a (not particularly good) object store. If 
we make everything into a FileRegion, we’re baking in the FileSystem coupling 
between the HDFS block layer and the provided store. We’re baking in its 
versatility- which is likely sufficient- but also its disadvantages.
 
For example, there are no good batch APIs to FileSystem. There are no 
reasonable async APIs, and the ones being built have no consistency guarantees. 
We’ve been trying to introduce an API providing the most basic consistency 
guarantee, and that’s taken a year of negotiation and prototyping.
 
Thomas/Ewan, you guys are more familiar with the limitations of S3Guard than I 
am. If those won’t materially affect the implementation of future provided 
stores (or those invariants are useful to their implementation) then I won’t 
insist on an abstraction that only gets in the way of implementation. -C
{quote}

> [DISCUSS] Provided Storage BlockAlias Refactoring
> -------------------------------------------------
>
>                 Key: HDFS-12589
>                 URL: https://issues.apache.org/jira/browse/HDFS-12589
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs
>            Reporter: Ewan Higgs
>            Priority: Minor
>
> A BlockAlias is an interface used by the Datanode to determine where to 
> retrieve data from. It currently has a single implementation: {{FileRegion}} 
> which contains the Block, BlockPoolID, Provided URL for the FileRegion (i.e. 
> block); and length and offset of the FileRegion in the remote storage.
> The BlockAlias currently has a single method: {{getBlock}}. This is not 
> particularly useful since we can't ask it meaningful questions like 'how do 
> we retrieve the data from the external storage system?'. Or 'is the version 
> of the block in the external storage system up to data?'. Either we can do 
> away with the BlockAlias altogether and work with FileRegion, or the 
> BlockAlias needs to be made more robust.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to