[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

Ewan Higgs (JIRA) Thu, 07 Mar 2019 01:01:15 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786534#comment-16786534
 ]


Ewan Higgs commented on HDFS-13186:
-----------------------------------

{quote}The HADOOP-15691 PathCapabilities patch is intended to allow callers to 
probe for a feature being available before making the API Call. This'd let you 
go{quote}
A capability model is much better, indeed.

{quote}Bear in mind I also want to move the MPU API to being async block 
uploads, complete calls. For the classic local and HDFS stores, these would 
actually be done in the current thread. For S3 they'd run in the thread pool, 
so you could trivially kick off a parallel upload of blocks from a single 
thread without even knowing that the FS impl worked that way.{quote}
This is a good idea. When designing the API I thought we wanted to stick to a 
synchronous model to be consistent with the rest of the APIs but async is a 
much better fit for this as it's remote calls and we don't do anything like 
locking (which can be hairy in async code).

> [PROVIDED Phase 2] Multipart Uploader API
> -----------------------------------------
>
>                 Key: HDFS-13186
>                 URL: https://issues.apache.org/jira/browse/HDFS-13186
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Ewan Higgs
>            Assignee: Ewan Higgs
>            Priority: Major
>             Fix For: 3.2.0
>
>         Attachments: HDFS-13186.001.patch, HDFS-13186.002.patch, 
> HDFS-13186.003.patch, HDFS-13186.004.patch, HDFS-13186.005.patch, 
> HDFS-13186.006.patch, HDFS-13186.007.patch, HDFS-13186.008.patch, 
> HDFS-13186.009.patch, HDFS-13186.010.patch
>
>
> To write files in parallel to an external storage system as in HDFS-12090, 
> there are two approaches:
>  # Naive approach: use a single datanode per file that copies blocks locally 
> as it streams data to the external service. This requires a copy for each 
> block inside the HDFS system and then a copy for the block to be sent to the 
> external system.
>  # Better approach: Single point (e.g. Namenode or SPS style external client) 
> and Datanodes coordinate in a multipart - multinode upload.
> This system needs to work with multiple back ends and needs to coordinate 
> across the network. So we propose an API that resembles the following:
> {code:java}
> public UploadHandle multipartInit(Path filePath) throws IOException;
> public PartHandle multipartPutPart(InputStream inputStream,
>     int partNumber, UploadHandle uploadId) throws IOException;
> public void multipartComplete(Path filePath,
>     List<Pair<Integer, PartHandle>> handles, 
>     UploadHandle multipartUploadId) throws IOException;{code}
> Here, UploadHandle and PartHandle are opaque handlers in the vein of 
> PathHandle so they can be serialized and deserialized in hadoop-hdfs project 
> without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle 
> and PartHandle.
> In an object store such as S3A, the implementation is straight forward. In 
> the case of writing multipart/multinode to HDFS, we can write each block as a 
> file part. The complete call will perform a concat on the blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

Reply via email to