[ 
https://issues.apache.org/jira/browse/HDFS-12090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079375#comment-16079375
 ] 

Virajith Jalaparti commented on HDFS-12090:
-------------------------------------------

Thanks for taking a look at the design document, and your comments [~rakeshr]. 
Responses below

# Setting the {{StoragePolicy}} to include PROVIDED storage type will result in 
data movemement only if there is an external store to which a file(s) can be 
moved to. We don't necessarily have to enforce an order between the two 
operations. However, it must then be understood that a policy with PROVIDED 
cannot be satisfied until a mount is defined to include the file under 
question. Enforcing an order between the operations (i.e., mount points must be 
specified before setting policy to include PROVIDED) will make this less 
implicit/avoid possible confusion. Do you have a particular preference here? In 
either case, as you point out, as long as the {{-createMountOnly}} flag is 
specified (ref: Section 1.1 in the document) the movement will be triggered 
when the user invokes HDFS-10285. If the flag is absent, as the document 
mentions, it will be triggered by the {{MountManager}} but it will still use 
HDFS-10285.
# For backup mounts, we do not want to write any new data directly to the 
PROVIDED store (i.e, as part of the write pipeline) but write it to PROVIDED 
only lazily (related to your point 6). This case arises when appending to files 
that have {{PROVIDED}} storage policy set or when creating new files under 
directories with {{PROVIDED}} policy. In these cases, I think we would have to 
change the write pipeline to not choose a PROVIDED location but choose one of 
the fallbacks for it (modify the {{BlockPlacementPolicyDefault}}). Later on, 
when the {{MountTask}} detects that the new data has been written, it will try 
to satisfy the storage policy.
# No, it will be done by the {{MountTask}}s in the {{MountManager}}.
# Agreed, the recovery mechanism has to be pluggable, and the implementation 
will be vendor-specific.
# Yes, ideally, the admins shouldn't do that. For example, for S3, you would 
want to use different buckets for different HDFS clusters. However, even if the 
same store is used, the conflict resolution policy can be used (Section 2).
# Yes, I would also agree that lazy write-back is better to avoid the latency 
overheads. The suggested operation for ephemeral mounts would be: (a) create 
and write locally using a storage policy that does not involve PROVIDED, (b) 
once the data is written, change it's policy to include PROVIDED so that it can 
be written back lazily. Our initial implementation will be geared towards 
supporting lazy write-backs. However, our idea is to design this in such a way 
that we can support synchronous writes if needed.
# So, your suggestion is to include another flag that makes {{unmount}} work 
the way you mentioned?
# No, we don't have to change SPS to do a recursive traversal -- that will be 
done in the {{MountTask}}. 
# That seems like a reasonable approach for EC files. Are you interested in 
writing EC files to PROVIDED storage too?

> Handling writes from HDFS to Provided storages
> ----------------------------------------------
>
>                 Key: HDFS-12090
>                 URL: https://issues.apache.org/jira/browse/HDFS-12090
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Virajith Jalaparti
>         Attachments: HDFS-12090-design.001.pdf
>
>
> HDFS-9806 introduces the concept of {{PROVIDED}} storage, which makes data in 
> external storage systems accessible through HDFS. However, HDFS-9806 is 
> limited to data being read through HDFS. This JIRA will deal with how data 
> can be written to such {{PROVIDED}} storages from HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to