[
https://issues.apache.org/jira/browse/HDFS-12090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079375#comment-16079375
]
Virajith Jalaparti commented on HDFS-12090:
-------------------------------------------
Thanks for taking a look at the design document, and your comments [~rakeshr].
Responses below
# Setting the {{StoragePolicy}} to include PROVIDED storage type will result in
data movemement only if there is an external store to which a file(s) can be
moved to. We don't necessarily have to enforce an order between the two
operations. However, it must then be understood that a policy with PROVIDED
cannot be satisfied until a mount is defined to include the file under
question. Enforcing an order between the operations (i.e., mount points must be
specified before setting policy to include PROVIDED) will make this less
implicit/avoid possible confusion. Do you have a particular preference here? In
either case, as you point out, as long as the {{-createMountOnly}} flag is
specified (ref: Section 1.1 in the document) the movement will be triggered
when the user invokes HDFS-10285. If the flag is absent, as the document
mentions, it will be triggered by the {{MountManager}} but it will still use
HDFS-10285.
# For backup mounts, we do not want to write any new data directly to the
PROVIDED store (i.e, as part of the write pipeline) but write it to PROVIDED
only lazily (related to your point 6). This case arises when appending to files
that have {{PROVIDED}} storage policy set or when creating new files under
directories with {{PROVIDED}} policy. In these cases, I think we would have to
change the write pipeline to not choose a PROVIDED location but choose one of
the fallbacks for it (modify the {{BlockPlacementPolicyDefault}}). Later on,
when the {{MountTask}} detects that the new data has been written, it will try
to satisfy the storage policy.
# No, it will be done by the {{MountTask}}s in the {{MountManager}}.
# Agreed, the recovery mechanism has to be pluggable, and the implementation
will be vendor-specific.
# Yes, ideally, the admins shouldn't do that. For example, for S3, you would
want to use different buckets for different HDFS clusters. However, even if the
same store is used, the conflict resolution policy can be used (Section 2).
# Yes, I would also agree that lazy write-back is better to avoid the latency
overheads. The suggested operation for ephemeral mounts would be: (a) create
and write locally using a storage policy that does not involve PROVIDED, (b)
once the data is written, change it's policy to include PROVIDED so that it can
be written back lazily. Our initial implementation will be geared towards
supporting lazy write-backs. However, our idea is to design this in such a way
that we can support synchronous writes if needed.
# So, your suggestion is to include another flag that makes {{unmount}} work
the way you mentioned?
# No, we don't have to change SPS to do a recursive traversal -- that will be
done in the {{MountTask}}.
# That seems like a reasonable approach for EC files. Are you interested in
writing EC files to PROVIDED storage too?
> Handling writes from HDFS to Provided storages
> ----------------------------------------------
>
> Key: HDFS-12090
> URL: https://issues.apache.org/jira/browse/HDFS-12090
> Project: Hadoop HDFS
> Issue Type: New Feature
> Reporter: Virajith Jalaparti
> Attachments: HDFS-12090-design.001.pdf
>
>
> HDFS-9806 introduces the concept of {{PROVIDED}} storage, which makes data in
> external storage systems accessible through HDFS. However, HDFS-9806 is
> limited to data being read through HDFS. This JIRA will deal with how data
> can be written to such {{PROVIDED}} storages from HDFS.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]