[jira] [Work logged] (HADOOP-13327) Add OutputStream + Syncable to the Filesystem Specification

ASF GitHub Bot (Jira) Tue, 09 Feb 2021 07:00:21 -0800


     [ 
https://issues.apache.org/jira/browse/HADOOP-13327?focusedWorklogId=550273&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-550273
 ]


ASF GitHub Bot logged work on HADOOP-13327:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 09/Feb/21 14:59
            Start Date: 09/Feb/21 14:59
    Worklog Time Spent: 10m 
      Work Description: steveloughran commented on a change in pull request 
#2587:
URL: https://github.com/apache/hadoop/pull/2587#discussion_r572957621



##########
File path: 
hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md
##########
@@ -0,0 +1,1002 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+<!-- MACRO{toc|fromDepth=1|toDepth=3} -->
+
+# Output: `OutputStream`, `Syncable` and `StreamCapabilities`
+
+## Introduction
+
+This document covers the Output Streams within the context of the
+[Hadoop File System Specification](index.html).
+
+It uses the filesystem model defined in [A Model of a Hadoop 
Filesystem](model.html)
+with the notation defined in [notation](Notation.md).
+
+The target audiences are:
+1. Users of the APIs. While `java.io.OutputStream` is a standard interfaces,
+this document clarifies how it is implemented in HDFS and elsewhere.
+The Hadoop-specific interfaces `Syncable` and `StreamCapabilities` are new;
+`Syncable` is notable in offering durability and visibility guarantees which
+exceed that of `OutputStream`.
+1. Implementors of File Systems and clients.
+
+## How data is written to a filesystem
+
+The core mechanism to write data to files through the Hadoop FileSystem APIs
+is through `OutputStream` subclasses obtained through calls to
+`FileSystem.create()`, `FileSystem.append()`,
+or `FSDataOutputStreamBuilder.build()`.
+
+These all return instances of `FSDataOutputStream`, through which data
+can be written through various `write()` methods.
+After a stream's `close()` method is called, all data written to the
+stream MUST BE persisted to the fileysystem and visible to oll other
+clients attempting to read data from that path via `FileSystem.open()`.
+
+As well as operations to write the data, Hadoop's `OutputStream` 
implementations
+provide methods to flush buffered data back to the filesystem,
+so as to ensure that the data is reliably persisted and/or visible
+to other callers. This is done via the `Syncable` interface. It was
+originally intended that the presence of this interface could be interpreted
+as a guarantee that the stream supported its methods. However, this has proven
+impossible to guarantee as the static nature of the interface is incompatible
+with filesystems whose syncability semantics may vary on a store/path basis.
+As an example, erasure coded files in HDFS do not support the Sync operations,
+even though they are implemented as subclass of an output stream which is 
`Syncable`.
+
+A new interface: `StreamCapabilities`. This allows callers
+to probe the exact capabilities of a stream, even transitively
+through a chain of streams.
+
+## Output Stream Model
+
+For this specification, an output stream can be viewed as a list of bytes
+stored in the client -the `hsync()` and `hflush()` operations the actions
+which propagate the data to be visible to other readers of the file and/or
+made durable.
+
+```python
+buffer: List[byte]
+```
+
+A flag, `open` tracks whether the stream is open: after the stream
+is closed no more data may be written to it:
+
+```python
+open: bool
+buffer: List[byte]
+```
+
+The destination path of the stream, `path`, can be tracked to form a triple
+`path, open, buffer`
+
+```python
+Stream = (path: Path, open: Boolean, buffer: byte[])
+```
+
+#### Visibility of Flushed Data
+
+(Immediately) after `Syncable` operations which flush data to the filesystem,
+the data at the stream's destination path MUST match that of
+`buffer`. That is, the following condition MUST hold:
+
+```python
+FS'.Files(path) == buffer
+```
+
+Any client reading the data at the path MUST see the new data.

Review comment:
       up to the implementation. Close() must pass all its data to the shared 
FS.
   
   Now, if you want some fun, look at [NFS client side 
caching](https://docstore.mik.ua/orelly/networking_2ndEd/nfs/ch07_04.htm). 
Dates from the era of diskless sun workstations and was optimised for short 
lived files which would only be used by the workstations, so copying over a 
1MB/s shared ethernet to an even slower shared HDD would hurt the rest of the 
cluster. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 550273)
    Time Spent: 9h 50m  (was: 9h 40m)

> Add OutputStream + Syncable to the Filesystem Specification
> -----------------------------------------------------------
>
>                 Key: HADOOP-13327
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13327
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HADOOP-13327-002.patch, HADOOP-13327-003.patch, 
> HADOOP-13327-branch-2-001.patch
>
>          Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> Write down what a Filesystem output stream should do. While core the API is 
> defined in Java, that doesn't say what's expected about visibility, 
> durability, etc —and Hadoop Syncable interface is entirely ours to define.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-13327) Add OutputStream + Syncable to the Filesystem Specification

Reply via email to