That's right, but is there a filesystem, that allows unbounded size of files? If there will always be an upper size limit, does that mean that you cannot use the order of elements in the file as is? You might need to transfer the offset from one file to another (that's how Kafka does it), but that implies that you don't use what natively gives you the batch storage, but you store the offset yourself (as metadata).

Either way, maybe the discussion is not that important, because the invariant requirement persists - there has to be a sequential observer of the data, that creates sequence of updates in the order the data was observed and persists this order. If you have two observers of data, each storing his own (even unbounded in size) file, then (if partition by key is not enforced) I'd say the ordering cannot be used.

This mechanism seems to me related to what limits parallelism in streaming sources and why batch sources are generally better parallelised.

Jan

On 5/30/19 1:35 PM, Reuven Lax wrote:
Files can grow (depending on the filesystem), and tailing growing files is a valid use case.

On Wed, May 29, 2019 at 3:23 PM Jan Lukavský <[email protected] <mailto:[email protected]>> wrote:

     > Offsets within a file, unordered between files seems exactly
    analogous with offsets within a partition, unordered between
    partitions,
    right?

    Not exactly. The key difference is in that partitions in streaming
    stores are defined (on purpose, and with key impact on this
    discussion)
    as unbounded sequence of appends. Files, on the other hand are
    always of
    finite size. This difference makes the semantics of offsets in
    partitioned stream useful, because the are guaranteed to only
    increase.
    On batch stores as files, these offsets would have to start from zero
    after some (finite) time, which makes them useless for comparison.

    On 5/29/19 2:44 PM, Robert Bradshaw wrote:
    > On Tue, May 28, 2019 at 12:18 PM Jan Lukavský <[email protected]
    <mailto:[email protected]>> wrote:
    >> As I understood it, Kenn was supporting the idea that sequence
    metadata
    >> is preferable over FIFO. I was trying to point out, that it
    even should
    >> provide the same functionally as FIFO, plus one important more -
    >> reproducibility and ability to being persisted and reused the
    same way
    >> in batch and streaming.
    >>
    >> There is no doubt, that sequence metadata can be stored in every
    >> storage. But, regarding some implicit ordering that sources
    might have -
    >> yes, of course, data written into HDFS or Cloud Storage has
    ordering,
    >> but only partial - inside some bulk (e.g. file) and the
    ordering is not
    >> defined correctly on boundaries of these bulks (between files).
    That is
    >> why I'd say, that ordering of sources is relevant only for
    >> (partitioned!) streaming sources and generally always reduces to
    >> sequence metadata (e.g. offsets).
    > Offsets within a file, unordered between files seems exactly
    analogous
    > with offsets within a partition, unordered between partitions,
    right?
    >
    >> On 5/28/19 11:43 AM, Robert Bradshaw wrote:
    >>> Huge +1 to all Kenn said.
    >>>
    >>> Jan, batch sources can have orderings too, just like Kafka. I
    think
    >>> it's reasonable (for both batch and streaming) that if a
    source has an
    >>> ordering that is an important part of the data, it should preserve
    >>> this ordering into the data itself (e.g. as sequence numbers,
    offsets,
    >>> etc.)
    >>>
    >>> On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles
    <[email protected] <mailto:[email protected]>> wrote:
    >>>> I strongly prefer explicit sequence metadata over FIFO
    requirements, because:
    >>>>
    >>>>    - FIFO is complex to specify: for example Dataflow has
    "per stage key-to-key" FIFO today, but it is not guaranteed to
    remain so (plus "stage" is not a portable concept, nor even
    guaranteed to remain a Dataflow concept)
    >>>>    - complex specifications are by definition poor usability
    (if necessary, then it is what it is)
    >>>>    - overly restricts the runner, reduces parallelism, for
    example any non-stateful ParDo has per-element parallelism, not
    per "key"
    >>>>    - another perspective on that: FIFO makes everyone pay
    rather than just the transform that requires exactly sequencing
    >>>>    - previous implementation details like reshuffles become
    part of the model
    >>>>    - I'm not even convinced the use cases involved are
    addressed by some careful FIFO restrictions; many sinks re-key and
    they would all have to become aware of how keying of a sequence of
    "stages" affects the end-to-end FIFO
    >>>>
    >>>> A noop becoming a non-noop is essentially the mathematical
    definition of moving from higher-level to lower-level abstraction.
    >>>>
    >>>> So this strikes at the core question of what level of
    abstraction Beam aims to represent. Lower-level means there are
    fewer possible implementations and it is more tied to the
    underlying architecture, and anything not near-exact match pays a
    huge penalty. Higher-level means there are more implementations
    possible with different tradeoffs, though they may all pay a minor
    penalty.
    >>>>
    >>>> I could be convinced to change my mind, but it needs some
    extensive design, examples, etc. I think it is probably about the
    most consequential design decision in the whole Beam model, around
    the same level as the decision to use ParDo and GBK as the
    primitives IMO.
    >>>>
    >>>> Kenn

Reply via email to