Re: Definition of Unified model

Reuven Lax Thu, 30 May 2019 04:36:19 -0700

Files can grow (depending on the filesystem), and tailing growing files is
a valid use case.


On Wed, May 29, 2019 at 3:23 PM Jan Lukavský <[email protected]> wrote:

>  > Offsets within a file, unordered between files seems exactly
> analogous with offsets within a partition, unordered between partitions,
> right?
>
> Not exactly. The key difference is in that partitions in streaming
> stores are defined (on purpose, and with key impact on this discussion)
> as unbounded sequence of appends. Files, on the other hand are always of
> finite size. This difference makes the semantics of offsets in
> partitioned stream useful, because the are guaranteed to only increase.
> On batch stores as files, these offsets would have to start from zero
> after some (finite) time, which makes them useless for comparison.
>
> On 5/29/19 2:44 PM, Robert Bradshaw wrote:
> > On Tue, May 28, 2019 at 12:18 PM Jan Lukavský <[email protected]> wrote:
> >> As I understood it, Kenn was supporting the idea that sequence metadata
> >> is preferable over FIFO. I was trying to point out, that it even should
> >> provide the same functionally as FIFO, plus one important more -
> >> reproducibility and ability to being persisted and reused the same way
> >> in batch and streaming.
> >>
> >> There is no doubt, that sequence metadata can be stored in every
> >> storage. But, regarding some implicit ordering that sources might have -
> >> yes, of course, data written into HDFS or Cloud Storage has ordering,
> >> but only partial - inside some bulk (e.g. file) and the ordering is not
> >> defined correctly on boundaries of these bulks (between files). That is
> >> why I'd say, that ordering of sources is relevant only for
> >> (partitioned!) streaming sources and generally always reduces to
> >> sequence metadata (e.g. offsets).
> > Offsets within a file, unordered between files seems exactly analogous
> > with offsets within a partition, unordered between partitions, right?
> >
> >> On 5/28/19 11:43 AM, Robert Bradshaw wrote:
> >>> Huge +1 to all Kenn said.
> >>>
> >>> Jan, batch sources can have orderings too, just like Kafka. I think
> >>> it's reasonable (for both batch and streaming) that if a source has an
> >>> ordering that is an important part of the data, it should preserve
> >>> this ordering into the data itself (e.g. as sequence numbers, offsets,
> >>> etc.)
> >>>
> >>> On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles <[email protected]>
> wrote:
> >>>> I strongly prefer explicit sequence metadata over FIFO requirements,
> because:
> >>>>
> >>>>    - FIFO is complex to specify: for example Dataflow has "per stage
> key-to-key" FIFO today, but it is not guaranteed to remain so (plus "stage"
> is not a portable concept, nor even guaranteed to remain a Dataflow concept)
> >>>>    - complex specifications are by definition poor usability (if
> necessary, then it is what it is)
> >>>>    - overly restricts the runner, reduces parallelism, for example
> any non-stateful ParDo has per-element parallelism, not per "key"
> >>>>    - another perspective on that: FIFO makes everyone pay rather than
> just the transform that requires exactly sequencing
> >>>>    - previous implementation details like reshuffles become part of
> the model
> >>>>    - I'm not even convinced the use cases involved are addressed by
> some careful FIFO restrictions; many sinks re-key and they would all have
> to become aware of how keying of a sequence of "stages" affects the
> end-to-end FIFO
> >>>>
> >>>> A noop becoming a non-noop is essentially the mathematical definition
> of moving from higher-level to lower-level abstraction.
> >>>>
> >>>> So this strikes at the core question of what level of abstraction
> Beam aims to represent. Lower-level means there are fewer possible
> implementations and it is more tied to the underlying architecture, and
> anything not near-exact match pays a huge penalty. Higher-level means there
> are more implementations possible with different tradeoffs, though they may
> all pay a minor penalty.
> >>>>
> >>>> I could be convinced to change my mind, but it needs some extensive
> design, examples, etc. I think it is probably about the most consequential
> design decision in the whole Beam model, around the same level as the
> decision to use ParDo and GBK as the primitives IMO.
> >>>>
> >>>> Kenn
>

Re: Definition of Unified model

Reply via email to