On Tue, May 28, 2019 at 12:18 PM Jan Lukavský <je...@seznam.cz> wrote:
>
> As I understood it, Kenn was supporting the idea that sequence metadata
> is preferable over FIFO. I was trying to point out, that it even should
> provide the same functionally as FIFO, plus one important more -
> reproducibility and ability to being persisted and reused the same way
> in batch and streaming.
>
> There is no doubt, that sequence metadata can be stored in every
> storage. But, regarding some implicit ordering that sources might have -
> yes, of course, data written into HDFS or Cloud Storage has ordering,
> but only partial - inside some bulk (e.g. file) and the ordering is not
> defined correctly on boundaries of these bulks (between files). That is
> why I'd say, that ordering of sources is relevant only for
> (partitioned!) streaming sources and generally always reduces to
> sequence metadata (e.g. offsets).

Offsets within a file, unordered between files seems exactly analogous
with offsets within a partition, unordered between partitions, right?

> On 5/28/19 11:43 AM, Robert Bradshaw wrote:
> > Huge +1 to all Kenn said.
> >
> > Jan, batch sources can have orderings too, just like Kafka. I think
> > it's reasonable (for both batch and streaming) that if a source has an
> > ordering that is an important part of the data, it should preserve
> > this ordering into the data itself (e.g. as sequence numbers, offsets,
> > etc.)
> >
> > On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles <k...@apache.org> wrote:
> >> I strongly prefer explicit sequence metadata over FIFO requirements, 
> >> because:
> >>
> >>   - FIFO is complex to specify: for example Dataflow has "per stage 
> >> key-to-key" FIFO today, but it is not guaranteed to remain so (plus 
> >> "stage" is not a portable concept, nor even guaranteed to remain a 
> >> Dataflow concept)
> >>   - complex specifications are by definition poor usability (if necessary, 
> >> then it is what it is)
> >>   - overly restricts the runner, reduces parallelism, for example any 
> >> non-stateful ParDo has per-element parallelism, not per "key"
> >>   - another perspective on that: FIFO makes everyone pay rather than just 
> >> the transform that requires exactly sequencing
> >>   - previous implementation details like reshuffles become part of the 
> >> model
> >>   - I'm not even convinced the use cases involved are addressed by some 
> >> careful FIFO restrictions; many sinks re-key and they would all have to 
> >> become aware of how keying of a sequence of "stages" affects the 
> >> end-to-end FIFO
> >>
> >> A noop becoming a non-noop is essentially the mathematical definition of 
> >> moving from higher-level to lower-level abstraction.
> >>
> >> So this strikes at the core question of what level of abstraction Beam 
> >> aims to represent. Lower-level means there are fewer possible 
> >> implementations and it is more tied to the underlying architecture, and 
> >> anything not near-exact match pays a huge penalty. Higher-level means 
> >> there are more implementations possible with different tradeoffs, though 
> >> they may all pay a minor penalty.
> >>
> >> I could be convinced to change my mind, but it needs some extensive 
> >> design, examples, etc. I think it is probably about the most consequential 
> >> design decision in the whole Beam model, around the same level as the 
> >> decision to use ParDo and GBK as the primitives IMO.
> >>
> >> Kenn

Reply via email to