On Tue, May 28, 2019 at 12:18 PM Jan Lukavský <je...@seznam.cz> wrote: > > As I understood it, Kenn was supporting the idea that sequence metadata > is preferable over FIFO. I was trying to point out, that it even should > provide the same functionally as FIFO, plus one important more - > reproducibility and ability to being persisted and reused the same way > in batch and streaming. > > There is no doubt, that sequence metadata can be stored in every > storage. But, regarding some implicit ordering that sources might have - > yes, of course, data written into HDFS or Cloud Storage has ordering, > but only partial - inside some bulk (e.g. file) and the ordering is not > defined correctly on boundaries of these bulks (between files). That is > why I'd say, that ordering of sources is relevant only for > (partitioned!) streaming sources and generally always reduces to > sequence metadata (e.g. offsets).
Offsets within a file, unordered between files seems exactly analogous with offsets within a partition, unordered between partitions, right? > On 5/28/19 11:43 AM, Robert Bradshaw wrote: > > Huge +1 to all Kenn said. > > > > Jan, batch sources can have orderings too, just like Kafka. I think > > it's reasonable (for both batch and streaming) that if a source has an > > ordering that is an important part of the data, it should preserve > > this ordering into the data itself (e.g. as sequence numbers, offsets, > > etc.) > > > > On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles <k...@apache.org> wrote: > >> I strongly prefer explicit sequence metadata over FIFO requirements, > >> because: > >> > >> - FIFO is complex to specify: for example Dataflow has "per stage > >> key-to-key" FIFO today, but it is not guaranteed to remain so (plus > >> "stage" is not a portable concept, nor even guaranteed to remain a > >> Dataflow concept) > >> - complex specifications are by definition poor usability (if necessary, > >> then it is what it is) > >> - overly restricts the runner, reduces parallelism, for example any > >> non-stateful ParDo has per-element parallelism, not per "key" > >> - another perspective on that: FIFO makes everyone pay rather than just > >> the transform that requires exactly sequencing > >> - previous implementation details like reshuffles become part of the > >> model > >> - I'm not even convinced the use cases involved are addressed by some > >> careful FIFO restrictions; many sinks re-key and they would all have to > >> become aware of how keying of a sequence of "stages" affects the > >> end-to-end FIFO > >> > >> A noop becoming a non-noop is essentially the mathematical definition of > >> moving from higher-level to lower-level abstraction. > >> > >> So this strikes at the core question of what level of abstraction Beam > >> aims to represent. Lower-level means there are fewer possible > >> implementations and it is more tied to the underlying architecture, and > >> anything not near-exact match pays a huge penalty. Higher-level means > >> there are more implementations possible with different tradeoffs, though > >> they may all pay a minor penalty. > >> > >> I could be convinced to change my mind, but it needs some extensive > >> design, examples, etc. I think it is probably about the most consequential > >> design decision in the whole Beam model, around the same level as the > >> decision to use ParDo and GBK as the primitives IMO. > >> > >> Kenn