Files can grow (depending on the filesystem), and tailing growing files is a valid use case.
On Wed, May 29, 2019 at 3:23 PM Jan Lukavský <[email protected]> wrote: > > Offsets within a file, unordered between files seems exactly > analogous with offsets within a partition, unordered between partitions, > right? > > Not exactly. The key difference is in that partitions in streaming > stores are defined (on purpose, and with key impact on this discussion) > as unbounded sequence of appends. Files, on the other hand are always of > finite size. This difference makes the semantics of offsets in > partitioned stream useful, because the are guaranteed to only increase. > On batch stores as files, these offsets would have to start from zero > after some (finite) time, which makes them useless for comparison. > > On 5/29/19 2:44 PM, Robert Bradshaw wrote: > > On Tue, May 28, 2019 at 12:18 PM Jan Lukavský <[email protected]> wrote: > >> As I understood it, Kenn was supporting the idea that sequence metadata > >> is preferable over FIFO. I was trying to point out, that it even should > >> provide the same functionally as FIFO, plus one important more - > >> reproducibility and ability to being persisted and reused the same way > >> in batch and streaming. > >> > >> There is no doubt, that sequence metadata can be stored in every > >> storage. But, regarding some implicit ordering that sources might have - > >> yes, of course, data written into HDFS or Cloud Storage has ordering, > >> but only partial - inside some bulk (e.g. file) and the ordering is not > >> defined correctly on boundaries of these bulks (between files). That is > >> why I'd say, that ordering of sources is relevant only for > >> (partitioned!) streaming sources and generally always reduces to > >> sequence metadata (e.g. offsets). > > Offsets within a file, unordered between files seems exactly analogous > > with offsets within a partition, unordered between partitions, right? > > > >> On 5/28/19 11:43 AM, Robert Bradshaw wrote: > >>> Huge +1 to all Kenn said. > >>> > >>> Jan, batch sources can have orderings too, just like Kafka. I think > >>> it's reasonable (for both batch and streaming) that if a source has an > >>> ordering that is an important part of the data, it should preserve > >>> this ordering into the data itself (e.g. as sequence numbers, offsets, > >>> etc.) > >>> > >>> On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles <[email protected]> > wrote: > >>>> I strongly prefer explicit sequence metadata over FIFO requirements, > because: > >>>> > >>>> - FIFO is complex to specify: for example Dataflow has "per stage > key-to-key" FIFO today, but it is not guaranteed to remain so (plus "stage" > is not a portable concept, nor even guaranteed to remain a Dataflow concept) > >>>> - complex specifications are by definition poor usability (if > necessary, then it is what it is) > >>>> - overly restricts the runner, reduces parallelism, for example > any non-stateful ParDo has per-element parallelism, not per "key" > >>>> - another perspective on that: FIFO makes everyone pay rather than > just the transform that requires exactly sequencing > >>>> - previous implementation details like reshuffles become part of > the model > >>>> - I'm not even convinced the use cases involved are addressed by > some careful FIFO restrictions; many sinks re-key and they would all have > to become aware of how keying of a sequence of "stages" affects the > end-to-end FIFO > >>>> > >>>> A noop becoming a non-noop is essentially the mathematical definition > of moving from higher-level to lower-level abstraction. > >>>> > >>>> So this strikes at the core question of what level of abstraction > Beam aims to represent. Lower-level means there are fewer possible > implementations and it is more tied to the underlying architecture, and > anything not near-exact match pays a huge penalty. Higher-level means there > are more implementations possible with different tradeoffs, though they may > all pay a minor penalty. > >>>> > >>>> I could be convinced to change my mind, but it needs some extensive > design, examples, etc. I think it is probably about the most consequential > design decision in the whole Beam model, around the same level as the > decision to use ParDo and GBK as the primitives IMO. > >>>> > >>>> Kenn >
