> Offsets within a file, unordered between files seems exactly
analogous with offsets within a partition, unordered between partitions,
right?
Not exactly. The key difference is in that partitions in streaming
stores are defined (on purpose, and with key impact on this discussion)
as unbounded sequence of appends. Files, on the other hand are always of
finite size. This difference makes the semantics of offsets in
partitioned stream useful, because the are guaranteed to only increase.
On batch stores as files, these offsets would have to start from zero
after some (finite) time, which makes them useless for comparison.
On 5/29/19 2:44 PM, Robert Bradshaw wrote:
On Tue, May 28, 2019 at 12:18 PM Jan Lukavský <je...@seznam.cz> wrote:
As I understood it, Kenn was supporting the idea that sequence metadata
is preferable over FIFO. I was trying to point out, that it even should
provide the same functionally as FIFO, plus one important more -
reproducibility and ability to being persisted and reused the same way
in batch and streaming.
There is no doubt, that sequence metadata can be stored in every
storage. But, regarding some implicit ordering that sources might have -
yes, of course, data written into HDFS or Cloud Storage has ordering,
but only partial - inside some bulk (e.g. file) and the ordering is not
defined correctly on boundaries of these bulks (between files). That is
why I'd say, that ordering of sources is relevant only for
(partitioned!) streaming sources and generally always reduces to
sequence metadata (e.g. offsets).
Offsets within a file, unordered between files seems exactly analogous
with offsets within a partition, unordered between partitions, right?
On 5/28/19 11:43 AM, Robert Bradshaw wrote:
Huge +1 to all Kenn said.
Jan, batch sources can have orderings too, just like Kafka. I think
it's reasonable (for both batch and streaming) that if a source has an
ordering that is an important part of the data, it should preserve
this ordering into the data itself (e.g. as sequence numbers, offsets,
etc.)
On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles <k...@apache.org> wrote:
I strongly prefer explicit sequence metadata over FIFO requirements, because:
- FIFO is complex to specify: for example Dataflow has "per stage key-to-key" FIFO
today, but it is not guaranteed to remain so (plus "stage" is not a portable concept, nor
even guaranteed to remain a Dataflow concept)
- complex specifications are by definition poor usability (if necessary,
then it is what it is)
- overly restricts the runner, reduces parallelism, for example any non-stateful ParDo
has per-element parallelism, not per "key"
- another perspective on that: FIFO makes everyone pay rather than just the
transform that requires exactly sequencing
- previous implementation details like reshuffles become part of the model
- I'm not even convinced the use cases involved are addressed by some careful FIFO
restrictions; many sinks re-key and they would all have to become aware of how keying of
a sequence of "stages" affects the end-to-end FIFO
A noop becoming a non-noop is essentially the mathematical definition of moving
from higher-level to lower-level abstraction.
So this strikes at the core question of what level of abstraction Beam aims to
represent. Lower-level means there are fewer possible implementations and it is
more tied to the underlying architecture, and anything not near-exact match
pays a huge penalty. Higher-level means there are more implementations possible
with different tradeoffs, though they may all pay a minor penalty.
I could be convinced to change my mind, but it needs some extensive design,
examples, etc. I think it is probably about the most consequential design
decision in the whole Beam model, around the same level as the decision to use
ParDo and GBK as the primitives IMO.
Kenn