Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

Micah Kornfield Sat, 19 Oct 2019 19:52:18 -0700

Hi John,

> Just concerned about running out of time if we are freezing the format at
> 1.0.  Is there any rough guess as to how much time I have to make my case?



We haven't discussed it formally, but we have been doing releases ~3
months, so i would expect 1.0.0 sometime in January.

On Fri, Oct 18, 2019 at 10:18 AM John Muehlhausen <j...@jgm.org> wrote:

> Perhaps what I should do is study the batch creation process in the
> reference implementation and see whether an alternative approach can be a
> lot more efficient in time while being less efficient in space.
>
> And if so, whether this new approach also requires a differentiation
> between batch length and underlying array/buffer length, even while the
> immutability doctrine of a batch remains unchanged.  (I.e. remove "mutating
> record batch headers on the fly" from the discussion as potentially
> unnecessary to make the point.)
>
> Just concerned about running out of time if we are freezing the format at
> 1.0.  Is there any rough guess as to how much time I have to make my case?
>
> -John
>
> On Fri, Oct 18, 2019 at 1:55 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> On the specification: I'm -.5 on saying array lengths can be different
>> then row batch length (especially if both are valid lengths).  I can see
>> some wiggle room the the current language [1][2] that might allow for
>> modifying this, so I think we should update it one way or another however
>> this conversation turns out.
>>
>> On the implementation: I wouldn't want to see code that is mutating the
>> record batch headers on the fly in the current code base.  It is dangerous
>> to break immutability assumptions across which standard implementations
>> might be relying on. IMO, for such a specialized use-case it makes sense to
>> have out-of-band communication (perhaps in a private implementation that
>> does modify headers) and then "touch" up the metadata for later analysis,
>> so it conforms to the specification (and standard libraries can be used).
>>
>> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs#L49
>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L34
>>
>> On Thu, Oct 17, 2019 at 7:17 AM John Muehlhausen <j...@jgm.org> wrote:
>>
>>> Micah, thanks very much for your input.  A few thoughts in response:
>>>
>>> ``Over the time horizon of desired latency if you aren't receiving
>>> enough messages to take advantage of columnar analytics, a system probably
>>> has enough time to compact batches after the fact for later analysis and
>>> conversely if you are receiving many events you naturally get reasonable
>>> batch sizes without having to do further work.''
>>>
>>> To see my perspective it is necessary to stop thinking about business
>>> analytics and latencies of minutes or seconds or significant fractions of
>>> seconds.  In real-time financial trading systems we do things like kernel
>>> bypass networking, core affinitization and spinning, fake workloads to keep
>>> hot paths cached, almost complete avoidance of general purpose allocators
>>> (preferring typed pools), all with the goal of shaving off one more
>>> MICROsecond.
>>>
>>> Therefore "probably has enough time" is never in our vocabulary.
>>> Microbatching is not a thing.  It is either purely event driven or bust if
>>> you are shooting for a dozen microseconds from packet-in to packet-out on
>>> the wire.  This is also the reason that certain high-performance systems
>>> will never use the reference implementations of Arrow, just as most other
>>> financial tech standards are centered on the *protocol* and there end up
>>> being dozens of implementations that compete on performance.
>>>
>>> Arrow *as a protocol* has the potential to be a beautiful thing.  It is
>>> simple and straightforward and brings needed standardization.  I would
>>> encourage thinking of the reference implementations almost as second-class
>>> citizens.  The reference implementations should not de-facto define the
>>> protocol, but vice versa. (I know this is probably already the philosophy.)
>>>
>>> Of course data eventually pops out of high performance systems and has a
>>> second life in the hands of researchers, etc.  This is why I'd like the
>>> "vanilla" Arrow to be able to deal with Arrow data as-constructed in the
>>> higher performance systems.  Could there be special processes to refactor
>>> all this data?  Of course, but why?  If I'm logging RecordBatches to disk
>>> and some of them happen to have extra array elements (e.g. because of the
>>> inevitable imperfect projection pre-allocation-- running out of variable
>>> length string storage before running out of rows), why would I refactor
>>> terabytes of data (and, more importantly, introduce a time window where
>>> data is not available while this occurs) when I can just have pyarrow skip
>>> the unused rows?  If I want compression/de-duplication I'll do it at the
>>> storage media layer.
>>>
>>> ``the proposal is essentially changing the unit of exchange from
>>> RecordBatches to a segment of a RecordBatch''
>>>
>>> It is the recognition that a RecordBatch may need to be computationally
>>> useful before it is completely received/constructed, and without adding
>>> additional overhead to the reception/construction process.  I grab a canned
>>> RecordBatch from a pool of them and start filling it in.  If an earlier
>>> RecordBatch falls out of the computation window, I put it back in the
>>> pool.  At any moment of time the batch advertises the truth:
>>> RecordBatch.length is the immutable section of data at this moment in
>>> time.  That section will not have changed in future moments.  It is also a
>>> recognition that some RecordBatches so-constructed are not completely
>>> fillable.  They are like a moving truck with a little empty space left.
>>> Sure, we could custom-build trucks to exactly fit the cargo, but why?  It
>>> takes too much time.  Grab a truck (or a RecordBatch) off the lot (pool)
>>> and go...  when you've filled it as much as possible, but not perfectly,
>>> grab another one that is exactly the same.
>>>
>>> I think my perception of the situation is clearest if we think about
>>> "frozen" or "sealed" RecordBatches that, during their construction stage,
>>> use a one-size-fits-all set of arrays and variable length buffer storage.
>>> I grab a RecordBatch off the lot that is "for two int columns and a
>>> variable length string column where the average expected length of a string
>>> is 10."  I fill it, then I'm done.  If my strings were slightly longer than
>>> expected I have extra array elements and RecordBatch.length is less than
>>> array length.
>>>
>>> While it is true that I have other use-cases in mind regarding
>>> simultaneous collection and computation, I'm hoping that the moving truck
>>> analogy by itself demonstrates enough efficiency advantages (as compared to
>>> typical batch construction) to warrant this change.  Put simply, it is
>>> Arrow accommodating more space/time tradeoff options than it currently does.
>>>
>>> When I was baking plasma into a system for the first time, I ran into
>>> the example that I create a record batch to a mock stream in order to know
>>> how much plasma memory to allocate.  That is the kind of "build the moving
>>> truck to fit the cargo" that I just can't have!
>>>
>>> -John
>>>
>>> On Wed, Oct 16, 2019 at 10:15 PM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> Hi John and Wes,
>>>>
>>>> A few thoughts:
>>>> One of the issues which we didn't get into in prior discussions, is the
>>>> proposal is essentially changing the unit of exchange from
>>>> RecordBatches to
>>>> a segment of a RecordBatch.
>>>>
>>>> I think I brought this up earlier in discussions, an interesting idea
>>>> that
>>>> Trill [1], a columnar streaming engine, illustrates.  Over the time
>>>> horizon
>>>> of desired latency if you aren't receiving enough messages to take
>>>> advantage of columnar analytics, a system probably has enough time to
>>>> compact batches after the fact for later analysis and conversely if you
>>>> are
>>>> receiving many events you naturally get reasonable batch sizes without
>>>> having to do further work.
>>>>
>>>>
>>>> > I'm objecting to RecordBatch.length being inconsistent with the
>>>> > constituent field lengths, that's where the danger lies. If all of the
>>>> > lengths are consistent, no code changes are necessary.
>>>>
>>>> John, is it  a viable solution to keep all length in sync for the use
>>>> case
>>>> you are imagining?
>>>>
>>>> A solution I like less, but might be viable: formally specify a negative
>>>> constant that signifies length should be inherited from RowBatch length
>>>> (this could only be used on top level fields).
>>>>
>>>> I contend that it can only be useful and will never be harmful.  What
>>>> are
>>>> > the counter-examples of concrete harm?
>>>>
>>>>
>>>> I'm not sure there is anything obviously wrong, however changes to
>>>> semantics are always dangerous.  One  blemish on the current proposal
>>>> is
>>>> one can't determine easily if a mismatch in row-length is a programming
>>>> error or intentional.
>>>>
>>>> [1]
>>>>
>>>> https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/trill-vldb2015.pdf
>>>>
>>>> On Wed, Oct 16, 2019 at 4:41 PM John Muehlhausen <j...@jgm.org> wrote:
>>>>
>>>> > "that's where the danger lies"
>>>> >
>>>> > What danger?  I have no idea what the specific danger is, assuming
>>>> that all
>>>> > reference implementations have test cases that hedge around this.
>>>> >
>>>> > I contend that it can only be useful and will never be harmful.  What
>>>> are
>>>> > the counter-examples of concrete harm?
>>>> >
>>>>
>>>

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

Reply via email to