Hi John, > Just concerned about running out of time if we are freezing the format at > 1.0. Is there any rough guess as to how much time I have to make my case?
We haven't discussed it formally, but we have been doing releases ~3 months, so i would expect 1.0.0 sometime in January. On Fri, Oct 18, 2019 at 10:18 AM John Muehlhausen <j...@jgm.org> wrote: > Perhaps what I should do is study the batch creation process in the > reference implementation and see whether an alternative approach can be a > lot more efficient in time while being less efficient in space. > > And if so, whether this new approach also requires a differentiation > between batch length and underlying array/buffer length, even while the > immutability doctrine of a batch remains unchanged. (I.e. remove "mutating > record batch headers on the fly" from the discussion as potentially > unnecessary to make the point.) > > Just concerned about running out of time if we are freezing the format at > 1.0. Is there any rough guess as to how much time I have to make my case? > > -John > > On Fri, Oct 18, 2019 at 1:55 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> On the specification: I'm -.5 on saying array lengths can be different >> then row batch length (especially if both are valid lengths). I can see >> some wiggle room the the current language [1][2] that might allow for >> modifying this, so I think we should update it one way or another however >> this conversation turns out. >> >> On the implementation: I wouldn't want to see code that is mutating the >> record batch headers on the fly in the current code base. It is dangerous >> to break immutability assumptions across which standard implementations >> might be relying on. IMO, for such a specialized use-case it makes sense to >> have out-of-band communication (perhaps in a private implementation that >> does modify headers) and then "touch" up the metadata for later analysis, >> so it conforms to the specification (and standard libraries can be used). >> >> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs#L49 >> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L34 >> >> On Thu, Oct 17, 2019 at 7:17 AM John Muehlhausen <j...@jgm.org> wrote: >> >>> Micah, thanks very much for your input. A few thoughts in response: >>> >>> ``Over the time horizon of desired latency if you aren't receiving >>> enough messages to take advantage of columnar analytics, a system probably >>> has enough time to compact batches after the fact for later analysis and >>> conversely if you are receiving many events you naturally get reasonable >>> batch sizes without having to do further work.'' >>> >>> To see my perspective it is necessary to stop thinking about business >>> analytics and latencies of minutes or seconds or significant fractions of >>> seconds. In real-time financial trading systems we do things like kernel >>> bypass networking, core affinitization and spinning, fake workloads to keep >>> hot paths cached, almost complete avoidance of general purpose allocators >>> (preferring typed pools), all with the goal of shaving off one more >>> MICROsecond. >>> >>> Therefore "probably has enough time" is never in our vocabulary. >>> Microbatching is not a thing. It is either purely event driven or bust if >>> you are shooting for a dozen microseconds from packet-in to packet-out on >>> the wire. This is also the reason that certain high-performance systems >>> will never use the reference implementations of Arrow, just as most other >>> financial tech standards are centered on the *protocol* and there end up >>> being dozens of implementations that compete on performance. >>> >>> Arrow *as a protocol* has the potential to be a beautiful thing. It is >>> simple and straightforward and brings needed standardization. I would >>> encourage thinking of the reference implementations almost as second-class >>> citizens. The reference implementations should not de-facto define the >>> protocol, but vice versa. (I know this is probably already the philosophy.) >>> >>> Of course data eventually pops out of high performance systems and has a >>> second life in the hands of researchers, etc. This is why I'd like the >>> "vanilla" Arrow to be able to deal with Arrow data as-constructed in the >>> higher performance systems. Could there be special processes to refactor >>> all this data? Of course, but why? If I'm logging RecordBatches to disk >>> and some of them happen to have extra array elements (e.g. because of the >>> inevitable imperfect projection pre-allocation-- running out of variable >>> length string storage before running out of rows), why would I refactor >>> terabytes of data (and, more importantly, introduce a time window where >>> data is not available while this occurs) when I can just have pyarrow skip >>> the unused rows? If I want compression/de-duplication I'll do it at the >>> storage media layer. >>> >>> ``the proposal is essentially changing the unit of exchange from >>> RecordBatches to a segment of a RecordBatch'' >>> >>> It is the recognition that a RecordBatch may need to be computationally >>> useful before it is completely received/constructed, and without adding >>> additional overhead to the reception/construction process. I grab a canned >>> RecordBatch from a pool of them and start filling it in. If an earlier >>> RecordBatch falls out of the computation window, I put it back in the >>> pool. At any moment of time the batch advertises the truth: >>> RecordBatch.length is the immutable section of data at this moment in >>> time. That section will not have changed in future moments. It is also a >>> recognition that some RecordBatches so-constructed are not completely >>> fillable. They are like a moving truck with a little empty space left. >>> Sure, we could custom-build trucks to exactly fit the cargo, but why? It >>> takes too much time. Grab a truck (or a RecordBatch) off the lot (pool) >>> and go... when you've filled it as much as possible, but not perfectly, >>> grab another one that is exactly the same. >>> >>> I think my perception of the situation is clearest if we think about >>> "frozen" or "sealed" RecordBatches that, during their construction stage, >>> use a one-size-fits-all set of arrays and variable length buffer storage. >>> I grab a RecordBatch off the lot that is "for two int columns and a >>> variable length string column where the average expected length of a string >>> is 10." I fill it, then I'm done. If my strings were slightly longer than >>> expected I have extra array elements and RecordBatch.length is less than >>> array length. >>> >>> While it is true that I have other use-cases in mind regarding >>> simultaneous collection and computation, I'm hoping that the moving truck >>> analogy by itself demonstrates enough efficiency advantages (as compared to >>> typical batch construction) to warrant this change. Put simply, it is >>> Arrow accommodating more space/time tradeoff options than it currently does. >>> >>> When I was baking plasma into a system for the first time, I ran into >>> the example that I create a record batch to a mock stream in order to know >>> how much plasma memory to allocate. That is the kind of "build the moving >>> truck to fit the cargo" that I just can't have! >>> >>> -John >>> >>> On Wed, Oct 16, 2019 at 10:15 PM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>>> Hi John and Wes, >>>> >>>> A few thoughts: >>>> One of the issues which we didn't get into in prior discussions, is the >>>> proposal is essentially changing the unit of exchange from >>>> RecordBatches to >>>> a segment of a RecordBatch. >>>> >>>> I think I brought this up earlier in discussions, an interesting idea >>>> that >>>> Trill [1], a columnar streaming engine, illustrates. Over the time >>>> horizon >>>> of desired latency if you aren't receiving enough messages to take >>>> advantage of columnar analytics, a system probably has enough time to >>>> compact batches after the fact for later analysis and conversely if you >>>> are >>>> receiving many events you naturally get reasonable batch sizes without >>>> having to do further work. >>>> >>>> >>>> > I'm objecting to RecordBatch.length being inconsistent with the >>>> > constituent field lengths, that's where the danger lies. If all of the >>>> > lengths are consistent, no code changes are necessary. >>>> >>>> John, is it a viable solution to keep all length in sync for the use >>>> case >>>> you are imagining? >>>> >>>> A solution I like less, but might be viable: formally specify a negative >>>> constant that signifies length should be inherited from RowBatch length >>>> (this could only be used on top level fields). >>>> >>>> I contend that it can only be useful and will never be harmful. What >>>> are >>>> > the counter-examples of concrete harm? >>>> >>>> >>>> I'm not sure there is anything obviously wrong, however changes to >>>> semantics are always dangerous. One blemish on the current proposal >>>> is >>>> one can't determine easily if a mismatch in row-length is a programming >>>> error or intentional. >>>> >>>> [1] >>>> >>>> https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/trill-vldb2015.pdf >>>> >>>> On Wed, Oct 16, 2019 at 4:41 PM John Muehlhausen <j...@jgm.org> wrote: >>>> >>>> > "that's where the danger lies" >>>> > >>>> > What danger? I have no idea what the specific danger is, assuming >>>> that all >>>> > reference implementations have test cases that hedge around this. >>>> > >>>> > I contend that it can only be useful and will never be harmful. What >>>> are >>>> > the counter-examples of concrete harm? >>>> > >>>> >>>