Re: Flight and sparse data

David Li Thu, 02 Jun 2022 04:48:38 -0700

Hey Matt,

This isn't really supported well right now, either in Flight or in Arrow 
itself. But you're in luck, someone else has already proposed a "ColumnBag" 
structure that addresses this use case (they had basically similar needs). See 
the ML discussion [1] and proposal PR [2].


However it has stalled a bit. It needs someone to pick it up and carry it 
across the finish line. That would mean cleaning up the proposal, and providing 
implementations in two languages. I'd be interested in helping but don't really 
have the time to do this right now.

[1]: https://lists.apache.org/thread/b0f34f89lq8c8cwhp5clwbl41p2rk2cy
[2]: https://github.com/apache/arrow/pull/11646

-David

On Thu, Jun 2, 2022, at 01:17, Matt Youill wrote:
> Hi,
>
> I have a question regarding Flight and sparse data...
>
> Suppose you have a data set where some records are missing values.
> Consuming those records in batches may mean a different schema for each
> batch.
>
> In the case where a field is known to be missing it isn't possible to
> infer the type. In the case where the fields aren't known in advance it
> isn't possible to include missing fields in the schema at all. E.g.
> Suppose the following 2 partitions of a notionally single data set are
> read into 2 batches of 3 records each.
>
> A, B
> 1, 2
> 4, 5
> 7, 8
>
> A,  B,  C
> 10, 11, 12
> 13, 14, 15
> 16, 17, 18
>
> Batch 1 may get schema ((A, int), (B, int)) while batch 2 may get ((A,
> int), (B, int), (C, int)) or in the case where we know C*should*  exist
> we could set batch 1 schema to ((A, int), (B, int), (C, null or some
> other "undefined" type?)).
>
> This isn't an issue when working with individual batches, but becomes
> problematic when working with data structures that aggregate batches
> (e.g. Table, RecordBatchReader, etc). Most of these data structures seem
> to assume that the schema is that of the first contained record batch -
> which is usually fine or can be worked around.
>
> What I can't figure out however is how to deal with FlightDataStream
> that wants a single schema for a stream of record batches AFAICT, when
> the record batches may have different schemas and it isn't possible to
> have a view of the entire stream of batches to resolve discrepancies
> prior to transmitting the stream. Or, indeed fix discrepancies at the
> receiving end?
>
> Is there a natural way to work with Flight and sparse streaming data.
>
> Thanks, Matt

Re: Flight and sparse data

Reply via email to