Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

2022-08-03 Thread Gosh Arzumanyan
Hi team!

2cents(maybe less): if I get the idea right, StringView data type might be
very handy/optimal for cases where users already have string data in some
other formats available (e.g. std::unordered_map, flat
buffer structures etc.)  Off which record batches are created and shipped
to the wire. Seems like at the very least some intermediate copies can be
skipped.

Thanks,
Gosh

On Tue, Aug 2, 2022, 2:49 PM Wes McKinney  wrote:

> On Tue, Aug 2, 2022 at 1:02 AM Antoine Pitrou  wrote:
> >
> >
> > Le 01/08/2022 à 19:13, Wes McKinney a écrit :
> > >
> > > If we start placing restrictions on how the out-of-line string buffers
> > > are managed and externalized, it risks undermining the zero-copy
> > > interoperability benefits that we're trying to achieve with this.
> >
> > But embedded pointers in turn undermine zero-copy for IPC and Flight.
> > And they probably make transferring data between CPU and GPU more
> > difficult and more expensive (unless the embedded pointers happen to
> > fall into a piece of the address space shared between CPU and GPU: which
> > you cannot ensure if, say, you got those pointers from a third party
> > through the C data interface).
> >
> > So the bottom line seems to be that embedded pointers enable zero-copy
> > for specific producers, but undermine existing zero-copy qualities for
> > everyone (and, to speak more broadly, ease of data movement).
>
> If the proposal were for implementations to switch over to using these
> StringViews for all of their string data, then I would agree with you.
> But the proposal is for this memory layout to be available as an
> "opt-in" for applications where it's beneficial — and the hypothesis
> (to be supported with evidence, which requires doing some
> implementation work) is that these benefits outweigh the costs
> (additional serialization in some cross-language scenarios).
>
> Currently, an Arrow receiver of this data must perform an expensive
> deserialization from the StringView representation for it to be
> considered valid Arrow — no matter what is the intended use of the
> data. In a way, we are "deferring" the deserialization until the data
> is written out to IPC / Flight, or received by a transitive consumer
> over the C interface.
>
> Similarly, applications that can achieve performance improvements
> (e.g. query engines) by using the StringViews — I would guess that the
> performance benefits outweigh the downstream serialization costs. For
> example, I believe that the performance gains achieved in the Filter
> (boolean selection) and Take (integer selection) operations alone will
> be greater than the StringView<->String transformations that may need
> to take place at application boundaries (where there is a receiver
> that does not benefit from the StringView representation).
>
> Part of my goal for kicking off the implementation work is to be able
> to quantify and demonstrate both the benefits and the costs, so that
> we can make judgments based on real world data. I'm of the
> "practicality beats purity" mindset on this, otherwise we introduce an
> unavoidable tension that will lead query engine projects to choose not
> to use Arrow as their columnar data representation.
>
> > In addition, the embedded pointers deviate from Arrow's representation
> > philosophy, adding cognitive load for implementors who now have to
> > account for the fact that buffers do not tell "everything about the
> > data" but may refer to memory unknown to them. The discussions about how
> > to support this in Go are a direct consequence of this deviation in
> > philosophy.
> >
> > Overall, my opinion is that this is not a very good strategic choice for
> > the project.
> >
> > Regards
> >
> > Antoine.
>


Re: June 23 virtual conference to highlight work in the Arrow ecosystem

2022-05-14 Thread Gosh Arzumanyan
Great news!

Actually I wonder if it would be also possible to organize some non-virtual
events later in the summer?


On Fri, May 13, 2022, 12:02 PM Andrew Lamb  wrote:

> > If folks would find it interesting, I could do a short talk on a
> use-case for FlightSQL (and Substrait)
>
> I would personally find it very interesting
>
>
> On Fri, May 13, 2022 at 11:46 AM Gavin Ray  wrote:
>
> > Super neat, saw the announcement post on Twitter and signed up the other
> > day!
> >
> > If folks would find it interesting, I could do a short talk on a
> > use-case for FlightSQL (and Substrait)
> > The gist of it is having a central API that allows users/vendors to write
> > "plugins" to register new data sources:
> >
> > [image: image.png]
> >
> > You lose a lot of the benefits of Arrow in the serialization to JSON, but
> > FlightSQL as a specification is a great language-agnostic way to share
> > schema metadata and handle queries.
> > With Substrait you get a spec for expressing data compute operations as
> > well, so you can have things solved on both the "tell me what you have"
> and
> > "give me what you have" fronts.
> >
> > (Have to wait for write operations in Substrait though, for full
> > functionality)
> >
> > On Fri, May 13, 2022 at 9:51 AM Wes McKinney 
> wrote:
> >
> >> hi all,
> >>
> >> My employer (Voltron Data) is organizing a free virtual conference on
> >> June 23 to highlight development work and usage of Apache Arrow — you
> >> can register for this or apply to give a talk here:
> >>
> >> https://thedatathread.com/
> >>
> >> We are especially interested in hearing from users (as opposed to only
> >> project developers/contributors!) about how they are using Arrow in
> >> their downstream applications. If you would be interested in speaking
> >> (talks will be pre-recorded, so you don't need to be available on June
> >> 23), please apply to give a short talk (~15 min) on the website!
> >>
> >> Thanks,
> >> Wes
> >>
> >
>


Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-27 Thread Gosh Arzumanyan
Hi guys,

1. Regarding IPC vs Flight: in fact my initial suggestion was to add this
feature starting from the IPC(I moved initial write up steps to the bottom
of the doc). Afterwards David suggested focusing on Flight and that's how
we ended up with the protobufs change in the proposal. This being said I do
think that the place where this should be impemented is a good question on
its own. Maybe it makes sense to have this kind of a feature in IPC and
somehow use it in Flight, maybe not.
2. The point about dictionaries deserves a dedicated section in the
proposal. Nate and David brought it up and shared some insights. I'll try
to aggregate them and we can continue the discussion form there.

Cheers,
Gosh

On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, 
wrote:

> >
> > > > makes it more difficult to bring schema evolution back into the
> > > > IPC Stream format (i.e. it would live only in flight)
> > >
> > > Gosh's proposal extends the flatbuffer structures not the protobufs.
> Can
> > > you help me understand how difficult it would be to bring the
> `schema_id`
> > > approach to the IPC stream format?
> >
> > I thought we were talking solely about the Flight Protobuf definitions -
> > not the Flatbuffers (and the Google doc at least only talks about the
> > Protobufs).
> >
>
> I somehow missed that schema_id is being added to protobuf in the document.
> It feels to me that the schema_id is a property that would ideally only
> apply to the RecordBatch. I better understand Micah's dictionary concerns,
> now, too.
>
> > Side Question: Why isn't the IPC stream format a series of the flight
> > > protobufs? It's a real shame that there is no standard way to
> > > capture/replay a stream with app_metadata. (Obviously ignoring the
> > > annoyances around protobuf wrapping flatbuffers.)
> >
> > The IPC format was defined long before Flight, and Flight's app_metadata
> > was added after Flight's initial definition. Note an IPC message does
> have
> > a provision for key-value metadata, though I think APIs for that are not
> > fully exposed. (See ARROW-6940:
> > https://issues.apache.org/jira/browse/ARROW-6940 and despite my comments
> > there perhaps we need to unify or at least consider how Flight's
> > app_metadata relates to the IPC message custom_metadata. Also perhaps see
> > ARROW-1059.)
> >
>
> KeyValue unfortunately is string to string. In flatbuffer strings are only
> UTF-8 or 7-bit ASCII. The app_metadata on the other hand is opaque bytes.
> The latter is a bit more useful.
>
> --
>


Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-25 Thread Gosh Arzumanyan
Hi Micah,

Sure, let me do it here:

   1. In our case we do expect relatively frequent changes in the schema of
   the batch being sent out. I don't see that pattern changing in the mid term
   for a good reason. However long term maybe it will be possible to leverage
   separate RPC calls. I left some description in the comments thread here
   
<https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?disco=MkkJ9xQ>
   .
   2. Regarding the union of structs trick: while it is a good workaround
   for most of the cases as of now there are also some aspects which deserve
   consideration:
  1. There is a  space overhead
  2. There is an additional glue code to make it work
  3. Considering that this solution is bubbling up frequently in
  similar contexts, seems like users might benefit if it  was "natively"
  supported and they could just focus on populating the schemas they really
  need to.
   3. Re complexity of "one schema at a time" vs "schema id based": i think
   they are not much different, right? In fact the second one is more of an
   optimization to the first one which is beneficial to us. Anyways even with
   the first approach you need to add some logic of schema change. Adding the
   schema identification is not that complex at first glance.

Cheers,
Gosh

On Fri, Jun 25, 2021 at 6:25 PM Micah Kornfield 
wrote:

> >
> > 1. It seems like renaming stream_id to schema_id and delegating "logical
> > stream" distinction to app_metadata mitigates the "multiplexing" point
> > while at the same time it gives enough flexibility to address both Nate's
> > and our use cases.
>
>
> I don't think this is the case.  It seems that having no additional fields
> added and then sending a new schema when necessary combined with Union of
> Structs would solve most use-cases.  The main downside could be potential
> performance implications if the schema is changing frequently.  Gosh, could
> you address why this wouldn't be sufficient (either here or on the doc).
>
> Thanks,
> -Micah
>
> On Fri, Jun 25, 2021 at 5:30 AM Gosh Arzumanyan  wrote:
>
> > Hi guys,
> >
> > Thanks for sharing your insights/concerns! I also left some comments
> based
> > on the discussion we had. Briefly:
> >
> > 1. It seems like renaming stream_id to schema_id and delegating "logical
> > stream" distinction to app_metadata mitigates the "multiplexing" point
> > while at the same time it gives enough flexibility to address both Nate's
> > and our use cases.
> > 2. To David's point about other transports: in fact currently we are
> using
> > other transports(aside from gRPC) so we don't wanna depend on only gRPC
> > features.
> >
> > Cheers,
> > Gosh
> >
> > On Wed, Jun 23, 2021 at 10:40 PM David Li  wrote:
> >
> > > Thanks for chiming in - I've replied in the doc. Scoping it to just
> > schema
> > > evolution would be preferable, but I'm not sure if Gosh's usecase
> > requires
> > > more flexibility than that or not.
> > >
> > > Again, though, given that 1) gRPC recycles a connection, so repeated
> > calls
> > > aren't necessarily expensive and 2) encoding tricks like
> > union-of-structs,
> > > any solution needs to be weighed against those/we should make sure to
> > > document why they aren't sufficient. (For instance, 1) is hampered by
> the
> > > use of L7 load balancers and/or client-side load balancing policies in
> > gRPC
> > > and assumes statefulness which is undesirable in general. There's also
> > the
> > > eventual desire to have a transport besides gRPC someday.)
> > >
> > > -David
> > >
> > > On Wed, Jun 23, 2021, at 16:24, Nate Bauernfeind wrote:
> > >
> > > Thanks for writing this up! I added a few general comments, but have a
> > > question on the approach because it's not quite what I was expecting.
> > >
> > > I am slightly concerned that the proposal looks more like support for
> > > "multiplexing" IPC streams into a single RPC stream rather than support
> > for
> > > a changing Schema of an otherwise consistently logical stream. gRPC
> > already
> > > does a good job decoupling RPC streams from one another. I feel that
> > > throwing that idea into the IPC stream increases client-library
> > > implementation cost by quite a lot.
> > >
> > > Why is it not good enough to replace the Schema when we see a
> duplicate?
> > > This is undoubtedly less work across all client implementations.
>

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-25 Thread Gosh Arzumanyan
Hi guys,

Thanks for sharing your insights/concerns! I also left some comments based
on the discussion we had. Briefly:

1. It seems like renaming stream_id to schema_id and delegating "logical
stream" distinction to app_metadata mitigates the "multiplexing" point
while at the same time it gives enough flexibility to address both Nate's
and our use cases.
2. To David's point about other transports: in fact currently we are using
other transports(aside from gRPC) so we don't wanna depend on only gRPC
features.

Cheers,
Gosh

On Wed, Jun 23, 2021 at 10:40 PM David Li  wrote:

> Thanks for chiming in - I've replied in the doc. Scoping it to just schema
> evolution would be preferable, but I'm not sure if Gosh's usecase requires
> more flexibility than that or not.
>
> Again, though, given that 1) gRPC recycles a connection, so repeated calls
> aren't necessarily expensive and 2) encoding tricks like union-of-structs,
> any solution needs to be weighed against those/we should make sure to
> document why they aren't sufficient. (For instance, 1) is hampered by the
> use of L7 load balancers and/or client-side load balancing policies in gRPC
> and assumes statefulness which is undesirable in general. There's also the
> eventual desire to have a transport besides gRPC someday.)
>
> -David
>
> On Wed, Jun 23, 2021, at 16:24, Nate Bauernfeind wrote:
>
> Thanks for writing this up! I added a few general comments, but have a
> question on the approach because it's not quite what I was expecting.
>
> I am slightly concerned that the proposal looks more like support for
> "multiplexing" IPC streams into a single RPC stream rather than support for
> a changing Schema of an otherwise consistently logical stream. gRPC already
> does a good job decoupling RPC streams from one another. I feel that
> throwing that idea into the IPC stream increases client-library
> implementation cost by quite a lot.
>
> Why is it not good enough to replace the Schema when we see a duplicate?
> This is undoubtedly less work across all client implementations.
>
> The benefit I see is that you might have two schemas that you swap between
> frequently then you can indicate with a single integer. If that's what you
> want to support I would rather think of them as `schema_id` instead of
> `stream_id` and not give this impression that multiplexing is a goal. As
> you have proposed, it seems that the "done writing for a stream" needs a
> callback notifying the user receiving the stream that a logical subset of
> the flight is complete. Alternatively, if they aren't independent streams
> (to the end-user), we could tell the Arrow layer that a particular schema
> is no longer needed without also needing to communicate further downstream.
>
> On Wed, Jun 23, 2021 at 1:39 PM David Li  wrote:
>
> > Ah to be clear, the API is indeed inconsistent - DoExchange was added
> some
> > time later (and by its nature returning a FlightDataStream would not have
> > been possible, since it's meant to be able to interleave
> reading/writing).
> > But really, DoGet is indeed the odd one out in the C++ API and it may be
> > worth correcting. You could also perhaps imagine making a
> FlightDataStream
> > implementation that accepts a closure and provides it a fake writer, if
> the
> > API mismatch is hard to work with...
> >
> > That said: this has some benefits, e.g. for a Python service that returns
> > a Table, that means data can be fed into gRPC entirely in C++ without
> > having to bounce into Python for each chunk.
> >
> > Best,
> > David
> >
> > On Wed, Jun 23, 2021, at 15:33, Gosh Arzumanyan wrote:
> > > Hi David,
> > >
> > > Got you. In fact I was looking at this more from the point of view of
> > consistency of the API in terms of "inputs" and thought DoExchange is
> kind
> > of a DoGet+ so might make sense to have the same classes being utilized
> in
> > both places. But again, I might be missing something and I get the point
> > about breaking change.
> > >
> > > Cheers,
> > > Gosh
> > >
> > > On Wed, Jun 23, 2021 at 2:58 PM David Li  wrote:
> > >> __
> > >> It's mostly a quirk of implementation (and just for clarification,
> > they're all nearly identical on the format/protocol level).
> > >>
> > >> DoGet is conceptualized as your application returning a readable
> stream
> > of batches, instead of your application imperatively writing batches to
> the
> > client. (This is different than how Flight is implemented in Java.) You
> > would normally not implement FlightDataStream - you would

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-23 Thread Gosh Arzumanyan
Hi David,

Got you. In fact I was looking at this more from the point of view of
consistency of the API in terms of "inputs" and thought DoExchange is kind
of a DoGet+ so might make sense to have the same classes being utilized in
both places. But again, I might be missing something and I get the point
about breaking change.

Cheers,
Gosh

On Wed, Jun 23, 2021 at 2:58 PM David Li  wrote:

> It's mostly a quirk of implementation (and just for clarification, they're
> all nearly identical on the format/protocol level).
>
> DoGet is conceptualized as your application returning a readable stream of
> batches, instead of your application imperatively writing batches to the
> client. (This is different than how Flight is implemented in Java.) You
> would normally not implement FlightDataStream - you would return a
> RecordBatchStream.
>
> DoGet could not have FlightMessageWriter as a return type as that wouldn't
> make sense, but it could accept an instance of that as a parameter instead,
> much like DoExchange. That would be a breaking change.
>
> Best,
> David
>
> On Wed, Jun 23, 2021, at 08:47, Gosh Arzumanyan wrote:
>
> Hi David,
>
> Going through the ArrowFlight API: got confused a bit on DoGet and
> DoPut/DoExachange apis: the former one expects FlightDataStream which talks
> in already serialized message terms while the latter to
> accept FlightMessageReader/Writer which expect the user to pass in
> RecordBatches etc. Is there any reason why the DoGet can't have
> FlightMessageWriter as a return type?
>
> Cheers,
> Gosh
>
> On Mon, Jun 21, 2021 at 9:47 PM Gosh Arzumanyan  wrote:
>
> > Thanks David!
> >
> > I also responded/added more suggestions/questions to the doc. I think it
> > makes sense to have two sections: one purely protocol oriented and second
> > API oriented(examples in c++ or in any other language should make the
> idea
> > easier to digest).
> >
> > Thanks for the reference too!
> >
> > Cheers,
> > Gosh
> >
> > On Mon, Jun 21, 2021 at 4:41 PM David Li  wrote:
> >
> >> Thanks! I've left some initial comments/suggestions to expand it in
> terms
> >> of the format definitions and not the C++ APIs.
> >>
> >> I'll also note something like this was proposed a long time ago -
> there's
> >> not very much discussion about it there but for reference:
> >>
> https://lists.apache.org/thread.html/0e5ba78c48cdd0e357f3a4a6d8affd31767c34376b62c001910823af%40%3Cdev.arrow.apache.org%3E
> >> (or see the thread '[Discuss][FlightRPC] Extensions to Flight:
> >> "DoBidirectional"' from 2019-2020). It might be good to address why the
> >> proposed workaround there (union-of-structs) is insufficient for the use
> >> cases here (and in FlightSQL).
> >>
> >> -David
> >>
> >> On Mon, Jun 21, 2021, at 08:22, Gosh Arzumanyan wrote:
> >> > Ah sorry, comments should work now.
> >> >
> >> > Cheers,
> >> > Gosh
> >> >
> >> > On Mon., 21 Jun. 2021, 14:18 David Li,  >> lidavidm%40apache.org>> wrote:
> >> >
> >> > > Thanks! Will give it a look.
> >> > >
> >> > > Would you mind opening it up for comments?
> >> > >
> >> > > -David
> >> > >
> >> > > On 2021/06/21 11:56:24, Gosh Arzumanyan  >> gosharz%40gmail.com>> wrote:
> >> > > > Hi folks,
> >> > > >
> >> > > > Started putting some thoughts together here:
> >> > > >
> >> > >
> >>
> https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?usp=sharing
> >> > > > Any feedback is welcome!
> >> > > >
> >> > > > Cheers,
> >> > > > Gosh
> >> > > >
> >> > >
> >> >
> >>
> >
>
>
>


Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-23 Thread Gosh Arzumanyan
Hi David,

Going through the ArrowFlight API: got confused a bit on DoGet and
DoPut/DoExachange apis: the former one expects FlightDataStream which talks
in already serialized message terms while the latter to
accept FlightMessageReader/Writer which expect the user to pass in
RecordBatches etc. Is there any reason why the DoGet can't have
FlightMessageWriter as a return type?

Cheers,
Gosh

On Mon, Jun 21, 2021 at 9:47 PM Gosh Arzumanyan  wrote:

> Thanks David!
>
> I also responded/added more suggestions/questions to the doc. I think it
> makes sense to have two sections: one purely protocol oriented and second
> API oriented(examples in c++ or in any other language should make the idea
> easier to digest).
>
> Thanks for the reference too!
>
> Cheers,
> Gosh
>
> On Mon, Jun 21, 2021 at 4:41 PM David Li  wrote:
>
>> Thanks! I've left some initial comments/suggestions to expand it in terms
>> of the format definitions and not the C++ APIs.
>>
>> I'll also note something like this was proposed a long time ago - there's
>> not very much discussion about it there but for reference:
>> https://lists.apache.org/thread.html/0e5ba78c48cdd0e357f3a4a6d8affd31767c34376b62c001910823af%40%3Cdev.arrow.apache.org%3E
>> (or see the thread '[Discuss][FlightRPC] Extensions to Flight:
>> "DoBidirectional"' from 2019-2020). It might be good to address why the
>> proposed workaround there (union-of-structs) is insufficient for the use
>> cases here (and in FlightSQL).
>>
>> -David
>>
>> On Mon, Jun 21, 2021, at 08:22, Gosh Arzumanyan wrote:
>> > Ah sorry, comments should work now.
>> >
>> > Cheers,
>> > Gosh
>> >
>> > On Mon., 21 Jun. 2021, 14:18 David Li, > lidavidm%40apache.org>> wrote:
>> >
>> > > Thanks! Will give it a look.
>> > >
>> > > Would you mind opening it up for comments?
>> > >
>> > > -David
>> > >
>> > > On 2021/06/21 11:56:24, Gosh Arzumanyan > gosharz%40gmail.com>> wrote:
>> > > > Hi folks,
>> > > >
>> > > > Started putting some thoughts together here:
>> > > >
>> > >
>> https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?usp=sharing
>> > > > Any feedback is welcome!
>> > > >
>> > > > Cheers,
>> > > > Gosh
>> > > >
>> > >
>> >
>>
>


Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-21 Thread Gosh Arzumanyan
Thanks David!

I also responded/added more suggestions/questions to the doc. I think it
makes sense to have two sections: one purely protocol oriented and second
API oriented(examples in c++ or in any other language should make the idea
easier to digest).

Thanks for the reference too!

Cheers,
Gosh

On Mon, Jun 21, 2021 at 4:41 PM David Li  wrote:

> Thanks! I've left some initial comments/suggestions to expand it in terms
> of the format definitions and not the C++ APIs.
>
> I'll also note something like this was proposed a long time ago - there's
> not very much discussion about it there but for reference:
> https://lists.apache.org/thread.html/0e5ba78c48cdd0e357f3a4a6d8affd31767c34376b62c001910823af%40%3Cdev.arrow.apache.org%3E
> (or see the thread '[Discuss][FlightRPC] Extensions to Flight:
> "DoBidirectional"' from 2019-2020). It might be good to address why the
> proposed workaround there (union-of-structs) is insufficient for the use
> cases here (and in FlightSQL).
>
> -David
>
> On Mon, Jun 21, 2021, at 08:22, Gosh Arzumanyan wrote:
> > Ah sorry, comments should work now.
> >
> > Cheers,
> > Gosh
> >
> > On Mon., 21 Jun. 2021, 14:18 David Li,  lidavidm%40apache.org>> wrote:
> >
> > > Thanks! Will give it a look.
> > >
> > > Would you mind opening it up for comments?
> > >
> > > -David
> > >
> > > On 2021/06/21 11:56:24, Gosh Arzumanyan  gosharz%40gmail.com>> wrote:
> > > > Hi folks,
> > > >
> > > > Started putting some thoughts together here:
> > > >
> > >
> https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?usp=sharing
> > > > Any feedback is welcome!
> > > >
> > > > Cheers,
> > > > Gosh
> > > >
> > >
> >
>


Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-21 Thread Gosh Arzumanyan
Ah sorry, comments should work now.

Cheers,
Gosh

On Mon., 21 Jun. 2021, 14:18 David Li,  wrote:

> Thanks! Will give it a look.
>
> Would you mind opening it up for comments?
>
> -David
>
> On 2021/06/21 11:56:24, Gosh Arzumanyan  wrote:
> > Hi folks,
> >
> > Started putting some thoughts together here:
> >
> https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?usp=sharing
> > Any feedback is welcome!
> >
> > Cheers,
> > Gosh
> >
>


Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-21 Thread Gosh Arzumanyan
Hi folks,

Started putting some thoughts together here:
https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?usp=sharing
Any feedback is welcome!

Cheers,
Gosh


Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-18 Thread Gosh Arzumanyan
Hi David,

Thanks for poking me on this. I have been thinking it out but have not got
to crafting a doc. Let me put together a rough proposal this weekend.
Afterwards I'll do need your help for bringing it to a reviewable state.

Cheers,
Gosh

On Fri., 18 Jun. 2021, 18:11 David Li,  wrote:

> Following up here - Gosh, did you get a chance to put something together?
> Do you need/want help on this? This would also potentially be useful for
> FlightSQL. (See the discussion on GitHub:
> https://github.com/apache/arrow/pull/9368#discussion_r572941765)
>
> Best,
> David
>
> On Fri, Apr 16, 2021, at 10:59, Gosh Arzumanyan wrote:
> > Hi guys!
> >
> > Thanks for the feedback/info.
> > Let me try to put a proposal together. Though I guess I'll need some
> > assistance on crafting it both in terms of the structure of a proposal
> > expected in the Arrow community as well as technical guidance.
> >
> > WIll share a doc with some ideas shortly so that we can start to iterate
> > over it.
> >
> > Cheers,
> > Gosh
> >
> > On Tue, Apr 13, 2021 at 6:55 PM Nate Bauernfeind <
> > natebauernfe...@deephaven.io <mailto:natebauernfeind%40deephaven.io>>
> wrote:
> >
> > > > possibly in coordination with the Deephaven/Barrage team, if they're
> also
> > > still interested
> > >
> > > Good opportunity for me to chime in =). I think we still have interest
> in
> > > this feature. On the other thread, it took a little cajoling, but I've
> come
> > > around to agree with the conclusions of taking a RecordBatch and
> splitting
> > > it up (a set of RecordBatches for added rows followed by a set of
> > > RecordBatches for modifications). In this case I think it's best not to
> > > evolve the schema between added row RecordBatches and modified row
> > > RecordBatches (sending empty buffer nodes and field nodes will be
> > > significantly cheaper). However, the schema evolution would be very
> useful
> > > for when the rpc client changes the set of columns that they are
> subscribed
> > > to (which is relatively rare compared to when the subscribed table
> itself
> > > ticks).
> > >
> > > That said, schema evolution is not yet particularly high in our queue.
> > >
> > > On Tue, Apr 13, 2021 at 9:12 AM David Li  lidavidm%40apache.org>> wrote:
> > >
> > > > Thanks for the details. I'll note a few things, but adding schema
> > > > evolution to Flight is reasonable, if you'd like to put together a
> > > > proposal for discussion (possibly in coordination with the
> > > > Deephaven/Barrage team, if they're also still interested).
> > > >
> > > > >3. Assume that there is a strong reason to query A1,..,AK
> together.
> > > >
> > > > While I don't know the details here, at least with Flight/gRPC, it's
> > > > not necessarily expensive to make several requests to the same
> server,
> > > > as gRPC will consolidate them into the same underlying network
> > > > connection. You could issue one GetFlightInfo request for all streams
> > > > at once, and get back a list of endpoints for each individual
> > > > subquery, which you could then issue separate DoGet requests for.
> > > >
> > > > There's a slight mismatch there in that GetFlightInfo returns a
> > > > FlightInfo, which assumes all endpoints have the same schema. But for
> > > > a specific application, you could ignore that field (nothing in
> Flight
> > > > checks that schema against the actual data).
> > > >
> > > > Of course, if said strong reason is that all the data is really
> > > > retrieved together despite being distinct datasets, then this would
> > > > complicate the server side implementation quite a bit. But it's one
> > > > option.
> > > >
> > > > > A potential way to address this(with the existing tools) could be
> > > having
> > > > a
> > > > > union schema of all fields across all entities(potentially prefixed
> > > with
> > > > > the field name just like in sql joins) and setting the values to NA
> > > which
> > > > > do not belong to an entity.
> > > >
> > > > I had a similar use case in the past, and it was suggested to use
> > > > Arrow's Union type which handles this directly. A Union of Struct
> > > > types essentially lets you have multiple distinct schemas all encoded
> > > > in t

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Gosh Arzumanyan
This might help to get the size of the output buffer upfront:
https://github.com/apache/arrow/blob/1830d1558be8741e7412f6af30582ff457f0f34f/cpp/src/arrow/io/memory.h#L96

Though with "standard" allocators there is a risk of running into
KiPageFaults when going for buffers over 1mb. This might be especially
painful in multithreaded environment.

A custom outputstream with configurable buffering parameter might help to
overcome that problem without dealing too much with the allocators.
Curious to hear community thoughts on this.

Cheers,
Gosh

On Fri., 11 Jun. 2021, 00:45 Wes McKinney,  wrote:

> From this, it seems like seeding the RecordBatchStreamWriter's output
> stream with a much larger preallocated buffer would improve
> performance (depends on the allocator used of course).
>
> On Thu, Jun 10, 2021 at 5:40 PM Weston Pace  wrote:
> >
> > Just for some reference times from my system I created a quick test to
> > dump a ~1.7GB table to buffer(s).
> >
> > Going to many buffers (just collecting the buffers): ~11,000ns
> > Going to one preallocated buffer: ~160,000,000ns
> > Going to one dynamically allocated buffer (using a grow factor of 2x):
> > ~2,000,000,000ns
> >
> > On Thu, Jun 10, 2021 at 11:46 AM Wes McKinney 
> wrote:
> > >
> > > To be clear, we would like to help make this faster. I don't recall
> > > much effort being invested in optimizing this code path in the last
> > > couple of years, so there may be some low hanging fruit to improve the
> > > performance. Changing the in-memory data layout (the chunking) is one
> > > of the most likely things to help.
> > >
> > > On Thu, Jun 10, 2021 at 2:14 PM Gosh Arzumanyan 
> wrote:
> > > >
> > > > Hi Jayjeet,
> > > >
> > > > I wonder if you really need to serialize the whole table into a
> single
> > > > buffer as you will end up with twice the memory while you could be
> sending
> > > > chunks as they are generated by the  RecordBatchStreamWriter. Also
> is the
> > > > buffer resized beforehand? I'd suspect there might be relocations
> happening
> > > > under the hood.
> > > >
> > > >
> > > > Cheers,
> > > > Gosh
> > > >
> > > > On Thu., 10 Jun. 2021, 21:01 Wes McKinney, 
> wrote:
> > > >
> > > > > hi Jayjeet — have you run prof to see where those 1000ms are being
> > > > > spent? How many arrays (the sum of the number of chunks across all
> > > > > columns) in total are there? I would guess that the problem is all
> the
> > > > > little Buffer memcopies. I don't think that the C Interface is
> going
> > > > > to help you.
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Thu, Jun 10, 2021 at 1:48 PM Jayjeet Chakraborty
> > > > >  wrote:
> > > > > >
> > > > > > Hello Arrow Community,
> > > > > >
> > > > > > I am a student working on a project where I need to serialize an
> > > > > in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I
> am
> > > > > currently using the arrow::ipc::RecordBatchStreamWriter API to
> serialize
> > > > > the table to a arrow::Buffer, but it is taking nearly 1000ms to
> serialize
> > > > > the whole table, and that is harming the performance of my
> > > > > performance-critical application. I basically want to get hold of
> the
> > > > > underlying memory of the table as bytes and send it over the
> network. How
> > > > > do you suggest I tackle this problem? I was thinking of using the
> C Data
> > > > > interface for this, so that I convert my arrow::Table to
> ArrowArray and
> > > > > ArrowSchema and serialize the structs to send them over the
> network, but
> > > > > seems like serializing structs is another complex problem on its
> own.  It
> > > > > will be great to have some suggestions on this. Thanks a lot.
> > > > > >
> > > > > > Best,
> > > > > > Jayjeet
> > > > > >
> > > > >
>


Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Gosh Arzumanyan
Hi Jayjeet,

I wonder if you really need to serialize the whole table into a single
buffer as you will end up with twice the memory while you could be sending
chunks as they are generated by the  RecordBatchStreamWriter. Also is the
buffer resized beforehand? I'd suspect there might be relocations happening
under the hood.


Cheers,
Gosh

On Thu., 10 Jun. 2021, 21:01 Wes McKinney,  wrote:

> hi Jayjeet — have you run prof to see where those 1000ms are being
> spent? How many arrays (the sum of the number of chunks across all
> columns) in total are there? I would guess that the problem is all the
> little Buffer memcopies. I don't think that the C Interface is going
> to help you.
>
> - Wes
>
> On Thu, Jun 10, 2021 at 1:48 PM Jayjeet Chakraborty
>  wrote:
> >
> > Hello Arrow Community,
> >
> > I am a student working on a project where I need to serialize an
> in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I am
> currently using the arrow::ipc::RecordBatchStreamWriter API to serialize
> the table to a arrow::Buffer, but it is taking nearly 1000ms to serialize
> the whole table, and that is harming the performance of my
> performance-critical application. I basically want to get hold of the
> underlying memory of the table as bytes and send it over the network. How
> do you suggest I tackle this problem? I was thinking of using the C Data
> interface for this, so that I convert my arrow::Table to ArrowArray and
> ArrowSchema and serialize the structs to send them over the network, but
> seems like serializing structs is another complex problem on its own.  It
> will be great to have some suggestions on this. Thanks a lot.
> >
> > Best,
> > Jayjeet
> >
>


Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-04-16 Thread Gosh Arzumanyan
Hi guys!

Thanks for the feedback/info.
Let me try to put a proposal together. Though I guess I'll need some
assistance on crafting it both in terms of the structure of a proposal
expected in the Arrow community as well as technical guidance.

WIll share a doc with some ideas shortly so that we can start to iterate
over it.

Cheers,
Gosh

On Tue, Apr 13, 2021 at 6:55 PM Nate Bauernfeind <
natebauernfe...@deephaven.io> wrote:

> > possibly in coordination with the Deephaven/Barrage team, if they're also
> still interested
>
> Good opportunity for me to chime in =). I think we still have interest in
> this feature. On the other thread, it took a little cajoling, but I've come
> around to agree with the conclusions of taking a RecordBatch and splitting
> it up (a set of RecordBatches for added rows followed by a set of
> RecordBatches for modifications). In this case I think it's best not to
> evolve the schema between added row RecordBatches and modified row
> RecordBatches (sending empty buffer nodes and field nodes will be
> significantly cheaper). However, the schema evolution would be very useful
> for when the rpc client changes the set of columns that they are subscribed
> to (which is relatively rare compared to when the subscribed table itself
> ticks).
>
> That said, schema evolution is not yet particularly high in our queue.
>
> On Tue, Apr 13, 2021 at 9:12 AM David Li  wrote:
>
> > Thanks for the details. I'll note a few things, but adding schema
> > evolution to Flight is reasonable, if you'd like to put together a
> > proposal for discussion (possibly in coordination with the
> > Deephaven/Barrage team, if they're also still interested).
> >
> > >3. Assume that there is a strong reason to query A1,..,AK together.
> >
> > While I don't know the details here, at least with Flight/gRPC, it's
> > not necessarily expensive to make several requests to the same server,
> > as gRPC will consolidate them into the same underlying network
> > connection. You could issue one GetFlightInfo request for all streams
> > at once, and get back a list of endpoints for each individual
> > subquery, which you could then issue separate DoGet requests for.
> >
> > There's a slight mismatch there in that GetFlightInfo returns a
> > FlightInfo, which assumes all endpoints have the same schema. But for
> > a specific application, you could ignore that field (nothing in Flight
> > checks that schema against the actual data).
> >
> > Of course, if said strong reason is that all the data is really
> > retrieved together despite being distinct datasets, then this would
> > complicate the server side implementation quite a bit. But it's one
> > option.
> >
> > > A potential way to address this(with the existing tools) could be
> having
> > a
> > > union schema of all fields across all entities(potentially prefixed
> with
> > > the field name just like in sql joins) and setting the values to NA
> which
> > > do not belong to an entity.
> >
> > I had a similar use case in the past, and it was suggested to use
> > Arrow's Union type which handles this directly. A Union of Struct
> > types essentially lets you have multiple distinct schemas all encoded
> > in the same overall table, with explicit information about which
> > schema is currently in use. But as you point out this isn't helpful if
> > you don't know all the schemas up front.
> >
> > Best,
> > David
> >
> > On 2021/04/13 11:21:20, Gosh Arzumanyan  wrote:
> > > Hi David,
> > >
> > > Thanks for sharing the link!
> > >
> > > Here is how a potential use case might look like:
> > >
> > >1. Assume that we have a service S which accepts expressions in some
> > >language X.
> > >2. Assume that a typical query to this service requests entities
> A_1,
> > >A_2,..,A_K. Each of those entities generates a stream of record
> > batches.
> > >Record batches for a single A_I share the same schema, yet there is
> no
> > >guarantee that schemas are equal across all streams.
> > >3. Assume that there is a strong reason to query A1,..,AK together.
> > >4. Service generates record batches(concurrently), tags those(e.g.
> > with
> > >schema level metadata) and sends them over.
> > >
> > > A potential way to address this(with the existing tools) could be
> having
> > a
> > > union schema of all fields across all entities(potentially prefixed
> with
> > > the field name just like in sql joins) and setting t

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-04-13 Thread Gosh Arzumanyan
Hi David,

Thanks for sharing the link!

Here is how a potential use case might look like:

   1. Assume that we have a service S which accepts expressions in some
   language X.
   2. Assume that a typical query to this service requests entities A_1,
   A_2,..,A_K. Each of those entities generates a stream of record batches.
   Record batches for a single A_I share the same schema, yet there is no
   guarantee that schemas are equal across all streams.
   3. Assume that there is a strong reason to query A1,..,AK together.
   4. Service generates record batches(concurrently), tags those(e.g. with
   schema level metadata) and sends them over.

A potential way to address this(with the existing tools) could be having a
union schema of all fields across all entities(potentially prefixed with
the field name just like in sql joins) and setting the values to NA which
do not belong to an entity. However this solution might not work in cases
where we are not able to construct the unified schema before opening the
stream(e.g. in case of changes in the schema for a specific entity upon
realtime input feeding or an unpredictable generator expression).

Cheers,
Gosh


On Mon., 12 Apr. 2021, 13:45 David Li,  wrote:

> Hi Gosh,
>
> There was indeed a discussion where schema evolution was proposed as a
> solution for another use case:
>
> https://lists.apache.org/thread.html/re800c63f0eb08022c8cd5e1b2236fd69a2e85afdc34daf6b75e3b7b3%40%3Cdev.arrow.apache.org%3E
>
> I am curious though, what is your use case here?
>
> Best,
> David
>
> On 2021/04/12 10:49:00, Gosh Arzumanyan  wrote:
> > Hi guys, hope you are well!
> >
> > Judging from the Flight API
> > <
> https://github.com/apache/arrow/blob/5b08205f7e864ed29f53ed3d836845fed62d5d4a/cpp/src/arrow/flight/types.h#L461
> >
> > and
> > from the documentation/examples out there, it seems like data schema is
> > supposed to be fixed per stream in ArrowFlight(which is also aligned with
> > corresponding IPC stream writers/readers).
> > Wondering if the community has evaluated the necessity/possibility of
> > supporting schema changes within a single stream(I do recall seeing a
> > discussion on this somewhere but can't find it)?
> >
> > Cheers,
> > Gosh
> >
>


[INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-04-12 Thread Gosh Arzumanyan
Hi guys, hope you are well!

Judging from the Flight API

and
from the documentation/examples out there, it seems like data schema is
supposed to be fixed per stream in ArrowFlight(which is also aligned with
corresponding IPC stream writers/readers).
Wondering if the community has evaluated the necessity/possibility of
supporting schema changes within a single stream(I do recall seeing a
discussion on this somewhere but can't find it)?

Cheers,
Gosh