Re: Arrow Flight usage with graph databases

Valentyn Kahamlyk Mon, 08 Aug 2022 13:52:20 -0700

Hello all,

The Gremlin Arrow Flight proof of concept demo will take place in the
TinkerPop Discord channel
https://discord.gg/renSpn8K?event=1006205553749545070 on Friday, Aug
12, 10:30am PST/1:30pm ET.


Feel free to join us if you are interested! Join the channel at
https://discord.gg/ndMpKZcBEE, Discord registration/login may be
required.


On Fri, Aug 5, 2022 at 3:01 PM Valentyn Kahamlyk <valent...@bitquilltech.com>
wrote:

> Hello all,I'm hosting in Discord a short demo for proof of concept using
> Arrow Flight with Gremlin, using string queries and GraphSON for
> serialization. Any questions and comments are welcome. The next step will
> be to create the full designs based on the proof of concept.The planned
> date is Aug 12, I will follow up with the exact time later.
>
> On Thu, Jul 28, 2022 at 1:26 AM Lee, David <david....@blackrock.com.invalid>
> wrote:
>
>>
>> I believe the graphql spec supports both pagination and cursors for
>> interacting with web apps which could be used to construct record batches.
>>
>> > On Jul 27, 2022, at 5:45 PM, Matthew Topol <m...@voltrondata.com.invalid>
>> wrote:
>> >
>> > External Email: Use caution with links and attachments
>> >
>> >
>> > Yea, the drawback you'll find there is that you can't effectively stream
>> > record batches as they are available with that setup as you wait for
>> all of
>> > the results before converting to an Arrow table.
>> >
>> > The result is higher memory usage necessary for larger result sets and
>> your
>> > time to the first byte is bottlenecked by the whole request instead of
>> > getting the first record batch immediately.
>> >
>> > If your requests are small on average and/or are very quick to come back
>> > then these aren't necessarily issues for your use case, lol.
>> >
>> > --Matt
>> >
>> >> On Wed, Jul 27, 2022, 8:32 PM Lee, David <david....@blackrock.com
>> .invalid>
>> >> wrote:
>> >>
>> >> Correct more or less.. It is Arrow Flight Native end to end.
>> >>
>> >> The GraphQL query is a string (saved as a Flight Ticket) that is sent
>> from
>> >> a client using Arrow Flight RPC.
>> >> The GraphQL query is executed on the GraphQL flight server that
>> produces
>> >> python record objects (JSON structured records).
>> >> Those Python record objects are then converted into an Arrow Formatted
>> >> Table using pa.Table.from_pylist().
>> >> The Arrow Table is then sent back to the client to complete the
>> original
>> >> Fight RPC request.
>> >>
>> >> -----Original Message-----
>> >> From: Matthew Topol <m...@voltrondata.com.INVALID>
>> >> Sent: Wednesday, July 27, 2022 5:10 PM
>> >> To: dev@arrow.apache.org
>> >> Subject: Re: Arrow Flight usage with graph databases
>> >>
>> >> External Email: Use caution with links and attachments
>> >>
>> >>
>> >> So this is sightly different than what I was doing and spoke about. As
>> far
>> >> as I can tell from your links, you are evaluating the graphql using
>> that
>> >> graphql server and then converting the JSON response into arrow format
>> >> (correct me if I'm wrong please).
>> >>
>> >> What I did was to hook into a graphql parser and make my own evaluator
>> >> which was arrow-native the whole way through. Using the GraphQL
>> request to
>> >> define the resulting Arrow schema based on the shape of the requested
>> data.
>> >> I had a planner and executor, with the executor using the plan to set
>> up a
>> >> pipeline to stream the record batches through.
>> >>
>> >> Just something to think about :)
>> >>
>> >> --Matt
>> >>
>> >> On Wed, Jul 27, 2022, 7:19 PM Lee, David <david....@blackrock.com
>> .invalid>
>> >> wrote:
>> >>
>> >>> I'm working on something similar for Ariadne which is a python graphql
>> >>> server package.
>> >>>
>> >>>
>> >>>
>> https://urldefense.com/v3/__https://github.com/davlee1972/ariadne_arro
>> >>> w/blob/arrow_flight/benchmark/test_arrow_flight_server.py__;!!KSjYCgUG
>> >>> sB4!byovVWSyyzk7ykPm24evy_v37c43Q3LWklYBybLlZRgNYh_gm969wojLlMiaQ5ehUV
>> >>> D6bj8z2b8U0qi_IGMeHgTkAw$
>> >>>
>> >>>
>> https://urldefense.com/v3/__https://github.com/davlee1972/ariadne_arro
>> >>> w/blob/arrow_flight/benchmark/test_asgi_arrow_client.py__;!!KSjYCgUGsB
>> >>> 4!byovVWSyyzk7ykPm24evy_v37c43Q3LWklYBybLlZRgNYh_gm969wojLlMiaQ5ehUVD6
>> >>> bj8z2b8U0qi_IGM3u1Wkxw$
>> >>>
>> >>> I'm basically calling pa.Table.from_pylist which infers the schema
>> >>> from the first json record, but that record could be incomplete so
>> >>> passing a schema is preferable.
>> >>>
>> >>> arrow_data = pa.Table.from_pylist([result])
>> >>>
>> >>> Basically I need to look at the graphql query and then go into the
>> >>> graphql SDL (Schema Definition Language) and generate an equivalent
>> >>> Arrow schema based on the subset of data points requested.
>> >>>
>> >>> -----Original Message-----
>> >>> From: Gavin Ray <ray.gavi...@gmail.com>
>> >>> Sent: Wednesday, July 20, 2022 11:15 AM
>> >>> To: dev@arrow.apache.org
>> >>> Subject: Re: Arrow Flight usage with graph databases
>> >>>
>> >>> External Email: Use caution with links and attachments
>> >>>
>> >>>
>> >>>>
>> >>>> We considered the option to analyze data to build a schema on the
>> >>>> fly, however it will be quite an expensive operation which will not
>> >>>> allow us to get performance benefits from using Arrow Flight.
>> >>>
>> >>>
>> >>> I'm not sure if you'll be able to avoid generating a schema on the
>> >>> fly, if it's anything like SQL or GraphQL queries since each query
>> >>> would have a unique shape based on the user's selection.
>> >>>
>> >>> Have you benchmarked this out of curiosity?
>> >>> (It's not an uncommon usecase from what I've seen)
>> >>>
>> >>> For example, Matt Topol does this to dynamically generate response
>> >>> schemas in his implementation of GraphQL-via-Flight and he says the
>> >>> overhead is negligible.
>> >>>
>> >>> On Tue, Jul 19, 2022 at 11:52 PM Valentyn Kahamlyk <
>> >>> valent...@bitquilltech.com.invalid> wrote:
>> >>>
>> >>>> Hi David,
>> >>>>
>> >>>> We are planning to use Flight for the prototype. We are also
>> >>>> planning to use Flight SQL as a reference, however we wanted to
>> >>>> explore ideas whether Arrow Flight Graph can be implemented on top
>> >>>> of Arrow Flight (similar to Arrow Flight SQL).
>> >>>>
>> >>>> Graph databases generally do not expose or enforce schema, which
>> >>>> indeed makes it challenging. While we do have ideas on building
>> >>>> extensions for graph databases to add schema, and we do see some
>> >>>> other ideas related to this, we will not be able to rely on this as
>> >>>> part of
>> >>> the initial prototype.
>> >>>> We considered the option to analyze data to build a schema on the
>> >>>> fly, however it will be quite an expensive operation which will not
>> >>>> allow us to get performance benefits from using Arrow Flight.
>> >>>>
>> >>>>> What type/size metadata are you referring to?
>> >>>> Metadata usually includes information about data type, size and
>> >>>> type-specific properties. Some complex types are made up of 10 or
>> >>>> more parts. Each Vertex or Edge of graph can have its own distinct
>> >>>> set of properties, but the total number of types is several dozen
>> >>>> and this can serve as a basis for constructing a schema. The total
>> >>>> size of metadata can be quite big, as we wanted to support cases
>> >>>> where the graph database can be very large (e.g. hundreds of GBs,
>> >>>> with vertices and edges possibly containing different properties).
>> >>>> More information about the serialization format we are using right
>> >>>> now can be found at
>> >>>
>> https://urldefense.com/v3/__https://tinkerpop.apache.org/docs/3.5.4/de
>> >>> v/io/*graphbinary__;Iw!!KSjYCgUGsB4!dzRC2hHjZwTZ3GW0T6UCRaF722tbMO9StA
>> >>> J_-RbcqRr_fg8xu478tctsdw1qspUjo4WSSdvmFtQ-R7u0Fmdr3jc$
>> >>> .
>> >>>>
>> >>>>> So effectively, the internal format is being carried in a
>> >>>>> string/binary
>> >>>> column?
>> >>>> Yes, I am considering this option for the first stage of
>> >> implementation.
>> >>>>
>> >>>> David, thank you again for your reply, and please let me know your
>> >>>> thoughts or whether you might have any suggestions around adopting
>> >>>> Arrow Flight for schema-less databases.
>> >>>>
>> >>>> Regards, Valentyn.
>> >>>>
>> >>>> On Mon, Jul 18, 2022 at 5:23 PM David Li <lidav...@apache.org>
>> wrote:
>> >>>>
>> >>>>> Hi Valentyn,
>> >>>>>
>> >>>>> Just to make sure, is this Flight or Flight SQL? I ask since
>> >>>>> Flight
>> >>>> itself
>> >>>>> does not have a notion of transactions in the first place. I'm
>> >>>>> also
>> >>>> curious
>> >>>>> what the intended target client application is.
>> >>>>>
>> >>>>> Not being familiar with graph databases myself, I'll try to give
>> >>>>> some comments…
>> >>>>>
>> >>>>> Lack of a schema does make things hard. There were some prior
>> >>>>> discussions about schema evolution during a (Flight) data stream,
>> >>>>> which would let you add/remove fields as the query progresses. And
>> >>>>> unions would let you accommodate inconsistent types. But if the
>> >>>>> changes are frequent, you'd negate many of the benefits of
>> >>>>> Arrow/Flight. And both of these could make client-side usage
>> >>> inconvenient.
>> >>>>>
>> >>>>> What type/size metadata are you referring to? Presumably, this
>> >>>>> would instead end up in the schema, once using Arrow?
>> >>>>>
>> >>>>> Is there any possibility to (say) unify (chunks of) the result to
>> >>>>> a consistent schema at least? Or possibly, encoding (some)
>> >>>>> properties as a Map<String, Union<...>> instead of as columns.
>> >>>>> (This negates the benefits of columnar data, of course, if you are
>> >>>>> interested in a particular property, but if you know those
>> >>>>> properties up front, the server could
>> >>>> pull
>> >>>>> those out into (consistently typed) columns.)
>> >>>>>
>> >>>>>> We are currently working on a prototype in which we are trying
>> >>>>>> to use
>> >>>>> Arrow Flight as a transport for transmitting requests and data to
>> >>>>> Gremlin Server. Serialization is still based on an internal format
>> >>>>> due to schema creation complexity.
>> >>>>>
>> >>>>> So effectively, the internal format is being carried in a
>> >>>>> string/binary column?
>> >>>>>
>> >>>>> On Mon, Jul 18, 2022, at 19:55, Valentyn Kahamlyk wrote:
>> >>>>>> Hi All,
>> >>>>>>
>> >>>>>> I'm investigating the possibility of using Arrow Flight with
>> >>>>>> graph
>> >>>>> databases, and exploring how to enable Arrow Flight endpoint in
>> >>>>> Apache Tinkerpop Gremlin server.
>> >>>>>>
>> >>>>>> Now graph databases use several incompatible protocols that make
>> >>>>>> it
>> >>>>> difficult to use and spread the technology.
>> >>>>>> A common features for graph databases are 1. Lack of a scheme.
>> >>>>>> Each vertex of the graph can have its own set of
>> >>>>> properties, including properties with the same name but different
>> >>> types.
>> >>>>> Metadata such as type and size are also passed with each value,
>> >>>>> which increases the amount of data transferred. Some data types
>> >>>>> are not
>> >>>> supported
>> >>>>> by all languages.
>> >>>>>> 2. Internal representation of data is different for all
>> >>>> implementations.
>> >>>>> For data exchange we used a set of formats like customized JSON
>> >>>>> and
>> >>>> custom
>> >>>>> binary, but we would like to get a performance gain from using
>> >>>>> Arrow
>> >>>> Flight.
>> >>>>>> 3. The difference in concepts like transactions, sessions, etc.
>> >>>>> Conceptually this may differ from the implementation in SQL.
>> >>>>>> Gremlin server does not natively support transactions, so we use
>> >>>>>> the
>> >>>>> Neo4J plugin.
>> >>>>>>
>> >>>>>> We are currently working on a prototype in which we are trying
>> >>>>>> to use
>> >>>>> Arrow Flight as a transport for transmitting requests and data to
>> >>>>> Gremlin Server. Serialization is still based on an internal format
>> >>>>> due to schema creation complexity.
>> >>>>>>
>> >>>>>> Ideas are welcome.
>> >>>>>>
>> >>>>>> Regards, Valentyn
>> >>>>>
>> >>>>
>> >>>
>> >>>
>> >>> This message may contain information that is confidential or
>> privileged.
>> >>> If you are not the intended recipient, please advise the sender
>> >>> immediately and delete this message. See
>> >>> http://www.blackrock.com/corporate/compliance/email-disclaimers for
>> >>> further information.  Please refer to
>> >>> http://www.blackrock.com/corporate/compliance/privacy-policy for more
>> >>> information about BlackRock’s Privacy Policy.
>> >>>
>> >>>
>> >>> For a list of BlackRock's office addresses worldwide, see
>> >>> http://www.blackrock.com/corporate/about-us/contacts-locations.
>> >>>
>> >>> © 2022 BlackRock, Inc. All rights reserved.
>> >>>
>> >>
>>
>

Re: Arrow Flight usage with graph databases

Reply via email to