Re: Arrow Flight usage with graph databases

Lee, David Thu, 28 Jul 2022 01:25:36 -0700

I believe the graphql spec supports both pagination and cursors for interacting 
with web apps which could be used to construct record batches.


> On Jul 27, 2022, at 5:45 PM, Matthew Topol <m...@voltrondata.com.invalid> 
> wrote:
> 
> External Email: Use caution with links and attachments
> 
> 
> Yea, the drawback you'll find there is that you can't effectively stream
> record batches as they are available with that setup as you wait for all of
> the results before converting to an Arrow table.
> 
> The result is higher memory usage necessary for larger result sets and your
> time to the first byte is bottlenecked by the whole request instead of
> getting the first record batch immediately.
> 
> If your requests are small on average and/or are very quick to come back
> then these aren't necessarily issues for your use case, lol.
> 
> --Matt
> 
>> On Wed, Jul 27, 2022, 8:32 PM Lee, David <david....@blackrock.com.invalid>
>> wrote:
>> 
>> Correct more or less.. It is Arrow Flight Native end to end.
>> 
>> The GraphQL query is a string (saved as a Flight Ticket) that is sent from
>> a client using Arrow Flight RPC.
>> The GraphQL query is executed on the GraphQL flight server that produces
>> python record objects (JSON structured records).
>> Those Python record objects are then converted into an Arrow Formatted
>> Table using pa.Table.from_pylist().
>> The Arrow Table is then sent back to the client to complete the original
>> Fight RPC request.
>> 
>> -----Original Message-----
>> From: Matthew Topol <m...@voltrondata.com.INVALID>
>> Sent: Wednesday, July 27, 2022 5:10 PM
>> To: dev@arrow.apache.org
>> Subject: Re: Arrow Flight usage with graph databases
>> 
>> External Email: Use caution with links and attachments
>> 
>> 
>> So this is sightly different than what I was doing and spoke about. As far
>> as I can tell from your links, you are evaluating the graphql using that
>> graphql server and then converting the JSON response into arrow format
>> (correct me if I'm wrong please).
>> 
>> What I did was to hook into a graphql parser and make my own evaluator
>> which was arrow-native the whole way through. Using the GraphQL request to
>> define the resulting Arrow schema based on the shape of the requested data.
>> I had a planner and executor, with the executor using the plan to set up a
>> pipeline to stream the record batches through.
>> 
>> Just something to think about :)
>> 
>> --Matt
>> 
>> On Wed, Jul 27, 2022, 7:19 PM Lee, David <david....@blackrock.com.invalid>
>> wrote:
>> 
>>> I'm working on something similar for Ariadne which is a python graphql
>>> server package.
>>> 
>>> 
>>> https://urldefense.com/v3/__https://github.com/davlee1972/ariadne_arro
>>> w/blob/arrow_flight/benchmark/test_arrow_flight_server.py__;!!KSjYCgUG
>>> sB4!byovVWSyyzk7ykPm24evy_v37c43Q3LWklYBybLlZRgNYh_gm969wojLlMiaQ5ehUV
>>> D6bj8z2b8U0qi_IGMeHgTkAw$
>>> 
>>> https://urldefense.com/v3/__https://github.com/davlee1972/ariadne_arro
>>> w/blob/arrow_flight/benchmark/test_asgi_arrow_client.py__;!!KSjYCgUGsB
>>> 4!byovVWSyyzk7ykPm24evy_v37c43Q3LWklYBybLlZRgNYh_gm969wojLlMiaQ5ehUVD6
>>> bj8z2b8U0qi_IGM3u1Wkxw$
>>> 
>>> I'm basically calling pa.Table.from_pylist which infers the schema
>>> from the first json record, but that record could be incomplete so
>>> passing a schema is preferable.
>>> 
>>> arrow_data = pa.Table.from_pylist([result])
>>> 
>>> Basically I need to look at the graphql query and then go into the
>>> graphql SDL (Schema Definition Language) and generate an equivalent
>>> Arrow schema based on the subset of data points requested.
>>> 
>>> -----Original Message-----
>>> From: Gavin Ray <ray.gavi...@gmail.com>
>>> Sent: Wednesday, July 20, 2022 11:15 AM
>>> To: dev@arrow.apache.org
>>> Subject: Re: Arrow Flight usage with graph databases
>>> 
>>> External Email: Use caution with links and attachments
>>> 
>>> 
>>>> 
>>>> We considered the option to analyze data to build a schema on the
>>>> fly, however it will be quite an expensive operation which will not
>>>> allow us to get performance benefits from using Arrow Flight.
>>> 
>>> 
>>> I'm not sure if you'll be able to avoid generating a schema on the
>>> fly, if it's anything like SQL or GraphQL queries since each query
>>> would have a unique shape based on the user's selection.
>>> 
>>> Have you benchmarked this out of curiosity?
>>> (It's not an uncommon usecase from what I've seen)
>>> 
>>> For example, Matt Topol does this to dynamically generate response
>>> schemas in his implementation of GraphQL-via-Flight and he says the
>>> overhead is negligible.
>>> 
>>> On Tue, Jul 19, 2022 at 11:52 PM Valentyn Kahamlyk <
>>> valent...@bitquilltech.com.invalid> wrote:
>>> 
>>>> Hi David,
>>>> 
>>>> We are planning to use Flight for the prototype. We are also
>>>> planning to use Flight SQL as a reference, however we wanted to
>>>> explore ideas whether Arrow Flight Graph can be implemented on top
>>>> of Arrow Flight (similar to Arrow Flight SQL).
>>>> 
>>>> Graph databases generally do not expose or enforce schema, which
>>>> indeed makes it challenging. While we do have ideas on building
>>>> extensions for graph databases to add schema, and we do see some
>>>> other ideas related to this, we will not be able to rely on this as
>>>> part of
>>> the initial prototype.
>>>> We considered the option to analyze data to build a schema on the
>>>> fly, however it will be quite an expensive operation which will not
>>>> allow us to get performance benefits from using Arrow Flight.
>>>> 
>>>>> What type/size metadata are you referring to?
>>>> Metadata usually includes information about data type, size and
>>>> type-specific properties. Some complex types are made up of 10 or
>>>> more parts. Each Vertex or Edge of graph can have its own distinct
>>>> set of properties, but the total number of types is several dozen
>>>> and this can serve as a basis for constructing a schema. The total
>>>> size of metadata can be quite big, as we wanted to support cases
>>>> where the graph database can be very large (e.g. hundreds of GBs,
>>>> with vertices and edges possibly containing different properties).
>>>> More information about the serialization format we are using right
>>>> now can be found at
>>> https://urldefense.com/v3/__https://tinkerpop.apache.org/docs/3.5.4/de
>>> v/io/*graphbinary__;Iw!!KSjYCgUGsB4!dzRC2hHjZwTZ3GW0T6UCRaF722tbMO9StA
>>> J_-RbcqRr_fg8xu478tctsdw1qspUjo4WSSdvmFtQ-R7u0Fmdr3jc$
>>> .
>>>> 
>>>>> So effectively, the internal format is being carried in a
>>>>> string/binary
>>>> column?
>>>> Yes, I am considering this option for the first stage of
>> implementation.
>>>> 
>>>> David, thank you again for your reply, and please let me know your
>>>> thoughts or whether you might have any suggestions around adopting
>>>> Arrow Flight for schema-less databases.
>>>> 
>>>> Regards, Valentyn.
>>>> 
>>>> On Mon, Jul 18, 2022 at 5:23 PM David Li <lidav...@apache.org> wrote:
>>>> 
>>>>> Hi Valentyn,
>>>>> 
>>>>> Just to make sure, is this Flight or Flight SQL? I ask since
>>>>> Flight
>>>> itself
>>>>> does not have a notion of transactions in the first place. I'm
>>>>> also
>>>> curious
>>>>> what the intended target client application is.
>>>>> 
>>>>> Not being familiar with graph databases myself, I'll try to give
>>>>> some comments…
>>>>> 
>>>>> Lack of a schema does make things hard. There were some prior
>>>>> discussions about schema evolution during a (Flight) data stream,
>>>>> which would let you add/remove fields as the query progresses. And
>>>>> unions would let you accommodate inconsistent types. But if the
>>>>> changes are frequent, you'd negate many of the benefits of
>>>>> Arrow/Flight. And both of these could make client-side usage
>>> inconvenient.
>>>>> 
>>>>> What type/size metadata are you referring to? Presumably, this
>>>>> would instead end up in the schema, once using Arrow?
>>>>> 
>>>>> Is there any possibility to (say) unify (chunks of) the result to
>>>>> a consistent schema at least? Or possibly, encoding (some)
>>>>> properties as a Map<String, Union<...>> instead of as columns.
>>>>> (This negates the benefits of columnar data, of course, if you are
>>>>> interested in a particular property, but if you know those
>>>>> properties up front, the server could
>>>> pull
>>>>> those out into (consistently typed) columns.)
>>>>> 
>>>>>> We are currently working on a prototype in which we are trying
>>>>>> to use
>>>>> Arrow Flight as a transport for transmitting requests and data to
>>>>> Gremlin Server. Serialization is still based on an internal format
>>>>> due to schema creation complexity.
>>>>> 
>>>>> So effectively, the internal format is being carried in a
>>>>> string/binary column?
>>>>> 
>>>>> On Mon, Jul 18, 2022, at 19:55, Valentyn Kahamlyk wrote:
>>>>>> Hi All,
>>>>>> 
>>>>>> I'm investigating the possibility of using Arrow Flight with
>>>>>> graph
>>>>> databases, and exploring how to enable Arrow Flight endpoint in
>>>>> Apache Tinkerpop Gremlin server.
>>>>>> 
>>>>>> Now graph databases use several incompatible protocols that make
>>>>>> it
>>>>> difficult to use and spread the technology.
>>>>>> A common features for graph databases are 1. Lack of a scheme.
>>>>>> Each vertex of the graph can have its own set of
>>>>> properties, including properties with the same name but different
>>> types.
>>>>> Metadata such as type and size are also passed with each value,
>>>>> which increases the amount of data transferred. Some data types
>>>>> are not
>>>> supported
>>>>> by all languages.
>>>>>> 2. Internal representation of data is different for all
>>>> implementations.
>>>>> For data exchange we used a set of formats like customized JSON
>>>>> and
>>>> custom
>>>>> binary, but we would like to get a performance gain from using
>>>>> Arrow
>>>> Flight.
>>>>>> 3. The difference in concepts like transactions, sessions, etc.
>>>>> Conceptually this may differ from the implementation in SQL.
>>>>>> Gremlin server does not natively support transactions, so we use
>>>>>> the
>>>>> Neo4J plugin.
>>>>>> 
>>>>>> We are currently working on a prototype in which we are trying
>>>>>> to use
>>>>> Arrow Flight as a transport for transmitting requests and data to
>>>>> Gremlin Server. Serialization is still based on an internal format
>>>>> due to schema creation complexity.
>>>>>> 
>>>>>> Ideas are welcome.
>>>>>> 
>>>>>> Regards, Valentyn
>>>>> 
>>>> 
>>> 
>>> 
>>> This message may contain information that is confidential or privileged.
>>> If you are not the intended recipient, please advise the sender
>>> immediately and delete this message. See
>>> http://www.blackrock.com/corporate/compliance/email-disclaimers for
>>> further information.  Please refer to
>>> http://www.blackrock.com/corporate/compliance/privacy-policy for more
>>> information about BlackRock’s Privacy Policy.
>>> 
>>> 
>>> For a list of BlackRock's office addresses worldwide, see
>>> http://www.blackrock.com/corporate/about-us/contacts-locations.
>>> 
>>> © 2022 BlackRock, Inc. All rights reserved.
>>> 
>>

Re: Arrow Flight usage with graph databases

Reply via email to