Hello all, The Gremlin Arrow Flight proof of concept demo will take place in the TinkerPop Discord channel https://discord.gg/renSpn8K?event=1006205553749545070 on Friday, Aug 12, 10:30am PST/1:30pm ET.
Feel free to join us if you are interested! Join the channel at https://discord.gg/ndMpKZcBEE, Discord registration/login may be required. On Fri, Aug 5, 2022 at 3:01 PM Valentyn Kahamlyk <valent...@bitquilltech.com> wrote: > Hello all,I'm hosting in Discord a short demo for proof of concept using > Arrow Flight with Gremlin, using string queries and GraphSON for > serialization. Any questions and comments are welcome. The next step will > be to create the full designs based on the proof of concept.The planned > date is Aug 12, I will follow up with the exact time later. > > On Thu, Jul 28, 2022 at 1:26 AM Lee, David <david....@blackrock.com.invalid> > wrote: > >> >> I believe the graphql spec supports both pagination and cursors for >> interacting with web apps which could be used to construct record batches. >> >> > On Jul 27, 2022, at 5:45 PM, Matthew Topol <m...@voltrondata.com.invalid> >> wrote: >> > >> > External Email: Use caution with links and attachments >> > >> > >> > Yea, the drawback you'll find there is that you can't effectively stream >> > record batches as they are available with that setup as you wait for >> all of >> > the results before converting to an Arrow table. >> > >> > The result is higher memory usage necessary for larger result sets and >> your >> > time to the first byte is bottlenecked by the whole request instead of >> > getting the first record batch immediately. >> > >> > If your requests are small on average and/or are very quick to come back >> > then these aren't necessarily issues for your use case, lol. >> > >> > --Matt >> > >> >> On Wed, Jul 27, 2022, 8:32 PM Lee, David <david....@blackrock.com >> .invalid> >> >> wrote: >> >> >> >> Correct more or less.. It is Arrow Flight Native end to end. >> >> >> >> The GraphQL query is a string (saved as a Flight Ticket) that is sent >> from >> >> a client using Arrow Flight RPC. >> >> The GraphQL query is executed on the GraphQL flight server that >> produces >> >> python record objects (JSON structured records). >> >> Those Python record objects are then converted into an Arrow Formatted >> >> Table using pa.Table.from_pylist(). >> >> The Arrow Table is then sent back to the client to complete the >> original >> >> Fight RPC request. >> >> >> >> -----Original Message----- >> >> From: Matthew Topol <m...@voltrondata.com.INVALID> >> >> Sent: Wednesday, July 27, 2022 5:10 PM >> >> To: dev@arrow.apache.org >> >> Subject: Re: Arrow Flight usage with graph databases >> >> >> >> External Email: Use caution with links and attachments >> >> >> >> >> >> So this is sightly different than what I was doing and spoke about. As >> far >> >> as I can tell from your links, you are evaluating the graphql using >> that >> >> graphql server and then converting the JSON response into arrow format >> >> (correct me if I'm wrong please). >> >> >> >> What I did was to hook into a graphql parser and make my own evaluator >> >> which was arrow-native the whole way through. Using the GraphQL >> request to >> >> define the resulting Arrow schema based on the shape of the requested >> data. >> >> I had a planner and executor, with the executor using the plan to set >> up a >> >> pipeline to stream the record batches through. >> >> >> >> Just something to think about :) >> >> >> >> --Matt >> >> >> >> On Wed, Jul 27, 2022, 7:19 PM Lee, David <david....@blackrock.com >> .invalid> >> >> wrote: >> >> >> >>> I'm working on something similar for Ariadne which is a python graphql >> >>> server package. >> >>> >> >>> >> >>> >> https://urldefense.com/v3/__https://github.com/davlee1972/ariadne_arro >> >>> w/blob/arrow_flight/benchmark/test_arrow_flight_server.py__;!!KSjYCgUG >> >>> sB4!byovVWSyyzk7ykPm24evy_v37c43Q3LWklYBybLlZRgNYh_gm969wojLlMiaQ5ehUV >> >>> D6bj8z2b8U0qi_IGMeHgTkAw$ >> >>> >> >>> >> https://urldefense.com/v3/__https://github.com/davlee1972/ariadne_arro >> >>> w/blob/arrow_flight/benchmark/test_asgi_arrow_client.py__;!!KSjYCgUGsB >> >>> 4!byovVWSyyzk7ykPm24evy_v37c43Q3LWklYBybLlZRgNYh_gm969wojLlMiaQ5ehUVD6 >> >>> bj8z2b8U0qi_IGM3u1Wkxw$ >> >>> >> >>> I'm basically calling pa.Table.from_pylist which infers the schema >> >>> from the first json record, but that record could be incomplete so >> >>> passing a schema is preferable. >> >>> >> >>> arrow_data = pa.Table.from_pylist([result]) >> >>> >> >>> Basically I need to look at the graphql query and then go into the >> >>> graphql SDL (Schema Definition Language) and generate an equivalent >> >>> Arrow schema based on the subset of data points requested. >> >>> >> >>> -----Original Message----- >> >>> From: Gavin Ray <ray.gavi...@gmail.com> >> >>> Sent: Wednesday, July 20, 2022 11:15 AM >> >>> To: dev@arrow.apache.org >> >>> Subject: Re: Arrow Flight usage with graph databases >> >>> >> >>> External Email: Use caution with links and attachments >> >>> >> >>> >> >>>> >> >>>> We considered the option to analyze data to build a schema on the >> >>>> fly, however it will be quite an expensive operation which will not >> >>>> allow us to get performance benefits from using Arrow Flight. >> >>> >> >>> >> >>> I'm not sure if you'll be able to avoid generating a schema on the >> >>> fly, if it's anything like SQL or GraphQL queries since each query >> >>> would have a unique shape based on the user's selection. >> >>> >> >>> Have you benchmarked this out of curiosity? >> >>> (It's not an uncommon usecase from what I've seen) >> >>> >> >>> For example, Matt Topol does this to dynamically generate response >> >>> schemas in his implementation of GraphQL-via-Flight and he says the >> >>> overhead is negligible. >> >>> >> >>> On Tue, Jul 19, 2022 at 11:52 PM Valentyn Kahamlyk < >> >>> valent...@bitquilltech.com.invalid> wrote: >> >>> >> >>>> Hi David, >> >>>> >> >>>> We are planning to use Flight for the prototype. We are also >> >>>> planning to use Flight SQL as a reference, however we wanted to >> >>>> explore ideas whether Arrow Flight Graph can be implemented on top >> >>>> of Arrow Flight (similar to Arrow Flight SQL). >> >>>> >> >>>> Graph databases generally do not expose or enforce schema, which >> >>>> indeed makes it challenging. While we do have ideas on building >> >>>> extensions for graph databases to add schema, and we do see some >> >>>> other ideas related to this, we will not be able to rely on this as >> >>>> part of >> >>> the initial prototype. >> >>>> We considered the option to analyze data to build a schema on the >> >>>> fly, however it will be quite an expensive operation which will not >> >>>> allow us to get performance benefits from using Arrow Flight. >> >>>> >> >>>>> What type/size metadata are you referring to? >> >>>> Metadata usually includes information about data type, size and >> >>>> type-specific properties. Some complex types are made up of 10 or >> >>>> more parts. Each Vertex or Edge of graph can have its own distinct >> >>>> set of properties, but the total number of types is several dozen >> >>>> and this can serve as a basis for constructing a schema. The total >> >>>> size of metadata can be quite big, as we wanted to support cases >> >>>> where the graph database can be very large (e.g. hundreds of GBs, >> >>>> with vertices and edges possibly containing different properties). >> >>>> More information about the serialization format we are using right >> >>>> now can be found at >> >>> >> https://urldefense.com/v3/__https://tinkerpop.apache.org/docs/3.5.4/de >> >>> v/io/*graphbinary__;Iw!!KSjYCgUGsB4!dzRC2hHjZwTZ3GW0T6UCRaF722tbMO9StA >> >>> J_-RbcqRr_fg8xu478tctsdw1qspUjo4WSSdvmFtQ-R7u0Fmdr3jc$ >> >>> . >> >>>> >> >>>>> So effectively, the internal format is being carried in a >> >>>>> string/binary >> >>>> column? >> >>>> Yes, I am considering this option for the first stage of >> >> implementation. >> >>>> >> >>>> David, thank you again for your reply, and please let me know your >> >>>> thoughts or whether you might have any suggestions around adopting >> >>>> Arrow Flight for schema-less databases. >> >>>> >> >>>> Regards, Valentyn. >> >>>> >> >>>> On Mon, Jul 18, 2022 at 5:23 PM David Li <lidav...@apache.org> >> wrote: >> >>>> >> >>>>> Hi Valentyn, >> >>>>> >> >>>>> Just to make sure, is this Flight or Flight SQL? I ask since >> >>>>> Flight >> >>>> itself >> >>>>> does not have a notion of transactions in the first place. I'm >> >>>>> also >> >>>> curious >> >>>>> what the intended target client application is. >> >>>>> >> >>>>> Not being familiar with graph databases myself, I'll try to give >> >>>>> some comments… >> >>>>> >> >>>>> Lack of a schema does make things hard. There were some prior >> >>>>> discussions about schema evolution during a (Flight) data stream, >> >>>>> which would let you add/remove fields as the query progresses. And >> >>>>> unions would let you accommodate inconsistent types. But if the >> >>>>> changes are frequent, you'd negate many of the benefits of >> >>>>> Arrow/Flight. And both of these could make client-side usage >> >>> inconvenient. >> >>>>> >> >>>>> What type/size metadata are you referring to? Presumably, this >> >>>>> would instead end up in the schema, once using Arrow? >> >>>>> >> >>>>> Is there any possibility to (say) unify (chunks of) the result to >> >>>>> a consistent schema at least? Or possibly, encoding (some) >> >>>>> properties as a Map<String, Union<...>> instead of as columns. >> >>>>> (This negates the benefits of columnar data, of course, if you are >> >>>>> interested in a particular property, but if you know those >> >>>>> properties up front, the server could >> >>>> pull >> >>>>> those out into (consistently typed) columns.) >> >>>>> >> >>>>>> We are currently working on a prototype in which we are trying >> >>>>>> to use >> >>>>> Arrow Flight as a transport for transmitting requests and data to >> >>>>> Gremlin Server. Serialization is still based on an internal format >> >>>>> due to schema creation complexity. >> >>>>> >> >>>>> So effectively, the internal format is being carried in a >> >>>>> string/binary column? >> >>>>> >> >>>>> On Mon, Jul 18, 2022, at 19:55, Valentyn Kahamlyk wrote: >> >>>>>> Hi All, >> >>>>>> >> >>>>>> I'm investigating the possibility of using Arrow Flight with >> >>>>>> graph >> >>>>> databases, and exploring how to enable Arrow Flight endpoint in >> >>>>> Apache Tinkerpop Gremlin server. >> >>>>>> >> >>>>>> Now graph databases use several incompatible protocols that make >> >>>>>> it >> >>>>> difficult to use and spread the technology. >> >>>>>> A common features for graph databases are 1. Lack of a scheme. >> >>>>>> Each vertex of the graph can have its own set of >> >>>>> properties, including properties with the same name but different >> >>> types. >> >>>>> Metadata such as type and size are also passed with each value, >> >>>>> which increases the amount of data transferred. Some data types >> >>>>> are not >> >>>> supported >> >>>>> by all languages. >> >>>>>> 2. Internal representation of data is different for all >> >>>> implementations. >> >>>>> For data exchange we used a set of formats like customized JSON >> >>>>> and >> >>>> custom >> >>>>> binary, but we would like to get a performance gain from using >> >>>>> Arrow >> >>>> Flight. >> >>>>>> 3. The difference in concepts like transactions, sessions, etc. >> >>>>> Conceptually this may differ from the implementation in SQL. >> >>>>>> Gremlin server does not natively support transactions, so we use >> >>>>>> the >> >>>>> Neo4J plugin. >> >>>>>> >> >>>>>> We are currently working on a prototype in which we are trying >> >>>>>> to use >> >>>>> Arrow Flight as a transport for transmitting requests and data to >> >>>>> Gremlin Server. Serialization is still based on an internal format >> >>>>> due to schema creation complexity. >> >>>>>> >> >>>>>> Ideas are welcome. >> >>>>>> >> >>>>>> Regards, Valentyn >> >>>>> >> >>>> >> >>> >> >>> >> >>> This message may contain information that is confidential or >> privileged. >> >>> If you are not the intended recipient, please advise the sender >> >>> immediately and delete this message. See >> >>> http://www.blackrock.com/corporate/compliance/email-disclaimers for >> >>> further information. Please refer to >> >>> http://www.blackrock.com/corporate/compliance/privacy-policy for more >> >>> information about BlackRock’s Privacy Policy. >> >>> >> >>> >> >>> For a list of BlackRock's office addresses worldwide, see >> >>> http://www.blackrock.com/corporate/about-us/contacts-locations. >> >>> >> >>> © 2022 BlackRock, Inc. All rights reserved. >> >>> >> >> >> >