Hi David,

We are planning to use Flight for the prototype. We are also planning to
use Flight SQL as a reference, however we wanted to explore ideas whether
Arrow Flight Graph can be implemented on top of Arrow Flight (similar to
Arrow Flight SQL).

Graph databases generally do not expose or enforce schema, which indeed
makes it challenging. While we do have ideas on building extensions for
graph databases to add schema, and we do see some other ideas related to
this, we will not be able to rely on this as part of the initial prototype.
We considered the option to analyze data to build a schema on the fly,
however it will be quite an expensive operation which will not allow us to
get performance benefits from using Arrow Flight.

>What type/size metadata are you referring to?
Metadata usually includes information about data type, size and
type-specific properties. Some complex types are made up of 10 or more
parts. Each Vertex or Edge of graph can have its own distinct set of
properties, but the total number of types is several dozen and this can
serve as a basis for constructing a schema. The total size of metadata can
be quite big, as we wanted to support cases where the graph database can be
very large (e.g. hundreds of GBs, with vertices and edges possibly
containing different properties).
More information about the serialization format we are using right now can
be found at https://tinkerpop.apache.org/docs/3.5.4/dev/io/#graphbinary.

>So effectively, the internal format is being carried in a string/binary
column?
Yes, I am considering this option for the first stage of implementation.

David, thank you again for your reply, and please let me know your thoughts
or whether you might have any suggestions around adopting Arrow Flight for
schema-less databases.

Regards, Valentyn.

On Mon, Jul 18, 2022 at 5:23 PM David Li <lidav...@apache.org> wrote:

> Hi Valentyn,
>
> Just to make sure, is this Flight or Flight SQL? I ask since Flight itself
> does not have a notion of transactions in the first place. I'm also curious
> what the intended target client application is.
>
> Not being familiar with graph databases myself, I'll try to give some
> comments…
>
> Lack of a schema does make things hard. There were some prior discussions
> about schema evolution during a (Flight) data stream, which would let you
> add/remove fields as the query progresses. And unions would let you
> accommodate inconsistent types. But if the changes are frequent, you'd
> negate many of the benefits of Arrow/Flight. And both of these could make
> client-side usage inconvenient.
>
> What type/size metadata are you referring to? Presumably, this would
> instead end up in the schema, once using Arrow?
>
> Is there any possibility to (say) unify (chunks of) the result to a
> consistent schema at least? Or possibly, encoding (some) properties as a
> Map<String, Union<...>> instead of as columns. (This negates the benefits
> of columnar data, of course, if you are interested in a particular
> property, but if you know those properties up front, the server could pull
> those out into (consistently typed) columns.)
>
> > We are currently working on a prototype in which we are trying to use
> Arrow Flight as a transport for transmitting requests and data to Gremlin
> Server. Serialization is still based on an internal format due to schema
> creation complexity.
>
> So effectively, the internal format is being carried in a string/binary
> column?
>
> On Mon, Jul 18, 2022, at 19:55, Valentyn Kahamlyk wrote:
> > Hi All,
> >
> > I'm investigating the possibility of using Arrow Flight with graph
> databases, and exploring how to enable Arrow Flight endpoint in Apache
> Tinkerpop Gremlin server.
> >
> > Now graph databases use several incompatible protocols that make it
> difficult to use and spread the technology.
> > A common features for graph databases are
> > 1. Lack of a scheme. Each vertex of the graph can have its own set of
> properties, including properties with the same name but different types.
> Metadata such as type and size are also passed with each value, which
> increases the amount of data transferred. Some data types are not supported
> by all languages.
> > 2. Internal representation of data is different for all implementations.
> For data exchange we used a set of formats like customized JSON and custom
> binary, but we would like to get a performance gain from using Arrow Flight.
> > 3. The difference in concepts like transactions, sessions, etc.
> Conceptually this may differ from the implementation in SQL.
> > Gremlin server does not natively support transactions, so we use the
> Neo4J plugin.
> >
> > We are currently working on a prototype in which we are trying to use
> Arrow Flight as a transport for transmitting requests and data to Gremlin
> Server. Serialization is still based on an internal format due to schema
> creation complexity.
> >
> > Ideas are welcome.
> >
> > Regards, Valentyn
>

Reply via email to