Hey Brendan, As Jacques promised here are a few things to act as pointers for your work on Flight: Our early release Flight connector[1] this fully supports single flight streams and partially supports parallel streams I also have a Spark DataSourceV2 client which may be of interest to you[2]
Both links make use of the 'doAction' part of the Flight API spec[3] to negotiate parallel vs single stream among other things. However, this is done in an ad-hoc manner and finding a way to standardise this for exchange of metadata, catalog info, connection parameters etc is for me an important next step to making a flight based protocol that is equivalent to odbc/jdbc. I would be happy to discuss further if you have any thoughts on the topic. Best, Ryan [1] https://github.com/dremio-hub/dremio-flight-connector [2] https://github.com/rymurr/flight-spark-source [3] https://github.com/apache/arrow/blob/master/format/Flight.proto On Thu, May 21, 2020 at 3:08 PM Uwe L. Korn <uw...@xhochy.com> wrote: > Hello Brendan, > > welcome to the community. In addition to the folks at Dremio, I wanted to > make you aware of the Python ODBC client library > https://github.com/blue-yonder/turbodbc which provides a high-performance > ODBC<->Arrow adapter. It is especially popular with MS SQL Server users as > the fastest known way to retrieve query results as DataFrames in Python > from SQL Server, considerably faster than pandas.read_sql or using pyodbc > directly. > > While being the fastest known, I can tell that still there is a lot time > CPU spent in the ODBC driver "transforming" results so that it matches the > ODBC interface. At least here, one could get possibly a lot better > performance when retrieving large columnar results from SQL Server when > going through Arrow Flight as an interface instead being constraint to the > less efficient ODBC for this use case. Currently there is a performance > difference of 50x between reading the data from a Parquet file and reading > the same data from a table in SQL Server (simple SELECT, no filtering or > so). As nearly for the full retrieval time the client CPU is at 100%, using > a more efficient protocol for data transferral could roughly translate into > a 10x speedup. > > Best, > Uwe > > On Wed, May 20, 2020, at 12:16 AM, Brendan Niebruegge wrote: > > Hi everyone, > > > > I wanted to informally introduce myself. My name is Brendan Niebruegge, > > I'm a Software Engineer in our SQL Server extensibility team here at > > Microsoft. I am leading an effort to explore how we could integrate > > Arrow Flight with SQL Server. We think this could be a very interesting > > integration that would both benefit SQL Server and the Arrow community. > > We are very early in our thoughts so I thought it best to reach out > > here and see if you had any thoughts or suggestions for me. What would > > be the best way to socialize my thoughts to date? I am keen to learn > > and deepen my knowledge of Arrow as well so please let me know how I > > can be of help to the community. > > > > Please feel free to reach out anytime (email:brn...@microsoft.com) > > > > Thanks, > > Brendan Niebruegge > > > > >