Hi, I would like to know if the community here would be interested in a project I started a little while back (rough, but mostly functional prototype here - https://github.com/tokoko/SparkFlightSql). There are 2 components in the repo I would like to have your feedback about.
SparkFlightSql - Arrow Flight SQL server with Apache Spark backend. It is intented to be a Spark ThriftServer alternative with an important distinction that unlike ThriftServer which streams query results back to the client through a single server, SparkFlightSql would be able to stream the data back to the client from multiple nodes in parallel leveraging Arrow Flight SQL architecture. Arrow Flight SQL will also eventually have both ODBC and JDBC open source drivers that would enable SparkFlightSql to cover all the use cases of Spark ThriftServer. SparkFlightManager - a lower-level utility that SparkFlightSql will be built upon. The goal for SparkFlightManager would be to enable easier development of any kind of distributed Arrow Flight servers, with server-side code written in Spark. (see an example for a simple Parquet-reader thrift server - https://github.com/tokoko/SparkFlightSql/blob/main/src/main/scala/com/tokoko/spark/flight/example/SparkParquetFlightProducer.scala) The idea is for the the developer to provide server-side code, it could be an ETL pipeline, ML model inference or anything else that outputs a Spark DataFrame and SparkFlightManager will wrap it into a distributed data microservice with Arrow Flight API. The client could be any application with no dependency on Spark, only on Arrow Flight or it could be another Spark application that consumes the data in parallel using Arrow Flight. If interested, you can see more details about how the current prototype is implemented in the repository README. Thanks, -- Tornike