Spark and Arrow Flight

Tornike Gurgenidze Tue, 26 Jul 2022 16:30:44 -0700

Hi,

I would like to know if the community here would be interested in a project
I started a little while back (rough, but mostly functional prototype here
- https://github.com/tokoko/SparkFlightSql). There are 2 components in the
repo I would like to have your feedback about.

SparkFlightSql - Arrow Flight SQL server with Apache Spark backend. It is
intented to be a Spark ThriftServer alternative with an important
distinction that unlike ThriftServer which streams query results back to
the client through a single server, SparkFlightSql would be able to stream
the data back to the client from multiple nodes in parallel leveraging
Arrow Flight SQL architecture. Arrow Flight SQL will also eventually have
both ODBC and JDBC open source drivers that would enable SparkFlightSql to
cover all the use cases of Spark ThriftServer.

SparkFlightManager - a lower-level utility that SparkFlightSql will be
built upon. The goal for SparkFlightManager would be to enable easier
development of any kind of distributed Arrow Flight servers, with
server-side code written in Spark. (see an example for a simple
Parquet-reader thrift server -
https://github.com/tokoko/SparkFlightSql/blob/main/src/main/scala/com/tokoko/spark/flight/example/SparkParquetFlightProducer.scala)
The idea is for the the developer to provide server-side code, it could be
an ETL pipeline, ML model inference or anything else that outputs a Spark
DataFrame and SparkFlightManager will wrap it into a distributed data
microservice with Arrow Flight API. The client could be any application
with no dependency on Spark, only on Arrow Flight or it could be another
Spark application that consumes the data in parallel using Arrow Flight.

If interested, you can see more details about how the current prototype is
implemented in the repository README.

Thanks,
--
Tornike

Spark and Arrow Flight

Reply via email to