Hi Wes, I would personally be very interested in this project and see it as huge extension of Arrow's capabilities.
I actually experimented with integration of Arrow into a main-memory db (HyPer [0]) though I might have had a slightly different focus. The way I took was to compile the export/import operators directly into the query plan, which allowed read/write performance similar to the RAM throughput. Think of COPY command but with Arrow as in-/output format. Obviously this way you don't have the flexibility of your typical ODBC/JDBC connector and have to rely on some form of shared-memory (plasma was great for it). Alternatively, you can export to a file, but it's is not Arrow's main use case. On the positive side, with importing and exporting overhead being negligible you can construct arbitrarily complex workflows using external tools (my implementation was very far from it). Which brings me to the question: which kind of performance are we targeting? I have no recent numbers, but I was under the impression that going through ODBC can cost quite a lot. Depending on the DB it might not be a problem, but very fast analytical engines become ever more common and with Gandiva making it's way into Arrow we might want to think about providing more advanced ways of integration. Concretely I could imagine a custom Array/RecordBatch builder API targeted directly to the data source provider who want to integrate Arrow. For example, I've found that using the "safe" methods (the once returning Status) might actually cost comparatively much in terms of performance and can be avoided by pre-allocating buffers (some builders have "unsafe" methods, but not all). Also, it would be nice to have a clear API for building the output "in place" in cases where the output size can be determined by the number of entries in the RecordBatch (primitives, fixed-size binaries, everything dict encoded etc) so we can avoid a copy when using shared memory. These might not be use cases that are general enough to justify extending the standard builders, but I could see some kind of specialized part of Arrow which provides these capabilities. I'm leaving on vacation in a couple of hours, but i would to follow in progress in the coming weeks. Would it be reasonable to create a wiki page for this project? Cheers, Dimitri. [0]: https://hyper-db.de/ On Tue, Aug 21, 2018 at 4:38 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi folks, > > I have long desired since the project's inception to develop higher > performance database clients that natively return Arrow columnar > format. This is a natural analogue to building Arrow-native interfaces > to storage formats like Parquet and ORC. If we can't get fast access > to data, many other parts of the project become less useful. > > Example databases I have actively used include: > > - PostgreSQL (see ARROW-1106) > - HiveServer2: for Apache Hive and Apache Impala (see ARROW-3050) > > There's good reason to build this software in Apache Arrow: > > * Define reusable Arrow-oriented abstractions for putting and getting > result sets from databases > * Define reusable APIs in the bindings (Python, Ruby, R, etc.) > * Benefit from a common build toolchain so that packaging in e.g. > Python is much simpler > * Fewer release / packaging cycles to manage (I don't have the > capacity to manage any more release and packaging cycles than I am > already involved with) > > The only example of an Arrow-native DB client so far is the Turbodbc > project (https://github.com/blue-yonder/turbodbc). I actually think > that it would be beneficial to have native ODBC interop in Apache > Arrow (we have JDBC now recently) but it's fine with me if the > Turbodbc community wishes to remain long term a third party project > under its own governance and release cycle. > > While I was still at Cloudera I helped develop a small C++ and Python > library (Apache 2.0) for interacting with HiveServer2, but it has > become abandonware. I have taken the liberty of forking this code and > modifying it to build as an optional component of the Arrow C++ > codebase: > > https://github.com/apache/arrow/pull/2444 > > I would like to merge this PR and proceed with creating more database > interfaces within the project, and defining common abstractions to > help users access data faster and be more productive. > > Thanks, > Wes >