Hi Wes,

I would personally be very interested in this project and see it as huge
extension of Arrow's capabilities.

I actually experimented with integration of Arrow into a main-memory db
(HyPer [0]) though I might have had a slightly different focus. The way I
took was to compile the export/import operators directly into the query
plan, which allowed read/write performance similar to the RAM throughput.
Think of COPY command but with Arrow as in-/output format. Obviously this
way you don't have the flexibility of your typical ODBC/JDBC connector and
have to rely on some form of shared-memory (plasma was great for it).
Alternatively, you can export to a file, but it's is not Arrow's main use
case. On the positive side, with importing and exporting overhead being
negligible you can construct arbitrarily complex workflows using external
tools (my implementation was very far from it).

Which brings me to the question: which kind of performance are we
targeting? I have no recent numbers, but I was under the impression that
going through ODBC can cost quite a lot. Depending on the DB it might not
be a problem, but very fast analytical engines become ever more common and
with Gandiva making it's way into Arrow we might want to think about
providing more advanced ways of integration.

Concretely I could imagine a custom Array/RecordBatch builder API targeted
directly to the data source provider who want to integrate Arrow. For
example, I've found that using the "safe" methods (the once returning
Status) might actually cost comparatively much in terms of performance and
can be avoided by pre-allocating buffers (some builders have "unsafe"
methods, but not all). Also, it would be nice to have a clear API for
building the output "in place" in cases where the output size can be
determined by the number of entries in the RecordBatch (primitives,
fixed-size binaries, everything dict encoded etc) so we can avoid a copy
when using shared memory. These might not be use cases that are general
enough to justify extending the standard builders, but I could see some
kind of specialized part of Arrow which provides these capabilities.

I'm leaving on vacation in a couple of hours, but i would to follow in
progress in the coming weeks. Would it be reasonable to create a wiki page
for this project?

Cheers,
Dimitri.

[0]: https://hyper-db.de/

On Tue, Aug 21, 2018 at 4:38 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi folks,
>
> I have long desired since the project's inception to develop higher
> performance database clients that natively return Arrow columnar
> format. This is a natural analogue to building Arrow-native interfaces
> to storage formats like Parquet and ORC. If we can't get fast access
> to data, many other parts of the project become less useful.
>
> Example databases I have actively used include:
>
> - PostgreSQL (see ARROW-1106)
> - HiveServer2: for Apache Hive and Apache Impala (see ARROW-3050)
>
> There's good reason to build this software in Apache Arrow:
>
> * Define reusable Arrow-oriented abstractions for putting and getting
> result sets from databases
> * Define reusable APIs in the bindings (Python, Ruby, R, etc.)
> * Benefit from a common build toolchain so that packaging in e.g.
> Python is much simpler
> * Fewer release / packaging cycles to manage (I don't have the
> capacity to manage any more release and packaging cycles than I am
> already involved with)
>
> The only example of an Arrow-native DB client so far is the Turbodbc
> project (https://github.com/blue-yonder/turbodbc). I actually think
> that it would be beneficial to have native ODBC interop in Apache
> Arrow (we have JDBC now recently) but it's fine with me if the
> Turbodbc community wishes to remain long term a third party project
> under its own governance and release cycle.
>
> While I was still at Cloudera I helped develop a small C++ and Python
> library (Apache 2.0) for interacting with HiveServer2, but it has
> become abandonware. I have taken the liberty of forking this code and
> modifying it to build as an optional component of the Arrow C++
> codebase:
>
> https://github.com/apache/arrow/pull/2444
>
> I would like to merge this PR and proceed with creating more database
> interfaces within the project, and defining common abstractions to
> help users access data faster and be more productive.
>
> Thanks,
> Wes
>

Reply via email to