hi Dimitri,

On Tue, Aug 21, 2018 at 1:04 PM, Dimitri Vorona
<alen...@googlemail.com.invalid> wrote:
> Hi Wes,
>
> I would personally be very interested in this project and see it as huge
> extension of Arrow's capabilities.
>
> I actually experimented with integration of Arrow into a main-memory db
> (HyPer [0]) though I might have had a slightly different focus. The way I
> took was to compile the export/import operators directly into the query
> plan, which allowed read/write performance similar to the RAM throughput.
> Think of COPY command but with Arrow as in-/output format. Obviously this
> way you don't have the flexibility of your typical ODBC/JDBC connector and
> have to rely on some form of shared-memory (plasma was great for it).
> Alternatively, you can export to a file, but it's is not Arrow's main use
> case. On the positive side, with importing and exporting overhead being
> negligible you can construct arbitrarily complex workflows using external
> tools (my implementation was very far from it).
>

Very cool. My understanding is that HyPer is not open source (or
likely to be after the Tableau acquisition) but if you have any
research or results to share I would be interested to learn more.

> Which brings me to the question: which kind of performance are we
> targeting? I have no recent numbers, but I was under the impression that
> going through ODBC can cost quite a lot. Depending on the DB it might not
> be a problem, but very fast analytical engines become ever more common and
> with Gandiva making it's way into Arrow we might want to think about
> providing more advanced ways of integration.

ODBC is quite expensive indeed, and the quality of ODBC drivers varies
a great deal. Worse still, many tech companies outsource the
development of ODBC drivers to Simba Technologies and their source
code is not open sourced (presumably because they contained licensed
proprietary Simba code)

In practice, pulling data through PostgreSQL, SQLite3, MySQL,
Microsoft SQL, or HiveServer2 protocols is a bottleneck in many
smaller scale workloads. So if you can double or triple the effective
throughput from these databases that makes a big difference, even if
we're only talking 20-100MB/s or so. When you consider that many
databases are either outright forks of PostgreSQL or reuse one of
those protocols instead of inventing a new one, that covers are large
percentage of the databases in use in production out there. The rest
can go through ODBC

>
> Concretely I could imagine a custom Array/RecordBatch builder API targeted
> directly to the data source provider who want to integrate Arrow. For
> example, I've found that using the "safe" methods (the once returning
> Status) might actually cost comparatively much in terms of performance and
> can be avoided by pre-allocating buffers (some builders have "unsafe"
> methods, but not all). Also, it would be nice to have a clear API for
> building the output "in place" in cases where the output size can be
> determined by the number of entries in the RecordBatch (primitives,
> fixed-size binaries, everything dict encoded etc) so we can avoid a copy
> when using shared memory. These might not be use cases that are general
> enough to justify extending the standard builders, but I could see some
> kind of specialized part of Arrow which provides these capabilities.

I see two ways forward for databases:

* Offer a native Arrow RPC protocol that servers can implement. This
is the "Arrow Flight" that we've been discussing, where Jacques has
developed a prototype that's up in a PR now for ARROW-249

* Maintain native protocol interfaces and optimize as much as
possible. For example, many Python programmers are using either
psycopg2 or asyncpg; it should be possible to go much faster for
downloading large result sets

>
> I'm leaving on vacation in a couple of hours, but i would to follow in
> progress in the coming weeks. Would it be reasonable to create a wiki page
> for this project?

Yes, I will create a section on
https://cwiki.apache.org/confluence/display/ARROW to track projects
related to Arrow<->database protocols. Glad to hear that you're
interested.

- Wes

>
> Cheers,
> Dimitri.
>
> [0]: https://hyper-db.de/
>
> On Tue, Aug 21, 2018 at 4:38 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> hi folks,
>>
>> I have long desired since the project's inception to develop higher
>> performance database clients that natively return Arrow columnar
>> format. This is a natural analogue to building Arrow-native interfaces
>> to storage formats like Parquet and ORC. If we can't get fast access
>> to data, many other parts of the project become less useful.
>>
>> Example databases I have actively used include:
>>
>> - PostgreSQL (see ARROW-1106)
>> - HiveServer2: for Apache Hive and Apache Impala (see ARROW-3050)
>>
>> There's good reason to build this software in Apache Arrow:
>>
>> * Define reusable Arrow-oriented abstractions for putting and getting
>> result sets from databases
>> * Define reusable APIs in the bindings (Python, Ruby, R, etc.)
>> * Benefit from a common build toolchain so that packaging in e.g.
>> Python is much simpler
>> * Fewer release / packaging cycles to manage (I don't have the
>> capacity to manage any more release and packaging cycles than I am
>> already involved with)
>>
>> The only example of an Arrow-native DB client so far is the Turbodbc
>> project (https://github.com/blue-yonder/turbodbc). I actually think
>> that it would be beneficial to have native ODBC interop in Apache
>> Arrow (we have JDBC now recently) but it's fine with me if the
>> Turbodbc community wishes to remain long term a third party project
>> under its own governance and release cycle.
>>
>> While I was still at Cloudera I helped develop a small C++ and Python
>> library (Apache 2.0) for interacting with HiveServer2, but it has
>> become abandonware. I have taken the liberty of forking this code and
>> modifying it to build as an optional component of the Arrow C++
>> codebase:
>>
>> https://github.com/apache/arrow/pull/2444
>>
>> I would like to merge this PR and proceed with creating more database
>> interfaces within the project, and defining common abstractions to
>> help users access data faster and be more productive.
>>
>> Thanks,
>> Wes
>>

Reply via email to