Re: adding insert

Paul Rogers Mon, 27 May 2019 14:17:21 -0700

Hi Ted,

>From item 3, it should like you are focusing on using Drill to front a DB 
>system, rather than proposing to use Drill to update files in a distributed 
>file system (DFS).

Turns out that, for the DFS case, the former HortonWorks put quite a bit into
working out viable insert/update semantics in Hive with the Hive ACID support.
[1], [2] This was a huge amount of work done in conjunction with various
partners, and is on its third version as Hive learns the semantics and how to
get ACID to perform well under load. Adding ACID support to Drill would be a
"non-trivial" exercise (unless Drill could actually borrow Hive's code, but
even that might not be simple.)

Drill is far simpler than Hive because Drill has long exploited the fact that
data is read-only. Once data can change, we must revisit various aspects to
account for that fact. Since change can occur concurrently with queries (and
other changes), some kind of concurrency control is needed. Hive has worked out
a way to ensure that only completed transactions are included in a query by
using delta files. Hive delta files can include inserts, updates and deletes.

If insert is all that is needed, then there may be simpler solutions: just
track which files are newly added. If the underlying file system is atomic,
then even this can be simplified down to just noticing that a file exist when
planning a query. If the file is visible before it is complete, then some form
of mechanism is needed to detect in-progress files. Of course, Drill must
already handle this case for files created outside of Drill, so it may "just
work" for the DFS case.

And, if the goal is simply to push insert into a DB, then the DB itself can
handle transactions and concurrency. Generally most DBs manage transaction as
part of a session. To ensure Drill does a consistent insert, Drill would need
to push the update though a single client (single minor fragment). A
distributed insert (using multiple minor fragments each inserting a subset of
rows) would require two-phase commit, or would have to forgo consistency. (The
CAP problem.) Further, Drill would have to handle insert failures (deadlock
detection, duplicate keys, etc.) reported by the target DB and return that
error to the Drill client (hopefully in a form other than a long Java stack
trace...)

All this said, I suspect you have in mind a specific use case that is far
simpler than the general case. Can you explain more a bit what you have in mind?

Thanks,
- Paul

[1]
https://hortonworks.com/tutorial/using-hive-acid-transactions-to-insert-update-and-delete-data/
[2]
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_3_internals.html

On Monday, May 27, 2019, 1:15:36 PM PDT, Ted Dunning
<ted.dunn...@gmail.com> wrote:

I would like to start a discussion about how to add insert capabilities to
drill.

It seems that the basic outline is:

1) making sure Calcite will parse it (almost certain)
2) defining an upsert operator in the logical plan
3) push rules into Drill from the DB driver to allow Drill to push down the
upsert into DB

Are these generally correct?

Can anybody point me to analogous operations?

Re: adding insert

Reply via email to