Re: adding insert

Ted Dunning Mon, 27 May 2019 17:39:36 -0700

And I should point out that Drill already has the problem of data that
changes. It just ignores the problem. If somebody appends to one CSV or
JSON file or another, some changes might get picked up, some might be seen
mid-change (causing a data syntax error, possibly) or if DB rows are
inserted then Drill will give strange results.


The Drill policy is and always has been "that's tough".

I am proposing to extend that policy by letting Drill join the party of
tools that do updates.

In particular, I want to send row updates or row inserts to MapR DB.

I have watched the Hive/ORC transactional insert train wreck for some time.
I think that the only viable lesson from that is that 1) doing transactions
on top of a non-database is hard and 2) having non-database people do it
makes it even harder.

My own feeling is that until some more serious work is done on this that
the right solution is to get some simple capabilities in place. For
instance, if we have insert-only semantics, tracking insertion transactions
using a job_id (or window_id) field works a treat, especially if you hide
the probe for pending or aborted inserts using a view. This actually works,
works well, and is incredibly simple. The only thing wrong is that I have
to bring in a separate tool like spark or python to do the insertions. With
sleazyInsert, I could do it all with Drill plus a tiny bit of scripting
glue.





On Mon, May 27, 2019 at 5:27 PM Ted Dunning <ted.dunn...@gmail.com> wrote:

>
> I have in mind the ability to push rows to an underlying DB without any
> transactional support.
>
>
>
> On Mon, May 27, 2019 at 2:16 PM Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
>
>> Hi Ted,
>>
>> From item 3, it should like you are focusing on using Drill to front a DB
>> system, rather than proposing to use Drill to update files in a distributed
>> file system (DFS).
>>
>>
>> Turns out that, for the DFS case, the former HortonWorks put quite a bit
>> into working out viable insert/update semantics in Hive with the Hive ACID
>> support. [1], [2] This was a huge amount of work done in conjunction with
>> various partners, and is on its third version as Hive learns the semantics
>> and how to get ACID to perform well under load. Adding ACID support to
>> Drill would be a "non-trivial" exercise (unless Drill could actually borrow
>> Hive's code, but even that might not be simple.)
>>
>>
>> Drill is far simpler than Hive because Drill has long exploited the fact
>> that data is read-only. Once data can change, we must revisit various
>> aspects to account for that fact. Since change can occur concurrently with
>> queries (and other changes), some kind of concurrency control is needed.
>> Hive has worked out a way to ensure that only completed transactions are
>> included in a query by using delta files. Hive delta files can include
>> inserts, updates and deletes.
>>
>> If insert is all that is needed, then there may be simpler solutions:
>> just track which files are newly added. If the underlying file system is
>> atomic, then even this can be simplified down to just noticing that a file
>> exist when planning a query. If the file is visible before it is complete,
>> then some form of mechanism is needed to detect in-progress files. Of
>> course, Drill must already handle this case for files created outside of
>> Drill, so it may "just work" for the DFS case.
>>
>>
>> And, if the goal is simply to push insert into a DB, then the DB itself
>> can handle transactions and concurrency. Generally most DBs manage
>> transaction as part of a session. To ensure Drill does a consistent insert,
>> Drill would need to push the update though a single client (single minor
>> fragment). A distributed insert (using multiple minor fragments each
>> inserting a subset of rows) would require two-phase commit, or would have
>> to forgo consistency. (The CAP problem.) Further, Drill would have to
>> handle insert failures (deadlock detection, duplicate keys, etc.) reported
>> by the target DB and return that error to the Drill client (hopefully in a
>> form other than a long Java stack trace...)
>>
>> All this said, I suspect you have in mind a specific use case that is far
>> simpler than the general case. Can you explain more a bit what you have in
>> mind?
>>
>> Thanks,
>> - Paul
>>
>> [1]
>> https://hortonworks.com/tutorial/using-hive-acid-transactions-to-insert-update-and-delete-data/
>> [2]
>> https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_3_internals.html
>>
>>
>>
>>
>>
>>     On Monday, May 27, 2019, 1:15:36 PM PDT, Ted Dunning <
>> ted.dunn...@gmail.com> wrote:
>>
>>  I would like to start a discussion about how to add insert capabilities
>> to
>> drill.
>>
>> It seems that the basic outline is:
>>
>> 1) making sure Calcite will parse it (almost certain)
>> 2) defining an upsert operator in the logical plan
>> 3) push rules into Drill from the DB driver to allow Drill to push down
>> the
>> upsert into DB
>>
>> Are these generally correct?
>>
>> Can anybody point me to analogous operations?
>>
>
>

Re: adding insert

Reply via email to