Hi Ted,

>From item 3, it should like you are focusing on using Drill to front a DB 
>system, rather than proposing to use Drill to update files in a distributed 
>file system (DFS).


Turns out that, for the DFS case, the former HortonWorks put quite a bit into 
working out viable insert/update semantics in Hive with the Hive ACID support. 
[1], [2] This was a huge amount of work done in conjunction with various 
partners, and is on its third version as Hive learns the semantics and how to 
get ACID to perform well under load. Adding ACID support to Drill would be a 
"non-trivial" exercise (unless Drill could actually borrow Hive's code, but 
even that might not be simple.)


Drill is far simpler than Hive because Drill has long exploited the fact that 
data is read-only. Once data can change, we must revisit various aspects to 
account for that fact. Since change can occur concurrently with queries (and 
other changes), some kind of concurrency control is needed. Hive has worked out 
a way to ensure that only completed transactions are included in a query by 
using delta files. Hive delta files can include inserts, updates and deletes.

If insert is all that is needed, then there may be simpler solutions: just 
track which files are newly added. If the underlying file system is atomic, 
then even this can be simplified down to just noticing that a file exist when 
planning a query. If the file is visible before it is complete, then some form 
of mechanism is needed to detect in-progress files. Of course, Drill must 
already handle this case for files created outside of Drill, so it may "just 
work" for the DFS case.


And, if the goal is simply to push insert into a DB, then the DB itself can 
handle transactions and concurrency. Generally most DBs manage transaction as 
part of a session. To ensure Drill does a consistent insert, Drill would need 
to push the update though a single client (single minor fragment). A 
distributed insert (using multiple minor fragments each inserting a subset of 
rows) would require two-phase commit, or would have to forgo consistency. (The 
CAP problem.) Further, Drill would have to handle insert failures (deadlock 
detection, duplicate keys, etc.) reported by the target DB and return that 
error to the Drill client (hopefully in a form other than a long Java stack 
trace...)

All this said, I suspect you have in mind a specific use case that is far 
simpler than the general case. Can you explain more a bit what you have in mind?

Thanks,
- Paul

[1] 
https://hortonworks.com/tutorial/using-hive-acid-transactions-to-insert-update-and-delete-data/
[2] 
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_3_internals.html



 

    On Monday, May 27, 2019, 1:15:36 PM PDT, Ted Dunning 
<ted.dunn...@gmail.com> wrote:  
 
 I would like to start a discussion about how to add insert capabilities to
drill.

It seems that the basic outline is:

1) making sure Calcite will parse it (almost certain)
2) defining an upsert operator in the logical plan
3) push rules into Drill from the DB driver to allow Drill to push down the
upsert into DB

Are these generally correct?

Can anybody point me to analogous operations?
  

Reply via email to