Hi Ted, >From item 3, it should like you are focusing on using Drill to front a DB >system, rather than proposing to use Drill to update files in a distributed >file system (DFS).
Turns out that, for the DFS case, the former HortonWorks put quite a bit into working out viable insert/update semantics in Hive with the Hive ACID support. [1], [2] This was a huge amount of work done in conjunction with various partners, and is on its third version as Hive learns the semantics and how to get ACID to perform well under load. Adding ACID support to Drill would be a "non-trivial" exercise (unless Drill could actually borrow Hive's code, but even that might not be simple.) Drill is far simpler than Hive because Drill has long exploited the fact that data is read-only. Once data can change, we must revisit various aspects to account for that fact. Since change can occur concurrently with queries (and other changes), some kind of concurrency control is needed. Hive has worked out a way to ensure that only completed transactions are included in a query by using delta files. Hive delta files can include inserts, updates and deletes. If insert is all that is needed, then there may be simpler solutions: just track which files are newly added. If the underlying file system is atomic, then even this can be simplified down to just noticing that a file exist when planning a query. If the file is visible before it is complete, then some form of mechanism is needed to detect in-progress files. Of course, Drill must already handle this case for files created outside of Drill, so it may "just work" for the DFS case. And, if the goal is simply to push insert into a DB, then the DB itself can handle transactions and concurrency. Generally most DBs manage transaction as part of a session. To ensure Drill does a consistent insert, Drill would need to push the update though a single client (single minor fragment). A distributed insert (using multiple minor fragments each inserting a subset of rows) would require two-phase commit, or would have to forgo consistency. (The CAP problem.) Further, Drill would have to handle insert failures (deadlock detection, duplicate keys, etc.) reported by the target DB and return that error to the Drill client (hopefully in a form other than a long Java stack trace...) All this said, I suspect you have in mind a specific use case that is far simpler than the general case. Can you explain more a bit what you have in mind? Thanks, - Paul [1] https://hortonworks.com/tutorial/using-hive-acid-transactions-to-insert-update-and-delete-data/ [2] https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_3_internals.html On Monday, May 27, 2019, 1:15:36 PM PDT, Ted Dunning <ted.dunn...@gmail.com> wrote: I would like to start a discussion about how to add insert capabilities to drill. It seems that the basic outline is: 1) making sure Calcite will parse it (almost certain) 2) defining an upsert operator in the logical plan 3) push rules into Drill from the DB driver to allow Drill to push down the upsert into DB Are these generally correct? Can anybody point me to analogous operations?