Hi Dobes,

Updating data in this way does seem to be challenge. Hive added ACID features a 
while back, but they are fiendishly complex: every record is given an ID. A 
delete or replace creates a edit entry in another file that uses the same ID. 
Hive then joins the main and update file to apply the updates, and anti-joins 
the deletes file to remove deletions. Seems to work, but it does seem fragile 
and expensive.

For how long are your files open to revision? Days? Weeks? The whole school 
term? Can deletes arrive after the school term? I'm wondering how dynamic the 
data is? Is there a "frothy" present, but then a stable history? Or is the 
whole history frothy?


If revisions are short-term (due to correcting errors, make-up tests, etc.), 
you are trying to create a streaming database as described in the book 
Streaming Systems [1]. If you read that book online, it has lots of nice 
animations showing the various issues, and how late-arriving values can replace 
earlier values. Might spark some ideas.


Depending on how fancy you want to get, you can create a fake directory 
structure in S3 and let Drill's usual partition pruning reduce the number of 
files for each query (partition pruning.)

Or, you can create an index database that holds the complete list of your 
files, along with metadata (maybe the school, class, date, assignment, 
whatever.) You can write custom code that first looks up filter conditions in 
your index DB, to return a list of files. From that, pass the files to Drill in 
place of Drill's partition pruning. This is not an "out of the box" task; you'd 
basically be writing a plugin that replaces Drill's normal directory & 
partition push-down code.

With the index DB, replacing a Parquet file is as easy as replacing the file at 
a particular (coordinate) in your index. Doing this also avoids race 
conditions: you can replace the index entry, wait a few hours to ensure all 
in-flight queries using the old file complete, then delete the obsolete Parquet 
file.

If queries are local (for a single teacher, student or school), the number of 
files can be small (with good data localization: all of the data from one 
district in one file, say.) So, Drill might normally scan a handful (dozens, 
few hundreds) of files. If, however, you are computing nation-wide trends 
across all data, then all files might be involved. In the localized, case, 
having a good index would help you keep the number of files per query small.

Hive, by the way, helps with this because Hive maps table columns to 
directories. You could have partitions for, say, school district, teacher, 
class, student, test which would directly map to terms in your actual queries. 
With Drill, you have to do the mapping yourself, which is a hassle. With a 
roll-your-own index, you can reproduce the Hive behavior. (Yes, we should 
implement the Hive pattern in Drill - perhaps an outcome of the Drill 
metastore.)


Are you planning to combine this with Mongo for the most recent data? If so, 
then your index mechanism can also identify when to use Mongo, and when data 
has migrated to in S3.

Just out of curiosity, if your data changes frequently, have you considered a 
distributed DB such as Cassandra or the like? A Google search suggested there 
are about 60 million students in the US. Let's assume you are wildly 
successful, and serve all of them. There are 250 school days per year (IIRC). 
Let's say each student takes 5 tests per day. (My kids would revolt, but 
still.) That is 75 billion data points per school year. At, say, 100 bytes per 
record, that is about 8 TB of data. Seems like something a small cluster could 
handle, with the number of nodes probably driven by query rate.

Maybe break the problems into pieces? Ignoring the update issue, could you get 
good query performance from S3 with some partitioning strategy? Could you get 
good performance from Mongo (or Cassandra or whatever?) If a static solution 
won't perform well, a dynamic update one probably won't be any better.


Anyone else whose built this kind of system who can offer suggestions?

Thanks,
- Paul


[1] https://learning.oreilly.com/library/view/streaming-systems/9781491983867/



 

    On Friday, March 13, 2020, 9:35:13 PM PDT, Dobes Vandermeer 
<[email protected]> wrote:  
 
 Hi,

I've been thinking about how I might be able to get a good level of performance 
from drill while still having data that updates and while storing the data in 
s3.  Maybe this is a pipe dream, but here are some thoughts and questions.

What I would like to be able to do is to update, replace, re-balance the 
parquet files in s3, but I don't want to calculate and specify the whole list 
of files that are "current" in each query.

I was thinking perhaps I could use a view, so when I replace a file I can add a 
new file, update the view to include it, and then delete the old file.

But I'm worried that having a view with thousands of files could perform poorly.

Building on that idea, it occurred to me that perhaps I could have a hierarchy 
of views - views of views.  For example, a view for each day, rolled into a 
view for each month, rolled into a view for each year, rolled into a top-level 
view.  This could be useful if drill could somehow prune views, but I haven't 
seen any mention of that in the docs.

It seems like Apache Iceberg is designed to help with this, but it doesn't 
support s3 currently, I'm not sure if it will (or can) anytime soon.

Does anyone have any thoughts or experience to share in this area?

Maybe what I am really looking for is some other database entirely - some kind 
of scalable database that supports updates but scales horizontally.  Maybe 
drill just isn't like that right now.
  

Reply via email to