This is a good description of the problem but any data solution requires a lot of details and there will always be many ways to solve it. Here is a smattering of advice.
You have quite a bit of filtering. How many rows total do you typically return from a query? If you have millions / billions of rows but you are only ever querying 10-100 rows (pre-aggregation) at a time then traditional row-major RDBMs will probably work pretty well. Indices on the columns you are filtering will quickly identify the rows that need to be loaded and all the caching is going to be built in. On the other hand, if you need the interoperability of Arrow, or you are going to be querying large result sets, then using column-based storage backed by Arrow sounds like a good idea. You will probably find you want to partition your data by some of the columns you are querying. For example, you could partition it by SeriesID and Timestamp or by all three columns or just one column. The smaller the partition the more precise your queries can be when selecting what data to load. However, the smaller the partition the more overhead you are going to have (more files, less effective I/O / prefetch, etc.) So the appropriate amount is going to depend on your data and your queries. Depending on your partitioning you may also struggle with updates. If you get a lot of updates that divide into many small batches once partitioned then you will end up with lots of tiny files (not a good thing). So then you will probably need some kind of periodic batching together of files. You'll want some kind of tool to do the filtering & compute work, both pushdown filtering (using the file metadata & directory names to select which data to load) and post-filtering (removing rows that you don't want but couldn't identify through metadata alone). Some of the implementations have this builtin and others don't. I don't know Go well enough to say for certain where it falls. For the ones that don't there are query engines out there you can use and you can communicate with them via flight, shared IPC files, the C data interface, or any number of ways (might want to check what Go supports). So I think my general advice is that what you are describing is probably a great fit for Arrow, but it's going to be a fair amount of work. There are lots of solutions out there that build on Arrow and will do some parts of this work for you. For example, datafusion, duckdb, iceberg, nessie, etc. I don't know that this mailing list will be able to provide comprehensive advice on the entire ecosystem of tools out there. On Fri, Dec 3, 2021 at 2:47 AM Frederic Branczyk <[email protected]> wrote: > > Hello, > > First of all thank you so much for your work on Arrow, it looks like a very > promising piece of technology. > > I'm very new to Arrow, and I'm trying to understand whether arrow is a good > fit for our use case (and if so, if you could maybe give us some pointers as > to which data structures might make sense). We happen to use Go, but I would > think that for the extent of my questions it should be language agnostic. > > We have a workload that works with data whose table looks pretty much like > > +----------+----------+-----------+-------+ > | SeriesID | EntityID | Timestamp | Value | > +----------+----------+-----------+-------+ > > Data is written by participants of the system by SeriesID, with a random, > unpredictable EntityID, and many values at the same time. > > Queries to this data are typically filtering by a set of SeriesIDs and a set > of EntityIDs, as well as a certain time-frame and the remaining datasets are > added up and aggregated by EntityIDs, so that the result is basically a map > of EntityID to Value. > > Maybe this influences the answer, since we are dealing with a lot of data, > our hope was that we could store the data in object storage and essentially > memory map it with multiple layers of caches from object storage to main > memory. > > At first glance, Arrow looks like a great fit, but I'd love to hear your > thoughts as well as if a particular strategy or data structures come to mind > for a workload like this. > > Best regards, > Frederic
