Il giorno domenica 15 gennaio 2023 alle 05:26:50 UTC+1 Dino ha scritto: > Hello, I have built a PoC service in Python Flask for my work, and - now > that the point is made - I need to make it a little more performant (to > be honest, chances are that someone else will pick up from where I left > off, and implement the same service from scratch in a different language > (GoLang? .Net? Java?) but I am digressing). > > Anyway, my Flask service initializes by loading a big "table" of 100k > rows and 40 columns or so (memory footprint: order of 300 Mb) and then > accepts queries through a REST endpoint. Columns are strings, enums, and > numbers. Once initialized, the table is read only. The endpoint will > parse the query and match it against column values (equality, > inequality, greater than, etc.) Finally, it will return a (JSON) list of > all rows that satisfy all conditions in the query. > > As you can imagine, this is not very performant in its current form, but > performance was not the point of the PoC - at least initially. > > Before I deliver the PoC to a more experienced software architect who > will look at my code, though, I wouldn't mind to look a bit less lame > and do something about performance in my own code first, possibly by > bringing the average time for queries down from where it is now (order > of 1 to 4 seconds per query on my laptop) to 1 or 2 milliseconds on > average). > > To be honest, I was already able to bring the time down to a handful of > microseconds thanks to a rudimentary cache that will associate the > "signature" of a query to its result, and serve it the next time the > same query is received, but this may not be good enough: 1) queries > might be many and very different from one another each time, AND 2) I am > not sure the server will have a ton of RAM if/when this thing - or > whatever is derived from it - is placed into production. > > How can I make my queries generally more performant, ideally also in > case of a new query? > > Here's what I have been considering: > > 1. making my cache more "modular", i.e. cache the result of certain > (wide) queries. When a complex query comes in, I may be able to restrict > my search to a subset of the rows (as determined by a previously cached > partial query). This should keep the memory footprint under control. > > 2. Load my data into a numpy.array and use numpy.array operations to > slice and dice my data. > > 3. load my data into sqlite3 and use SELECT statement to query my table. > I have never used sqllite, plus there's some extra complexity as > comparing certain colum requires custom logic, but I wonder if this > architecture would work well also when dealing with a 300Mb database. > > 4. Other ideas? > > Hopefully I made sense. Thank you for your attention > > Dino
As a comparison with numpy. Given the following lines: import numpy as np a = np.random.randn(400,100_000) ia = np.argsort(a[0,:]) a_elem = a[56, ia[0]] I have just taken an element randomly in a numeric table of 400x100000 elements To find it with numpy: %timeit isel = a == a_elem 35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) And %timeit a[isel] 9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) As data are not ordered it is searching it one by one but at C level. Of course it depends on a lot of thing... -- https://mail.python.org/mailman/listinfo/python-list