Thanks for doing this, Thom! It's quite a while since I did something similar.
A few quick notes on import: - We expect writes to be relatively slow (indeed, we expect everything to be relatively slow, because we haven't done much performance tuning yet), but that importer code is running one SQLite write transaction and one Mentat transact per place. Transactions are costly — fsyncs alone will add up. You should get a big speedup on the importer by sharing a single ` begin_transaction` call across all places. However, don't expect to be able to tweak your way to this code to running in milliseconds: a truly fast Places importer (indeed, this is true for most SQL systems) will be fast because it uses knowledge of the structure of the input data, rather than forcing the transactor to do expensive work to figure it out — *e.g.*, it will probably be faster to import all the places, then use their IDs to import all the visits, allocating new IDs for each visit in advance, because we know that visits never merge. The transactor does lots of work to 'knit' incoming data into the existing graph, and the importer knows that it doesn't need to. There are a bunch of smaller optimizations you could make, too (*e.g.*, looking up attributes by keyword *once*, rather than tens of thousands of times), but I won't get into those here. - The database size is within expected parameters. A few multiples of Places isn't a concern for the leaders I've talked to — we're deliberately trading for flexibility here. But you've used :index true on most attributes, which trades disk space for reverse lookup performance. There are a few issues tracked to make size/speed tradeoffs easier to twiddle, like this one to allow an automatic equivalent of url_hash <https://github.com/mozilla/mentat/issues/69>. There are two reasons why the disk size is larger than Places. The first is that Mentat's layout is more normalized than Places' schema; imagine a Places where each visit type, date, and URL was tracked in a separate table. The second is that we're storing a log of all the data. Right after an import into an empty database the log is basically identical to the live data, just with different indices. If you want to see what a log-free mode would look like, you can `DELETE FROM transactions <https://github.com/mozilla/mentat/issues/545>` and vacuum; you should see about a 30% drop in file size. - It might be worth vacuuming the database after the import. Mentat won't do that for you, and the fragmentation might affect query times. On the topic of the queries you ran: - Your 10-most-recent visits query is actually more precise than you think it is (it doesn't mean the same thing as the SQL!), which is why it's slower. You intend "fetch the most recent 10 visits", but it's actually "fetch me the entities which have a date, a place, and a URL, excluding those that are missing any, and give me the distinct URLs of the 10 most recent". That requires joins, and the joins are why your query is slow — they happen *inside* the limit, because they need to exclude partial data. The performance-equivalent way to do this is to just fetch the visit entity: *[:find ?visit :where [?visit :visit/date ?date] :order (desc ?when) :limit 10]*. If you also need the other fields, use pull, which I'm in the process of building now <https://github.com/mozilla/mentat/pull/638>, or grab them out of the cache or off disk in a second query. The Mentat 2019® way of doing this would be to define a materialized view for this so you don't need to do any work *at all* at a performance-sensitive moment. We're some ways away from that. - For queries with times in the milliseconds, .query_prepared will give fairer timings. Mentat does strictly more work than SQLite — being a wrapper it naturally does everything SQLite does, plus planning beforehand and result processing afterwards — so preparing a query is a little more important. - You should consider caching appropriate attributes. Adding a reverse-lookup cache for :place/url_hash might help your Facebook query: `.cache *:place/url_hash reverse*` in the CLI. More broadly: Places is a very relational system, and a very 'square' one at that. For example, every visit has all three fields; there are never any NULLs. That 'squareness' is why we originally proposed addressing Places' workload by deriving the tabular representation from the log, preserving the ability to write those square queries. (That could either be done inside Mentat <https://github.com/mozilla/mentat/issues/33> or as a materialized view in SQL.) I'm actually quite pleasantly surprised that you've been able to dump in a non-trivial amount of data, not vacuum the DB, and write "SQL-ey" queries and get non-prepared and non-cache-using times that are within rock-chucking distance of fast. I think we can get significant improvements just by using what already exists (*e.g.*, caching attributes and not writing SQL-style queries), and there's already work tracked and underway that will go even further. The final leap would be to derive the Places tables as we had originally planned. Naturally that would get the same query performance as Places itself (or better, because there'd be no writer contention!). We haven't prioritized doing that work because leadership didn't consider it much of a risk. Keep researching! I'm keen to see this kind of thing continue. Thanks, -R On Tue, Apr 17, 2018 at 1:18 AM, Thom Chiovoloni <[email protected]> wrote: > In my 1:1 with Mark I talked about how I was a little concerned that > (AFAIK) we hadn't tried to run Mentat at a scale equivalent to a user's > Places database (I had also mentioned this a few times to various others). > Eventually, if mentat is to actually be the way forward for synced storage > in Firefox, this seems necessary, and it would be good to find problems > sooner and not later. > > I wrote some hacky code to stuff a representative set of the > `places.sqlite` data (the relevant-looking fields in your visit and place > info) into the mentat DB this evening. You can find the code here > https://github.com/thomcc/mentat-places-test, and if you have a rust dev > environment setup it should be usable. The README explains somewhat how to > use it at the start. > > I wrote a lot more details in the README, but the TLDR is that my findings > were mixed. The performance is not very good, but it also wasn't so bad as > to be unusable on a good machine. That said, I accidentally destroyed my > places.sqlite while writing this code though, so I only had a subset of my > history to test on. > > Caveat 1: The code is hacky and doesn't handle errors robustly. It may or > may not work for you. Ask me on IRC, file a bug, or reply if it doesn't. > That said, it's connection to places is readonly, so your history/bookmarks > shouldn't be at any risk. > > Caveat 2: I did this mostly out of curiosity. I also likely don't really > know how to use mentat properly, so they should be taken with a grain of > salt. Someone who knew what they were doing would probably get better > numbers, but IDK how much better. I'm skeptical it would compare favorably > to the current system, but it also might not need to. > > Anyway, thanks, > Thom Chiovoloni > > > _______________________________________________ > Sync-dev mailing list > [email protected] > https://mail.mozilla.org/listinfo/sync-dev > >
_______________________________________________ Sync-dev mailing list [email protected] https://mail.mozilla.org/listinfo/sync-dev

