On Tue, Apr 17, 2018 at 9:50 AM, Nicholas Alexander <nalexan...@mozilla.com>
wrote:

>
> Thanks for investigating this!  I did a very similar process to understand
> write through-put in Datomish (to a first approximation).  (Datomish was
> the prototype Clojure/ClojureScript implementation of the ideas we're
> pushing forward in Mentat.)  For the curious,
> https://github.com/mozilla/mentat/blob/9a9dfb502acf5e4cdb105
> 9d4aac831d7603063c8/src/common/datomish/places/importer.cljc is archived.
>
> I used these tests to motivate the approach we're taking in the transactor
> in Mentat, where we upsert in a series of rounds, each round being a bulk
> SQLite operation.  When we stopped working on the Datomish implementation,
> ~90% of our execution time was spent by SQLite, which I took as
> sufficiently little Datomish overhead to move on.  It might be that we're
> spending more significant overhead in Mentat.
>
> Of course, as rnewman points out, there are much faster ways to bulk
> import into Mentat for production uses, but I will take a look at your
> experiment and see if we can use it as a first perf test for Mentat.  In
> particular, repeated import is a real stress test for the transactor: every
> single datom we try to transact will be present in the store, so the
> question is how fast can we recognize that and move on?
>
> Thanks again for starting this!
> Nick
>
>
Interesting, I guess it's not surprising that something like this already
existed.

I think it's still true that most of the time is spent in Sqlite for
mentat. If I run the importer inside `Instruments` (the profiler) on my
mac, that's certainly what it looks like to me. I'd expect that's in part
due to the usage model though, since it might be doing multiple
transactions per insert (I had trouble figuring out how to do it in one
given the APIs available, I also didn't want to just build an EDN string
and send it in due to parsing overhead, although maybe that would have been
less than what I ended up with).


On Tue, Apr 17, 2018 at 8:00 AM, Richard Newman <rnew...@mozilla.com> wrote:

> Thanks for doing this, Thom! It's quite a while since I did something
> similar.
>
> A few quick notes on import:
>
>    - We expect writes to be relatively slow (indeed, we expect everything
>    to be relatively slow, because we haven't done much performance tuning
>    yet), but that importer code is running one SQLite write transaction and
>    one Mentat transact per place. Transactions are costly — fsyncs alone will
>    add up. You should get a big speedup on the importer by sharing a single `
>    begin_transaction` call across all places.
>
>    However, don't expect to be able to tweak your way to this code to
>    running in milliseconds: a truly fast Places importer (indeed, this is true
>    for most SQL systems) will be fast because it uses knowledge of the
>    structure of the input data, rather than forcing the transactor to do
>    expensive work to figure it out — *e.g.*, it will probably be faster
>    to import all the places, then use their IDs to import all the visits,
>    allocating new IDs for each visit in advance, because we know that visits
>    never merge. The transactor does lots of work to 'knit' incoming data into
>    the existing graph, and the importer knows that it doesn't need to.
>
>    There are a bunch of smaller optimizations you could make, too (*e.g.*,
>    looking up attributes by keyword *once*, rather than tens of thousands
>    of times), but I won't get into those here.
>
>    - The database size is within expected parameters. A few multiples of
>    Places isn't a concern for the leaders I've talked to — we're deliberately
>    trading for flexibility here. But you've used :index true on most
>    attributes, which trades disk space for reverse lookup performance. There
>    are a few issues tracked to make size/speed tradeoffs easier to twiddle,
>    like this one to allow an automatic equivalent of url_hash
>    <https://github.com/mozilla/mentat/issues/69>.
>
>    There are two reasons why the disk size is larger than Places. The
>    first is that Mentat's layout is more normalized than Places' schema;
>    imagine a Places where each visit type, date, and URL was tracked in a
>    separate table. The second is that we're storing a log of all the data.
>    Right after an import into an empty database the log is basically identical
>    to the live data, just with different indices. If you want to see what a
>    log-free mode would look like, you can `DELETE FROM transactions
>    <https://github.com/mozilla/mentat/issues/545>` and vacuum; you should
>    see about a 30% drop in file size.
>
>    - It might be worth vacuuming the database after the import. Mentat
>    won't do that for you, and the fragmentation might affect query times.
>
>
> On the topic of the queries you ran:
>
>    - Your 10-most-recent visits query is actually more precise than you
>    think it is (it doesn't mean the same thing as the SQL!), which is why it's
>    slower. You intend "fetch the most recent 10 visits", but it's actually
>    "fetch me the entities which have a date, a place, and a URL, excluding
>    those that are missing any, and give me the distinct URLs of the 10 most
>    recent". That requires joins, and the joins are why your query is slow —
>    they happen *inside* the limit, because they need to exclude partial
>    data.
>
>    The performance-equivalent way to do this is to just fetch the visit
>    entity: *[:find ?visit :where [?visit :visit/date ?date] :order (desc
>    ?when) :limit 10]*. If you also need the other fields, use pull, which I'm
>    in the process of building now
>    <https://github.com/mozilla/mentat/pull/638>, or grab them out of the
>    cache or off disk in a second query.
>
>    The Mentat 2019® way of doing this would be to define a materialized
>    view for this so you don't need to do any work *at all* at a
>    performance-sensitive moment. We're some ways away from that.
>
>    - For queries with times in the milliseconds, .query_prepared will
>    give fairer timings. Mentat does strictly more work than SQLite — being a
>    wrapper it naturally does everything SQLite does, plus planning beforehand
>    and result processing afterwards — so preparing a query is a little more
>    important.
>
>    - You should consider caching appropriate attributes. Adding a
>    reverse-lookup cache for :place/url_hash might help your Facebook
>    query: `.cache *:place/url_hash reverse*` in the CLI.
>
>
> More broadly: Places is a very relational system, and a very 'square' one
> at that. For example, every visit has all three fields; there are never any
> NULLs. That 'squareness' is why we originally proposed addressing Places'
> workload by deriving the tabular representation from the log, preserving
> the ability to write those square queries. (That could either be done inside
> Mentat <https://github.com/mozilla/mentat/issues/33> or as a materialized
> view in SQL.)
>
> I'm actually quite pleasantly surprised that you've been able to dump in a
> non-trivial amount of data, not vacuum the DB, and write "SQL-ey" queries
> and get non-prepared and non-cache-using times that are within
> rock-chucking distance of fast.
>
> I think we can get significant improvements just by using what already
> exists (*e.g.*, caching attributes and not writing SQL-style queries),
> and there's already work tracked and underway that will go even further.
> The final leap would be to derive the Places tables as we had originally
> planned. Naturally that would get the same query performance as Places
> itself (or better, because there'd be no writer contention!). We haven't
> prioritized doing that work because leadership didn't consider it much of a
> risk.
>
> Keep researching! I'm keen to see this kind of thing continue.
>
> Thanks,
>
> -R
>
>
Yeah, it's probably worth noting that my goal here wasn't to test import
speed. I was surprised at how slow the approach I took was, but I didn't
expect it to be fast to begin with. I'll try tweaking it to use a single
transaction, if I can figure out the right set of APIs to do this (as
mentioned, I'm not really sure how to do this). I did consider trying to
use the places ids to insert the visits and the places more efficiently,
but I couldn't think of a way to do it. Is there a way using the public API
as is? (at this point, I'm mostly just curious, since as mentioned I'm not
as concerned about performance for importing).

FWIW, the fields that had :index true on them are the same ones that had it
in places AFAICT (based on nsPlacesIndexes.h
<https://searchfox.org/mozilla-central/source/toolkit/components/places/nsPlacesIndexes.h>,
although it's possible I imagined some were there when they weren't).
Vacuuming reduced the test database from 61MB to 56MB, so not a huge
difference. It makes sense that file size isn't a big priority, but it
seems like maybe on mobile it would be (as noted by rfk in slack).

Even on desktop, it does get quite large, I just tried this on my personal
machine's places db, which is about 2 years old and has around 120k places
and 250k visits, and it went from 58MB (places) to 336MB (mentat), which
seems really quite large to me, especially if we're planning on storing a
lot of user generated signals from e.g. Activity Stream, in addition to
everything we currently store. 300MB for a 2yo record of history is enough
that I could imagine users getting annoyed, although maybe there are ways
to reduce the size, and we could always consider expiring data (places
does, after all), or something.

Regarding the queries, I'll play around with it more. I will note that
using a cache on the url_hash, as you suggested, moves that query down to
4ms, which is the same speed as places. It's also good to hear that
something like url_hash will be able to be done automatically in the future.
_______________________________________________
Sync-dev mailing list
Sync-dev@mozilla.org
https://mail.mozilla.org/listinfo/sync-dev

Reply via email to