[tor-dev] [GSoC 2013] Status report - Searchable metrics archive

2013-08-23 Thread Kostas Jakeliunas
Hello!

Updating on my Searchable Tor metrics archive project. (As is very evident)
I'm very open for naming suggestions. :)

To the best of my understanding and current satisfaction, I solved the
database bottlenecks, or at least I am, as of now, satisfied with the
current output from my benchmarking utility. Things may change, but I am
confident (and have support to argue) that the whole thing runs swell at
least on amazon m2.2xlarge instances.

For fun and profit, a part of the database (which, has, for now, status
entries (only) in the range [2010-01-01 00:00:00, 2013-05-31 23:00:00]),
namely, what is currently used by the Onionoo-like API is now available
online (not on EC2, though) - will now write a separate email so that
everyone can inspect it.

I should now move on with implementing / extending the Onionoo API, in
particular, working on date range queries, and refining/rewriting the list
status entries API point (see below). Need to carefully plan some things,
and always keep an updated API document. (Also need to update and publish a
separate, more detailed specification document.)

More concrete report points:

   - re-examined my benchmarking approach, and wrote a rather simple but
   effective set of benchmarking tools (more like a simple script) [1] that
   can be hopefully used outside this project as well; at the very least,
   together with the profiling and the query_info tools, it is powerful (but
   also simple) enough to be used to test all kinds of bottlenecks in ORMs and
   elsewhere.

   - used this tool to generate benchmark reports on EC2 and on the (less
   powerful) dev server, and with different schema settings (usually rather
   minor schema changes that do not require re-importing all the data)

   - came up with a triple table schema that proves to render our queries
   quickly: we first do a search (using whatever criteria (e.g. nickname,
   fingerprint, address, running), if any) on a table which has a column with
   unique fingerprints; extract the relevant fingerprints; JOIN with the main
   status entry table, which is much larger; and get the final results.
   Benchmarked using this schema.

   details If we are only extracting a list of the latest status entries
   (with distinct on fingerprint), we can do LIMITs and OFFSETs already on the
   fingerprint table, before the JOIN. This helps us quite a bit. On the other
   hand, nickname searches etc. are also efficient. As of now, I have
   re-enabled nickname+address+fingerprint substring search (not from the
   middle (LIKE %substring%), but from the beginning of a substring (LIKE
   substring%), which is still nice), and all is well. Updated the
   higher-level ORM to reflect this new table [2] (I've yet to change some
   column names, though - but these are cosmetics.) /details

   - found a way to generate the SQL queries that I need to generate using
   the higher-level SQLAlchemy SQL API using various SQLAlchemy-provided
   primitives, and always observing the resulting query statements. This is
   good, because everything becomes more modular: much easier to shape the
   query depending on the query parameters received, etc. (while still
   retaining it in sane order.)

   - hence (re)wrote a part of the Onionoo-like API that uses the new
   schema and the SQLAlchemy primitives. Extended the API a bit. [3]

   - wrote a very hacky API point for getting a list of status entries for
   a given fingerprint. I simply wanted a way (for myself and people) to query
   this kind of a relation easily and externally. It now works as part of the
   API. This part will probably need some discussion.

   - wrote a (kind of a stub) document explaining the current Onionoo-like
   API, what can be queried, what can be returned, what kinds of parameters
   work. [4] Will extend this later on.


while writing the doc and rewriting part the API, stumbled upon a few
things that make clear that I've made some shortcuts that may hurt later
on. Will be happy to elaborate on them later on / separately. I need to
carefully plan a few things, and then try rewriting the Onionoo API yet
again, this time including more parameters and fields returned.

TL;DR yay, a working database backend!

I might give *one* more update detailing things I might have forgotten
about soon re: this report - I don't want to make a habit of delaying
reports (which I have consistently done), so reporting what I have now.

[1]: https://github.com/wfn/torsearch/blob/master/torsearch/benchmark.py
[2]: https://github.com/wfn/torsearch/blob/master/torsearch/models.py
[3]: https://github.com/wfn/torsearch/blob/master/torsearch/onionoo_api.py
[4]: https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md

--

Kostas (wfn on OFTC)

0x0e5dce45 @ pgp.mit.edu
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive

2013-08-15 Thread Kostas Jakeliunas
On Wed, Aug 14, 2013 at 1:33 PM, Karsten Loesing kars...@torproject.orgwrote:


 Looks like pg_trgm is contained in postgresql-contrib-9.1, so it's more
 likely that we can run something requiring this extension on a
 torproject.org machine.  Still, requiring extensions should be the last
 resort if no other solution can be found.  Leaving out searches for
 nickname substrings is a valid solution for now.


Got it.

  Do you have a list of searches you're planning to support?
 
 
  These are the ones that should *really* be supported:
 
 - ?search=nickname
 - ?search=fingerprint
 - ?lookup=fingerprint
 - ?search=address [done some limited testing, currently not focusing
 on
 this]

 The lookup parameter is basically the same as search=fingerprint with
 the additional requirement that fingerprint must be 40 characters long.
  So, this is the current search parameter.

 I agree, these would be good to support.

 You might also add another parameter ?address=address for ExoneraTor.
 That should, in theory, be just a subset of the search parameter.


Oh yes, makes a lot of sense, OK.

By the way: I considered having the last consensus (all the data for at
least the /summary document, or /details as well) be stored in memory (this
is possible) (probably as a hashtable where key = fingerprint, value = all
the fields we'd need to return) so that when the backend is queried without
any search criteria, it would be possible to avoid hitting the database
(which is always nice), and just dump the last consensus. (There's also
caching of course, which we could discuss at a (probably quite a bit) later
point.)


 - ?running=boolean

 This one is tricky.  So far, Onionoo looks only at the very latest
 consensus or bridge status to decide if a relay or bridge is running or
 not.

 But now you're adding archives to Onionoo, so that people can search for
 a certain consensus or certain bridge status in the past, or they can
 search for a time interval of consensuses or bridge statuses.  How do
 you define that a relay or bridge is running, or more importantly
 included as not running?


Agree, this is not clear. (And whatever ends up being done, this should be
well documented and clearly articulated (of course.))

For me at least, 'running' implies the clause whether a given relay/bridge
is running *right now*, i.e. whether it is present in the very last
consensus. (Here's where that hashtable (with fingerprints as keys) in
memory might be able to help: no need to run a separate query / do an inner
join / whatnot; it would depend on whether there's a LIMIT involved though,
etc.)

I'm not sure which one is more useful (intuitively for me, the whether it
is running *right now* is more useful.) Do you mean that it might make
sense to have a field (or have running be it) indicating whether a given
relay/bridge was present in the last consensus in the specified date range?
If this is what you meant, then the return all that are/were not running
clause would indeed be kind of..peculiar (semantically - it wouldn't be
very obvious what's it doing.)

Maybe it'd be simpler to first answer, what would be the most useful case?

 How do you define that a relay or bridge [should be] included as not
running?

Could you rephrase maybe? Do you mean that it might be difficult to
construct sane queries to check for this condition? Or that the situation
where

   - a from..to date range is specified
   - ?running=false is specified

would be rather confusing ('exclude those nodes which are running *right
now* ('now' possibly having nothing to do with the date range)?

 - ?flag=flag [every kind of clause which further narrows down the
 query
 is not bad; the current db model supports all the flags that Stem
 does, and
 each flag has its own column]

 I'd say leave this one out until there's an actual use case.


Ok, I won't focus on these now; just wanted to say that these should be
possible without much ado/problems.


 - ?first_seen_days=range
 - ?last_seen_days=range
 
  As per the plan, the db should be able to return a list of status
 entries /
  validafter ranges (which can be used in {first,last}_seen_days) given
 some
  fingerprint.

 Oh, I think there's a misunderstanding of these two fields.  These
 fields are only there to search for relays or bridges that have first
 appeared or were last seen on a given day.

 You'll need two new parameters, say, from=datetime and to=datetime (or
 start=datetime and end=datetime) to define a valid-after range for your
 search.


Ah! I wasn't paying attention here. :) Ok, all good.

Thanks as always!
Regards
Kostas.
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive

2013-08-14 Thread Karsten Loesing
On 8/13/13 3:17 PM, Kostas Jakeliunas wrote:
 On Tue, Aug 13, 2013 at 2:15 PM, Karsten Loesing 
 kars...@torproject.orgwrote:

 I suggest putting pg_prewarm on the future work list.  I sense there's a
 lot of unused potential in stock PostgreSQL.  Tweaking the database at
 this point has the word premature optimization written on it in big
 letters for me.
 
 
 Also, to be very clear here, a tool that requires custom tweaks to
 PostgreSQL has minimal chances of running on torproject.org machines in
 the future.  The current plan is that we'll have a dedicated database
 machine operated by our sysadmins that not even the service operator
 will have shell access to.
 
 
 Oh, understood then, OK, no extensions (at least) for now.
 
 Apropos: as of my current (limited) understanding, it might be difficult to
 support, for example, nickname sub-string searches without a (supported,
 official) extension. One such extension is pg_trgm [1], which is in the
 contrib/ directory in 9.1, and is just one make install away. But for now,
 I'll assume this is not possible / we should avoid this.

Looks like pg_trgm is contained in postgresql-contrib-9.1, so it's more
likely that we can run something requiring this extension on a
torproject.org machine.  Still, requiring extensions should be the last
resort if no other solution can be found.  Leaving out searches for
nickname substrings is a valid solution for now.

 So, why do you join descriptors and network statuses in the search
 process?  At the Munich dev meeting I suggested joining the tables
 already in the import process.  What do you think about that idea?
 
 
 Yes, I had made a half-hearted attempt to normalize the two tables some
 time ago, for a small amount of descriptors and status entries; I'll be
 trying out this scheme in full (will need to re-import a major part of the
 data (which I didn't do then) to be able to see if it scales well) after I
 try something else.

Okay.

 (Namely, using a third table of unique fingerprints
 (the statusentry table currently holds ~170K unique fingerprints vs. ~67M
 rows in total) and (non-unique) nicknames for truly quick fingerprint
 lookup and nickname search; I did experiment with this as well, but I
 worked with a small subset of overall data in that case, too; and I think I
 can do a better job now.)
 
 It had seemed to me that the bottleneck was in having to sort a too large
 number of rows, but now I understand (if only just a bit) more about the
 'explain analyze' output to see that the 'Nested Loop' procedure, which is
 what does the join in the join query discussed, is expensive and is part of
 the bottleneck so to speak. So I'll look into that after properly
 benchmarking stuff with the third table.

Sounds like another fine thing to test, I agree.

 (By the way, for future reference,
 we do have to test out different ideas on a substantial subset of overall
 data, as the scale function is not, so to say, linear.) :)

Good point.  Yes, we should keep this in mind for the future.

 https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql

 We use the following indexes while executing that query:

  * lower(nickname) on descriptor

  * (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on
 statusentry

 Using only the first 12 characters sounds like a fine approach to speed
 up things.  But why 12?  Why not 10 or 14?  This is probably something
 you should annotate as parameter to find a good value for later in the
 process.  (I'm not saying that 12 is a bad number.  It's perfectly fine
 for now, but it might not be the best number.)

 
 Yes, this is as unscientific as it gets. As of now, we're using a raw SQL
 query, but I'll be encapuslating them properly soon (so we can easily
 attach different WHERE clauses, etc.), at which point I'll make it into a
 parameter. I did do some tests, but nothing extensive; just made sure the
 indexes can fit into memory whole, which was the main constraint. Will do
 some tests.
 
 
 Also, would it keep indexes smaller if you took something else than
 base16 encoding for fingerprints?  What about base64?  Or is there a
 binary type in PostgreSQL that works fine for indexes?

 
 Re: latter, no binary type for B-Trees (which is the default index type in
 pgsql) as far as I can see. But it's a good idea / approach, so I'll look
 into it, thanks! On the whole though, as long as all the indexes occupy
 only a subset of pgsql's internal buffers, there shouldn't be a problem /
 that's not the problem, afaik. But, if we're making a well-researched
 ORM/database design, I should look into it.
 
 
 Do you have a list of searches you're planning to support?
 
 
 These are the ones that should *really* be supported:
 
- ?search=nickname
- ?search=fingerprint
- ?lookup=fingerprint
- ?search=address [done some limited testing, currently not focusing on
this]

The lookup parameter is basically the same as search=fingerprint with
the additional requirement 

Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive

2013-08-13 Thread Karsten Loesing
On 8/12/13 2:58 PM, Kostas Jakeliunas wrote:
 Karsten,
 
 this won't be a very short email, but I honestly swear I did revise it a
 couple of times. :)

Okay.  I'm putting in feedback below where I think it makes most sense.

 This is not urgent by any measure, so whenever you find
 time to reply will be fine. ctrl+f to observe: for some precise data /
 support for my plan re: using the pg_prewarm extension.
 
 On Mon, Aug 12, 2013 at 2:16 PM, Karsten Loesing 
 kars...@torproject.orgwrote:
 
 On 8/10/13 9:28 PM, Kostas Jakeliunas wrote:
   * I don't think we can avoid using certain postgresql extensions (if
 only
 one) - which means that deploying will always take more than apt-get 
 pip
 install, but I believe it is needed;

 Can you give an example of a query that won't be executed efficiently
 without this extension and just fine with it?  Maybe we can tweak that
 query somehow so it works fine on a vanilla PostgreSQL.  Happy to give
 that some thoughts.

 I'd really want to avoid using stuff that is not in Debian.  Or rather,
 if we really need to add non-standard extensions, we need more than
 thinking and believing that it's unavoidable. :)
 
 
 First off, the general idea. I know this might not sound convincing (see
 below re: this), but any query that uses an index will take significantly
 longer to execute if it needs to load parts of the index from disk. More
 precisely, query time deviation and max(query_time) inversely correlates
 with the percentage of the index in question in memory. The larger the
 index, the more difficult it is to 'prep' it into cache, the more
 unpredictable query exec time gets.
 
 Take a look at the query used to join descriptors and network statuses
 given some nickname (could be any other criterion, e.g. fingerprint or IP
 address):

So, why do you join descriptors and network statuses in the search
process?  At the Munich dev meeting I suggested joining the tables
already in the import process.  What do you think about that idea?

 https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
 
 We use the following indexes while executing that query:
 
  * lower(nickname) on descriptor
 
  * (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on statusentry

Using only the first 12 characters sounds like a fine approach to speed
up things.  But why 12?  Why not 10 or 14?  This is probably something
you should annotate as parameter to find a good value for later in the
process.  (I'm not saying that 12 is a bad number.  It's perfectly fine
for now, but it might not be the best number.)

Also, would it keep indexes smaller if you took something else than
base16 encoding for fingerprints?  What about base64?  Or is there a
binary type in PostgreSQL that works fine for indexes?

 (this one is used to efficiently join descriptor table with statusentry:
 (fingerprint, descriptor) pair is completely unique in the descriptor
 table, and it is fairly unique in the statusentry table (whereas a
 particular fingerprint usually has lots and lots of rows in statusentry));
 this index uses only substrings because otherwise, it will hog memory on my
 remote development machine (not EC2), leaving not much for other indexes;
 this composite substring index still takes ~2.5GB for status entries (only)
 in the range between [2010-01; 2013-05] as of now
 
  * validafter on statusentry (the latter *must* stay in memory, as we use
 it elsewhere as well; for example, when not given a particular search
 criterion, we want to return a list of status entries (with distinct
 fingerprints) sorted by consensus validafter in descending order)
 
 We also want to keep a fingerprint index on the descriptor table because we
 want to be able to search / look up by fingerprint.

Do you have a list of searches you're planning to support?

 I'm thinking of a way to demonstrate the efficiency of having the whole
 index in memory. For now, let me summarize what I have observed, intersect
 with what is relevant now: running the aforementioned query on some
 nickname that we haven't queried for since the last restart of postgresql,
 it might take, on average, about 1.5 to 3 seconds to execute on EC2, and
 considerably longer on my development db if it is a truly popular nickname
 (otherwise, more or less the same amount of time); sometimes a bit longer -
 up to ~4s (ideally it should be rather uniform since the indexes are
 *balanced* trees, but.. and autovacuum is enabled.)
 
 Running that same query later on (after we've run other queries after that
 first one), it will take = 160ms to execute and return results (this is a
 conservative number, usually it's much faster (see below)). Running EXPLAIN
 (ANALYZE, BUFFERS) shows that what happened was that there was no [disk]
 read next to index operations - only buffer hit. This means that there
 was no need to read from disk during all the sorting - only when we knew
 which rows to return did we need to actually read them from disk. (There
 are 

Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive

2013-08-13 Thread Kostas Jakeliunas
On Tue, Aug 13, 2013 at 2:15 PM, Karsten Loesing kars...@torproject.orgwrote:

 I suggest putting pg_prewarm on the future work list.  I sense there's a
 lot of unused potential in stock PostgreSQL.  Tweaking the database at
 this point has the word premature optimization written on it in big
 letters for me.


 Also, to be very clear here, a tool that requires custom tweaks to
 PostgreSQL has minimal chances of running on torproject.org machines in
 the future.  The current plan is that we'll have a dedicated database
 machine operated by our sysadmins that not even the service operator
 will have shell access to.


Oh, understood then, OK, no extensions (at least) for now.

Apropos: as of my current (limited) understanding, it might be difficult to
support, for example, nickname sub-string searches without a (supported,
official) extension. One such extension is pg_trgm [1], which is in the
contrib/ directory in 9.1, and is just one make install away. But for now,
I'll assume this is not possible / we should avoid this.

So, why do you join descriptors and network statuses in the search
 process?  At the Munich dev meeting I suggested joining the tables
 already in the import process.  What do you think about that idea?


Yes, I had made a half-hearted attempt to normalize the two tables some
time ago, for a small amount of descriptors and status entries; I'll be
trying out this scheme in full (will need to re-import a major part of the
data (which I didn't do then) to be able to see if it scales well) after I
try something else. (Namely, using a third table of unique fingerprints
(the statusentry table currently holds ~170K unique fingerprints vs. ~67M
rows in total) and (non-unique) nicknames for truly quick fingerprint
lookup and nickname search; I did experiment with this as well, but I
worked with a small subset of overall data in that case, too; and I think I
can do a better job now.)

It had seemed to me that the bottleneck was in having to sort a too large
number of rows, but now I understand (if only just a bit) more about the
'explain analyze' output to see that the 'Nested Loop' procedure, which is
what does the join in the join query discussed, is expensive and is part of
the bottleneck so to speak. So I'll look into that after properly
benchmarking stuff with the third table. (By the way, for future reference,
we do have to test out different ideas on a substantial subset of overall
data, as the scale function is not, so to say, linear.) :)


  https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
 
  We use the following indexes while executing that query:
 
   * lower(nickname) on descriptor
 
   * (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on
 statusentry

 Using only the first 12 characters sounds like a fine approach to speed
 up things.  But why 12?  Why not 10 or 14?  This is probably something
 you should annotate as parameter to find a good value for later in the
 process.  (I'm not saying that 12 is a bad number.  It's perfectly fine
 for now, but it might not be the best number.)


Yes, this is as unscientific as it gets. As of now, we're using a raw SQL
query, but I'll be encapuslating them properly soon (so we can easily
attach different WHERE clauses, etc.), at which point I'll make it into a
parameter. I did do some tests, but nothing extensive; just made sure the
indexes can fit into memory whole, which was the main constraint. Will do
some tests.


 Also, would it keep indexes smaller if you took something else than
 base16 encoding for fingerprints?  What about base64?  Or is there a
 binary type in PostgreSQL that works fine for indexes?


Re: latter, no binary type for B-Trees (which is the default index type in
pgsql) as far as I can see. But it's a good idea / approach, so I'll look
into it, thanks! On the whole though, as long as all the indexes occupy
only a subset of pgsql's internal buffers, there shouldn't be a problem /
that's not the problem, afaik. But, if we're making a well-researched
ORM/database design, I should look into it.


 Do you have a list of searches you're planning to support?


These are the ones that should *really* be supported:

   - ?search=nickname
   - ?search=fingerprint
   - ?lookup=fingerprint
   - ?search=address [done some limited testing, currently not focusing on
   this]
   - ?running=boolean
   - ?flag=flag [every kind of clause which further narrows down the query
   is not bad; the current db model supports all the flags that Stem does, and
   each flag has its own column]
   - ?first_seen_days=range
   - ?last_seen_days=range

As per the plan, the db should be able to return a list of status entries /
validafter ranges (which can be used in {first,last}_seen_days) given some
fingerprint.

Thanks for your feedback and reply!

Kostas.


[1]: http://www.postgresql.org/docs/9.1/static/pgtrgm.html
___
tor-dev mailing list
tor-dev@lists.torproject.org

Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive

2013-08-12 Thread Karsten Loesing
On 8/10/13 9:28 PM, Kostas Jakeliunas wrote:
   * I don't think we can avoid using certain postgresql extensions (if only
 one) - which means that deploying will always take more than apt-get  pip
 install, but I believe it is needed;

Can you give an example of a query that won't be executed efficiently
without this extension and just fine with it?  Maybe we can tweak that
query somehow so it works fine on a vanilla PostgreSQL.  Happy to give
that some thoughts.

I'd really want to avoid using stuff that is not in Debian.  Or rather,
if we really need to add non-standard extensions, we need more than
thinking and believing that it's unavoidable. :)

 Overall, I'm spending a bit too much time on a specific problem, but at
 least I have a more intimate lower-level knowledge of PostgreSQL, which
 turns out to be very relevant to this project. I hope to be able to soon
 move to extending Onionoo support and providing a clean API for getting
 lists of consensuses in which a particular relay was present. And maybe
 start with the frontend. :)

Sounds like a plan, except for the frontend part.  Scratch that, I'd
say, and focus on the API.  Making it shiny can come once everything
works as expected.

Best,
Karsten

___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [GSoC 2013] Status report - Searchable metrics archive

2013-08-12 Thread Kostas Jakeliunas
Karsten,

this won't be a very short email, but I honestly swear I did revise it a
couple of times. :) This is not urgent by any measure, so whenever you find
time to reply will be fine. ctrl+f to observe: for some precise data /
support for my plan re: using the pg_prewarm extension.

On Mon, Aug 12, 2013 at 2:16 PM, Karsten Loesing kars...@torproject.orgwrote:

 On 8/10/13 9:28 PM, Kostas Jakeliunas wrote:
* I don't think we can avoid using certain postgresql extensions (if
 only
  one) - which means that deploying will always take more than apt-get 
 pip
  install, but I believe it is needed;

 Can you give an example of a query that won't be executed efficiently
 without this extension and just fine with it?  Maybe we can tweak that
 query somehow so it works fine on a vanilla PostgreSQL.  Happy to give
 that some thoughts.

 I'd really want to avoid using stuff that is not in Debian.  Or rather,
 if we really need to add non-standard extensions, we need more than
 thinking and believing that it's unavoidable. :)


First off, the general idea. I know this might not sound convincing (see
below re: this), but any query that uses an index will take significantly
longer to execute if it needs to load parts of the index from disk. More
precisely, query time deviation and max(query_time) inversely correlates
with the percentage of the index in question in memory. The larger the
index, the more difficult it is to 'prep' it into cache, the more
unpredictable query exec time gets.

Take a look at the query used to join descriptors and network statuses
given some nickname (could be any other criterion, e.g. fingerprint or IP
address):

https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql

We use the following indexes while executing that query:

 * lower(nickname) on descriptor

 * (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on statusentry
(this one is used to efficiently join descriptor table with statusentry:
(fingerprint, descriptor) pair is completely unique in the descriptor
table, and it is fairly unique in the statusentry table (whereas a
particular fingerprint usually has lots and lots of rows in statusentry));
this index uses only substrings because otherwise, it will hog memory on my
remote development machine (not EC2), leaving not much for other indexes;
this composite substring index still takes ~2.5GB for status entries (only)
in the range between [2010-01; 2013-05] as of now

 * validafter on statusentry (the latter *must* stay in memory, as we use
it elsewhere as well; for example, when not given a particular search
criterion, we want to return a list of status entries (with distinct
fingerprints) sorted by consensus validafter in descending order)

We also want to keep a fingerprint index on the descriptor table because we
want to be able to search / look up by fingerprint.

I'm thinking of a way to demonstrate the efficiency of having the whole
index in memory. For now, let me summarize what I have observed, intersect
with what is relevant now: running the aforementioned query on some
nickname that we haven't queried for since the last restart of postgresql,
it might take, on average, about 1.5 to 3 seconds to execute on EC2, and
considerably longer on my development db if it is a truly popular nickname
(otherwise, more or less the same amount of time); sometimes a bit longer -
up to ~4s (ideally it should be rather uniform since the indexes are
*balanced* trees, but.. and autovacuum is enabled.)

Running that same query later on (after we've run other queries after that
first one), it will take = 160ms to execute and return results (this is a
conservative number, usually it's much faster (see below)). Running EXPLAIN
(ANALYZE, BUFFERS) shows that what happened was that there was no [disk]
read next to index operations - only buffer hit. This means that there
was no need to read from disk during all the sorting - only when we knew
which rows to return did we need to actually read them from disk. (There
are some nuances, but at least this will be true for PostgreSQL = 9.2 [1],
which I haven't tried yet - there might be some pleasant surprises re:
query time. Last I checked, Debian 9.0 repository contains postgresql
9.1.9.)

Observe:

1a. Run that query looking for 'moria2' for the first time since postgresql
restart - relay is an old one, only one distinct fingerprint, relatively
few status entries: http://sprunge.us/cEGh

1b. Run that same query later on: http://sprunge.us/jiPg (notice: no reads,
only hits; notice query time)

2a. Run query on 'gabelmoo' (a ton of status entries) for the first time
(development machine, query time is rather insane indeed):
http://sprunge.us/fQEK

2b. Run that same query on 'gablemoo' later on: http://sprunge.us/fDDV

PostgresSQL is rather clever: it will keep the parts of indexes more often
used in cache. What pg_prewarm simply does is:

 * load all (or critical for us) indexes to memory (and load them whole),
which is possible 

[tor-dev] [GSoC 2013] Status report - Searchable metrics archive

2013-08-10 Thread Kostas Jakeliunas
Hello,

another busy benchmarking + profiling period for database querying, but
this time more rigorous and awesome.

  * wrote a generic query analyzer which logs query statements, EXPLAIN,
ANALYZE, spots and informs of particular queries that yield inefficient
query plans;
  * wrote a very simple but rather exhaustive profiler (using python's
cProfile) which logs query times, function calls, etc.; output is used to
see which parts of the e.g. backend are slow during API calls; output can
be easily used to construct a general query 'profile' for a particular
database, etc.; [1]
  * benchmarked lots of different queries using these tools, recorded query
times, was able to observe deviations/discrepancies;
  * uploaded the whole database and benchmarked briefly on an amazon EC2
m2.2xlarge instance;
  * concluded that, provided there is enough memory to cache *and hold* the
indexes in cache, query times are good;
  * in particular, tested the following query scheme extensively: [2] (see
comments there as well if curious); concluded that it runs well;
  * opted for testing raw SQL queries (from within Flask/python) - so far,
translating them into ORM queries (while being careful) resulted in
degraded performance; if we have to end up using raw SQL, I will create a
way to encapsulate them nicely;
  * made sure data importing is not slowed and remains a quick-enough
procedure;
  * researched PostgreSQL stuff, especially its two-layer caching; I now
have an understanding of the way pgsql caches things in memory, how
statistics on index usage are gathered and used for maintaining
buffer_cache, etc.
The searchable metrics archive would work best when all of its indexes are
kept in memory.
  * to this end, looked into buffer cache hibernation [3], etc.; I think
pg_prewarm [4, 5] would serve our purpose well. (Apparently many
business/etc. solutions do find cache prewarming relevant - pity it's not
supported in stock PostgreSQL.)

The latter means that
  * I don't think we can avoid using certain postgresql extensions (if only
one) - which means that deploying will always take more than apt-get  pip
install, but I believe it is needed;
 * next on my agenda is testing pg_prewarm on EC2 and, hopefully, putting
our beloved database bottleneck problem to rest.

I planned to expose the EC2 for public tor-dev inquiry (and ended up
delaying status report yet again), but I'll have to do this separately.
This is possible, however. Sorry for the delayed report.

##

More generally,

I'm happy with my queer queries [2] now;
the two constraints/goals of

  * being able to run Onionoo-like queries on the whole descriptor / status
entry database
  * being able to get a list of status entries for a particular relay

will hopefully be put to rest very soon. The former is done, provided I
have no trouble setting up a database index precaching system (which will
ensure that all queries of the same syntax/scheme run quick enough.)

Overall, I'm spending a bit too much time on a specific problem, but at
least I have a more intimate lower-level knowledge of PostgreSQL, which
turns out to be very relevant to this project. I hope to be able to soon
move to extending Onionoo support and providing a clean API for getting
lists of consensuses in which a particular relay was present. And maybe
start with the frontend. :)

Kostas.

[1]:
https://github.com/wfn/torsearch/commit/8e6f16a07c40f7806e98e9c71c1ce0f8e3849911
[2]: https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
[3]:
http://postgresql.1045698.n5.nabble.com/patch-for-new-feature-Buffer-Cache-Hibernation-td4370109.html
[4]:
http://www.postgresql.org/message-id/ca+tgmobrrrxco+t6gcqrw_djw+uf9zedwf9bejnu+rb5teb...@mail.gmail.com
[5]: http://raghavt.blogspot.com/2012/04/caching-in-postgresql.html
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev