Heikki Linnakangas wrote:
Jan Urbański wrote:
through it. The only tiny ugliness is that there's one function used for qsort() and another for bsearch(), because I'm sorting an array of texts (from pg_statistic) and I'm binary searching for a lexeme (non-NULL terminated string with length).

It would be nice to clean that up a bit. I think you could convert the lexeme to a TextFreq, or make the TextFreq.element a "text *" instead of Datum (ie., detoast it with PG_DETOAST_DATUM while you build the array for qsort).

Hm, making it a text * won't help, I think, because the problem is:
- pg_statistics holds an array of text datums and an array of float datums, ordered by the latter - a tsquery is a tree of WordEntries (lexemes), ie. non-NULL terminated strings

The approach was to:
(1) create an array of (text, float) pairs by zipping the two pg_statistics arrays into one
 (2) sort that array on text values
(3) every time a frequency of a WordEntry needs to be determined, look it up using binary search in the sorted array

So for (2) I needed a function that compares text with text (I reused bttext_pattern_cmp for that). And to do (3) I needed a function that compares text with WordEntries. I didn't want to convert WordEntries into "text *" values, because that would require a palloc(). Hmm, maybe I could build an array of (lexeme, float) in (1) instead, turning "text *" into a lexeme is very straightforward. Then I'd use the same function in (2) and (3) - cleaner.

However, maybe I should just skip all that qsort() nonsense and use a hashtable? Another bold thought is: keep the pg_statistics arrays for tsvectors ordered by text datums, and not frequencies. Would've been awkward, 'cause people expect the most common frequencies array to be sorted and not the most common values/elements one, but it'd help a lot and simplify the code quite a bit. It would induce one extra qsort() in ts_typanalyze(), but would allow only bsearch()es in tssel().

My medicore gprof skills got me:
[... nonsense ...]

I'd like to see a little bit more testing of that. I can't read gprof myself, so the above doesn't give me much confidence. I use oprofile, which I find is much simpler to use. I think the worst case scenario is with statistics_target set to maximum, with a simplest possible query and simplest possible tsquery.

One kernel recompile later...
I got oprofile running, here's the setup:
$ pgbench -n -f tssel-bench.sql -c 10 -t 1000 postgres
And here's tssel-bench.sql:
explain select * from manuals where tsvector @@ to_tsquery('foo');

The "manuals" table was rather small (but that's irrelevant I think) and statistic_target for the "tsvector" column were set to 100.

Obviously foo() isn't a most common element in my test data, so the bsearch()es always miss. The results are:
samples  %        symbol name
101507   13.4461  internal_text_pattern_compare
91398    12.1070  bttext_pattern_cmp
82753    10.9618  pg_detoast_datum_packed
66434     8.8001  pg_qsort
54005     7.1537  DirectFunctionCall2
48925     6.4808  pglz_decompress
44931     5.9518  compare_two_textfreqs
40178     5.3222  AllocSetAlloc
26763     3.5451  AllocSetCheck
20839     2.7604  AtEOXact_CatCache
16057     2.1270  AllocSetFree
13772     1.8243  swapfunc
10001     1.3248  .plt
7859      1.0410  text_to_cstring
7556      1.0009  datumCopy

So, maybe qsorting() every time you plan a query is not that cheap after all. I think hashing would also be an overkill. How do peole feel about storing the statistics sorted on values and not on frequencies?

Cheers,
Jan

PS:
I used a simple
$ opreport --symbols /path/to/postgres
are there any magic switches I need to add?

J

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to