Re: [HACKERS] gsoc, oprrest function for text search take 2

Jan Urbański Wed, 13 Aug 2008 13:44:52 -0700

Heikki Linnakangas wrote:

Jan Urbański wrote:
through it. The only tiny ugliness is that there's one function usedfor qsort() and another for bsearch(), because I'm sorting an array oftexts (from pg_statistic) and I'm binary searching for a lexeme(non-NULL terminated string with length).
It would be nice to clean that up a bit. I think you could convert thelexeme to a TextFreq, or make the TextFreq.element a "text *" instead ofDatum (ie., detoast it with PG_DETOAST_DATUM while you build the arrayfor qsort).


Hm, making it a text * won't help, I think, because the problem is:

- pg_statistics holds an array of text datums and an array of floatdatums, ordered by the latter- a tsquery is a tree of WordEntries (lexemes), ie. non-NULLterminated strings


The approach was to:

(1) create an array of (text, float) pairs by zipping the twopg_statistics arrays into one

 (2) sort that array on text values

(3) every time a frequency of a WordEntry needs to be determined, lookit up using binary search in the sorted array

So for (2) I needed a function that compares text with text (I reusedbttext_pattern_cmp for that). And to do (3) I needed a function thatcompares text with WordEntries. I didn't want to convert WordEntriesinto "text *" values, because that would require a palloc().Hmm, maybe I could build an array of (lexeme, float) in (1) instead,turning "text *" into a lexeme is very straightforward. Then I'd use thesame function in (2) and (3) - cleaner.

However, maybe I should just skip all that qsort() nonsense and use ahashtable?Another bold thought is: keep the pg_statistics arrays for tsvectorsordered by text datums, and not frequencies. Would've been awkward,'cause people expect the most common frequencies array to be sorted andnot the most common values/elements one, but it'd help a lot andsimplify the code quite a bit. It would induce one extra qsort() ints_typanalyze(), but would allow only bsearch()es in tssel().

My medicore gprof skills got me:
[... nonsense ...]

I'd like to see a little bit more testing of that. I can't read gprofmyself, so the above doesn't give me much confidence. I use oprofile,which I find is much simpler to use.I think the worst case scenario is with statistics_target set tomaximum, with a simplest possible query and simplest possible tsquery.


One kernel recompile later...
I got oprofile running, here's the setup:
$ pgbench -n -f tssel-bench.sql -c 10 -t 1000 postgres
And here's tssel-bench.sql:
explain select * from manuals where tsvector @@ to_tsquery('foo');

The "manuals" table was rather small (but that's irrelevant I think) andstatistic_target for the "tsvector" column were set to 100.

Obviously foo() isn't a most common element in my test data, so thebsearch()es always miss. The results are:

samples  %        symbol name
101507   13.4461  internal_text_pattern_compare
91398    12.1070  bttext_pattern_cmp
82753    10.9618  pg_detoast_datum_packed
66434     8.8001  pg_qsort
54005     7.1537  DirectFunctionCall2
48925     6.4808  pglz_decompress
44931     5.9518  compare_two_textfreqs
40178     5.3222  AllocSetAlloc
26763     3.5451  AllocSetCheck
20839     2.7604  AtEOXact_CatCache
16057     2.1270  AllocSetFree
13772     1.8243  swapfunc
10001     1.3248  .plt
7859      1.0410  text_to_cstring
7556      1.0009  datumCopy

So, maybe qsorting() every time you plan a query is not that cheap afterall. I think hashing would also be an overkill. How do peole feel aboutstoring the statistics sorted on values and not on frequencies?


Cheers,
Jan

PS:
I used a simple
$ opreport --symbols /path/to/postgres
are there any magic switches I need to add?

J

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gsoc, oprrest function for text search take 2

Reply via email to