Re: [HACKERS] Very bad FTS performance with the Polish config
ts_headline calls ts_lexize equivalent to break the text. Off course there is algorithm to process the tokens and generate the headline. I would be really surprised if the algorithm to generate the headline is somehow dependent on language (as it only processes the tokens). So Oleg is right when he says ts_lexize is something to be checked. ts_lexize performs well for all dictionaries, nothing to see here. It's ts_headline that's causing the problem. But that's just IMHO. I will try to replicate what you are trying to do but in the meantime can you run the same ts_headline under psql multiple times and paste the result. All the results I pasted had the medium run time out of multiple calls. These were certainly not some extreme corner cases. cheers, Wojciech Knapik -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Very bad FTS performance with the Polish config
Tom Lane wrote: *SNIP* So about 55% of the time is going into affix pattern matching. I wonder whether that couldn't be made faster. A lot of the cycles are spent on coping with variable-length characters --- perhaps the ispell code should convert to wchar representation before doing this? Thanks a lot for looking into this. I hope this will lead to some improvements one day. Unfortunately my C skills were pretty basic years ago and I haven't used the language since, so I can't be of much help.. cheers, Wojciech Knapik -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Very bad FTS performance with the Polish config
Tom Lane wrote: I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0 (2.6.21), then switched both installations to 8.3.8 (both packages compiled from source, but provided by the distro - port/emerge). The Polish dictionaries and config were created according to this article (it's in Polish, but the code is self-explanatory): http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/ I tried to duplicate this test, but got no further than here: u8=# CREATE TEXT SEARCH DICTIONARY polish_ispell ( TEMPLATE = ispell, DictFile = polish, AffFile = polish, StopWords = polish ); ERROR: syntax error CONTEXT: line 174 of configuration file /home/tgl/testversion/share/postgresql/tsearch_data/polish.affix: L E C -C,GĹEM #zalec (15a) u8=# Seems there's something about the current version of the dictionary that we don't like. I used sjp-ispell-pl-20091117-src.tar.bz2 ... Here are the files I used (polish.affix, polish.dict already generated): http://wolniartysci.pl/pl.tar.gz These should work fine. I'd be grateful if you could test and see if you get similar results. cheers, Wojciech Knapik PS. Weird, I get the emails without a reply-to set for the list.. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Very bad FTS performance with the Polish config
your polish_english, polish configurations uses ispell language and slow, while english configuration doesn't contains ispell. So, what's your complains ? Try add ispell dictionary to english configuration and see timings. Oh, so this is not anomalous ? These are the expected speeds for an ispell dictionary ? I didn't realize that. Sorry for the bother then. It just seemed way too slow to be practical. cheers, Wojciech Knapik -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Very bad FTS performance with the Polish config
Oleg Bartunov wrote: your polish_english, polish configurations uses ispell language and slow, while english configuration doesn't contains ispell. So, what's your complains ? Try add ispell dictionary to english configuration and see timings. Oh, so this is not anomalous ? These are the expected speeds for an ispell dictionary ? I didn't realize that. Sorry for the bother then. It just seemed way too slow to be practical. You can see real timings using ts_lexize() function for different dictionaries (try several time to avoid cold-start problem) instead of ts_headline(), which involves other factors. On my test machine I see no real difference between very simple dictionary and french ispell, snowball dictionaries: ts_lexize seems to be just as fast for simple, polish_ispell and english_stem with the 'voila' argument. polish_ispell is in fact *faster* for the lorem ipsum text repeated a couple times (10 ?). Which suggests that the issue is with ts_headline iteself. I see no big difference in ts_headline as well: dev-oleg=# select ts_headline('english','I can do voila', 'voila'::tsquery); ts_headline --- I can do bvoila/b (1 row) Time: 0.265 ms Yes, for 4-word texts the results are similar. Try that with a longer text and the difference becomes more and more significant. For the lorem ipsum text, 'polish' is about 4 times slower, than 'english'. For 5 repetitions of the text, it's 6 times, for 10 repetitions - 7.5 times... This is 8.4.1 version of PostgreSQL. An that was 8.3.8/OSX. cheers, Wojciech Knapik -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Very bad FTS performance with the Polish config
Oleg Bartunov wrote: Yes, for 4-word texts the results are similar. Try that with a longer text and the difference becomes more and more significant. For the lorem ipsum text, 'polish' is about 4 times slower, than 'english'. For 5 repetitions of the text, it's 6 times, for 10 repetitions - 7.5 times... Again, I see nothing unclear here, since dictionaries (as specified in configuration) apply to ALL words in document. The more words in document, the more overhead. You're missing the point. I'm not surprised that the function takes more time for larger input texts - that's obvious. The thing is, the computation times rise more steeply when the Polish config is used. Steeply enough, that the difference between the Polish and English configs becomes enormous in practical cases. Now this may be expected behaviour, but since I don't know if it is, I posted to the mailing lists to find out. If you're saying this is ok and there's nothing to fix here, then there's nothing more to discuss and we may consider the thread closed. If not, ts_headline deserves a closer look. cheers, Wojciech Knapik -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Very bad FTS performance with the Polish config
Hello This has been discussed in #postgresql and posted to -performance a couple days ago, but no solution has been found. The discussion can be found here: http://archives.postgresql.org/pgsql-performance/2009-11/msg00162.php I just finished implementing a search engine for my site and found ts_headline extremely slow when used with a Polish tsearch configuration, while fast with English. All of it boils down to a simple testcase, but first some background. I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0 (2.6.21), then switched both installations to 8.3.8 (both packages compiled from source, but provided by the distro - port/emerge). The Polish dictionaries and config were created according to this article (it's in Polish, but the code is self-explanatory): http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/ Now for the testcase: text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.' # explain analyze select ts_headline('polish', text, plainto_tsquery('polish', 'foobar')); QUERY PLAN Result (cost=0.00..0.01 rows=1 width=0) (actual time=6.407..6.470 rows=1 loops=1) Total runtime: 6.524 ms (2 rows) # explain analyze select ts_headline('english', text, plainto_tsquery('english', 'foobar')); QUERY PLAN Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.861..0.895 rows=1 loops=1) Total runtime: 0.935 ms (2 rows) # explain analyze select ts_headline('simple', text, plainto_tsquery('simple', 'foobar')); QUERY PLAN Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.627..0.660 rows=1 loops=1) Total runtime: 0.697 ms (2 rows) # As you can see, the results differ by an order of magnitude between Polish and English. While in this simple testcase it's a non-issue, in the real world this translates into enormous overhead. One of the queries I ran testing my site's search function took 1870ms. When I took that query and changed all ts_headline(foo) calls to just foo, the time dropped below 100ms. That's the difference between something completely unacceptable and something quite useful. I can post various details about the hardware, software and specific queries, but the testcases speak for themselves. I'm sure you can easily reproduce my results. I'm putting my code into production tomorrow, since I can't wait anymore. Hints would be very much appreciated! cheers, Wojciech Knapik PS. This issue is not related to the loading time of dictionaries, or calls to ts_headline for results that won't be displayed. A few other details can be found here http://pastie.textmate.org/private/hqnqfnsfsknjyjlffzmog along with snippets of my conversations in #postgresql that lead to this testcase. Big thanks to RhodiumToad for helping me with fts for the last couple days ;] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Very bad FTS performance with the Polish config
Hello This has been discussed in #postgresql and posted to -performance a couple days ago, but no solution has been found. The discussion can be found here: http://archives.postgresql.org/pgsql-performance/2009-11/msg00162.php I just finished implementing a search engine for my site and found ts_headline extremely slow when used with a Polish tsearch configuration, while fast with English. All of it boils down to a simple testcase, but first some background. I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0 (2.6.21), then switched both installations to 8.3.8 (both packages compiled from source, but provided by the distro - port/emerge). The Polish dictionaries and config were created according to this article (it's in Polish, but the code is self-explanatory): http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/ Now for the testcase: text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.' # explain analyze select ts_headline('polish', text, plainto_tsquery('polish', 'foobar')); QUERY PLAN Result (cost=0.00..0.01 rows=1 width=0) (actual time=6.407..6.470 rows=1 loops=1) Total runtime: 6.524 ms (2 rows) # explain analyze select ts_headline('english', text, plainto_tsquery('english', 'foobar')); QUERY PLAN Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.861..0.895 rows=1 loops=1) Total runtime: 0.935 ms (2 rows) # explain analyze select ts_headline('simple', text, plainto_tsquery('simple', 'foobar')); QUERY PLAN Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.627..0.660 rows=1 loops=1) Total runtime: 0.697 ms (2 rows) # As you can see, the results differ by an order of magnitude between Polish and English. While in this simple testcase it's a non-issue, in the real world this translates into enormous overhead. One of the queries I ran testing my site's search function took 1870ms. When I took that query and changed all ts_headline(foo) calls to just foo, the time dropped below 100ms. That's the difference between something completely unacceptable and something quite useful. I can post various details about the hardware, software and specific queries, but the testcases speak for themselves. I'm sure you can easily reproduce my results. I'm putting my code into production tomorrow, since I can't wait anymore. Hints would be very much appreciated! cheers, Wojciech Knapik PS. This issue is not related to the loading time of dictionaries, or calls to ts_headline for results that won't be displayed. A few other details can be found here http://pastie.textmate.org/private/hqnqfnsfsknjyjlffzmog along with snippets of my conversations in #postgresql that lead to this testcase. Big thanks to RhodiumToad for helping me with fts for the last couple days ;] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Very bad FTS performance with the Polish config
Euler Taveira de Oliveira wrote: PS. This issue is not related to the loading time of dictionaries, or calls to ts_headline for results that won't be displayed. So what? Could you post the profiling of that query? Polish: http://pastie.textmate.org/private/8lhmnbvde43lfjoxc52r1q English: http://pastie.textmate.org/private/4iaipottrmjmfxfykz94mw cheers, Wojciech Knapik PS. Sorry for the double post. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers