Re: [PERFORM] big data - slow select (speech search)
Hi Robert, thank you for your help. I tried to cluster the table on hyps_wordid_index and the query execution time dropped from 4.43 to 0.19 seconds which is not that far from Lucene's performance of 0.10 second. Thanks a lot! Miso Fapso On 6 July 2010 02:25, Robert Haas wrote: > On Thu, Jul 1, 2010 at 6:34 PM, Michal Fapso wrote: >> It took about 4.5 seconds. If I rerun it, it takes >> less than 2 miliseconds, but it is because of the cache. I need to >> optimize the first-run. >> >> laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM >> >> EXPLAIN ANALYZE SELECT h1.docid >> FROM hyps AS h1 >> WHERE h1.wordid=65658; >> >> Bitmap Heap Scan on hyps h1 (cost=10.97..677.09 rows=171 width=4) >> (actual time=62.106..4416.864 rows=343 loops=1) >> Recheck Cond: (wordid = 65658) >> -> Bitmap Index Scan on hyps_wordid_index (cost=0.00..10.92 >> rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1) >> Index Cond: (wordid = 65658) >> Total runtime: 4432.015 ms >> >> If I run the same query in Lucene search engine, it takes 0.105 >> seconds on the same data which is quite a huge difference. > > So PostgreSQL is reading 343 rows from disk in 4432 ms, or about 12 > ms/row. I'm not an expert on seek times, but that might not really be > that unreasonable, considering that those rows may be scattered all > over the index and thus it may be basically random I/O. Have you > tried clustering hyps on hyps_wordid_index? If you had a more > sophisticated disk subsystem you could try increasing > effective_io_concurrency but that's not going to help with only one > spindle. > > If you run the same query in Lucene and it takes only 0.105 s, then > Lucene is obviously doing a lot less I/O. I doubt that any amount of > tuning of your existing schema is going to produce that kind of result > on PostgreSQL. Using the full-text search stuff, or a gin index of > some kind, might get you closer, but it's hard to beat a > special-purpose engine that implements exactly the right algorithm for > your use case. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise Postgres Company > -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] big data - slow select (speech search)
I forgot to mention one thing. If you want to generate data using the perl script, do this: perl create_synthetic_data.pl > synthetic_data.sqlcopy and then after you create the 'hyps' table, use the COPY command with the generated file: COPY hyps FROM '/the/full/path/synthetic_data.sqlcopy'; Best regards, Miso Fapso On 2 July 2010 00:34, Michal Fapso wrote: > Hi, > > I have quite a simple query but a lot of data and the SELECT query is > too slow. I will be really grateful for any advice on this. > > -- > The background info: > > I work on a speech search engine which differs from text search in > having more words (hypotheses) on the same position and each > hypothesis has some weight (probability) of occurrence. > > When a word 'hello' appears in a document 'lecture_1', there is a row > in the table hyps (see below) which contains an array of all positions > of word 'hello' in the document 'lecture_1' and for each position it > contains a weight as well. > > I need the positions to be able to search for phrases. However, here I > simplified the query as much as I could without a significant > reduction in speed. > > I know there is tsearch extension which could be more appropriate for > this but I didn't try that yet. The size of my data will be the same > which seems to be the issue in my case. But maybe I am wrong and with > tsearch it will be much faster. What do you think? > > -- > Preconditions: > > First I cleared the disk cache: > sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' > > Then run the postgresql deamon and with psql client I connected to my > database. The first thing I did then was executing the SELECT query > described below. It took about 4.5 seconds. If I rerun it, it takes > less than 2 miliseconds, but it is because of the cache. I need to > optimize the first-run. > > -- > Hardware: > > laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM > > -- > Version: > > PostgreSQL 8.4.4 on i686-pc-linux-gnu, compiled by GCC gcc (Ubuntu > 4.4.1-4ubuntu9) 4.4.1, 32-bit > > compiled from sources, only --prefix=... argument given to ./configure > > -- > Schema: > > CREATE TABLE hyps ( > docid INT, > wordid INT, > positions INT[], > weights REAL[], > length INT, > total_weight REAL > ); > COPY hyps FROM > '/home/miso/exp/speech_search/postgresql/sqlcopy/all_weights_clustered.sqlcopy'; > CREATE INDEX hyps_wordid_index ON hyps USING hash (wordid); > CREATE INDEX hyps_docid_index ON hyps USING hash (docid); > > shared_buffers = 300MB ...this is the only thing I changed in the config > > I tried that also with btree indices instead of hash and surprisingly > the SELECT query was a bit faster. I would expect hash to be faster. > > The index on 'docid' column is there because I need to be able to > search also in a particular document or in a set of documents. > -- > Table info: > > - rows = 5490156 > - average length of positions vectors = 19.5 > - total number of items in positions vectors = 107444304 > - positions and weights in one row have the same number of items, but > for each row the number may differ. > - table data are loaded only once (using COPY) and are not modified anymore > - there are 369 various docid and 161460 various wordid > - VACUUM was executed after COPY of data > > -- > Query: > > EXPLAIN ANALYZE SELECT h1.docid > FROM hyps AS h1 > WHERE h1.wordid=65658; > > Bitmap Heap Scan on hyps h1 (cost=10.97..677.09 rows=171 width=4) > (actual time=62.106..4416.864 rows=343 loops=1) > Recheck Cond: (wordid = 65658) > -> Bitmap Index Scan on hyps_wordid_index (cost=0.00..10.92 > rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1) > Index Cond: (wordid = 65658) > Total runtime: 4432.015 ms > > The result has 343 rows and there are 9294 items in positions vectors in > total. > > -- > Comparison with Lucene: > > If I run the same query in Lucene search engine, it takes 0.105 > seconds on the same data which is quite a huge difference. > > -- > Synthetic data set: > > If you want to try it yourself, here is a script which generates the >
[PERFORM] big data - slow select (speech search)
Hi, I have quite a simple query but a lot of data and the SELECT query is too slow. I will be really grateful for any advice on this. -- The background info: I work on a speech search engine which differs from text search in having more words (hypotheses) on the same position and each hypothesis has some weight (probability) of occurrence. When a word 'hello' appears in a document 'lecture_1', there is a row in the table hyps (see below) which contains an array of all positions of word 'hello' in the document 'lecture_1' and for each position it contains a weight as well. I need the positions to be able to search for phrases. However, here I simplified the query as much as I could without a significant reduction in speed. I know there is tsearch extension which could be more appropriate for this but I didn't try that yet. The size of my data will be the same which seems to be the issue in my case. But maybe I am wrong and with tsearch it will be much faster. What do you think? -- Preconditions: First I cleared the disk cache: sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' Then run the postgresql deamon and with psql client I connected to my database. The first thing I did then was executing the SELECT query described below. It took about 4.5 seconds. If I rerun it, it takes less than 2 miliseconds, but it is because of the cache. I need to optimize the first-run. -- Hardware: laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM -- Version: PostgreSQL 8.4.4 on i686-pc-linux-gnu, compiled by GCC gcc (Ubuntu 4.4.1-4ubuntu9) 4.4.1, 32-bit compiled from sources, only --prefix=... argument given to ./configure -- Schema: CREATE TABLE hyps ( docid INT, wordid INT, positions INT[], weights REAL[], length INT, total_weight REAL ); COPY hyps FROM '/home/miso/exp/speech_search/postgresql/sqlcopy/all_weights_clustered.sqlcopy'; CREATE INDEX hyps_wordid_index ON hyps USING hash (wordid); CREATE INDEX hyps_docid_index ON hyps USING hash (docid); shared_buffers = 300MB ...this is the only thing I changed in the config I tried that also with btree indices instead of hash and surprisingly the SELECT query was a bit faster. I would expect hash to be faster. The index on 'docid' column is there because I need to be able to search also in a particular document or in a set of documents. -- Table info: - rows = 5490156 - average length of positions vectors = 19.5 - total number of items in positions vectors = 107444304 - positions and weights in one row have the same number of items, but for each row the number may differ. - table data are loaded only once (using COPY) and are not modified anymore - there are 369 various docid and 161460 various wordid - VACUUM was executed after COPY of data -- Query: EXPLAIN ANALYZE SELECT h1.docid FROM hyps AS h1 WHERE h1.wordid=65658; Bitmap Heap Scan on hyps h1 (cost=10.97..677.09 rows=171 width=4) (actual time=62.106..4416.864 rows=343 loops=1) Recheck Cond: (wordid = 65658) -> Bitmap Index Scan on hyps_wordid_index (cost=0.00..10.92 rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1) Index Cond: (wordid = 65658) Total runtime: 4432.015 ms The result has 343 rows and there are 9294 items in positions vectors in total. -- Comparison with Lucene: If I run the same query in Lucene search engine, it takes 0.105 seconds on the same data which is quite a huge difference. -- Synthetic data set: If you want to try it yourself, here is a script which generates the data for COPY command. I don't know whether it is possible to send attachments here, so I put the script inline. Just save it as create_synthetic_data.pl and run it by 'perl create_synthetic_data.pl'. With these synthetic data the SELECT query times are around 2.5 seconds. You can try the SELECT query with 'wordid' equal 1, 2, 3, ...1. #!/usr/bin/perl # Create synthetic data for PostgreSQL COPY. $rows = 5490156; $docs = 369; $words = 161460; $docid = 0; $wordid = 0; for ($row=0; $row<$rows; $row++) { my $sep = ""; my $positions= ""; my $weights = ""; my $total_weight = 0; my $items= int(rand(39))+1; if ($row % int($rows/$docs) == 0) { $docid++; $wordid = 0; } $wordid++; for ($i=0; $i<$items; $i++) { $position = int(rand(2)); $weight= rand(1); $positions.= $sep.$position; $weights