Re: [PERFORM] big data - slow select (speech search)

2010-07-07 Thread Michal Fapso
Hi Robert,

thank you for your help. I tried to cluster the table on
hyps_wordid_index and the query execution time dropped from 4.43 to
0.19 seconds which is not that far from Lucene's performance of 0.10
second.

Thanks a lot!
Miso Fapso

On 6 July 2010 02:25, Robert Haas  wrote:
> On Thu, Jul 1, 2010 at 6:34 PM, Michal Fapso  wrote:
>> It took about 4.5 seconds. If I rerun it, it takes
>> less than 2 miliseconds, but it is because of the cache. I need to
>> optimize the first-run.
>>
>> laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM
>>
>> EXPLAIN ANALYZE SELECT h1.docid
>> FROM hyps AS h1
>> WHERE h1.wordid=65658;
>>
>>  Bitmap Heap Scan on hyps h1  (cost=10.97..677.09 rows=171 width=4)
>> (actual time=62.106..4416.864 rows=343 loops=1)
>>   Recheck Cond: (wordid = 65658)
>>   ->  Bitmap Index Scan on hyps_wordid_index  (cost=0.00..10.92
>> rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1)
>>         Index Cond: (wordid = 65658)
>>  Total runtime: 4432.015 ms
>>
>> If I run the same query in Lucene search engine, it takes 0.105
>> seconds on the same data which is quite a huge difference.
>
> So PostgreSQL is reading 343 rows from disk in 4432 ms, or about 12
> ms/row.  I'm not an expert on seek times, but that might not really be
> that unreasonable, considering that those rows may be scattered all
> over the index and thus it may be basically random I/O.  Have you
> tried clustering hyps on hyps_wordid_index?  If you had a more
> sophisticated disk subsystem you could try increasing
> effective_io_concurrency but that's not going to help with only one
> spindle.
>
> If you run the same query in Lucene and it takes only 0.105 s, then
> Lucene is obviously doing a lot less I/O.  I doubt that any amount of
> tuning of your existing schema is going to produce that kind of result
> on PostgreSQL.  Using the full-text search stuff, or a gin index of
> some kind, might get you closer, but it's hard to beat a
> special-purpose engine that implements exactly the right algorithm for
> your use case.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise Postgres Company
>

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] big data - slow select (speech search)

2010-07-01 Thread Michal Fapso
I forgot to mention one thing. If you want to generate data using the
perl script, do this:

   perl create_synthetic_data.pl > synthetic_data.sqlcopy

and then after you create the 'hyps' table, use the COPY command with
the generated file:

   COPY hyps FROM '/the/full/path/synthetic_data.sqlcopy';

Best regards,
Miso Fapso


On 2 July 2010 00:34, Michal Fapso  wrote:
> Hi,
>
> I have quite a simple query but a lot of data and the SELECT query is
> too slow. I will be really grateful for any advice on this.
>
> --
> The background info:
>
> I work on a speech search engine which differs from text search in
> having more words (hypotheses) on the same position and each
> hypothesis has some weight (probability) of occurrence.
>
> When a word 'hello' appears in a document 'lecture_1', there is a row
> in the table hyps (see below) which contains an array of all positions
> of word 'hello' in the document 'lecture_1' and for each position it
> contains a weight as well.
>
> I need the positions to be able to search for phrases. However, here I
> simplified the query as much as I could without a significant
> reduction in speed.
>
> I know there is tsearch extension which could be more appropriate for
> this but I didn't try that yet. The size of my data will be the same
> which seems to be the issue in my case. But maybe I am wrong and with
> tsearch it will be much faster. What do you think?
>
> --
> Preconditions:
>
> First I cleared the disk cache:
>  sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
>
> Then run the postgresql deamon and with psql client I connected to my
> database. The first thing I did then was executing the SELECT query
> described below. It took about 4.5 seconds. If I rerun it, it takes
> less than 2 miliseconds, but it is because of the cache. I need to
> optimize the first-run.
>
> --
> Hardware:
>
> laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM
>
> --
> Version:
>
> PostgreSQL 8.4.4 on i686-pc-linux-gnu, compiled by GCC gcc (Ubuntu
> 4.4.1-4ubuntu9) 4.4.1, 32-bit
>
> compiled from sources, only --prefix=... argument given to ./configure
>
> --
> Schema:
>
> CREATE TABLE hyps (
>        docid INT,
>        wordid INT,
>        positions INT[],
>        weights REAL[],
>        length INT,
>        total_weight REAL
> );
> COPY hyps FROM 
> '/home/miso/exp/speech_search/postgresql/sqlcopy/all_weights_clustered.sqlcopy';
> CREATE INDEX hyps_wordid_index ON hyps USING hash (wordid);
> CREATE INDEX hyps_docid_index ON hyps USING hash (docid);
>
> shared_buffers = 300MB ...this is the only thing I changed in the config
>
> I tried that also with btree indices instead of hash and surprisingly
> the SELECT query was a bit faster. I would expect hash to be faster.
>
> The index on 'docid' column is there because I need to be able to
> search also in a particular document or in a set of documents.
> --
> Table info:
>
> - rows = 5490156
> - average length of positions vectors = 19.5
> - total number of items in positions vectors = 107444304
> - positions and weights in one row have the same number of items, but
> for each row the number may differ.
> - table data are loaded only once (using COPY) and are not modified anymore
> - there are 369 various docid and 161460 various wordid
> - VACUUM was executed after COPY of data
>
> --
> Query:
>
> EXPLAIN ANALYZE SELECT h1.docid
> FROM hyps AS h1
> WHERE h1.wordid=65658;
>
>  Bitmap Heap Scan on hyps h1  (cost=10.97..677.09 rows=171 width=4)
> (actual time=62.106..4416.864 rows=343 loops=1)
>   Recheck Cond: (wordid = 65658)
>   ->  Bitmap Index Scan on hyps_wordid_index  (cost=0.00..10.92
> rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1)
>         Index Cond: (wordid = 65658)
>  Total runtime: 4432.015 ms
>
> The result has 343 rows and there are 9294 items in positions vectors in 
> total.
>
> --
> Comparison with Lucene:
>
> If I run the same query in Lucene search engine, it takes 0.105
> seconds on the same data which is quite a huge difference.
>
> --
> Synthetic data set:
>
> If you want to try it yourself, here is a script which generates the
> 

[PERFORM] big data - slow select (speech search)

2010-07-01 Thread Michal Fapso
Hi,

I have quite a simple query but a lot of data and the SELECT query is
too slow. I will be really grateful for any advice on this.

--
The background info:

I work on a speech search engine which differs from text search in
having more words (hypotheses) on the same position and each
hypothesis has some weight (probability) of occurrence.

When a word 'hello' appears in a document 'lecture_1', there is a row
in the table hyps (see below) which contains an array of all positions
of word 'hello' in the document 'lecture_1' and for each position it
contains a weight as well.

I need the positions to be able to search for phrases. However, here I
simplified the query as much as I could without a significant
reduction in speed.

I know there is tsearch extension which could be more appropriate for
this but I didn't try that yet. The size of my data will be the same
which seems to be the issue in my case. But maybe I am wrong and with
tsearch it will be much faster. What do you think?

--
Preconditions:

First I cleared the disk cache:
  sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Then run the postgresql deamon and with psql client I connected to my
database. The first thing I did then was executing the SELECT query
described below. It took about 4.5 seconds. If I rerun it, it takes
less than 2 miliseconds, but it is because of the cache. I need to
optimize the first-run.

--
Hardware:

laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM

--
Version:

PostgreSQL 8.4.4 on i686-pc-linux-gnu, compiled by GCC gcc (Ubuntu
4.4.1-4ubuntu9) 4.4.1, 32-bit

compiled from sources, only --prefix=... argument given to ./configure

--
Schema:

CREATE TABLE hyps (
docid INT,
wordid INT,
positions INT[],
weights REAL[],
length INT,
total_weight REAL
);
COPY hyps FROM 
'/home/miso/exp/speech_search/postgresql/sqlcopy/all_weights_clustered.sqlcopy';
CREATE INDEX hyps_wordid_index ON hyps USING hash (wordid);
CREATE INDEX hyps_docid_index ON hyps USING hash (docid);

shared_buffers = 300MB ...this is the only thing I changed in the config

I tried that also with btree indices instead of hash and surprisingly
the SELECT query was a bit faster. I would expect hash to be faster.

The index on 'docid' column is there because I need to be able to
search also in a particular document or in a set of documents.
--
Table info:

- rows = 5490156
- average length of positions vectors = 19.5
- total number of items in positions vectors = 107444304
- positions and weights in one row have the same number of items, but
for each row the number may differ.
- table data are loaded only once (using COPY) and are not modified anymore
- there are 369 various docid and 161460 various wordid
- VACUUM was executed after COPY of data

--
Query:

EXPLAIN ANALYZE SELECT h1.docid
FROM hyps AS h1
WHERE h1.wordid=65658;

 Bitmap Heap Scan on hyps h1  (cost=10.97..677.09 rows=171 width=4)
(actual time=62.106..4416.864 rows=343 loops=1)
   Recheck Cond: (wordid = 65658)
   ->  Bitmap Index Scan on hyps_wordid_index  (cost=0.00..10.92
rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1)
 Index Cond: (wordid = 65658)
 Total runtime: 4432.015 ms

The result has 343 rows and there are 9294 items in positions vectors in total.

--
Comparison with Lucene:

If I run the same query in Lucene search engine, it takes 0.105
seconds on the same data which is quite a huge difference.

--
Synthetic data set:

If you want to try it yourself, here is a script which generates the
data for COPY command. I don't know whether it is possible to send
attachments here, so I put the script inline. Just save it as
create_synthetic_data.pl and run it by 'perl
create_synthetic_data.pl'. With these synthetic data the SELECT query
times are around 2.5 seconds. You can try the SELECT query with
'wordid' equal 1, 2, 3, ...1.


#!/usr/bin/perl
# Create synthetic data for PostgreSQL COPY.

$rows = 5490156;
$docs = 369;
$words = 161460;
$docid = 0;
$wordid = 0;

for ($row=0; $row<$rows; $row++) {

my $sep  = "";
my $positions= "";
my $weights  = "";
my $total_weight = 0;
my $items= int(rand(39))+1;

if ($row % int($rows/$docs) == 0) {
$docid++;
$wordid = 0;
}
$wordid++;

for ($i=0; $i<$items; $i++) {
$position  = int(rand(2));
$weight= rand(1);
$positions.= $sep.$position;
$weights