Anders,
headline is a slow, since it should read and process an original document,
so you should avoid simple search query, which calculates ts_headline()
for each document in the result set. Use subselect instead !
In example below, 2nd query is way faster the 1rd !
select id,headline(body,q),rank(ti,q) as rank
from apod, to_tsquery('stars') q
where ti @@ q order by rank desc limit 10;
and
select id,headline(body,q),rank
from ( select id,body,q, rank(ti,q) as rank from apod, to_tsquery('stars') q
where ti @@ q order by rank desc limit 10) as foo;
Oleg
On Sun, 6 Jan 2008, Anders ?stergaard Jensen wrote:
Hi there people,
I'm trying to build a text search engine for relatively large PDF documents
with a web frontend. I use PostgreSQL 8.3 beta3/4 with tsearch2 for full-text
indexing. Everything performs quite well, until I start indexing 800 pages of
text into the database (with the utf8 danish locale and dictionary) -- then
ts_headline performs unreasonably slow. I use a GIN index on the table
(called document_revisions which is linked to a master document table called
'documents' -- the system is made for version tracking of multiple documents
into the same index) and an update trigger for maintaining the tsvector row
on the table.
Omitting ts_headline yields the following EXPLAIN result:
<snip>
metabase=# explain analyze select * from f_search_revision('website');
NOTICE: f_search_revisions_arr: loop: (1) doc_id = (1000)
CONTEXT: SQL statement "SELECT f_search_revision_arr( $1 ) AS res"
PL/pgSQL function "f_search_revision_int" line 13 at SQL statement
SQL function "f_search_revision" statement 1
NOTICE: hest: ({1000})
CONTEXT: SQL function "f_search_revision" statement 1
NOTICE: total ranking for (1) eq (0.0865452) for id = (1000)
CONTEXT: SQL function "f_search_revision" statement 1
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Function Scan on f_search_revision (cost=0.00..260.00 rows=1000 width=408)
(actual time=2413.384..2413.384 rows=1 loops=1)
Total runtime: 2413.465 ms
(2 rows)
</snip>
The function f_search_revision invokes another function that fetches all
matching document_revisions without calling ts_headline (method
f_search_revision_arr), followed by a post-processing of the matching rows
for fetching the ts_ranking values and ts_headline. The following snippet
shows the query plan for the function call with the same query:
<snip>
metabase=# explain analyze select * from f_search_revision_arr('website');
NOTICE: f_search_revisions_arr: loop: (1) doc_id = (1000)
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Function Scan on f_search_revision_arr (cost=0.00..0.26 rows=1 width=32)
(actual time=3.417..3.418 rows=1 loops=1)
Total runtime: 3.442 ms
(2 rows)
</snip>
Here is the code for the functions that I am using:
CREATE OR REPLACE FUNCTION f_search_revision_arr(q TEXT) RETURNS INTEGER[] AS
$$
declare
iter record; results integer[];
i integer;
begin
i := 1;
for iter in SELECT * FROM documents doc, plainto_tsquery(q)
AS tsq
WHERE (doc.search_idx @@ tsq)
OR doc.id IN (SELECT document_id FROM
document_revisions dr, plainto_tsquery(q) AS tsq2
WHERE dr.search_idx @@ tsq2) loop
results[i] = iter.id;
raise notice 'f_search_revisions_arr: loop: (%)
doc_id = (%)', i, results[i];
i := i + 1;
end loop;
return results;
end;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION f_search_revision_int(q TEXT) RETURNS
t_search_result[] AS $$
declare
doc_ids integer[];
doc_rankings float[];
i integer;
iter record;
pointer_doc record;
pointer_rev record;
pointer_type record;
results t_search_result[];
begin
i := 1;
SELECT INTO iter f_search_revision_arr(q) AS res;
doc_ids := iter.res;
raise notice 'hest: (%)', doc_ids;
-- Ensure that something was found:
if (doc_ids IS NULL) then
return results;
end if;
for i in 1..array_upper(doc_ids, 1) loop
-- raise notice 'upper = (%)', doc_ids[i];
SELECT INTO pointer_doc ts_rank(doc.search_idx, tsq)
AS rank_doc,
ts_headline(doc.name, tsq) AS headline_name,
ts_headline(doc.description, tsq) AS
headline_description,
ts_headline(doc.keywords, tsq) AS
headline_keywords,
doc.name,
doc.description,
doc.keywords,
doc.document_type_id
FROM documents doc, plainto_tsquery(q) AS tsq
WHERE doc.id = doc_ids[i];
SELECT INTO pointer_rev ts_rank(rev.search_idx, tsq)
AS rank_rev,
ts_headline(rev.content, tsq) AS
headline_content,
rev.id AS revision_id
FROM document_revisions rev,
plainto_tsquery(q) AS tsq
WHERE rev.document_id = doc_ids[i];
-- Select the document type:
SELECT INTO pointer_type mime_type, type_name,
image_url, extension, internal_type
FROM document_types WHERE id =
pointer_doc.document_type_id;
doc_rankings[i] = pointer_doc.rank_doc +
pointer_rev.rank_rev; -- SUM!
raise notice 'total ranking for (%) eq (%) for id =
(%)', i, doc_rankings[i], doc_ids[i];
results[i] = ROW(pointer_rev.revision_id,
doc_ids[i],
pointer_doc.name,
pointer_doc.description, pointer_doc.keywords,
pointer_doc.document_type_id,
pointer_type.mime_type, pointer_type.type_name, pointer_type.image_url,
pointer_type.extension, pointer_type.internal_type,
doc_rankings[i],
pointer_doc.headline_name, pointer_doc.headline_description,
pointer_doc.headline_keywords, pointer_rev.headline_content,
0);
end loop;
return results;
end;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION f_search_revision(q TEXT) RETURNS SETOF
t_search_result AS $$
SELECT * FROM f_unnest(f_search_revision_int($1));
$$ LANGUAGE sql;
------
How come that ts_headline yields so bad performance? Is there any way of
tuning ts_headline? I need the highlighting functionality for presenting the
search results in the web frontend properly. Alternately, are there any
alternatives to the ts_headline function?
Sincerely,
Anders
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match