Help! Need to do full-text query on multiple tables

Info Cascade Wed, 13 Jan 2010 17:51:39 -0800

Hi,

Need some help,here!  I posted this a while ago and didn't get any response 
that was helpful.
I still need to resolve it, and someone who understands the problem surely can 
point me in
the right direction.  Do I need a custom manager for a sql UNION query, and if 
so, what is
the best example for me to emulate?


There are two tables we want to full-text search in simultaneously, Article and
Tag. (Table definitions at end of message.)
They have a many-to-many relationship, via the article_tags table.
We want to end up with a QuerySet of Articles that match a query term on
article.title, as well as those that match the same term on tag.name.
(I was thinking this would be a quick way to get full-text search of the
two tables, using Postgres.  However, if it is going to be too much
work, it might be better to put the effort into implementing some other
full-text engine, like Sphinx!)

So, my question is two-fold:
1. How do I get acceptable SQL performance for the query, and
2. How do I create the query in Django

Ideally, I want to be able to add .filter() and .exclude() and
.order_by() to the QuerySet.  These SQL examples below are the basic
idea. Currently, the code looks something like this (I simplified it a
bit, but this should give an idea):
> >
> >             art_list = Article.objects.filter(status__status='PUBLISHED')
> >             art_list = art_list.extra(
> >                         where=["title_tsv @@ plainto_tsquery(%s)"],
> >                         params=[term])
> >              if channel:
> >                     art_list = art_list.filter(channel=channel)
> >              art_list = art_list.order_by(*order_by)
Somehow, I want to include the fulltext search on the tag.name field,
which would look something like:
                        where=['name_tsv @@ plainto_tsquery(%s)'],
                        params=[term])

(Actually, I'm not sure how to do this in Django, in any case, because
of having to use .extra() to do the full-text search on the tag table.
Can I add them to the .extra() WHERE clause?)
So, that's question #2.

I can formulate the query directly in SQL, without using Django.
The SQL example queries below give the correct results.
However, the first one with joins takes over 7 seconds.
The ones with UNION are quite fast.
I also played with doing a subquery, and that is fast, but causes other
issues.
If I could get the first (slow) version to not be so slow, then that would
probably solve my problem, thus that is question #1.  Why so slow?

I think what I need to do might be to create a custom manager for
Article that would execute the UNION query and return a QuerySet with
all the matching Articles.  I'm not sure quite how to do that,
especially if I expect it to work with the .filter() and .exclude()
and .order_by() methods.


-- Full-text search on article.title and tag.name, with join on status
-- 7475ms!
SELECT DISTINCT article.title
FROM article JOIN article_tags ON article.id = article_tags.article_id
JOIN tag ON article_tags.tag_id = tag.id
JOIN article_status ON article_status.id = article.status_id
WHERE article_status.status = 'PUBLISHED'
    AND (tag.name @@ plainto_tsquery('french restaurants')
        OR title_tsv @@ plainto_tsquery('french restaurants'))
ORDER BY article.title;

-- For comparison, using standard string comparison instead of full-text
-- is substantially faster, and happens to give correct results,
-- although clearly this is not what we want to do.
-- 1569ms
SELECT DISTINCT article.title
FROM article JOIN article_tags ON article.id = article_tags.article_id
JOIN tag ON article_tags.tag_id = tag.id
WHERE (name ilike '%french restaurant%'
        OR title ilike '%french%restaurant%')
ORDER BY article.title;

-- Individual full-text searches are quite fast
-- 12ms
SELECT DISTINCT article.title
FROM article
WHERE (title_tsv @@ plainto_tsquery('french restaurants'))
ORDER BY article.title;

-- Individual full-text searches are quite fast
--10ms
SELECT tag.name
FROM tag
WHERE (name_tsv @@ plainto_tsquery('french restaurants'))
ORDER BY tag.name;

-- Alternative full-text search on article.title and tag.name with UNION
-- with ranking.  This returns correct results with acceptable
-- performance, but would be more complicated on the Django side.
-- 640ms
SELECT DISTINCT article.title, ts_rank_cd(title_tsv, q1) AS rank
FROM plainto_tsquery ('french restaurants') AS q1, article
JOIN article_status ON article.status_id = article_status.id
WHERE article_status.status = 'PUBLISHED' AND
title_tsv @@ q1
UNION
SELECT DISTINCT article.title, ts_rank_cd(name_tsv, q2) AS rank
FROM plainto_tsquery('french restaurants') AS q2, article
JOIN article_tags ON article.id = article_tags.article_id
JOIN article_status ON article.status_id = article_status.id
JOIN tag ON article_tags.tag_id = tag.id
WHERE article_status.status = 'PUBLISHED' AND
tag.name @@ q2
ORDER BY rank DESC, title;


So -- if anyone can offer any advice on this, I would very much
appreciate it!!!

Best,
Liam

> > -- Table: article
> > 
> > -- DROP TABLE article;
> > 
> > CREATE TABLE article
> > (
> >   id serial NOT NULL,
> >   title character varying(1024) NOT NULL,
> >   sections character varying(3072) NOT NULL,
> >   channel_name character varying(120) NOT NULL,
> >   date_create timestamp with time zone,
> >   date_modify timestamp with time zone,
> >   point geometry NOT NULL,
> >   poi_id integer,
> >   license_id integer,
> >   rank integer NOT NULL,
> >   title_tsv tsvector,
> >   date_human_modify timestamp with time zone,
> >   channel_id integer,
> >   channel_article_id integer,
> >   channel_info_1 character varying(1024),
> >   featured boolean,
> >   status_id integer,
> >   publisher_id integer,
> >   CONSTRAINT article_pkey PRIMARY KEY (id),
> >   CONSTRAINT article_poi_id_fkey FOREIGN KEY (poi_id)
> >       REFERENCES poi (id) MATCH SIMPLE
> >       ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
> >   CONSTRAINT channel_id_refs_id_3540205250801bc FOREIGN KEY (channel_id)
> >       REFERENCES channel (id) MATCH SIMPLE
> >       ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
> >   CONSTRAINT enforce_dims_point CHECK (ndims(point) = 2),
> >   CONSTRAINT enforce_geotype_point CHECK (geometrytype(point) = 
> > 'POINT'::text OR point IS NULL),
> >   CONSTRAINT enforce_srid_point CHECK (srid(point) = 4326)
> > )
> > WITH (
> >   OIDS=FALSE
> > );
> > ALTER TABLE article OWNER TO postgres;
> > 
> > -- Index: article_channel_article_id
> > 
> > -- DROP INDEX article_channel_article_id;
> > 
> > CREATE INDEX article_channel_article_id
> >   ON article
> >   USING btree
> >   (channel_article_id);
> > 
> > -- Index: article_channel_id
> > 
> > -- DROP INDEX article_channel_id;
> > 
> > CREATE INDEX article_channel_id
> >   ON article
> >   USING btree
> >   (channel_id);
> > 
> > -- Index: article_point_id
> > 
> > -- DROP INDEX article_point_id;
> > 
> > CREATE INDEX article_point_id
> >   ON article
> >   USING gist
> >   (point);
> > 
> > -- Index: article_title
> > 
> > -- DROP INDEX article_title;
> > 
> > CREATE INDEX article_title
> >   ON article
> >   USING btree
> >   (title);
> > 
> > -- Index: article_tsv_idx
> > 
> > -- DROP INDEX article_tsv_idx;
> > 
> > CREATE INDEX article_tsv_idx
> >   ON article
> >   USING gin
> >   (title_tsv);
> > 
> > 
> > -- Trigger: tsvectorupdate on article
> > 
> > -- DROP TRIGGER tsvectorupdate ON article;
> > 
> > CREATE TRIGGER tsvectorupdate
> >   BEFORE INSERT OR UPDATE
> >   ON article
> >   FOR EACH ROW
> >   EXECUTE PROCEDURE tsvector_update_trigger('title_tsv', 
> > 'pg_catalog.english', 'title');
> > 

> > -- Table: tag
> > 
> > -- DROP TABLE tag;
> > 
> > CREATE TABLE tag
> > (
> >   id serial NOT NULL,
> >   "name" character varying(100) NOT NULL,
> >   is_feature boolean NOT NULL DEFAULT false,
> >   name_tsv tsvector,
> >   parent_id integer,
> >   display_name character varying(100),
> >   CONSTRAINT tag_pkey PRIMARY KEY (id),
> >   CONSTRAINT tag_name_key UNIQUE (name)
> > )
> > WITH (
> >   OIDS=FALSE
> > );
> > ALTER TABLE tag OWNER TO postgres;
> > 
> > -- Index: tag_parent_id
> > 
> > -- DROP INDEX tag_parent_id;
> > 
> > CREATE INDEX tag_parent_id
> >   ON tag
> >   USING btree
> >   (parent_id);
> > 
> > -- Index: tag_tsv
> > 
> > -- DROP INDEX tag_tsv;
> > 
> > CREATE INDEX tag_tsv
> >   ON tag
> >   USING gin
> >   (name_tsv);
> > 
> > 
> > -- Trigger: tsvectorupdate on tag
> > 
> > -- DROP TRIGGER tsvectorupdate ON tag;
> > 
> > CREATE TRIGGER tsvectorupdate
> >   BEFORE INSERT OR UPDATE
> >   ON tag
> >   FOR EACH ROW
> >   EXECUTE PROCEDURE tsvector_update_trigger('name_tsv', 
> > 'pg_catalog.english', 'name');
> >

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Help! Need to do full-text query on multiple tables

Reply via email to