#30637: Django is unable to combine SearchVectorField and SearchVector
-------------------------------------+-------------------------------------
     Reporter:  Dani Hodovic         |                    Owner:  (none)
         Type:                       |                   Status:  new
  Cleanup/optimization               |
    Component:  contrib.postgres     |                  Version:  2.2
     Severity:  Normal               |               Resolution:
     Keywords:  db, postgres, full-  |             Triage Stage:
  text, search                       |  Unreviewed
    Has patch:  0                    |      Needs documentation:  0
  Needs tests:  0                    |  Patch needs improvement:  0
Easy pickings:  0                    |                    UI/UX:  0
-------------------------------------+-------------------------------------
Description changed by Dani Hodovic:

Old description:

> When using `django.contrib.postgres` to perform full text search it's not
> possible to combine SearchVectors and SearchVectorfields. Doing so
> impacts the performance of the query.
>
> Suppose we have a model with a small "role" field and a much larger
> "text" field. The text field is large enough that it warrants indexing in
> a separate column as a tsvector with a Gin index to ensure our queries
> are fast.
>
> {{{
> #!div style="font-size: 80%"
> Code highlighting:
>   {{{#!python
>   class JobPosting(models.Model):
>       role = models.CharField(max_length=170, null=True)
>       text = models.TextField(max_length=8000, default="")
>       # Large field optimized for full text search
>       text_search = SearchVectorField(null=True)
>
>       class Meta:
>           indexes = [GinIndex(fields=["text_search"])]
>   }}}
> }}}
>
> If we need to perform a search on all columns we need to combine them
> into a common tsvector. The problem is that Django casts the large and
> search optimized text_search field to text and then back into a tsvector.
> This results in Postgres not using the existing Gin index and having to
> cast between types which makes for a very slow query.
>
> {{{
> #!div style="font-size: 80%"
> Code highlighting:
>   {{{#!python
>   JobPosting.objects.annotate(full_text=(SearchVector("role") +
> SearchVector("text_search"))).filter(full_text=SearchQuery("foo"))
>   }}}
> }}}
>

> {{{
> #!div style="font-size: 80%"
> Code highlighting:
>   {{{#!sql
>   EXPLAIN ANALYZE SELECT "jobs_jobposting"."id",
>          "jobs_jobposting"."role",
>          "jobs_jobposting"."text",
>          (to_tsvector(COALESCE("jobs_jobposting"."role", '')) ||
> to_tsvector(COALESCE(("jobs_jobposting"."text_search")::text, ''))) AS
> "full_text"
>     FROM "jobs_jobposting"
>    WHERE (to_tsvector(COALESCE("jobs_jobposting"."role", '')) ||
> to_tsvector(COALESCE(("jobs_jobposting"."text_search")::text, ''))) @@
> (plainto_tsquery('foo')) = true
>    LIMIT 21;
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>    Limit  (cost=0.00..35.54 rows=1 width=809) (actual time=40.085..40.085
> rows=0 loops=1)
>      ->  Seq Scan on jobs_jobposting  (cost=0.00..35.54 rows=1 width=809)
> (actual time=40.082..40.082 rows=0 loops=1)
>            Filter: ((to_tsvector((COALESCE(role, ''::character
> varying))::text) || to_tsvector(COALESCE((text_search)::text, ''::text)))
> @@ plainto_tsquery('foo'::text))
>            Rows Removed by Filter: 42
>    Planning Time: 3.140 ms
>    Execution Time: 40.273 ms
>   }}}
> }}}
>

> If you compare this to using the text_search field directly we can see
> that the query is much faster, presumably due to the use of index and
> lack of casting to `::text`
>
> {{{
> #!div style="font-size: 80%"
> Code highlighting:
>   {{{#!python
>   JobPosting.objects.filter(text_search=SearchQuery("foo"))
>   }}}
> }}}
>
> {{{
> #!div style="font-size: 80%"
> Code highlighting:
>   {{{#!sql
>   EXPLAIN ANALYZE SELECT "jobs_jobposting"."id",
>          "jobs_jobposting"."text",
>          "jobs_jobposting"."role"
>     FROM "jobs_jobposting"
>    WHERE "jobs_jobposting"."text_search" @@ (plainto_tsquery('foo')) =
> true
>    LIMIT 21;
> ------------------------------------------------------------------------------------------------------------------
>    Limit  (cost=0.00..15.24 rows=1 width=371) (actual time=1.165..1.166
> rows=0 loops=1)
>      ->  Seq Scan on jobs_jobposting  (cost=0.00..15.24 rows=1 width=371)
> (actual time=1.163..1.163 rows=0 loops=1)
>            Filter: (text_search @@ plainto_tsquery('foo'::text))
>            Rows Removed by Filter: 42
>    Planning Time: 0.699 ms
>    Execution Time: 1.209 ms
>   }}}
> }}}
>

> Compare the execution times: from **40.3ms to 1.2ms**.
>
> You could technically concatenate the role and the text field into one
> SearchVectorField, but then you would be unable to search rank different
> fields differently. Perhaps we would like to rank the information in the
> role column as A, but the text column as C.
>
> I have tried to use F expressions to concatenate the columns, but then
> Django complains.
>
> {{{
> #!div style="font-size: 80%"
> Code highlighting:
>   {{{#!python
>   JobPosting.objects.annotate(full_text=(SearchVector("role") +
> F("text_search"))).filter(full_text=SearchQuery("foo"))
>   TypeError: SearchVector can only be combined with other SearchVectors
>   }}}
> }}}

New description:

 When using `django.contrib.postgres` to perform full text search it's not
 possible to combine SearchVectors and SearchVectorfields. Doing so impacts
 the performance of the query.

 Suppose we have a model with a small "role" field and a much larger "text"
 field. The text field is large enough that it warrants indexing in a
 separate column as a tsvector with a Gin index to ensure our queries are
 fast.

 {{{
 #!div style="font-size: 80%"
 Code highlighting:
   {{{#!python
   class JobPosting(models.Model):
       role = models.CharField(max_length=170, null=True)
       text = models.TextField(max_length=8000, default="")
       # Large field optimized for full text search
       text_search = SearchVectorField(null=True)

       class Meta:
           indexes = [GinIndex(fields=["text_search"])]
   }}}
 }}}

 If we need to perform a search on all columns we need to combine them into
 a common tsvector. The problem is that Django casts the large and search
 optimized text_search field to text and then back into a tsvector. This
 results in Postgres not using the existing Gin index and having to cast
 between types which makes for a very slow query.

 {{{
 #!div style="font-size: 80%"
 Code highlighting:
   {{{#!python
   JobPosting.objects.annotate(full_text=(SearchVector("role") +
 SearchVector("text_search"))).filter(full_text=SearchQuery("foo"))
   }}}
 }}}


 {{{
 #!div style="font-size: 80%"
 Code highlighting:
   {{{#!sql
   EXPLAIN ANALYZE SELECT "jobs_jobposting"."id",
          "jobs_jobposting"."role",
          "jobs_jobposting"."text",
          (to_tsvector(COALESCE("jobs_jobposting"."role", '')) ||
 to_tsvector(COALESCE(("jobs_jobposting"."text_search")::text, ''))) AS
 "full_text"
     FROM "jobs_jobposting"
    WHERE (to_tsvector(COALESCE("jobs_jobposting"."role", '')) ||
 to_tsvector(COALESCE(("jobs_jobposting"."text_search")::text, ''))) @@
 (plainto_tsquery('foo')) = true
    LIMIT 21;
 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Limit  (cost=0.00..35.54 rows=1 width=809) (actual time=40.085..40.085
 rows=0 loops=1)
      ->  Seq Scan on jobs_jobposting  (cost=0.00..35.54 rows=1 width=809)
 (actual time=40.082..40.082 rows=0 loops=1)
            Filter: ((to_tsvector((COALESCE(role, ''::character
 varying))::text) || to_tsvector(COALESCE((text_search)::text, ''::text)))
 @@ plainto_tsquery('foo'::text))
            Rows Removed by Filter: 42
    Planning Time: 3.140 ms
    Execution Time: 40.273 ms
   }}}
 }}}


 If you compare this to using the text_search field directly we can see
 that the query is much faster, presumably due to the use of index and lack
 of casting to `::text`

 {{{
 #!div style="font-size: 80%"
 Code highlighting:
   {{{#!python
   JobPosting.objects.filter(text_search=SearchQuery("foo"))
   }}}
 }}}

 {{{
 #!div style="font-size: 80%"
 Code highlighting:
   {{{#!sql
   EXPLAIN ANALYZE SELECT "jobs_jobposting"."id",
          "jobs_jobposting"."text",
          "jobs_jobposting"."role"
     FROM "jobs_jobposting"
    WHERE "jobs_jobposting"."text_search" @@ (plainto_tsquery('foo')) =
 true
    LIMIT 21;
 
------------------------------------------------------------------------------------------------------------------
    Limit  (cost=0.00..15.24 rows=1 width=371) (actual time=1.165..1.166
 rows=0 loops=1)
      ->  Seq Scan on jobs_jobposting  (cost=0.00..15.24 rows=1 width=371)
 (actual time=1.163..1.163 rows=0 loops=1)
            Filter: (text_search @@ plainto_tsquery('foo'::text))
            Rows Removed by Filter: 42
    Planning Time: 0.699 ms
    Execution Time: 1.209 ms
   }}}
 }}}


 Compare the execution times: from **40.3ms to 1.2ms**.

 You could technically concatenate the role and the text field into one
 SearchVectorField, but then you would be unable to search rank different
 fields differently. Perhaps we would like to rank the information in the
 role column as A, but the text column as C.

 I have tried to use F expressions to concatenate the columns, but then
 Django complains.

 {{{
 #!div style="font-size: 80%"
 Code highlighting:
   {{{#!python
   JobPosting.objects.annotate(full_text=(SearchVector("role") +
 F("text_search"))).filter(full_text=SearchQuery("foo"))
   TypeError: SearchVector can only be combined with other SearchVectors
   }}}
 }}}

 I have also tried casting the "text_search" field into a SearchVector, but
 that also fails.
 {{{
 #!div style="font-size: 80%"
 Code highlighting:
   {{{#!python
   JobPosting.objects.annotate(full_text=(SearchVector("role") +
 Cast("text_search", SearchVector()))).filter(full_text=SearchQuery("foo"))
   TypeError: SearchVector can only be combined with other SearchVectors
   }}}
 }}}

--

-- 
Ticket URL: <https://code.djangoproject.com/ticket/30637#comment:1>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-updates+unsubscr...@googlegroups.com.
To post to this group, send email to django-updates@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/069.1e96f043cd029e8651b4fb98f9b15c52%40djangoproject.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to