[GENERAL] Matching on keyword or phrases within a field that is delimited with an "or" operator "|"

2012-03-12 Thread Jim Ostler
I have a table that is around 20 GB, so I need to optimize as best as possible 
the matching with another table on keywords across multiple fields. I have 
around 10 fields that have keywords or phrases delimited with the "or" 
operator  "|". So it would be in the form of  "a | b  |  and jack  | cd" . 
There are around 20 keywords or phrases per field, and these keywords could be 
any word. 

Because of the size of the database suing a "like" match would take too long. I 
am not sure if tsvector would work, or if there is a way to indicate how you 
want it parsed? 

If I could index these fields somehow that would be best, but I don't want to 
do the traditional full text indexing as I only want to match whatever is 
between the " | " whether it is one word or more.

The original use of this was as it appears, to have the field "a |  b |  c" be 
read "a or b or c" etc. If there is a way to match using this type of logic 
with an index that would be great. 

I hope this is clear enough. Thanks for any help as I am fairly new at this so 
any direction would be helpful.

--Jim

Re: [GENERAL] Matching on keyword or phrases within a field that is delimited with an "or" operator "|"

2012-03-12 Thread David Johnston

>> From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Jim Ostler
>> Sent: Monday, March 12, 2012 6:57 PM
>> To: pgsql-general@postgresql.org
>> Subject: [GENERAL] Matching on keyword or phrases within a field that is
delimited with an "or" operator "|"
>>
>> I have a table that is around 20 GB, so I need to optimize as best as
possible the matching with another table on keywords across multiple fields.
I have around 10 fields that have keywords or phrases delimited with the
"or" 
>> operator  "|". So it would be in the form of  "a | b  |  and jack  | cd"
. There are around 20 keywords or phrases per field, and these keywords
could be any word. 
>>
>> Because of the size of the database suing a "like" match would take too
long. I am not sure if tsvector would work, or if there is a way to indicate
how you want it parsed? 
>>
>> If I could index these fields somehow that would be best, but I don't
want to do the traditional full text indexing as I only want
to match whatever is between the " | " whether it is one word or more.
>>
>> The original use of this was as it appears, to have the field "a  |  b 
|  c" be read "a or b or c" etc. If there is a way to match using this type
of logic with an index that would be great. 
>>
>> I hope this is clear enough. Thanks for any help as I am fairly new at
this so any direction would be helpful.
>>
>> --Jim

=

Start with this:

SELECT 'a' = ANY(regexp_split_to_array('a|b|c', '\|'));  -- In this query
the "ANY" is providing the OR capability; use "ALL" for AND

and adapt as needed.

Regular Expressions are friendly creatures - everybody should have at least
one.

Given the lack of an example, but functioning, query that currently does
what you want it is hard to provide suggestions on improvements.  Whether
the above even is useful for you I do not know due to the lack of details.

David J.






-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Matching on keyword or phrases within a field that is delimited with an "or" operator "|"

2012-03-12 Thread Martin Gregorie
On Mon, 2012-03-12 at 15:57 -0700, Jim Ostler wrote:
> I have a table that is around 20 GB, so I need to optimize as best as
> possible the matching with another table on keywords across multiple
> fields. I have around 10 fields that have keywords or phrases
> delimited with the "or" operator  "|". So it would be in the form of
>  "a | b  |  and jack  | cd" . There are around 20 keywords or phrases
> per field, and these keywords could be any word. 
> 
How static is the list of keywords?
Is there any significance in their order? (I'm guessing there isn't).
How many rows are there in the table, IOW how big is each row?

I wonder if putting the keywords in a separate table with a many:many
relationship with your big table would help. This would retain your
ability to add or remove keywords without affecting the schema. The
selection rules used to combine keywords would also run fast. 

I'm getting good performance from a mail archive where messages are
selected using:
- a 1:M relationship with a subject table
- a M:M relationship with an address table
- range comparison with the date sent (non-indexed field in the message)
- and ILIKE(%phrase%) comparison with the message text

The query is assembled using WHERE clauses for each of the four
selection possibilities listed above before being prepared and issued. A
where clause is only included if the user has specified a value for it.
There is an AND relationship between the clauses. Currently there are
around 130,000 messages in the database. Speed: it took 0.6 seconds to
find 209 messages to or from an address in the last 5 years and
containing the keyword 'f1a'. The same search without the date range
specified ran in 0.86 seconds. It took 16.9 seconds to find the 12331
messages containing 'f1a' when no address or data range were specified.
This is on a dual core, 3.2 GHz Athlon with 4GB of RAM using PG 9.1. 

The other possibility is to use a star schema with each keyword being a
dimension - IOW a traditional data warehouse set-up, but of course this
does require a static list of keywords to be defined.


Martin




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Matching on keyword or phrases within a field that is delimited with an "or" operator "|"

2012-03-13 Thread Jim Ostler
David,

Thanks for the tip on the Regular Expression, as well as the advice to use an 
example statement.

So, I played with the expression you gave me and that works well. The question 
I now have is if I am trying to select all data for any row where that 
condition is true, is it possible to index the column I am selecting on to help 
with this expression? Given the table I am dealing with is over 20 GB, speed is 
a concern.

For example, If I wanted to select all rows from table_1 where ‘a’ existed in 
column ‘diagnosis’ and where two potential fields of diagnosis could be ‘a|b|c’ 
or  ‘d|b|f’  I could use the statement

SELECT  * FROM table_1
Where 'a' = ANY(regexp_split_to_array('diagnosis’, '\|')); 

Is there any way to index the column ‘diagnosis’ to speed this up?  I have 
tried to find information on using a Regular Expression in an index, but that 
doesn’t seem to work. 

One option I am playing with but stuck on is creating a new field with the 
array created by the Regular Expression. Will this work, or will I have to 
create a column for each element of the array? If I were to do this could I use 
a multiple column index? It would then be an index of 20 columns The postgres 
documentation states it is usually not efficient to have more than 3 columns, 
so I am not sure if that is a good idea or not. In general what I have read 
about multicolumn indexes leads me to believe that is not a good idea.

Also, given there is not a standard set of keywords in the column 'diagnosis' I 
don't think a GIN index would work.

Any ideas on other options I could try?
 
--Jim 




 From: David Johnston 
To: 'Jim Ostler' ; pgsql-general@postgresql.org 
Sent: Monday, March 12, 2012 5:27 PM
Subject: RE: [GENERAL] Matching on keyword or phrases within a field that is 
delimited with an "or" operator "|"
 

>> From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Jim Ostler
>> Sent: Monday, March 12, 2012 6:57 PM
>> To: pgsql-general@postgresql.org
>> Subject: [GENERAL] Matching on keyword or phrases within a field that is
delimited with an "or" operator "|"
>>
>> I have a table that is around 20 GB, so I need to optimize as best as
possible the matching with another table on keywords across multiple fields.
I have around 10 fields that have keywords or phrases delimited with the
"or" 
>> operator  "|". So it would be in the form of  "a | b  |  and jack  | cd"
. There are around 20 keywords or phrases per field, and these keywords
could be any word. 
>>
>> Because of the size of the database suing a "like" match would take too
long. I am not sure if tsvector would work, or if there is a way to indicate
how you want it parsed? 
>>
>> If I could index these fields somehow that would be best, but I don't
want to do the traditional full text indexing as I only want
to match whatever is between the " | " whether it is one word or more.
>>
>> The original use of this was as it appears, to have the field "a  |  b 
|  c" be read "a or b or c" etc. If there is a way to match using this type
of logic with an index that would be great. 
>>
>> I hope this is clear enough. Thanks for any help as I am fairly new at
this so any direction would be helpful.
>>
>> --Jim

=

Start with this:

SELECT 'a' = ANY(regexp_split_to_array('a|b|c', '\|'));  -- In this query
the "ANY" is providing the OR capability; use "ALL" for AND

and adapt as needed.

Regular Expressions are friendly creatures - everybody should have at least
one.

Given the lack of an example, but functioning, query that currently does
what you want it is hard to provide suggestions on improvements.  Whether
the above even is useful for you I do not know due to the lack of details.

David J.