Re: [GENERAL] Grouping By Similarity (using pg_trgm)?

2015-05-22 Thread Oleg Bartunov
Have you seen http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf ?

On Thu, May 14, 2015 at 9:58 PM, Cory Tucker  wrote:

> [pg version 9.3 or 9.4]
>
> Suppose I have a simple table:
>
> create table data (
>   my_value  TEXT NOT NULL
> );
> CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops);
>
>
> Now I would like to essentially do group by to get a count of all the
> values that are sufficiently similar.  I can do it using something like a
> CROSS JOIN to join the table on itself, but then I still am getting all the
> rows with duplicate counts.
>
> Is there a way to do a group by query and only return a single "my_value"
> column and a count of the number of times other values are similar while
> also not returning the included similar values in the output, too?
>
>


Re: [GENERAL] Grouping By Similarity (using pg_trgm)?

2015-05-14 Thread David G. Johnston
On Thu, May 14, 2015 at 1:09 PM, Cory Tucker  wrote:

> That produces pretty much the same results as the CROSS JOIN I was using
> before.  Because each "my_value" in the table are different, if I group on
> just their value then I will always have the full result set and a bunch of
> essentially duplicated results.
>
> Any other ideas/options?
>
>
​how do you want to solve the problem:

A is similar to B
B is similar to C
A IS NOT similar to C

​?

I'm not sure that PostgreSQL is the best tool to solve clustering
problems...though my experience with them is minimal.

David J.


Re: [GENERAL] Grouping By Similarity (using pg_trgm)?

2015-05-14 Thread Cory Tucker
That produces pretty much the same results as the CROSS JOIN I was using
before.  Because each "my_value" in the table are different, if I group on
just their value then I will always have the full result set and a bunch of
essentially duplicated results.

Any other ideas/options?

On Thu, May 14, 2015 at 12:08 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

>
> On Thu, May 14, 2015 at 11:58 AM, Cory Tucker 
> wrote:
>
>> [pg version 9.3 or 9.4]
>>
>> Suppose I have a simple table:
>>
>> create table data (
>>   my_value  TEXT NOT NULL
>> );
>> CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops);
>>
>>
>> Now I would like to essentially do group by to get a count of all the
>> values that are sufficiently similar.  I can do it using something like a
>> CROSS JOIN to join the table on itself, but then I still am getting all the
>> rows with duplicate counts.
>>
>> Is there a way to do a group by query and only return a single "my_value"
>> column and a count of the number of times other values are similar while
>> also not returning the included similar values in the output, too?
>>
>>
> ​Concept below - not bothering to lookup the functions/operators for
> pg_trgm:
>
> SELECT my_value_src, count(*)
> FROM (SELECT my_value AS my_value_src FROM data) src
> JOIN (SELECT my_value AS my_value_compareto FROM data) comparedto
> ON ( func(my_value_src, my_value_compareto) < # )
> GROUP BY my_value_src
>
> ​David J.
>
>


Re: [GENERAL] Grouping By Similarity (using pg_trgm)?

2015-05-14 Thread David G. Johnston
On Thu, May 14, 2015 at 11:58 AM, Cory Tucker  wrote:

> [pg version 9.3 or 9.4]
>
> Suppose I have a simple table:
>
> create table data (
>   my_value  TEXT NOT NULL
> );
> CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops);
>
>
> Now I would like to essentially do group by to get a count of all the
> values that are sufficiently similar.  I can do it using something like a
> CROSS JOIN to join the table on itself, but then I still am getting all the
> rows with duplicate counts.
>
> Is there a way to do a group by query and only return a single "my_value"
> column and a count of the number of times other values are similar while
> also not returning the included similar values in the output, too?
>
>
​Concept below - not bothering to lookup the functions/operators for
pg_trgm:

SELECT my_value_src, count(*)
FROM (SELECT my_value AS my_value_src FROM data) src
JOIN (SELECT my_value AS my_value_compareto FROM data) comparedto
ON ( func(my_value_src, my_value_compareto) < # )
GROUP BY my_value_src

​David J.