Re: [GENERAL] Merge rows based on Levenshtein distance

David G Johnston Tue, 02 Dec 2014 22:37:19 -0800

On Tuesday, December 2, 2014, mongoose [via PostgreSQL] <
[email protected]> wrote:


> David,
>
> Thank you for your prompt reply. I believe your answer helped a lot but it
> seems I was not clear enough on my description. Basically I want a counter
> (id) to show if two or more names are similar (i.e. levenshtein distance
> less than 3) So in the previous example:
>
> From this table:
>
> Name, City
> "Booob", "NYC"
> "Alex", "Washington"
> "Alexj2", "Washington"
> "Bob", "NYC"
> "Aleex1", "Washington"
>
> to get this table:
>
> id, Name, City
> 1,"Alex", "Washington"
> 1,"Aleex1", "Washington"
> 1,"Alexj2", "Washington"
> 2,"Bob", "NYC"
> 2,"Booob", "NYC"
>
> So basically the id is a counter that starts from "1" and increments only
> when there is a different name. Please notice that the table has its names
> in a completely random order.
>
>
Write and combine a few subqueries that use window functions (namely lag
and row_number) to identify groups, label them, and assign rows to each
group (using a between condition on a join)

Pondering some (not tested) if you identify the boundary records in a
subquery you can assign them a value of 1 while all others take on null.
In the outer query you should be able to assign groups by simply
applying the sum function over the entire result such that at each boundary
value the presence of the 1 will increment the sum while the null rows will
use the sum value from the prior row.

David J.




--
View this message in context: 
http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829041.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: [GENERAL] Merge rows based on Levenshtein distance

Reply via email to