Greetings!


My original problem is to de duplicate a list of around 0.3 million company names.

Since a company name can be potentially (mis)spelt in numerous ways exactmatch
obviously wont work.


To make the searches faster i am using tsearch. For each company name i want to
search other companies whose name is similar to the company in question.


Since inclusion of all the vector elements of a given company reduces the
chance of matching i am thinking of excluding the high frequency words
from the query.

Hence i need to find the high frequency elements like say 'consulting' , 'limited' , 'Private'
'Industries' that occur commonly in company names.


In my table i have populated the co_name_vec feild as strip(to_tsvector(co_name))
can anyone help me analyzing the co_name_vec for the high frequency words?


Also i would like to know alternate / better solution to this problem.


Regds Mallah.



SAMPLE DATA.

+-----------------------------------------------------+----------------------------------------------------------+
| co_name | co_name_vec |
+-----------------------------------------------------+----------------------------------------------------------+
| European Trade Partner & Consulting | 'trade' 'consult' 'partner' 'european' |
| Gulbrandsen Chemicals Pvt. Ltd. | 'ltd' 'pvt' 'chemic' 'gulbrandsen' |
| Govt. of Karnataka, Vision Group on Biotechnology | 'govt' 'group' 'vision' 'karnataka' 'biotechnolog' |
| Digital Globalsoft Ltd. (A Hewlett Packard Company) | 'ltd' 'digit' 'compani' 'hewlett' 'packard' 'globalsoft' |
| Shanon Construction Material Industries | 'materi' 'shanon' 'industri' 'construct' |
| singpore india trade rsources company | 'india' 'trade' 'rsourc' 'compani' 'singpor' |
| RGV TELECOM CONSULTANTS PVT. LTD. | 'ltd' 'pvt' 'rgv' 'consult' 'telecom' |
| avid information search and documents (p) ltd. | 'p' 'ltd' 'avid' 'inform' 'search' 'document' |
| Tavant Technologies India (P) Ltd. | 'p' 'ltd' 'india' 'tavant' 'technolog' |
| Maschinen Fabrik (India) Pvt. Ltd | 'ltd' 'pvt' 'india' 'fabrik' 'maschinen' |
| Manishri Refractories and Ceramics Pvt. Ltd. | 'ltd' 'pvt' 'ceram' 'manishri' 'refractori' |
| xavier export import management institute | 'manag' 'export' 'import' 'xavier' 'institut' |
| Best InformationTechnology ltd. | 'ltd' 'best' 'informationtechnolog' |
| FutureCalls Technology Private Limited | 'limit' 'privat' 'futurecal' 'technolog' |
| mak controls and systems pvt ltd | 'ltd' 'mak' 'pvt' 'system' 'control' |
| NATIONAL RESEARCH CENTRE FOR CASHEW | 'centr' 'cashew' 'nation' 'research' |
| The Madras Aluminium Company Ltd. | 'ltd' 'madra' 'compani' 'aluminium' |
| Shriram Institute for Industrial Research | 'shriram' 'industri' 'institut' 'research' |
| All India Carpet Trade Fair Committee | 'fair' 'india' 'trade' 'carpet' 'committe' |
| Tuff Security & Allied Services | 'alli' 'tuff' 'secur' 'servic' |
+-----------------------------------------------------+----------------------------------------------------------+
(20 rows)



---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Reply via email to