This is not strictly related to opencog but might come useful if you want 
to use it as part of an NLP / NLU pipeline where you need to spell check 
and link a given text to a knowledge base.

So the idea is that you have a text where they might be spelling mistakes. 
The easiest option would be to use an existing spell checker like hunspell 
/ aspell / ispell. The problem with that approach is that anytime you add 
items to the knowledge base you need to update the spell checker 
dictionary. My idea is to rely on a single source of truth database, that I 
can drive from python or scheme.

It seems to be the most used spell checker in Python is fuzzy-wuzzy. I 
tried to use it and here are a few results with timings. As far as I 
understand, fuzzy-wuzzy will not compile or preprocess or index the 
"choices" before guessing a match which leads to a very big run time:


$ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv 
10 resaerch

('öres', 90)('erc', 90)
('e', 90)
('rch', 90)
('c', 90)
('c̄', 90)
('sae', 90)
('sé', 90)
('öre', 90)
('re', 90)

26.097001791000366

In the above query e and a are swapped, and fuzzywuzzy fail to find even 
remotely something similar. Mind the fact that the last line is run time in 
seconds.

$ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv 
10 reserch

('research', 93)
('c̄', 90)
('öre', 90)
('rc', 90)
('ré', 90)
('ser', 90)
('rese', 90)
('re', 90)
('ch', 90)
('öres', 90)

26.26053023338318

$ python fw.py data/conceptnet-assertions-5.7.0.english-words-to-concept.tsv 
10 research

('research', 100)
('researchy', 94)
('ré', 90)
('sear', 90)
('rê', 90)
('öres', 90)
('ar', 90)
('nonresearcher', 90)
('c@', 90)
('unresearched', 90)

26.261364459991455

As you can see the time to run is very big, and that will become bigger 
over time as the KB grows with more words.

To help with that task I created a hash in the spirit of simhash that 
preserve similarity in the prefix of a hash so that it is easy to query in 
an Ordered Key-Value Store (OKVS). Here are the same queries using the that 
algorithm:

$ python fuzz.py query 10 resaerch
* most similar according to bbk fuzzbuzz
** research      -2
0.011413335800170898


$ python fuzz.py query 10 reserch
* most similar according to bbk fuzzbuzz
** research      -1
** resch      -2
** resercher      -2
0.011811494827270508


$ python fuzz.py query 10 research
* most similar according to bbk fuzzbuzz
** research      0
** researches      -2
** researchee      -2
** researcher      -2
0.012357711791992188

I tried similar queries over wikidata labels, it gives good results under 
250 ms.

As you can see it is much much faster and the result seems more relevant. 
The algorithm can be found at: https://stackoverflow.com/a/58791875/140837

I will be glad if someone can try that algorithm in their system?

Similarly, I will be glad if you can give me pointers on how to evaluate 
(precision / recall?) against a gold standard.

This one step toward the goal of re-implementing link-grammar using only a 
sat-solver and an okvs.

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to opencog+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/64c7b12c-e196-433a-a9e6-b622ff953ccen%40googlegroups.com.

Reply via email to