Hi

Has there been any work that tries to integrate Kernel methods [1] with SOLR? I 
am interested in using kernel methods to solve synonym, hyponym and polysemous 
(disambiguation) problems which SOLR's Vector space model ("bag of words") does 
not capture. 

For example, imagine we have only 3 words in our corpus, "puma", "cougar" and 
"feline". The 3 words have obviously interdependencies (puma disambiguates to 
cougar, cougar and puma are instances of felines - hyponyms). Now, imagine 2 
docs, d1 and d2, that have the following TF-IDF vectors. 

                 puma, cougar, feline
d1       =   [  2,        0,         0]
d2       =   [  0,        1,         0]

i.e. d1 has no mention of term cougar or feline and conversely, d2 has no 
mention of terms puma or feline. Hence under the vector approach d1 and d2 are 
not related at all (and each interpretation of the terms have a unique vector). 
Which is not what we want to conclude. 

What I need is to include a kernel matrix (as data) such as the following that 
captures these relationships:

                       puma, cougar, feline
puma    =   [  1,        1,         0.4]
cougar  =   [  1,        1,         0.4]
feline  =   [  0.4,     0.4,         1]

then recompute the TF-IDF vector as a product of (1) the original vector and 
(2) the kernel matrix, resulting in

                 puma, cougar, feline
d1       =   [  2,        2,         0.8]
d2       =   [  1,        1,         0.4]

(note, the new vectors are much less sparse). 

I can solve this problem (inefficiently) at the application layer but I was 
wondering if there has been any attempts within the community to solve similar 
problems, efficiently without paying a hefty response time price?

thank you 

Peyman

[1] http://en.wikipedia.org/wiki/Kernel_methods

Reply via email to