Hi Has there been any work that tries to integrate Kernel methods [1] with SOLR? I am interested in using kernel methods to solve synonym, hyponym and polysemous (disambiguation) problems which SOLR's Vector space model ("bag of words") does not capture.
For example, imagine we have only 3 words in our corpus, "puma", "cougar" and "feline". The 3 words have obviously interdependencies (puma disambiguates to cougar, cougar and puma are instances of felines - hyponyms). Now, imagine 2 docs, d1 and d2, that have the following TF-IDF vectors. puma, cougar, feline d1 = [ 2, 0, 0] d2 = [ 0, 1, 0] i.e. d1 has no mention of term cougar or feline and conversely, d2 has no mention of terms puma or feline. Hence under the vector approach d1 and d2 are not related at all (and each interpretation of the terms have a unique vector). Which is not what we want to conclude. What I need is to include a kernel matrix (as data) such as the following that captures these relationships: puma, cougar, feline puma = [ 1, 1, 0.4] cougar = [ 1, 1, 0.4] feline = [ 0.4, 0.4, 1] then recompute the TF-IDF vector as a product of (1) the original vector and (2) the kernel matrix, resulting in puma, cougar, feline d1 = [ 2, 2, 0.8] d2 = [ 1, 1, 0.4] (note, the new vectors are much less sparse). I can solve this problem (inefficiently) at the application layer but I was wondering if there has been any attempts within the community to solve similar problems, efficiently without paying a hefty response time price? thank you Peyman [1] http://en.wikipedia.org/wiki/Kernel_methods