Re: Dictionary lookup possibilities
Hi, Thanks for the suggestions! It looks like the MemoryIndex is worth having a detailed look at, so that's what I'll start on. Thanks again, bye, Jaco. 2009/4/17 Steven A Rowe sar...@syr.edu Hi Jaco, On 4/9/2009 at 2:58 PM, Jaco wrote: I'm struggling with some ideas, maybe somebody can help me with past experiences or tips. I have loaded a dictionary into a Solr index, using stemming and some stopwords in analysis part of the schema. Each record holds a term from the dictionary, which can consist of multiple words. For some data analysis work, I want to send pieces of text (sentences actually) to Solr to retrieve all possible dictionary terms that could occur. Ideally, I want to construct a query that only returns those Solr records for which all individual words in that record are matched. For instance, my dictionary holds the following terms: 1 - a b c d 2 - c d e 3 - a b 4 - a e f g h If I put the sentence [a b c d f g h] in as a query, I want to recieve dictionary items 1 (matching all words a b c d) and 3 (matching words a b) as matches I have been puzzling about how to do this. The only way I found so far was to construct an OR query with all words of the sentence in it. In this case, that would result in all dictionary items being returned. This would then require some code to go over the search results and analyse each of them (i.e. by using the highlight function) to kick out 'false' matches, but I am looking for a more efficient way. Is there a way to do this with Solr functionality, or do I need to start looking into the Lucene API ..? Your problem could be modeled as a set of standing queries, where your dictionary entries are the *queries* (with all words required, maybe using a PhraseQuery or a SpanNearQuery), and the sentence is the document. Solr may not be usable in this context (extremely high volume queries), depending on your throughput requirements, but Lucene's MemoryIndex was designed for this kind of thing: http://lucene.apache.org/java/2_4_1/api/contrib-memory/org/apache/lucene/index/memory/MemoryIndex.html Steve
Re: Dictionary lookup possibilities
: For instance, my dictionary holds the following terms: : 1 - a b c d : 2 - c d e : 3 - a b : 4 - a e f g h : : If I put the sentence [a b c d f g h] in as a query, I want to recieve : dictionary items 1 (matching all words a b c d) and 3 (matching words a b) : as matches this is a pretty hard problem in general ... in my mind i call it the longest matching sub-phrase problem, but i have no idea if it has a real name. the only solution i know of using Lucene is to construct a phrase query for each of the sub phrases, giving a bigger query boost to the longer phrases ... but it might be possible to design a customer query impl for solving this problem. (i've never had an important enough use case to dedicate a significant amount of time to figuring it out) -Hoss
Re: Dictionary lookup possibilities
On Fri, Apr 17, 2009 at 3:37 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: this is a pretty hard problem in general ... in my mind i call it the longest matching sub-phrase problem, but i have no idea if it has a real name. the only solution i know of using Lucene is to construct a phrase query for each of the sub phrases, giving a bigger query boost to the longer phrases ... but it might be possible to design a customer query impl for solving this problem. There was an issue opened for something similar but there is not patch yet. https://issues.apache.org/jira/browse/SOLR-633 -- Regards, Shalin Shekhar Mangar.
Dictionary lookup possibilities
Hello, I'm struggling with some ideas, maybe somebody can help me with past experiences or tips. I have loaded a dictionary into a Solr index, using stemming and some stopwords in analysis part of the schema. Each record holds a term from the dictionary, which can consist of multiple words. For some data analysis work, I want to send pieces of text (sentences actually) to Solr to retrieve all possible dictionary terms that could occur. Ideally, I want to construct a query that only returns those Solr records for which all individual words in that record are matched. For instance, my dictionary holds the following terms: 1 - a b c d 2 - c d e 3 - a b 4 - a e f g h If I put the sentence [a b c d f g h] in as a query, I want to recieve dictionary items 1 (matching all words a b c d) and 3 (matching words a b) as matches I have been puzzling about how to do this. The only way I found so far was to construct an OR query with all words of the sentence in it. In this case, that would result in all dictionary items being returned. This would then require some code to go over the search results and analyse each of them (i.e. by using the highlight function) to kick out 'false' matches, but I am looking for a more efficient way. Is there a way to do this with Solr functionality, or do I need to start looking into the Lucene API ..? Any help would be much appreciated as usual! Thanks, bye, Jaco.