Re: Dictionary lookup possibilities

2009-04-18 Thread Jaco
Hi,

Thanks for the suggestions! It looks like the MemoryIndex is worth having a
detailed look at, so that's what I'll start on.

Thanks again, bye,

Jaco.


2009/4/17 Steven A Rowe sar...@syr.edu

 Hi Jaco,

 On 4/9/2009 at 2:58 PM, Jaco wrote:
  I'm struggling with some ideas, maybe somebody can help me with past
  experiences or tips. I have loaded a dictionary into a Solr index,
  using stemming and some stopwords in analysis part of the schema.
  Each record holds a term from the dictionary, which can consist of
  multiple words. For some data analysis work, I want to send pieces
  of text (sentences actually) to Solr to retrieve all possible
  dictionary terms that could occur. Ideally, I want to construct a
  query that only returns those Solr records for which all individual
  words in that record are matched.
 
  For instance, my dictionary holds the following terms:
  1 - a b c d
  2 - c d e
  3 - a b
  4 - a e f g h
 
  If I put the sentence [a b c d f g h] in as a query, I want to recieve
  dictionary items 1 (matching all words a b c d) and 3 (matching words a
  b) as matches
 
  I have been puzzling about how to do this. The only way I found so far
  was to construct an OR query with all words of the sentence in it. In
  this case, that would result in all dictionary items being returned.
  This would then require some code to go over the search results and
  analyse each of them (i.e. by using the highlight function) to kick
  out 'false' matches, but I am looking for a more efficient way.
 
  Is there a way to do this with Solr functionality, or do I need to
  start looking into the Lucene API ..?

 Your problem could be modeled as a set of standing queries, where your
 dictionary entries are the *queries* (with all words required, maybe using a
 PhraseQuery or a SpanNearQuery), and the sentence is the document.

 Solr may not be usable in this context (extremely high volume queries),
 depending on your throughput requirements, but Lucene's MemoryIndex was
 designed for this kind of thing:

 
 http://lucene.apache.org/java/2_4_1/api/contrib-memory/org/apache/lucene/index/memory/MemoryIndex.html
 

 Steve




Re: Dictionary lookup possibilities

2009-04-16 Thread Chris Hostetter

: For instance, my dictionary holds the following terms:
: 1 - a b c d
: 2 - c d e
: 3 - a b
: 4 - a e f g h
: 
: If I put the sentence [a b c d f g h] in as a query, I want to recieve
: dictionary items 1 (matching all words a b c d) and 3 (matching words a b)
: as matches

this is a pretty hard problem in general ... in my mind i call it the 
longest matching sub-phrase problem, but i have no idea if it has a real 
name.

the only solution i know of using Lucene is to construct a phrase query 
for each of the sub phrases, giving a bigger query boost to the longer 
phrases ... but it might be possible to design a customer query impl for 
solving this problem.

(i've never had an important enough use case to dedicate a significant 
amount of time to figuring it out)





-Hoss



Re: Dictionary lookup possibilities

2009-04-16 Thread Shalin Shekhar Mangar
On Fri, Apr 17, 2009 at 3:37 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


  this is a pretty hard problem in general ... in my mind i call it the
 longest matching sub-phrase problem, but i have no idea if it has a real
 name.

 the only solution i know of using Lucene is to construct a phrase query
 for each of the sub phrases, giving a bigger query boost to the longer
 phrases ... but it might be possible to design a customer query impl for
 solving this problem.


There was an issue opened for something similar but there is not patch yet.

https://issues.apache.org/jira/browse/SOLR-633

-- 
Regards,
Shalin Shekhar Mangar.


Dictionary lookup possibilities

2009-04-09 Thread Jaco
Hello,

I'm struggling with some ideas, maybe somebody can help me with past
experiences or tips. I have loaded a dictionary into a Solr index, using
stemming and some stopwords in analysis part of the schema. Each record
holds a term from the dictionary, which can consist of multiple words. For
some data analysis work, I want to send pieces of text (sentences actually)
to Solr to retrieve all possible dictionary terms that could occur. Ideally,
I want to construct a query that only returns those Solr records for which
all individual words in that record are matched.

For instance, my dictionary holds the following terms:
1 - a b c d
2 - c d e
3 - a b
4 - a e f g h

If I put the sentence [a b c d f g h] in as a query, I want to recieve
dictionary items 1 (matching all words a b c d) and 3 (matching words a b)
as matches

I have been puzzling about how to do this. The only way I found so far was
to construct an OR query with all words of the sentence in it. In this case,
that would result in all dictionary items being returned. This would then
require some code to go over the search results and analyse each of them
(i.e. by using the highlight function) to kick out 'false' matches, but I am
looking for a more efficient way.

Is there a way to do this with Solr functionality, or do I need to start
looking into the Lucene API ..?

Any help would be much appreciated as usual!

Thanks, bye,

Jaco.