[ 
https://issues.apache.org/jira/browse/LUCENE-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006254#comment-13006254
 ] 

Steven Rowe commented on LUCENE-2749:
-------------------------------------

Hi Elmar,

I haven't had a chance to do more than an hour or two of work on this, and that 
was a while back, so please feel free to run with it.

You should know, though, that Robert Muir and Yonik Seeley (both Lucene/Solr 
developers) expressed skepticism (on #lucene IRC) about whether this filter 
belongs in Lucene itself, because in their experience, collocations are used by 
non-search software, and they believe that Lucene should remain focused 
exclusively on search.  

Robert Muir also thinks that components that support Boolean search (i.e., not 
ranked search) should go elsewhere.  

I personally disagree with these restrictions in general, and I think that a 
co-occurrence filter could directly support search.  See this 
solr-u...@lucene.apache.org mailing list discussion for an example I gave (and 
one of the reasons I made this issue): 
http://www.lucidimagination.com/search/document/f69f877e0fa05d17/how_do_i_this_in_solr#d9d5932e7074d356
 . In this thread, I described a way to solve the original poster's problem 
using a co-occurrence filter exactly like the one proposed here.

I mention all this to caution you that work you put in here may never be 
committed to Lucene itself.

The mailing list thread I mentioned above describes the main limitations a 
filter like this will have: combinatoric explosion of generated terms.  I 
haven't figured out how to manage this, but it occurs to me that the 
two-term-collocation case is less problematic in this regard than the 
generalized case (whole-field window, all possible combinations).  I had a 
vague implementation conception of incrementing a fixed-width integer to 
iterate over the combinations, using the integer's bits to include/exclude 
input terms in the output "termset" tokens.  Using a 32-bit integer to track 
combinations would limit the length of an input token stream to 32 tokens, but 
in the generalized case of all combinations, I'm pretty sure that the number of 
bits available would not be the limiting factor, but rather the number of 
generated terms.  I guess the question is how to handle cases that produce 
fewer terms than all combinations of terms from an input token stream, e.g. the 
two-term-collocation case, without imposing the restrictions necessary in the 
generalized case.

Here are a couple of recent information retrieval papers using "termset" to 
mean "indexed token containing multiple input terms":

"TSS: Efficient Term Set Search in Large Peer-to-Peer Textual Collections"
http://www.cs.ust.hk/~liu/TSS-TC.pdf

"Termset-based Indexing and Query Processing in P2P Search"
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5384831

(Sorry, I couldn't find a free public location for the second paper.)

> Co-occurrence filter
> --------------------
>
>                 Key: LUCENE-2749
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2749
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>
> The co-occurrence filter to be developed here will output sets of tokens that 
> co-occur within a given window onto a token stream.  
> These token sets can be ordered either lexically (to allow order-independent 
> matching/counting) or positionally (e.g. sliding windows of positionally 
> ordered co-occurring terms that include all terms in the window are called 
> n-grams or shingles). 
> The parameters to this filter will be: 
> * window size: this can be a fixed sequence length, sentence/paragraph 
> context (these will require sentence/paragraph segmentation, which is not in 
> Lucene yet), or over the entire token stream (full field width)
> * minimum number of co-occurring terms: >= 2
> * maximum number of co-occurring terms: <= window size
> * token set ordering (lexical or positional)
> One use case for co-occurring token sets is as candidates for collocations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to