Re: Search within a sentence (revisited)

Mark Miller Thu, 21 Jul 2011 13:25:41 -0700

Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to change 
that to an IndexReader I believe.


- Mark

On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote:

> Does this patch require the trunk version? I'm using 3.2 and
> 'AtomicReaderContext' isn't there.
> 
> Peter
> 
> On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller <[email protected]> wrote:
> 
>> Hey Peter,
>> 
>> Getting sucked back into Spans...
>> 
>> That test should pass now - I uploaded a new patch to
>> https://issues.apache.org/jira/browse/LUCENE-777
>> 
>> Further tests may be needed though.
>> 
>> - Mark
>> 
>> 
>> On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:
>> 
>>> Hi Mark,
>>> 
>>> Here is a unit test using a version of 'SpanWithinQuery' modified for 3.2
>>> ('getTerms' removed) . The last test fails (search for "1" and "3").
>>> 
>>> package org.apache.lucene.search.spans;
>>> 
>>> import java.io.Reader;
>>> 
>>> import org.apache.lucene.analysis.Analyzer;
>>> import org.apache.lucene.analysis.TokenStream;
>>> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
>>> import
>>> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
>>> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
>>> import org.apache.lucene.document.Document;
>>> import org.apache.lucene.document.Field;
>>> import org.apache.lucene.index.IndexReader;
>>> import org.apache.lucene.index.RandomIndexWriter;
>>> import org.apache.lucene.index.Term;
>>> import org.apache.lucene.store.Directory;
>>> import org.apache.lucene.search.IndexSearcher;
>>> import org.apache.lucene.search.PhraseQuery;
>>> import org.apache.lucene.search.ScoreDoc;
>>> import org.apache.lucene.search.TermQuery;
>>> import org.apache.lucene.search.spans.SpanNearQuery;
>>> import org.apache.lucene.search.spans.SpanQuery;
>>> import org.apache.lucene.search.spans.SpanTermQuery;
>>> import org.apache.lucene.util.LuceneTestCase;
>>> 
>>> public class TestSentence extends LuceneTestCase {
>>> public static final String field = "field";
>>> public static final String START = "^";
>>> public static final String END = "$";
>>> public void testSetPosition() throws Exception {
>>> Analyzer analyzer = new Analyzer() {
>>> @Override
>>> public TokenStream tokenStream(String fieldName, Reader reader) {
>>> return new TokenStream() {
>>> private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
>>> "9"};
>>> private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
>>> private int i = 0;
>>> 
>>> PositionIncrementAttribute posIncrAtt =
>>> addAttribute(PositionIncrementAttribute.class);
>>> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
>>> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
>>> 
>>> @Override
>>> public boolean incrementToken() {
>>> assertEquals(TOKENS.length, INCREMENTS.length);
>>> if (i == TOKENS.length)
>>> return false;
>>> clearAttributes();
>>> termAtt.append(TOKENS[i]);
>>> offsetAtt.setOffset(i,i);
>>> posIncrAtt.setPositionIncrement(INCREMENTS[i]);
>>> i++;
>>> return true;
>>> }
>>> };
>>> }
>>> };
>>> Directory store = newDirectory();
>>> RandomIndexWriter writer = new RandomIndexWriter(random, store,
>> analyzer);
>>> Document d = new Document();
>>> d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED));
>>> writer.addDocument(d);
>>> IndexReader reader = writer.getReader();
>>> writer.close();
>>> IndexSearcher searcher = newSearcher(reader);
>>> 
>>> SpanTermQuery startSentence = makeSpanTermQuery(START);
>>> SpanTermQuery endSentence = makeSpanTermQuery(END);
>>> SpanQuery[] clauses = new SpanQuery[2];
>>> clauses[0] = makeSpanTermQuery("1");
>>> clauses[1] = makeSpanTermQuery("2");
>>> SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE,
>>> false); // SpanAndQuery equivalent
>>> SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0);
>>> System.out.println("query: "+query);
>>> ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
>>> assertEquals(hits.length, 1);
>>> 
>>> clauses[1] = makeSpanTermQuery("4");
>>> allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
>>> SpanAndQuery equivalent
>>> query = new SpanWithinQuery(allKeywords, endSentence, 0);
>>> System.out.println("query: "+query);
>>> hits = searcher.search(query, null, 1000).scoreDocs;
>>> assertEquals(hits.length, 0);
>>> 
>>> PhraseQuery pq = new PhraseQuery();
>>> pq.add(new Term(field, "3"));
>>> pq.add(new Term(field, "4"));
>>> hits = searcher.search(pq, null, 1000).scoreDocs;
>>> assertEquals(hits.length, 1);
>>> 
>>> clauses[1] = makeSpanTermQuery("3");
>>> allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
>>> SpanAndQuery equivalent
>>> query = new SpanWithinQuery(allKeywords, endSentence, 0);
>>> System.out.println("query: "+query);
>>> hits = searcher.search(query, null, 1000).scoreDocs;
>>> assertEquals(hits.length, 1);
>>> 
>>> 
>>> }
>>> 
>>> public SpanTermQuery makeSpanTermQuery(String text) {
>>> return new SpanTermQuery(new Term(field, text));
>>> }
>>> public TermQuery makeTermQuery(String text) {
>>> return new TermQuery(new Term(field, text));
>>> }
>>> }
>>> 
>>> Peter
>>> 
>>> On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller <[email protected]>
>> wrote:
>>> 
>>>> 
>>>> On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:
>>>> 
>>>>> 
>>>>> On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:
>>>>> 
>>>>>> Mark Miller's 'SpanWithinQuery' patch
>>>>>> seems to have the same issue.
>>>>> 
>>>>> If I remember right (It's been more the a couple years), I did index
>> the
>>>> sentence markers at the same position as the last word in the sentence.
>> And
>>>> I think the limitation that I ate was that the word could belong to both
>>>> it's true sentence, and the one after it.
>>>>> 
>>>>> - Mark Miller
>>>>> lucidimagination.com
>>>> 
>>>> Perhaps you could index the sentence marker at both the last word of the
>>>> sentence as well as the first word of the next sentence if there is one.
>>>> This would seem to solve the above limitation as well?
>>>> 
>>>> - Mark Miller
>>>> lucidimagination.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>>> 
>> 
>> - Mark Miller
>> lucidimagination.com
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 

- Mark Miller
lucidimagination.com









---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Search within a sentence (revisited)

Reply via email to