Re: Fwd: SpanNearQuery: how to get the "intra-span" matching positions?

Claudio Corsi Mon, 09 Jun 2008 01:37:32 -0700

Hi!
I've implemented a non efficient (but working) solution to the "intra-span"
matching problem. With these modifications (see the attachment) I have a way
to pick all the matching positions *inside* the current NearSpan using the
new method matchingSpans() (to call after each next()).

There are three files:

1) NearSpans.java (just an interface declaring the matchingSpans method, I
need it for my framework, but it is not mandatory)
2) NearSpansUnordered.java
3) NearSpansOrdered.java

(the package declaration is relative to my project, please ignore it!! ;)

The files 2) and 3) are copies of the one I found in Lucene
2.3.1.NearSpansUnordered just provide the implementation of
matchinSpans() without
any other modifications to the code: it just cycles over the list of
SpanCell kept inside the current instance and filter out the elements whose
doc() is not equals to the current doc() of the Spans, and whose
start()/end() mathcing positions are not "compatible" with the ones of the
current Spans state.

The case of NearSpansOrdered is a little bit more complicated. I had to
maintain track of the subSpans states in the method
shrinkToAfterShortestMatch(). So, I've introduced the subSpansCopy ArrayList
and the inner class SpansCopy that just copies the doc()/start()/end()
values of the passed span. Then I've used this list in the implementation of
matchingSpans() in a way similar to NearSpansUnorderd.

These copies and the matchingSpans() implementations are not very efficient.
I think that this problem can be solved in a better way. But for my
collection and my application it works fine and fast.

Hope that these files will help someone else!

Cheers.

On Fri, Jun 6, 2008 at 6:34 PM, Paul Elschot <[EMAIL PROTECTED]> wrote:

> See below.
>
> Op Friday 06 June 2008 16:23:15 schreef Claudio Corsi:
> > Hi,
> > I'm trying to extend the NearSpansOrdered and NearSpansUnordered
> > classes of the Lucene core in order to create a way to access to the
> > inner positions of the current span (in a next() loop). Suppose that
> > the current near span starts at position N and ends at position N+k,
> > I would discover the starting/ending positions of all the inner
> > clauses that generate such span.
> >
> > I'm working on the NearSpansOrdered class right now. I guess that
> > this modification could be trivial to do, but it requires to me time
> > to understand the code. Any hints about that?
> >
> > Actually (as a very inefficient way to proceed) I've added this
> > method to call *after each next()*, but it doesn't work as aspected:
> >
> > public Spans[] matchingSpans() {
> >
> >       ArrayList<Spans> list = new ArrayList<Spans>();
> >       if (subSpans.length == 0) return null;
> >       for(int pos = 0; pos < subSpans.length; pos++) {
> >           if (subSpans[pos].doc() != matchDoc) continue;
> >           if (subSpans[pos].start() >= matchStart &&
> > subSpans[pos].end() <= matchEnd)
> >           list.add(subSpans[pos]);
> >       }
> >       return list.toArray(new Spans[0]);
> > }
> >
> > As you can see, I'm just looping over the subSpans array, filtering
> > the ones having doc() == matchDoc and which span starts/end inside
> > the current near span (matchStart and matchEnd are the boundaries
> > returned by start() and ends() of NearSpansOrdered). This technique
> > doesn't work. Maybe the problem is that the subSpans are not in the
> > rigth state afte the next() call?
>
> Correct. The reason is that a match must be minimal length,
> and for that at least the matching subspans at the lowest
> position needs to be advanced beyond its matching position.
> This is the same for both the ordered and unordered case.
>
> So, to implement the matchingSpans() method, it will be necessary
> to copy the subspans when they are at the matching position.
> This will probably involve some fruitless copying for incomplete
> matches that never become a real match.
>
> There is also a difference beyond ordered/unordered.
> In the ordered case, no overlaps between the matching subspans
> are allowed, and in the unordered case overlaps are allowed.
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-- 
Claudio Corsi

package it.unipi.di.dante.lucene;

import org.apache.lucene.search.spans.Spans;

public interface NearSpans {
	
	public Spans[] matchingSpans();

}

package it.unipi.di.dante.lucene;

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.io.IOException;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.Spans;

/** A Spans that is formed from the ordered subspans of a SpanNearQuery
 * where the subspans do not overlap and have a maximum slop between them.
 * <p>
 * The formed spans only contains minimum slop matches.<br>
 * The matching slop is computed from the distance(s) between
 * the non overlapping matching Spans.<br>
 * Successive matches are always formed from the successive Spans
 * of the SpanNearQuery.
 * <p>
 * The formed spans may contain overlaps when the slop is at least 1.
 * For example, when querying using
 * <pre>t1 t2 t3</pre>
 * with slop at least 1, the fragment:
 * <pre>t1 t2 t1 t3 t2 t3</pre>
 * matches twice:
 * <pre>t1 t2 .. t3      </pre>
 * <pre>      t1 .. t2 t3</pre>
 */
public class NearSpansOrdered implements Spans, NearSpans {
  private final int allowedSlop;
  private boolean firstTime = true;
  private boolean more = false;

  /** The spans in the same order as the SpanNearQuery */
  private final Spans[] subSpans;
  
  private final List<Spans> subSpansCopy = new ArrayList<Spans>();
  
  class SpansCopy implements Spans {
	  private int doc, start, end;
	  public SpansCopy(Spans s) {
		  doc = s.doc(); start = s.start(); end = s.end();
	  }
	  public int doc() { return doc; }
	  public int start() { return start; }
	  public int end() { return end; }
	  
	public boolean next() throws IOException { return false; }
	public boolean skipTo(int target) throws IOException { return false; }
	public String toString() {return "doc:" + doc + ", start:" + start + ", end:" + end; }
  };

  /** Indicates that all subSpans have same doc() */
  private boolean inSameDoc = false;

  private int matchDoc = -1;
  private int matchStart = -1;
  private int matchEnd = -1;

  private final Spans[] subSpansByDoc;
  private final Comparator spanDocComparator = new Comparator() {
    public int compare(Object o1, Object o2) {
      return ((Spans)o1).doc() - ((Spans)o2).doc();
    }
  };
  
  private SpanNearQuery query;
  
  public NearSpansOrdered(SpanNearQuery spanNearQuery, IndexReader reader) throws IOException {
    if (spanNearQuery.getClauses().length < 2) {
      throw new IllegalArgumentException("Less than 2 clauses: " + spanNearQuery);
    }
    allowedSlop = spanNearQuery.getSlop();
    SpanQuery[] clauses = spanNearQuery.getClauses();
    subSpans = new Spans[clauses.length];
    subSpansByDoc = new Spans[clauses.length];
    for (int i = 0; i < clauses.length; i++) {
      subSpans[i] = clauses[i].getSpans(reader);
      subSpansByDoc[i] = subSpans[i]; // used in toSameDoc()
    }
    query = spanNearQuery; // kept for toString() only.
  }

  // inherit javadocs
  public int doc() { return matchDoc; }

  // inherit javadocs
  public int start() { return matchStart; }

  // inherit javadocs
  public int end() { return matchEnd; }

  // inherit javadocs
  public boolean next() throws IOException {

	if (firstTime) {
		firstTime = false;
		for (int i = 0; i < subSpans.length; i++) {
			if (!subSpans[i].next()) {
				more = false;
				return false;
			}
		}
		more = true;
	}
	
	return advanceAfterOrdered();
  }
  
  public void saveSubSpans(Spans spansToCopy) {
	  subSpansCopy.add(new SpansCopy(spansToCopy));
  }

  // inherit javadocs
  public boolean skipTo(int target) throws IOException {
    if (firstTime) {
      firstTime = false;
      for (int i = 0; i < subSpans.length; i++) {
        if (! subSpans[i].skipTo(target)) {
          more = false;
          return false;
        }
      }
      more = true;
    } else if (more && (subSpans[0].doc() < target)) {
      if (subSpans[0].skipTo(target)) {
        inSameDoc = false;
      } else {
        more = false;
        return false;
      }
    }
    return advanceAfterOrdered();
  }
  
  /** Advances the subSpans to just after an ordered match with a minimum slop
   * that is smaller than the slop allowed by the SpanNearQuery.
   * @return true iff there is such a match.
   */
  private boolean advanceAfterOrdered() throws IOException {
    while (more && (inSameDoc || toSameDoc())) {
      if (stretchToOrder() && shrinkToAfterShortestMatch()) {
        return true;
      }
    }
    return false; // no more matches
  }


  /** Advance the subSpans to the same document */
  private boolean toSameDoc() throws IOException {
    Arrays.sort(subSpansByDoc, spanDocComparator);
    int firstIndex = 0;
    int maxDoc = subSpansByDoc[subSpansByDoc.length - 1].doc();
    while (subSpansByDoc[firstIndex].doc() != maxDoc) {
      if (! subSpansByDoc[firstIndex].skipTo(maxDoc)) {
        more = false;
        inSameDoc = false;
        return false;
      }
      maxDoc = subSpansByDoc[firstIndex].doc();
      if (++firstIndex == subSpansByDoc.length) {
        firstIndex = 0;
      }
    }
    for (int i = 0; i < subSpansByDoc.length; i++) {
      assert (subSpansByDoc[i].doc() == maxDoc)
             : " NearSpansOrdered.toSameDoc() spans " + subSpansByDoc[0]
                                 + "\n at doc " + subSpansByDoc[i].doc()
                                 + ", but should be at " + maxDoc;
    }
    inSameDoc = true;
    return true;
  }
  
  /** Check whether two Spans in the same document are ordered.
   * @param spans1 
   * @param spans2 
   * @return true iff spans1 starts before spans2
   *              or the spans start at the same position,
   *              and spans1 ends before spans2.
   */
  static final boolean docSpansOrdered(Spans spans1, Spans spans2) {
    assert spans1.doc() == spans2.doc() : "doc1 " + spans1.doc() + " != doc2 " + spans2.doc();
    int start1 = spans1.start();
    int start2 = spans2.start();
    /* Do not call docSpansOrdered(int,int,int,int) to avoid invoking .end() : */
    return (start1 == start2) ? (spans1.end() < spans2.end()) : (start1 < start2);
  }

  /** Like [EMAIL PROTECTED] #docSpansOrdered(Spans,Spans)}, but use the spans
   * starts and ends as parameters.
   */
  private static final boolean docSpansOrdered(int start1, int end1, int start2, int end2) {
    return (start1 == start2) ? (end1 < end2) : (start1 < start2);
  }

  /** Order the subSpans within the same document by advancing all later spans
   * after the previous one.
   */
  private boolean stretchToOrder() throws IOException {
    matchDoc = subSpans[0].doc();
    for (int i = 1; inSameDoc && (i < subSpans.length); i++) {
      while (! docSpansOrdered(subSpans[i-1], subSpans[i])) {
        if (! subSpans[i].next()) {
          inSameDoc = false;
          more = false;
          break;
        } else if (matchDoc != subSpans[i].doc()) {
          inSameDoc = false;
          break;
        }
      }
    }
    return inSameDoc;
  }

  /** The subSpans are ordered in the same doc, so there is a possible match.
   * Compute the slop while making the match as short as possible by advancing
   * all subSpans except the last one in reverse order.
   */
  private boolean shrinkToAfterShortestMatch() throws IOException {
    matchStart = subSpans[subSpans.length - 1].start();
    matchEnd = subSpans[subSpans.length - 1].end();
    int matchSlop = 0;
    int lastStart = matchStart;
    int lastEnd = matchEnd;
    
    subSpansCopy.clear();
    saveSubSpans(subSpans[subSpans.length - 1]);
    
    for (int i = subSpans.length - 2; i >= 0; i--) {
      Spans prevSpans = subSpans[i];
      int prevStart = prevSpans.start();
      int prevEnd = prevSpans.end();
      
      while (true) { // Advance prevSpans until after (lastStart, lastEnd)
    	  
    	saveSubSpans(prevSpans);
    	  
        if (!prevSpans.next()) {
        	inSameDoc = false;
			more = false;
			break; // Check remaining subSpans for final match.
		} 
        else 
        	if (matchDoc != prevSpans.doc()) {
				inSameDoc = false; // The last subSpans is not advanced
				break; // Check remaining subSpans for last match in this document.
        	} 
        	else {
				int ppStart = prevSpans.start();
				int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()

				if (!docSpansOrdered(ppStart, ppEnd, lastStart, lastEnd)) {
					break; // Check remaining subSpans.
				} 
				else { // prevSpans still before (lastStart, lastEnd)
					prevStart = ppStart;
					prevEnd = ppEnd;
				}
			}
      }
      
      assert prevStart <= matchStart;
      if (matchStart > prevEnd) { // Only non overlapping spans add to slop.
        matchSlop += (matchStart - prevEnd);
      }
      /* Do not break on (matchSlop > allowedSlop) here to make sure
       * that subSpans[0] is advanced after the match, if any.
       */
      matchStart = prevStart;
      lastStart = prevStart;
      lastEnd = prevEnd;
    }
    return matchSlop <= allowedSlop; // ordered and allowed slop
  }

  public String toString() {
    return getClass().getName() + "("+query.toString()+")@"+
      (firstTime?"START":(more?(doc()+":"+start()+"-"+end()):"END"));
  }
  
  public Spans[] matchingSpans() {
	  
	  ArrayList<Spans> list = new ArrayList<Spans>();

	  if (subSpansCopy.size() == 0) return null;
	  
	  for(int k = 0; k < subSpansCopy.size(); k++) {
		  
		  if (subSpansCopy.get(k).doc() != matchDoc) continue;
		  
		  if (subSpansCopy.get(k).start() >= matchStart && subSpansCopy.get(k).end() <= matchEnd)
			  list.add(subSpansCopy.get(k));
	  }
	  
	  return list.toArray(new Spans[0]);
  }
}

package it.unipi.di.dante.lucene;

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.io.IOException;

import java.util.List;
import java.util.ArrayList;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.Spans;
import org.apache.lucene.util.PriorityQueue;

public class NearSpansUnordered implements Spans, NearSpans {
  private SpanNearQuery query;

  private List ordered = new ArrayList();         // spans in query order
  private int slop;                               // from query

  private SpansCell first;                        // linked list of spans
  private SpansCell last;                         // sorted by doc only

  private int totalLength;                        // sum of current lengths

  private CellQueue queue;                        // sorted queue of spans
  private SpansCell max;                          // max element in queue

  private boolean more = true;                    // true iff not done
  private boolean firstTime = true;               // true before first next()

  private class CellQueue extends PriorityQueue {
    public CellQueue(int size) {
      initialize(size);
    }
    
    protected final boolean lessThan(Object o1, Object o2) {
      SpansCell spans1 = (SpansCell)o1;
      SpansCell spans2 = (SpansCell)o2;
      if (spans1.doc() == spans2.doc()) {
        return NearSpansOrdered.docSpansOrdered(spans1, spans2);
      } else {
        return spans1.doc() < spans2.doc();
      }
    }
  }


  /** Wraps a Spans, and can be used to form a linked list. */
  private class SpansCell implements Spans {
    private Spans spans;
    private SpansCell next;
    private int length = -1;
    private int index;

    public SpansCell(Spans spans, int index) {
      this.spans = spans;
      this.index = index;
    }

    public boolean next() throws IOException {
      return adjust(spans.next());
    }

    public boolean skipTo(int target) throws IOException {
      return adjust(spans.skipTo(target));
    }
    
    private boolean adjust(boolean condition) {
      if (length != -1) {
        totalLength -= length;  // subtract old length
      }
      if (condition) {
        length = end() - start(); 
        totalLength += length; // add new length

        if (max == null || doc() > max.doc()
            || (doc() == max.doc()) && (end() > max.end())) {
          max = this;
        }
      }
      more = condition;
      return condition;
    }

    public int doc() { return spans.doc(); }
    public int start() { return spans.start(); }
    public int end() { return spans.end(); }

    public String toString() { return spans.toString() + "#" + index; }
  }


  public NearSpansUnordered(SpanNearQuery query, IndexReader reader)
    throws IOException {
    this.query = query;
    this.slop = query.getSlop();

    SpanQuery[] clauses = query.getClauses();
    queue = new CellQueue(clauses.length);
    for (int i = 0; i < clauses.length; i++) {
      SpansCell cell =
        new SpansCell(clauses[i].getSpans(reader), i);
      ordered.add(cell);
    }
  }

  public boolean next() throws IOException {
    if (firstTime) {
      initList(true);
      listToQueue(); // initialize queue
      firstTime = false;
    } else if (more) {
      if (min().next()) { // trigger further scanning
        queue.adjustTop(); // maintain queue
      } else {
        more = false;
      }
    }

    while (more) {

      boolean queueStale = false;

      if (min().doc() != max.doc()) {             // maintain list
        queueToList();
        queueStale = true;
      }

      // skip to doc w/ all clauses

      while (more && first.doc() < last.doc()) {
        more = first.skipTo(last.doc());          // skip first up to last
        firstToLast();                            // and move it to the end
        queueStale = true;
      }

      if (!more) return false;

      // found doc w/ all clauses

      if (queueStale) {                           // maintain the queue
        listToQueue();
        queueStale = false;
      }

      if (atMatch()) {
        return true;
      }
      
      more = min().next();
      if (more) {
        queue.adjustTop();                      // maintain queue
      }
    }
    return false;                                 // no more matches
  }

  public boolean skipTo(int target) throws IOException {
    if (firstTime) {                              // initialize
      initList(false);
      for (SpansCell cell = first; more && cell!=null; cell=cell.next) {
        more = cell.skipTo(target);               // skip all
      }
      if (more) {
        listToQueue();
      }
      firstTime = false;
    } else {                                      // normal case
      while (more && min().doc() < target) {      // skip as needed
        if (min().skipTo(target)) {
          queue.adjustTop();
        } else {
          more = false;
        }
      }
    }
    return more && (atMatch() ||  next());
  }

  private SpansCell min() { return (SpansCell)queue.top(); }

  public int doc() { return min().doc(); }
  public int start() { return min().start(); }
  public int end() { return max.end(); }


  public String toString() {
    return getClass().getName() + "("+query.toString()+")@"+
      (firstTime?"START":(more?(doc()+":"+start()+"-"+end()):"END"));
  }

  private void initList(boolean next) throws IOException {
    for (int i = 0; more && i < ordered.size(); i++) {
      SpansCell cell = (SpansCell)ordered.get(i);
      if (next)
        more = cell.next();                       // move to first entry
      if (more) {
        addToList(cell);                          // add to list
      }
    }
  }

  private void addToList(SpansCell cell) {
    if (last != null) {			  // add next to end of list
      last.next = cell;
    } else
      first = cell;
    last = cell;
    cell.next = null;
  }

  private void firstToLast() {
    last.next = first;			  // move first to end of list
    last = first;
    first = first.next;
    last.next = null;
  }

  private void queueToList() {
    last = first = null;
    while (queue.top() != null) {
      addToList((SpansCell)queue.pop());
    }
  }
  
  private void listToQueue() {
    queue.clear(); // rebuild queue
    for (SpansCell cell = first; cell != null; cell = cell.next) {
      queue.put(cell);                      // add to queue from list
    }
  }

  private boolean atMatch() {
    return (min().doc() == max.doc())
        && ((max.end() - min().start() - totalLength) <= slop);
  }
  
  public Spans[] matchingSpans() {
	  
	  ArrayList<Spans> list = new ArrayList<Spans>();

	  if (first == null) return null;
	  
	  for (SpansCell cell = first; cell != null; cell = cell.next) {
	      
	      if (cell.doc() != doc()) continue;
	      
	      if (cell.start() >= start() && cell.end() <= end())
			  list.add(cell);
	  }
	  
	  return list.toArray(new Spans[0]);
  }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Fwd: SpanNearQuery: how to get the "intra-span" matching positions?

Reply via email to