Thank you. Y, the Map<Integer, OffsetAttribute> charOffsets is a map of token
position (Integer) and the character offsets for that token...so I think I'm
good?
As part of LUCENE-5317, I have a DocTokenOffsetsVisitor interface and
SpanCrawler that runs that visitor against an IndexSearcher...
The visitor sees a Document and a list of character offsets for hits in that
document.
Once I update that to include the more modern code below (optionally for those
storing offsets)...is there any interest in integrating LUCENE-5317 or
components of it so that others don't have to reinvent the wheel below?
Cheers,
Tim
-----Original Message-----
From: Alan Woodward [mailto:[email protected]]
Sent: Tuesday, November 03, 2015 4:25 AM
To: [email protected]
Subject: Re: extracting charoffsets from SpanWeight's getSpans() in 5.3.1?
The second parameter passed to SpanCollector.collectLeaf() is the position,
rather than an index of any kind, which I think is going to mess things up for
you. But other than that, you've got the right idea. :-)
Alan Woodward
www.flax.co.uk
On 3 Nov 2015, at 00:26, Allison, Timothy B. wrote:
> All,
>
> I'm trying to find all spans in a given String via stored offsets in Lucene
> 5.3.1. I wanted to use the Highlighter with a NullFragmenter, but that is
> highlighting only the matching terms, not the full Spans (related to
> LUCENE-6796?).
>
> My Current code iterates through the spans, stores the span positions in one
> array and gathers the character offsets via a SpanCollector in a Map<Integer,
> OffsetAttribute>. Is there a simpler way?
>
> Something like this:
>
> String s = "the quick brown fox jumped over the lazy dog"; String
> field = "f"; Analyzer analyzer = new StandardAnalyzer();
>
> SpanQuery spanQuery = new SpanNearQuery(
> new SpanQuery[] {
> new SpanTermQuery(new Term(field, "fox")),
> new SpanTermQuery(new Term(field, "quick"))
> },
> 3,
> false
> );
>
>
> MemoryIndex index = new MemoryIndex(true);
>
>
> index.addField(field, s, analyzer);
> index.freeze();
>
> IndexSearcher searcher = index.createSearcher(); IndexReader reader =
> searcher.getIndexReader(); spanQuery = (SpanQuery)
> spanQuery.rewrite(reader); SpanWeight weight = (SpanWeight)
> searcher.createWeight(spanQuery, false); Spans spans =
> weight.getSpans(reader.leaves().get(0),
> SpanWeight.Postings.OFFSETS);
>
> if (spans == null) {
> //do something with full string
> return;
> }
>
> OffsetSpanCollector offsetSpanCollector = new OffsetSpanCollector();
> List<OffsetAttribute> spanPositions = new ArrayList<>(); while
> (spans.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
> while (spans.nextStartPosition() != Spans.NO_MORE_POSITIONS) {
> OffsetAttributeImpl offsetAttribute = new OffsetAttributeImpl();
> offsetAttribute.setOffset(spans.startPosition(),
> spans.endPosition()-1);
> spanPositions.add(offsetAttribute);
> spans.collect(offsetSpanCollector);
> }
> }
> Map<Integer, OffsetAttribute> charOffsets =
> offsetSpanCollector.getOffsets(); //now iterate through the list of
> spanPositions and grab the character offsets for the start and end
> tokens of each //span from the charOffsets ...
>
>
>
>
> private class OffsetSpanCollector implements SpanCollector {
> Map<Integer, Offset> charOffsets = new HashMap<>();
>
> @Override
> public void collectLeaf(PostingsEnum postingsEnum, int i, Term
> term) throws IOException {
>
> OffsetAttributeImpl offsetAttribute = new OffsetAttributeImpl();
> offsetAttribute.setOffset(postingsEnum.startOffset(),
> postingsEnum.endOffset());
>
> charOffsets.put(i, offsetAttribute);
> }
>
> @Override
> public void reset() {
>
> //don't think I need to do anything with this?
> }
>
> public Map<Integer, OffsetAttribute> getOffsets() {
> return charOffsets;
> }
> }
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]