I think you have hit on all the best solutions.

The Jira issues you mentioned do indeed hold out some promising solutions here, but they are a ways away, requiring some significant re-plumbing and I'm not sure there is a lot of attention being paid to that at the moment. You should vote for those issues, I think. But in the meantime, I think your payload solution is probably the best in terms of efficiency; you can find code that does that kind of thing in LUCENE-3318 if you poke around a bit. However it might be simplest to just use the existing highlighters to do this sort of thing, and not worry about spans?

-Mike

On 1/19/2012 9:46 PM, Nishad Prakash wrote:

I'm going to cry. There is no way to retrieve offsets for position, rather than for term?


On 1/13/2012 6:33 PM, Nishad Prakash wrote:
I'm having a set of issues in trying to use Lucene that are all
connected to the difficulty of retrieving offsets.  I need some advice
on how best to proceed, or a pointer if this has been answered
somewhere.

My app requires that I display all portions of the documents where the
search term or terms are found.  Because of this, I always use
IndexReader.getSpans(), since knowing only which documents matched
isn't enough.  However, this still leaves me with a lot of unresolved
problems.

- I cannot find any standard way to map the returned span positions to
offsets.  For single term queries, I can get at offsets by writing a
custom TermVectorMapper.  For more complex queries, I have to (I
think) use rewrite(), extract the target terms, then load their term
vectors and go through them to find the positions that match what's in
the span, and pull up the corresponding offsets.  This
is...surprising.  We took considerable pains during indexing to
maintain the offset information through several layers of analysis
filters, but now we can't get to it while searching without
considerably more pain.  Am I missing something obvious?

- More generally, I would like to be able to iterate over positions in
a document, collecting offset information for those positions as I go.
Is there any way to do this?  I didn't find such an iterator, but I
may not know where to look.  Everything I did find was tied to
iterating over positions for specific terms, which is not relevant
here.

Right now, I can think of these options:
1) get at offsets via term vectors; try to make that as fast as

possible by "short-circuiting" how much of the term vector we load.
2) Maintain external per-document position->offset maps outside
Lucene.
3) Maybe store offsets as payload?

But is there already a (non-term-vector based) way of getting at
offsets that I don't know about?  My ideal solution would be an
iterable position->offset map for each document; failing that, an
enhancement to getSpans() that returns offset information along with
position.

It seems like LUCENE-2878 and LUCENE-3318 are concerned with at least
some of these issues, but the comments are a bit inside-baseball for
me at this stage.  So I would greatly appreciate any advice on this
issue.

nishad

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to