On Oct 18, 2007, at 11:53 AM, Binkley, Peter wrote:
I think the requirements I mentioned in a comment
(https://issues.apache.org/jira/browse/SOLR-380#action_12535296)
justify
abandoning the one-page-per-document approach. The increment-gap
approach would break the cross-page searching, and would involve about
as much work as the stored map, since the gap would have to vary
depending on the number of terms on each page, wouldn't it? (if there
are 100 terms on page one, the gap has to be 900 to get page two to
start at 1000 - or can you specify the absolute position you want
for a
term?).
Yeah, one Solr document per page is not sufficient for this purpose.
As for position increment gap and querying across page boundaries, I
still think having all text in a single field is necessary, but to
somehow separate pages such that whether a query can control whether
it spans pages or not. This could be accomplished trivially with a
position increment gap. The gap used only depends on the slop factor
you need for phrase queries, not on the number of tokens per page.
"quick fox"~10, for example - the default gap of 100, say, would
prevent that query from matching across page boundaries. I haven't
thought this through thoroughly, so more thinking is needed here.
I think the problem of indexing books (or any text with arbitrary
subdivisions) is common enough that a generic approach like this would
be useful to more people than just me, and justifies some enhancements
within Solr to make the solution easy to reuse; but maybe when we've
figured out the best approach it will become clear how much of it is
worth packing into Solr.
Most definitely this would be a VERY useful addition to Solr. I know
of several folks that are working with XTF (which uses a custom
version of Lucene and other interesting data structures) to achieve
this capability, but blending that sort of thing into Solr would make
life a lot better for these projects.
(and just to clarify roles: Tricia's the one who'll actually be coding
this, if it's feasible; I'm just helping to think out requirements and
approaches based on a project in hand.)
There is more to consider here. Lucene now supports "payloads",
additional metadata on terms that can be leveraged with custom
queries. I've not yet tinkered with them myself, but my
understanding is that they would be useful (and in fact designed in
part) for representing structured documents. It would behoove us to
investigate how payloads might be leveraged for your needs here, such
that a single field could represent an entire document, with payloads
representing the hierarchical structure. This will require
specialized Analyzer and Query subclasses be created to take
advantage of payloads. The Lucene community itself is just now
starting to exploit this new feature, so there isn't a lot out there
on it yet, but I think it holds great promise for these purposes.
Erik