On Oct 18, 2007, at 11:53 AM, Binkley, Peter wrote:
I think the requirements I mentioned in a comment
(https://issues.apache.org/jira/browse/SOLR-380#action_12535296) justify
abandoning the one-page-per-document approach. The increment-gap
approach would break the cross-page searching, and would involve about
as much work as the stored map, since the gap would have to vary
depending on the number of terms on each page, wouldn't it? (if there
are 100 terms on page one, the gap has to be 900 to get page two to
start at 1000 - or can you specify the absolute position you want for a
term?).

Yeah, one Solr document per page is not sufficient for this purpose.

As for position increment gap and querying across page boundaries, I still think having all text in a single field is necessary, but to somehow separate pages such that whether a query can control whether it spans pages or not. This could be accomplished trivially with a position increment gap. The gap used only depends on the slop factor you need for phrase queries, not on the number of tokens per page. "quick fox"~10, for example - the default gap of 100, say, would prevent that query from matching across page boundaries. I haven't thought this through thoroughly, so more thinking is needed here.

I think the problem of indexing books (or any text with arbitrary
subdivisions) is common enough that a generic approach like this would
be useful to more people than just me, and justifies some enhancements
within Solr to make the solution easy to reuse; but maybe when we've
figured out the best approach it will become clear how much of it is
worth packing into Solr.

Most definitely this would be a VERY useful addition to Solr. I know of several folks that are working with XTF (which uses a custom version of Lucene and other interesting data structures) to achieve this capability, but blending that sort of thing into Solr would make life a lot better for these projects.

(and just to clarify roles: Tricia's the one who'll actually be coding
this, if it's feasible; I'm just helping to think out requirements and
approaches based on a project in hand.)

There is more to consider here. Lucene now supports "payloads", additional metadata on terms that can be leveraged with custom queries. I've not yet tinkered with them myself, but my understanding is that they would be useful (and in fact designed in part) for representing structured documents. It would behoove us to investigate how payloads might be leveraged for your needs here, such that a single field could represent an entire document, with payloads representing the hierarchical structure. This will require specialized Analyzer and Query subclasses be created to take advantage of payloads. The Lucene community itself is just now starting to exploit this new feature, so there isn't a lot out there on it yet, but I think it holds great promise for these purposes.

        Erik

Reply via email to