Re: [jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Erik Hatcher Fri, 19 Oct 2007 12:08:29 -0700


On Oct 18, 2007, at 11:53 AM, Binkley, Peter wrote:

I think the requirements I mentioned in a comment

(https://issues.apache.org/jira/browse/SOLR-380#action_12535296)justify

abandoning the one-page-per-document approach. The increment-gap
approach would break the cross-page searching, and would involve about
as much work as the stored map, since the gap would have to vary
depending on the number of terms on each page, wouldn't it? (if there
are 100 terms on page one, the gap has to be 900 to get page two to

start at 1000 - or can you specify the absolute position you wantfor a

term?).


Yeah, one Solr document per page is not sufficient for this purpose.

As for position increment gap and querying across page boundaries, Istill think having all text in a single field is necessary, but tosomehow separate pages such that whether a query can control whetherit spans pages or not. This could be accomplished trivially with aposition increment gap. The gap used only depends on the slop factoryou need for phrase queries, not on the number of tokens per page."quick fox"~10, for example - the default gap of 100, say, wouldprevent that query from matching across page boundaries. I haven'tthought this through thoroughly, so more thinking is needed here.

I think the problem of indexing books (or any text with arbitrary
subdivisions) is common enough that a generic approach like this would
be useful to more people than just me, and justifies some enhancements
within Solr to make the solution easy to reuse; but maybe when we've
figured out the best approach it will become clear how much of it is
worth packing into Solr.

Most definitely this would be a VERY useful addition to Solr. I knowof several folks that are working with XTF (which uses a customversion of Lucene and other interesting data structures) to achievethis capability, but blending that sort of thing into Solr would makelife a lot better for these projects.

(and just to clarify roles: Tricia's the one who'll actually be coding
this, if it's feasible; I'm just helping to think out requirements and
approaches based on a project in hand.)

There is more to consider here. Lucene now supports "payloads",additional metadata on terms that can be leveraged with customqueries. I've not yet tinkered with them myself, but myunderstanding is that they would be useful (and in fact designed inpart) for representing structured documents. It would behoove us toinvestigate how payloads might be leveraged for your needs here, suchthat a single field could represent an entire document, with payloadsrepresenting the hierarchical structure. This will requirespecialized Analyzer and Query subclasses be created to takeadvantage of payloads. The Lucene community itself is just nowstarting to exploit this new feature, so there isn't a lot out thereon it yet, but I think it holds great promise for these purposes.


        Erik

Re: [jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Reply via email to