DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this may be

For your line number, page number etc perspective, it is possible to index
special guaranteed-to-not-match tokens then use the termdocs/termenum
data, along with SpanQueries to figure this out at search time. For
coincident with the last term in each line, index the token "$$$$$".
with the last token of every paragraph index the token "#####". If you get
offsets of the matching terms, you can quite quickly simply count the number
of line and paragraph tokens using TermDocs/TermEnums and correlate hits
to lines and paragraphs. The trick is to index your special tokens with an
increment of 0 (see SynonymAnalyzer in Lucene In Action for more on this).

Another possibility is to add a special field with each document with the
of each end-of-sentence and end-of-paragraph offsets (stored, not indexed).
Again, "given the offsets",  you can read in this field and figure out what
paragraph your hits are in.

How suitable either of these is depends on a lot of characteristics of your
particular problem space. I'm not sure either of them is suitable for very
volume applications.

Also, I'm approaching this from an in-the-guts-of-lucene perspective, so
even *think* of asking me how to really make this work in SOLR <G>.


On Nov 11, 2007 12:44 AM, David Neubert <[EMAIL PROTECTED]> wrote:

> Ryan (and others who need something to put them so sleep :) )
> Wow -- the light-bulb finally went off -- the Analzyer admin page is very
> cool -- I just was not at all thinking the SOLR/Lucene way.
> I need to rethink my whole approach now that I understand (from reviewing
> the schema.xml closer and playing with the Analyser) how compatible index
> and query policies can be applied automatically on a field by field basis by
> SOLR at both index and query time.
> I still may have a stumper here, but I need to give it some thought, and
> may return again with another question:
> The problem is that my text is book text (fairly large) that ooks very
> much like one would expect:
> <book>
> <chapter>
> <para><sen>...</sen><sen>....</sen></para>
> <para><sen>...</sen><sen>....</sen></para>
> <para><sen>...</sen><sen>...</sen></para>
> </chapter>
> </book
> The search results need to return exact sentences or paragraphs with their
> exact page:line numbers (which is available in the embedded markup in the
> text).
> There were previous responses by others, suggesting I look into payloads,
> but I did not fully understand that -- I may have to re-read those e-mails
> now that I am getting a clearer picture of SOLR/Lucene.
> However, the reason I resorted to indexing each paragraph as a single
> document, and then redundantly indexing each sentence as a single document,
> is because I was planning on pre-parsing the text myself (outside of SOLR)
> -- and feeding separate <doc> elements to the <add> because in that way I
> could produce the page:line reference in the pre-parsing (again outside of
> SOLR) and feed it in as explict field in the <doc> elements of the <add>
> requests.  Therefore at query time, I will have the exact page:line
> corresponding to the start of the paragraph or sentence.
> But I am beginning to suspect, I was planning to do a lot of work that
> SOLR can do for me.
> I will continue to study this and respond when I am a bit clearer, but the
> closer I could get to just submitting the books a chapter at a time -- and
> letting SOLR do the work, the better (cause I have all the books in well
> formed xml at chapter levels).  However, I don't  see yet how I could get
> par/sen granular search result hits, along with their exact page:line
> coordinates unless I approach it by explicitly indexing the pars and sens as
> single documents, not chapters hits, and also return the entire text of the
> sen or par, and highlight the keywords within (for the search result hit).
>  Once a search result hit is selected, it would then act as expected and
> position into the chapter, at the selected reference, highlight again the
> key words, but this time in the context of an entire chapter (the whole
> document to the user's mind).
> Even with my new understanding you (and others) have given me, which I can
> use to certainly improve my approach -- it still seems to me that because
> multi-valued fields concatenate text -- even if you use the
> positionGapIncrment feature to prohibit unwanted phrase matches, how do you
> produce a well definied search result hit, bounded by the exact sen or par,
> unless you index them as single documents?
> Should I still read up on the payload discussion?
> Dave
> ----- Original Message ----
> From: Ryan McKinley <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Saturday, November 10, 2007 5:00:43 PM
> Subject: Re: Redundant indexing * 4 only solution (for par/sen and case
> sensitivity)
> David Neubert wrote:
> > Ryan,
> >
> > Thanks for your response.  I infer from your response that you can
>  have a different analyzer for each field
> yes!  each field can have its own indexing strategy.
> > I believe that the Analyzer approach you suggested requires the use
> > of the same Analzyer at query time that was used during indexing.
> it does not require the *same* Analyzer - it just requires one that
> generates compatiable tokens.  That is, you may want the indexing to
> split the input into sentences, but the query time analyzer keeps the
> input as a single token.
> check the example schema.xml file -- the 'text' field type applies
> synonyms at index time, but does at query time.
> re searching acrross multiple fields, don't worry, lucene handles this
> well.  You may want to do that explicitly or with the dismax handler.
> I'd suggest you play around with indexing some data.  check the
> analysis.jsp in the admin section.  It is a great tool to help figure
> out what analyzers do at index vs query time.
> ryan
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com

Reply via email to