DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this may be useful....
For your line number, page number etc perspective, it is possible to index special guaranteed-to-not-match tokens then use the termdocs/termenum data, along with SpanQueries to figure this out at search time. For instance, coincident with the last term in each line, index the token "$$$$$". Coincident with the last token of every paragraph index the token "#####". If you get the offsets of the matching terms, you can quite quickly simply count the number of line and paragraph tokens using TermDocs/TermEnums and correlate hits to lines and paragraphs. The trick is to index your special tokens with an increment of 0 (see SynonymAnalyzer in Lucene In Action for more on this). Another possibility is to add a special field with each document with the offsets of each end-of-sentence and end-of-paragraph offsets (stored, not indexed). Again, "given the offsets", you can read in this field and figure out what line/ paragraph your hits are in. How suitable either of these is depends on a lot of characteristics of your particular problem space. I'm not sure either of them is suitable for very high volume applications. Also, I'm approaching this from an in-the-guts-of-lucene perspective, so don't even *think* of asking me how to really make this work in SOLR <G>. Best Erick On Nov 11, 2007 12:44 AM, David Neubert <[EMAIL PROTECTED]> wrote: > Ryan (and others who need something to put them so sleep :) ) > > Wow -- the light-bulb finally went off -- the Analzyer admin page is very > cool -- I just was not at all thinking the SOLR/Lucene way. > > I need to rethink my whole approach now that I understand (from reviewing > the schema.xml closer and playing with the Analyser) how compatible index > and query policies can be applied automatically on a field by field basis by > SOLR at both index and query time. > > I still may have a stumper here, but I need to give it some thought, and > may return again with another question: > > The problem is that my text is book text (fairly large) that ooks very > much like one would expect: > <book> > <chapter> > <para><sen>...</sen><sen>....</sen></para> > <para><sen>...</sen><sen>....</sen></para> > <para><sen>...</sen><sen>...</sen></para> > </chapter> > </book > > The search results need to return exact sentences or paragraphs with their > exact page:line numbers (which is available in the embedded markup in the > text). > > There were previous responses by others, suggesting I look into payloads, > but I did not fully understand that -- I may have to re-read those e-mails > now that I am getting a clearer picture of SOLR/Lucene. > > However, the reason I resorted to indexing each paragraph as a single > document, and then redundantly indexing each sentence as a single document, > is because I was planning on pre-parsing the text myself (outside of SOLR) > -- and feeding separate <doc> elements to the <add> because in that way I > could produce the page:line reference in the pre-parsing (again outside of > SOLR) and feed it in as explict field in the <doc> elements of the <add> > requests. Therefore at query time, I will have the exact page:line > corresponding to the start of the paragraph or sentence. > > But I am beginning to suspect, I was planning to do a lot of work that > SOLR can do for me. > > I will continue to study this and respond when I am a bit clearer, but the > closer I could get to just submitting the books a chapter at a time -- and > letting SOLR do the work, the better (cause I have all the books in well > formed xml at chapter levels). However, I don't see yet how I could get > par/sen granular search result hits, along with their exact page:line > coordinates unless I approach it by explicitly indexing the pars and sens as > single documents, not chapters hits, and also return the entire text of the > sen or par, and highlight the keywords within (for the search result hit). > Once a search result hit is selected, it would then act as expected and > position into the chapter, at the selected reference, highlight again the > key words, but this time in the context of an entire chapter (the whole > document to the user's mind). > > Even with my new understanding you (and others) have given me, which I can > use to certainly improve my approach -- it still seems to me that because > multi-valued fields concatenate text -- even if you use the > positionGapIncrment feature to prohibit unwanted phrase matches, how do you > produce a well definied search result hit, bounded by the exact sen or par, > unless you index them as single documents? > > Should I still read up on the payload discussion? > > Dave > > > > > ----- Original Message ---- > From: Ryan McKinley <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Saturday, November 10, 2007 5:00:43 PM > Subject: Re: Redundant indexing * 4 only solution (for par/sen and case > sensitivity) > > > David Neubert wrote: > > Ryan, > > > > Thanks for your response. I infer from your response that you can > have a different analyzer for each field > > yes! each field can have its own indexing strategy. > > > > I believe that the Analyzer approach you suggested requires the use > > of the same Analzyer at query time that was used during indexing. > > it does not require the *same* Analyzer - it just requires one that > generates compatiable tokens. That is, you may want the indexing to > split the input into sentences, but the query time analyzer keeps the > input as a single token. > > check the example schema.xml file -- the 'text' field type applies > synonyms at index time, but does at query time. > > re searching acrross multiple fields, don't worry, lucene handles this > well. You may want to do that explicitly or with the dismax handler. > > I'd suggest you play around with indexing some data. check the > analysis.jsp in the admin section. It is a great tool to help figure > out what analyzers do at index vs query time. > > ryan > > > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com >