Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

Erick Erickson Mon, 12 Nov 2007 11:11:44 -0800

DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this may be
useful....


For your line number, page number etc perspective, it is possible to index
special guaranteed-to-not-match tokens then use the termdocs/termenum
data, along with SpanQueries to figure this out at search time. For
instance,
coincident with the last term in each line, index the token "$$$$$".
Coincident
with the last token of every paragraph index the token "#####". If you get
the
offsets of the matching terms, you can quite quickly simply count the number
of line and paragraph tokens using TermDocs/TermEnums and correlate hits
to lines and paragraphs. The trick is to index your special tokens with an
increment of 0 (see SynonymAnalyzer in Lucene In Action for more on this).


Another possibility is to add a special field with each document with the
offsets
of each end-of-sentence and end-of-paragraph offsets (stored, not indexed).
Again, "given the offsets",  you can read in this field and figure out what
line/
paragraph your hits are in.

How suitable either of these is depends on a lot of characteristics of your
particular problem space. I'm not sure either of them is suitable for very
high
volume applications.

Also, I'm approaching this from an in-the-guts-of-lucene perspective, so
don't
even *think* of asking me how to really make this work in SOLR <G>.

Best
Erick

On Nov 11, 2007 12:44 AM, David Neubert <[EMAIL PROTECTED]> wrote:

> Ryan (and others who need something to put them so sleep :) )
>
> Wow -- the light-bulb finally went off -- the Analzyer admin page is very
> cool -- I just was not at all thinking the SOLR/Lucene way.
>
> I need to rethink my whole approach now that I understand (from reviewing
> the schema.xml closer and playing with the Analyser) how compatible index
> and query policies can be applied automatically on a field by field basis by
> SOLR at both index and query time.
>
> I still may have a stumper here, but I need to give it some thought, and
> may return again with another question:
>
> The problem is that my text is book text (fairly large) that ooks very
> much like one would expect:
> <book>
> <chapter>
> <para><sen>...</sen><sen>....</sen></para>
> <para><sen>...</sen><sen>....</sen></para>
> <para><sen>...</sen><sen>...</sen></para>
> </chapter>
> </book
>
> The search results need to return exact sentences or paragraphs with their
> exact page:line numbers (which is available in the embedded markup in the
> text).
>
> There were previous responses by others, suggesting I look into payloads,
> but I did not fully understand that -- I may have to re-read those e-mails
> now that I am getting a clearer picture of SOLR/Lucene.
>
> However, the reason I resorted to indexing each paragraph as a single
> document, and then redundantly indexing each sentence as a single document,
> is because I was planning on pre-parsing the text myself (outside of SOLR)
> -- and feeding separate <doc> elements to the <add> because in that way I
> could produce the page:line reference in the pre-parsing (again outside of
> SOLR) and feed it in as explict field in the <doc> elements of the <add>
> requests.  Therefore at query time, I will have the exact page:line
> corresponding to the start of the paragraph or sentence.
>
> But I am beginning to suspect, I was planning to do a lot of work that
> SOLR can do for me.
>
> I will continue to study this and respond when I am a bit clearer, but the
> closer I could get to just submitting the books a chapter at a time -- and
> letting SOLR do the work, the better (cause I have all the books in well
> formed xml at chapter levels).  However, I don't  see yet how I could get
> par/sen granular search result hits, along with their exact page:line
> coordinates unless I approach it by explicitly indexing the pars and sens as
> single documents, not chapters hits, and also return the entire text of the
> sen or par, and highlight the keywords within (for the search result hit).
>  Once a search result hit is selected, it would then act as expected and
> position into the chapter, at the selected reference, highlight again the
> key words, but this time in the context of an entire chapter (the whole
> document to the user's mind).
>
> Even with my new understanding you (and others) have given me, which I can
> use to certainly improve my approach -- it still seems to me that because
> multi-valued fields concatenate text -- even if you use the
> positionGapIncrment feature to prohibit unwanted phrase matches, how do you
> produce a well definied search result hit, bounded by the exact sen or par,
> unless you index them as single documents?
>
> Should I still read up on the payload discussion?
>
> Dave
>
>
>
>
> ----- Original Message ----
> From: Ryan McKinley <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Saturday, November 10, 2007 5:00:43 PM
> Subject: Re: Redundant indexing * 4 only solution (for par/sen and case
> sensitivity)
>
>
> David Neubert wrote:
> > Ryan,
> >
> > Thanks for your response.  I infer from your response that you can
>  have a different analyzer for each field
>
> yes!  each field can have its own indexing strategy.
>
>
> > I believe that the Analyzer approach you suggested requires the use
> > of the same Analzyer at query time that was used during indexing.
>
> it does not require the *same* Analyzer - it just requires one that
> generates compatiable tokens.  That is, you may want the indexing to
> split the input into sentences, but the query time analyzer keeps the
> input as a single token.
>
> check the example schema.xml file -- the 'text' field type applies
> synonyms at index time, but does at query time.
>
> re searching acrross multiple fields, don't worry, lucene handles this
> well.  You may want to do that explicitly or with the dismax handler.
>
> I'd suggest you play around with indexing some data.  check the
> analysis.jsp in the admin section.  It is a great tool to help figure
> out what analyzers do at index vs query time.
>
> ryan
>
>
>
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

Reply via email to