Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this may be useful For your line number, page number etc perspective, it is possible to index special guaranteed-to-not-match tokens then use the termdocs/termenum data, along with SpanQueries to figure this out at search time. For instance, coincident with the last term in each line, index the token $. Coincident with the last token of every paragraph index the token #. If you get the offsets of the matching terms, you can quite quickly simply count the number of line and paragraph tokens using TermDocs/TermEnums and correlate hits to lines and paragraphs. The trick is to index your special tokens with an increment of 0 (see SynonymAnalyzer in Lucene In Action for more on this). Another possibility is to add a special field with each document with the offsets of each end-of-sentence and end-of-paragraph offsets (stored, not indexed). Again, given the offsets, you can read in this field and figure out what line/ paragraph your hits are in. How suitable either of these is depends on a lot of characteristics of your particular problem space. I'm not sure either of them is suitable for very high volume applications. Also, I'm approaching this from an in-the-guts-of-lucene perspective, so don't even *think* of asking me how to really make this work in SOLR G. Best Erick On Nov 11, 2007 12:44 AM, David Neubert [EMAIL PROTECTED] wrote: Ryan (and others who need something to put them so sleep :) ) Wow -- the light-bulb finally went off -- the Analzyer admin page is very cool -- I just was not at all thinking the SOLR/Lucene way. I need to rethink my whole approach now that I understand (from reviewing the schema.xml closer and playing with the Analyser) how compatible index and query policies can be applied automatically on a field by field basis by SOLR at both index and query time. I still may have a stumper here, but I need to give it some thought, and may return again with another question: The problem is that my text is book text (fairly large) that ooks very much like one would expect: book chapter parasen.../sensen/sen/para parasen.../sensen/sen/para parasen.../sensen.../sen/para /chapter /book The search results need to return exact sentences or paragraphs with their exact page:line numbers (which is available in the embedded markup in the text). There were previous responses by others, suggesting I look into payloads, but I did not fully understand that -- I may have to re-read those e-mails now that I am getting a clearer picture of SOLR/Lucene. However, the reason I resorted to indexing each paragraph as a single document, and then redundantly indexing each sentence as a single document, is because I was planning on pre-parsing the text myself (outside of SOLR) -- and feeding separate doc elements to the add because in that way I could produce the page:line reference in the pre-parsing (again outside of SOLR) and feed it in as explict field in the doc elements of the add requests. Therefore at query time, I will have the exact page:line corresponding to the start of the paragraph or sentence. But I am beginning to suspect, I was planning to do a lot of work that SOLR can do for me. I will continue to study this and respond when I am a bit clearer, but the closer I could get to just submitting the books a chapter at a time -- and letting SOLR do the work, the better (cause I have all the books in well formed xml at chapter levels). However, I don't see yet how I could get par/sen granular search result hits, along with their exact page:line coordinates unless I approach it by explicitly indexing the pars and sens as single documents, not chapters hits, and also return the entire text of the sen or par, and highlight the keywords within (for the search result hit). Once a search result hit is selected, it would then act as expected and position into the chapter, at the selected reference, highlight again the key words, but this time in the context of an entire chapter (the whole document to the user's mind). Even with my new understanding you (and others) have given me, which I can use to certainly improve my approach -- it still seems to me that because multi-valued fields concatenate text -- even if you use the positionGapIncrment feature to prohibit unwanted phrase matches, how do you produce a well definied search result hit, bounded by the exact sen or par, unless you index them as single documents? Should I still read up on the payload discussion? Dave - Original Message From: Ryan McKinley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 5:00:43 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) David Neubert wrote: Ryan, Thanks for your response. I infer from your response that you can have a different analyzer for each field yes
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Erik, Probably because of my newness to SOLR/Lucene, I see now what you/Yonik meant by case field, but I am not clear about your wording per-book setting attached at index time - would you mind ellaborating on that, so I am clear? Dave - Original Message From: Erik Hatcher [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Sunday, November 11, 2007 5:21:45 AM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) Solr query syntax is documented here: http://wiki.apache.org/solr/ SolrQuerySyntax What Yonik is referring to is creating your own case field with the per-book setting attached at index time. Erik On Nov 11, 2007, at 12:55 AM, David Neubert wrote: Yonik (or anyone else) Do you know where on-line documentation on the +case: syntax is located? I can't seem to find it. Dave - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 4:56:40 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) On Nov 10, 2007 4:24 PM, David Neubert [EMAIL PROTECTED] wrote: So if I am hitting multiple fields (in the same search request) that invoke different Analyzers -- am I at a dead end, and have to result to consequetive multiple queries instead Solr handles that for you automatically. The app that I am replacing (and trying to enhance) has the ability to search multiple books at once with sen/par and case sensitivity settings individually selectable per book You could easily select case sensitivity or not *per query* across all books. You should step back and see what the requirements actually are (i.e. the reasons why one needs to be able to select case sensitive/insensitive on a book level... it doesn't make sense to me at first blush). It could be done on a per-book level in solr with a more complex query structure though... (+case:sensitive +(normal relevancy query on the case sensitive fields goes here)) OR (+case:insensitive +(normal relevancy query on the case insensitive fields goes here)) -Yonik __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Erik - thanks, I am considering this approach, verses explicit redundant indexing -- and am also considering Lucene -- problem is, I am one week into both technologies (though have years in the search space) -- wish I could go to Hong Kong -- any discounts available anywhere :) Dave - Original Message From: Erick Erickson [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 12, 2007 2:11:14 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this may be useful For your line number, page number etc perspective, it is possible to index special guaranteed-to-not-match tokens then use the termdocs/termenum data, along with SpanQueries to figure this out at search time. For instance, coincident with the last term in each line, index the token $. Coincident with the last token of every paragraph index the token #. If you get the offsets of the matching terms, you can quite quickly simply count the number of line and paragraph tokens using TermDocs/TermEnums and correlate hits to lines and paragraphs. The trick is to index your special tokens with an increment of 0 (see SynonymAnalyzer in Lucene In Action for more on this). Another possibility is to add a special field with each document with the offsets of each end-of-sentence and end-of-paragraph offsets (stored, not indexed). Again, given the offsets, you can read in this field and figure out what line/ paragraph your hits are in. How suitable either of these is depends on a lot of characteristics of your particular problem space. I'm not sure either of them is suitable for very high volume applications. Also, I'm approaching this from an in-the-guts-of-lucene perspective, so don't even *think* of asking me how to really make this work in SOLR G. Best Erick On Nov 11, 2007 12:44 AM, David Neubert [EMAIL PROTECTED] wrote: Ryan (and others who need something to put them so sleep :) ) Wow -- the light-bulb finally went off -- the Analzyer admin page is very cool -- I just was not at all thinking the SOLR/Lucene way. I need to rethink my whole approach now that I understand (from reviewing the schema.xml closer and playing with the Analyser) how compatible index and query policies can be applied automatically on a field by field basis by SOLR at both index and query time. I still may have a stumper here, but I need to give it some thought, and may return again with another question: The problem is that my text is book text (fairly large) that ooks very much like one would expect: book chapter parasen.../sensen/sen/para parasen.../sensen/sen/para parasen.../sensen.../sen/para /chapter /book The search results need to return exact sentences or paragraphs with their exact page:line numbers (which is available in the embedded markup in the text). There were previous responses by others, suggesting I look into payloads, but I did not fully understand that -- I may have to re-read those e-mails now that I am getting a clearer picture of SOLR/Lucene. However, the reason I resorted to indexing each paragraph as a single document, and then redundantly indexing each sentence as a single document, is because I was planning on pre-parsing the text myself (outside of SOLR) -- and feeding separate doc elements to the add because in that way I could produce the page:line reference in the pre-parsing (again outside of SOLR) and feed it in as explict field in the doc elements of the add requests. Therefore at query time, I will have the exact page:line corresponding to the start of the paragraph or sentence. But I am beginning to suspect, I was planning to do a lot of work that SOLR can do for me. I will continue to study this and respond when I am a bit clearer, but the closer I could get to just submitting the books a chapter at a time -- and letting SOLR do the work, the better (cause I have all the books in well formed xml at chapter levels). However, I don't see yet how I could get par/sen granular search result hits, along with their exact page:line coordinates unless I approach it by explicitly indexing the pars and sens as single documents, not chapters hits, and also return the entire text of the sen or par, and highlight the keywords within (for the search result hit). Once a search result hit is selected, it would then act as expected and position into the chapter, at the selected reference, highlight again the key words, but this time in the context of an entire chapter (the whole document to the user's mind). Even with my new understanding you (and others) have given me, which I can use to certainly improve my approach -- it still seems to me that because multi-valued fields concatenate text -- even if you use the positionGapIncrment feature to prohibit unwanted phrase matches, how do you produce
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
On Nov 12, 2007 2:20 PM, David Neubert [EMAIL PROTECTED] wrote: Erik - thanks, I am considering this approach, verses explicit redundant indexing -- and am also considering Lucene - There's not a well defined solution in either IMO. - problem is, I am one week into both technologies (though have years in the search space) -- wish I could go to Hong Kong -- any discounts available anywhere :) Unfortunately the OS Summit has been canceled. -Yonik
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
: - problem is, I am one week into both technologies (though have years in the search space) -- wish I could : go to Hong Kong -- any discounts available anywhere :) : : Unfortunately the OS Summit has been canceled. Or rescheduled to 2008 ... depending on wether you are a half-empty / half-full kind of person. And lets not forget atlanta ... starting today and all... http://us.apachecon.com/us2007/ -Hoss
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Solr query syntax is documented here: http://wiki.apache.org/solr/ SolrQuerySyntax What Yonik is referring to is creating your own case field with the per-book setting attached at index time. Erik On Nov 11, 2007, at 12:55 AM, David Neubert wrote: Yonik (or anyone else) Do you know where on-line documentation on the +case: syntax is located? I can't seem to find it. Dave - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 4:56:40 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) On Nov 10, 2007 4:24 PM, David Neubert [EMAIL PROTECTED] wrote: So if I am hitting multiple fields (in the same search request) that invoke different Analyzers -- am I at a dead end, and have to result to consequetive multiple queries instead Solr handles that for you automatically. The app that I am replacing (and trying to enhance) has the ability to search multiple books at once with sen/par and case sensitivity settings individually selectable per book You could easily select case sensitivity or not *per query* across all books. You should step back and see what the requirements actually are (i.e. the reasons why one needs to be able to select case sensitive/insensitive on a book level... it doesn't make sense to me at first blush). It could be done on a per-book level in solr with a more complex query structure though... (+case:sensitive +(normal relevancy query on the case sensitive fields goes here)) OR (+case:insensitive +(normal relevancy query on the case insensitive fields goes here)) -Yonik __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Hi all, Using SOLR, I believe I have to index the same content 4 times (not desirable) into 2 indexes -- and I don't know how you can practically do multiple indexes in SOLR (if indeed there is no better solution than 4 indexing runs into two indexes? My need is case-sensitive and case insensitive searches over well formed XML content (books), performing exact searches at the paragraph and sentence levels -- no errors over approximate boundaries -- the source content has exact par/sen tags. I have already proven a pretty nice solution for par/sen indexing twice into the same index in SOLR. I have added a tags field, and put correlative XML tags (comma delimited) into this field (one of which is either a para or sen flag) which flags the document (partial) as a paragraph or sentence. Thus all paragraphs of the book are indexed as single document (with its sentences combined and concatenated) and then all sentences in the book are indexed again as single documents. Both go into the same SOLR index. I just add an AND tags:para or tags:sen to my search and everything works fine. The obvious downside to this approach is the 2X indexing, but it does execute quite nicely on a single Index using SOLR. This obviously doesn't scale nicely, but will do for quite a while probably. I thought I could live with that But then I moved on to case sensitive and case-insensitive searches, and my research so far is pointing to one index for each case. So now I have: (1) 4X in content indexing (2) 2X in actual SOLR/Lucene indices (3) I don't know how to practically due multiple indices using SOLR? If there is a better way of attacking this problem, I would appreciate recommendations!!! Also, I don't know how to do multiple indices in SOLR -- I have heard it might be available in 1.3.0.? If this is my only recourse, please advise me where really good documentation is available on building 1.3.0. I am not admin savvy, but I did succeed in getting SOLR up myself and navigation through it with the help of this forum. But I have that building 1.3.0 (as opposed to downloading and installing it, like in 1.2.0) is a whole different experience and much more complex. Thanks Dave __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Ryan, Thanks for your response. I infer from your response that you can have a different analyzer for each field -- I guess I should have figured that out --but because I had not thought of that, I concluded that I needed multiple indices (sorry , I am still very new to Solr/Lucene). Does such an approach make querying difficult under the following condition: ? The app that I am replacing (and trying to enhance) has the ability to search multiple books at once with sen/par and case sensitivity settings individually selectable per book (e.g. default search modes per book). So with a single query request (just the query word(s)), you can search one book by par, with case, another by sen w/o case, etc. -- all settable as user defaults. I need to try to figure out how to match that in Solr/Lucene -- I believe that the Analyzer approach you suggested requires the use of the same Analzyer at query time that was used during indexing. So if I am hitting multiple fields (in the same search request) that invoke different Analyzers -- am I at a dead end, and have to result to consequetive multiple queries instead (and sort merge results afterwards?) Or am I just over complicating this? Dave - Original Message From: Ryan McKinley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 2:18:00 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) So now I have: (1) 4X in content indexing (2) 2X in actual SOLR/Lucene indices (3) I don't know how to practically due multiple indices using SOLR? If there is a better way of attacking this problem, I would appreciate recommendations!!! I don't quite follow your current approach, but it sounds like you just needs some copyFields to index the same content with multiple analyzers. for example, say you have fields: field name=content type=string indexed=true stored=true/ field name=content_sentence type=sentence indexed=true stored=false/ field name=content_paragraph type=paragraph indexed=true stored=false/ field name=content_text type=text indexed=true stored=false/ and copy fields: copyField source=content dest=content_sentence/ copyField source=content dest=content_paragraph/ copyField source=content dest=content_text/ The 4X indexing cost? If you *need* to index the content 4 different ways, you don't have any way around that - do you? But is it really a big deal? How often does it need to index? How big is the data? I'm not quite following your need for multiple solr indicies, but in 1.3 it is possible. ryan __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
On Nov 10, 2007 4:24 PM, David Neubert [EMAIL PROTECTED] wrote: So if I am hitting multiple fields (in the same search request) that invoke different Analyzers -- am I at a dead end, and have to result to consequetive multiple queries instead Solr handles that for you automatically. The app that I am replacing (and trying to enhance) has the ability to search multiple books at once with sen/par and case sensitivity settings individually selectable per book You could easily select case sensitivity or not *per query* across all books. You should step back and see what the requirements actually are (i.e. the reasons why one needs to be able to select case sensitive/insensitive on a book level... it doesn't make sense to me at first blush). It could be done on a per-book level in solr with a more complex query structure though... (+case:sensitive +(normal relevancy query on the case sensitive fields goes here)) OR (+case:insensitive +(normal relevancy query on the case insensitive fields goes here)) -Yonik
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
David Neubert wrote: Ryan, Thanks for your response. I infer from your response that you can have a different analyzer for each field yes! each field can have its own indexing strategy. I believe that the Analyzer approach you suggested requires the use of the same Analzyer at query time that was used during indexing. it does not require the *same* Analyzer - it just requires one that generates compatiable tokens. That is, you may want the indexing to split the input into sentences, but the query time analyzer keeps the input as a single token. check the example schema.xml file -- the 'text' field type applies synonyms at index time, but does at query time. re searching acrross multiple fields, don't worry, lucene handles this well. You may want to do that explicitly or with the dismax handler. I'd suggest you play around with indexing some data. check the analysis.jsp in the admin section. It is a great tool to help figure out what analyzers do at index vs query time. ryan
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Ryan (and others who need something to put them so sleep :) ) Wow -- the light-bulb finally went off -- the Analzyer admin page is very cool -- I just was not at all thinking the SOLR/Lucene way. I need to rethink my whole approach now that I understand (from reviewing the schema.xml closer and playing with the Analyser) how compatible index and query policies can be applied automatically on a field by field basis by SOLR at both index and query time. I still may have a stumper here, but I need to give it some thought, and may return again with another question: The problem is that my text is book text (fairly large) that ooks very much like one would expect: book chapter parasen.../sensen/sen/para parasen.../sensen/sen/para parasen.../sensen.../sen/para /chapter /book The search results need to return exact sentences or paragraphs with their exact page:line numbers (which is available in the embedded markup in the text). There were previous responses by others, suggesting I look into payloads, but I did not fully understand that -- I may have to re-read those e-mails now that I am getting a clearer picture of SOLR/Lucene. However, the reason I resorted to indexing each paragraph as a single document, and then redundantly indexing each sentence as a single document, is because I was planning on pre-parsing the text myself (outside of SOLR) -- and feeding separate doc elements to the add because in that way I could produce the page:line reference in the pre-parsing (again outside of SOLR) and feed it in as explict field in the doc elements of the add requests. Therefore at query time, I will have the exact page:line corresponding to the start of the paragraph or sentence. But I am beginning to suspect, I was planning to do a lot of work that SOLR can do for me. I will continue to study this and respond when I am a bit clearer, but the closer I could get to just submitting the books a chapter at a time -- and letting SOLR do the work, the better (cause I have all the books in well formed xml at chapter levels). However, I don't see yet how I could get par/sen granular search result hits, along with their exact page:line coordinates unless I approach it by explicitly indexing the pars and sens as single documents, not chapters hits, and also return the entire text of the sen or par, and highlight the keywords within (for the search result hit). Once a search result hit is selected, it would then act as expected and position into the chapter, at the selected reference, highlight again the key words, but this time in the context of an entire chapter (the whole document to the user's mind). Even with my new understanding you (and others) have given me, which I can use to certainly improve my approach -- it still seems to me that because multi-valued fields concatenate text -- even if you use the positionGapIncrment feature to prohibit unwanted phrase matches, how do you produce a well definied search result hit, bounded by the exact sen or par, unless you index them as single documents? Should I still read up on the payload discussion? Dave - Original Message From: Ryan McKinley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 5:00:43 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) David Neubert wrote: Ryan, Thanks for your response. I infer from your response that you can have a different analyzer for each field yes! each field can have its own indexing strategy. I believe that the Analyzer approach you suggested requires the use of the same Analzyer at query time that was used during indexing. it does not require the *same* Analyzer - it just requires one that generates compatiable tokens. That is, you may want the indexing to split the input into sentences, but the query time analyzer keeps the input as a single token. check the example schema.xml file -- the 'text' field type applies synonyms at index time, but does at query time. re searching acrross multiple fields, don't worry, lucene handles this well. You may want to do that explicitly or with the dismax handler. I'd suggest you play around with indexing some data. check the analysis.jsp in the admin section. It is a great tool to help figure out what analyzers do at index vs query time. ryan __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Yonik (or anyone else) Do you know where on-line documentation on the +case: syntax is located? I can't seem to find it. Dave - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 4:56:40 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) On Nov 10, 2007 4:24 PM, David Neubert [EMAIL PROTECTED] wrote: So if I am hitting multiple fields (in the same search request) that invoke different Analyzers -- am I at a dead end, and have to result to consequetive multiple queries instead Solr handles that for you automatically. The app that I am replacing (and trying to enhance) has the ability to search multiple books at once with sen/par and case sensitivity settings individually selectable per book You could easily select case sensitivity or not *per query* across all books. You should step back and see what the requirements actually are (i.e. the reasons why one needs to be able to select case sensitive/insensitive on a book level... it doesn't make sense to me at first blush). It could be done on a per-book level in solr with a more complex query structure though... (+case:sensitive +(normal relevancy query on the case sensitive fields goes here)) OR (+case:insensitive +(normal relevancy query on the case insensitive fields goes here)) -Yonik __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com