Re: Different ranking results
They look the same to me too. What does q.getClass().getName() say in each case? q.toString()? searcher.explain(q, n)? What version of lucene? -- Ian. On Wed, Jul 21, 2010 at 10:25 PM, Philippe wrote: > Hi, > > I just performed two queries which, in my opinion, should lead to the same > document rankings. However, the document ranking differ between these two > queries. For better understanding I prepared minimal examples for both > queries. In my understanding both queries perform the same task. Namely > search for "lucene" in two different fields. > > Maybe someone can explain me my misunderstanding? > > > String[] fields = {"TITLE", "BOOK"}; > MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_29, > fields, new StandardAnalyzer(Version.LUCENE_29)); > > 1.) > Query q = parser.parse("lucene"); > > 2.) > Query q = parser.parse(TITLE:lucene OR BOOK:lucene); > > Regards, > philippe > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: on-the-fly "filters" from docID lists
It sounds like you should implement a custom Filter? Its getDocIdSet would consult your foreign key-value store and iterate through the allowed docIDs, per segment. Mike On Wed, Jul 21, 2010 at 8:37 AM, Martin J wrote: > Hello, we are trying to implement a query type for Lucene (with eventual > target being Solr) where the query string passed in needs to be "filtered" > through a large list of document IDs per user. We can't store the user ID > information in the lucene index per document so we were planning to pull the > list of documents owned by user X from a key-value store at query time and > then build some sort of filter in memory before doing the Lucene/Solr query. > For example: > > content:"cars" user_id:X567 > > would first pull the list of docIDs that user_id:X567 has "access" to from a > keyvalue store and then we'd query the main index with content:"cars" but > only allow the docIDs that came back to be part of the response. The list of > docIDs can near the hundreds of thousands. > > What should I be looking at to implement such a feature? > > Thank you > Martin > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Different ranking results
Hi Ian, I'm using Version 2.93 of lucene. q.getClass() and q.toString() are exactly equal: org.apache.lucene.search.BooleanQuery TITLE:672 BOOK:672 However, the results for searcher.explain(q,n) significantly differ. It seems to me that "Query q = parser.parse("672");" searches only one the Book field, whereas "Query q = parser.parse("TITLE:672 BOOK:672");" searches on both fields. Do you have a explanation for this behaviour? I only observed this problem for this field... I appended the result of both explain strings below: "Query q = parser.parse("672");" 9.987344E10 = (MATCH) sum of: 9.987344E10 = (MATCH) weight(BOOK:672 in 7078), product of: 0.6583703 = queryWeight(BOOK:672), product of: 6.085349 = idf(docFreq=109, maxDocs=17780) 0.10818941 = queryNorm 1.51697965E11 = (MATCH) fieldWeight(BOOK:672 in 7078), product of: 3.3166249 = tf(termFreq(BOOK:672)=11) 6.085349 = idf(docFreq=109, maxDocs=17780) 7.5161928E9 = fieldNorm(field=BOOK, doc=7078) "Query q = parser.parse("TITLE:672 BOOK:672");" 9.5225594E10 = (MATCH) sum of: 9.5225594E10 = (MATCH) weight(BOOK:672 in 4979), product of: 0.6583703 = queryWeight(BOOK:672), product of: 6.085349 = idf(docFreq=109, maxDocs=17780) 0.10818941 = queryNorm 1.44638345E11 = (MATCH) fieldWeight(BOOK:672 in 4979), product of: 3.1622777 = tf(termFreq(BOOK:672)=10) 6.085349 = idf(docFreq=109, maxDocs=17780) 7.5161928E9 = fieldNorm(field=BOOK, doc=4979) 52.366344 = (MATCH) weight(TITLE:672 in 4979), product of: 0.7526941 = queryWeight(TITLE:672), product of: 6.957188 = idf(docFreq=45, maxDocs=17780) 0.10818941 = queryNorm 69.571884 = (MATCH) fieldWeight(TITLE:672 in 4979), product of: 1.0 = tf(termFreq(TITLE:672)=1) 6.957188 = idf(docFreq=45, maxDocs=17780) 10.0 = fieldNorm(field=TITLE, doc=4979) Cheers, P. Am 22.07.2010 10:02, schrieb Ian Lea: They look the same to me too. What does q.getClass().getName() say in each case? q.toString()? searcher.explain(q, n)? What version of lucene? -- Ian. On Wed, Jul 21, 2010 at 10:25 PM, Philippe wrote: Hi, I just performed two queries which, in my opinion, should lead to the same document rankings. However, the document ranking differ between these two queries. For better understanding I prepared minimal examples for both queries. In my understanding both queries perform the same task. Namely search for "lucene" in two different fields. Maybe someone can explain me my misunderstanding? String[] fields = {"TITLE", "BOOK"}; MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_29, fields, new StandardAnalyzer(Version.LUCENE_29)); 1.) Query q = parser.parse("lucene"); 2.) Query q = parser.parse(TITLE:lucene OR BOOK:lucene); Regards, philippe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Different ranking results
No, I don't have an explanation. Perhaps a minimal self-contained program or test case would help. -- Ian. On Thu, Jul 22, 2010 at 10:23 AM, Philippe wrote: > Hi Ian, > > I'm using Version 2.93 of lucene. > > q.getClass() and q.toString() are exactly equal: > org.apache.lucene.search.BooleanQuery > TITLE:672 BOOK:672 > > However, the results for searcher.explain(q,n) significantly differ. It > seems to me that "Query q = parser.parse("672");" searches only one the > Book field, whereas "Query q = parser.parse("TITLE:672 BOOK:672");" > searches on both fields. Do you have a explanation for this behaviour? I > only observed this problem for this field... > > I appended the result of both explain strings below: > > "Query q = parser.parse("672");" > 9.987344E10 = (MATCH) sum of: > 9.987344E10 = (MATCH) weight(BOOK:672 in 7078), product of: > 0.6583703 = queryWeight(BOOK:672), product of: > 6.085349 = idf(docFreq=109, maxDocs=17780) > 0.10818941 = queryNorm > 1.51697965E11 = (MATCH) fieldWeight(BOOK:672 in 7078), product of: > 3.3166249 = tf(termFreq(BOOK:672)=11) > 6.085349 = idf(docFreq=109, maxDocs=17780) > 7.5161928E9 = fieldNorm(field=BOOK, doc=7078) > > > "Query q = parser.parse("TITLE:672 BOOK:672");" > 9.5225594E10 = (MATCH) sum of: > 9.5225594E10 = (MATCH) weight(BOOK:672 in 4979), product of: > 0.6583703 = queryWeight(BOOK:672), product of: > 6.085349 = idf(docFreq=109, maxDocs=17780) > 0.10818941 = queryNorm > 1.44638345E11 = (MATCH) fieldWeight(BOOK:672 in 4979), product of: > 3.1622777 = tf(termFreq(BOOK:672)=10) > 6.085349 = idf(docFreq=109, maxDocs=17780) > 7.5161928E9 = fieldNorm(field=BOOK, doc=4979) > 52.366344 = (MATCH) weight(TITLE:672 in 4979), product of: > 0.7526941 = queryWeight(TITLE:672), product of: > 6.957188 = idf(docFreq=45, maxDocs=17780) > 0.10818941 = queryNorm > 69.571884 = (MATCH) fieldWeight(TITLE:672 in 4979), product of: > 1.0 = tf(termFreq(TITLE:672)=1) > 6.957188 = idf(docFreq=45, maxDocs=17780) > 10.0 = fieldNorm(field=TITLE, doc=4979) > > Cheers, > P. > Am 22.07.2010 10:02, schrieb Ian Lea: >> >> They look the same to me too. >> >> What does q.getClass().getName() say in each case? q.toString()? >> searcher.explain(q, n)? >> >> What version of lucene? >> >> >> -- >> Ian. >> >> >> >> >> On Wed, Jul 21, 2010 at 10:25 PM, Philippe >> wrote: >> >>> >>> Hi, >>> >>> I just performed two queries which, in my opinion, should lead to the >>> same >>> document rankings. However, the document ranking differ between these two >>> queries. For better understanding I prepared minimal examples for both >>> queries. In my understanding both queries perform the same task. Namely >>> search for "lucene" in two different fields. >>> >>> Maybe someone can explain me my misunderstanding? >>> >>> >>> String[] fields = {"TITLE", "BOOK"}; >>> MultiFieldQueryParser parser = new >>> MultiFieldQueryParser(Version.LUCENE_29, >>> fields, new StandardAnalyzer(Version.LUCENE_29)); >>> >>> 1.) >>> Query q = parser.parse("lucene"); >>> >>> 2.) >>> Query q = parser.parse(TITLE:lucene OR BOOK:lucene); >>> >>> Regards, >>> philippe >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Holding and changing index wide information
Hi, When using incremental updating via Solr, we want to know, which update is in the current index. Each update has a number. How can we store/change/retrieve this number with the index. We want to store it in the index to replicate it to any slaves as well. So basically can I store/change/retrieve a number index wide in lucen/Solr? Jan
Re: Different ranking results
Well, that's difficult at the moment as I can also just reproduce this error for some few cases. But I will try to generate such an example.. Cheers, Philippe Am 22.07.2010 12:34, schrieb Ian Lea: No, I don't have an explanation. Perhaps a minimal self-contained program or test case would help. -- Ian. On Thu, Jul 22, 2010 at 10:23 AM, Philippe wrote: Hi Ian, I'm using Version 2.93 of lucene. q.getClass() and q.toString() are exactly equal: org.apache.lucene.search.BooleanQuery TITLE:672 BOOK:672 However, the results for searcher.explain(q,n) significantly differ. It seems to me that "Query q = parser.parse("672");" searches only one the Book field, whereas "Query q = parser.parse("TITLE:672 BOOK:672");" searches on both fields. Do you have a explanation for this behaviour? I only observed this problem for this field... I appended the result of both explain strings below: "Query q = parser.parse("672");" 9.987344E10 = (MATCH) sum of: 9.987344E10 = (MATCH) weight(BOOK:672 in 7078), product of: 0.6583703 = queryWeight(BOOK:672), product of: 6.085349 = idf(docFreq=109, maxDocs=17780) 0.10818941 = queryNorm 1.51697965E11 = (MATCH) fieldWeight(BOOK:672 in 7078), product of: 3.3166249 = tf(termFreq(BOOK:672)=11) 6.085349 = idf(docFreq=109, maxDocs=17780) 7.5161928E9 = fieldNorm(field=BOOK, doc=7078) "Query q = parser.parse("TITLE:672 BOOK:672");" 9.5225594E10 = (MATCH) sum of: 9.5225594E10 = (MATCH) weight(BOOK:672 in 4979), product of: 0.6583703 = queryWeight(BOOK:672), product of: 6.085349 = idf(docFreq=109, maxDocs=17780) 0.10818941 = queryNorm 1.44638345E11 = (MATCH) fieldWeight(BOOK:672 in 4979), product of: 3.1622777 = tf(termFreq(BOOK:672)=10) 6.085349 = idf(docFreq=109, maxDocs=17780) 7.5161928E9 = fieldNorm(field=BOOK, doc=4979) 52.366344 = (MATCH) weight(TITLE:672 in 4979), product of: 0.7526941 = queryWeight(TITLE:672), product of: 6.957188 = idf(docFreq=45, maxDocs=17780) 0.10818941 = queryNorm 69.571884 = (MATCH) fieldWeight(TITLE:672 in 4979), product of: 1.0 = tf(termFreq(TITLE:672)=1) 6.957188 = idf(docFreq=45, maxDocs=17780) 10.0 = fieldNorm(field=TITLE, doc=4979) Cheers, P. Am 22.07.2010 10:02, schrieb Ian Lea: They look the same to me too. What does q.getClass().getName() say in each case? q.toString()? searcher.explain(q, n)? What version of lucene? -- Ian. On Wed, Jul 21, 2010 at 10:25 PM, Philippe wrote: Hi, I just performed two queries which, in my opinion, should lead to the same document rankings. However, the document ranking differ between these two queries. For better understanding I prepared minimal examples for both queries. In my understanding both queries perform the same task. Namely search for "lucene" in two different fields. Maybe someone can explain me my misunderstanding? String[] fields = {"TITLE", "BOOK"}; MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_29, fields, new StandardAnalyzer(Version.LUCENE_29)); 1.) Query q = parser.parse("lucene"); 2.) Query q = parser.parse(TITLE:lucene OR BOOK:lucene); Regards, philippe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Holding and changing index wide information
Just add/update a dedicated document in the index. k=updatenumber v=whatever. Retrieve it with a search for k:updatenumber, update with iw.updateDocument(whatever). -- Ian. On Thu, Jul 22, 2010 at 12:55 PM, wrote: > Hi, > > When using incremental updating via Solr, we want to know, which update is in > the current index. Each update has a number. > How can we store/change/retrieve this number with the index. We want to store > it in the index to replicate it to any slaves as well. > > So basically can I store/change/retrieve a number index wide in lucen/Solr? > > Jan > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Holding and changing index wide information
Hi Jan, I think, you require version number for each commit OR updates. Say you added 10 docs then it is update 1, then modifed or added some more then it is update 2.. If it is so then my advice would be to have field named field-type, version-number and version-date-time as part of the field in the index. You could set this field as like any other field. Retrieve the record by filtering the value to the field field-type. Regards Aditya www.findbestopensource.com On Thu, Jul 22, 2010 at 5:25 PM, wrote: > Hi, > > When using incremental updating via Solr, we want to know, which update is in > the current index. Each update has a number. > How can we store/change/retrieve this number with the index. We want to store > it in the index to replicate it to any slaves as well. > > So basically can I store/change/retrieve a number index wide in lucen/Solr? > > Jan > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Question to the writer of MultiPassIndexSplitter
Hi, I heard work is being done on re-writing MultiPassIndexSplitter so it will be a single pass and work quicker. I was wondering if this is already done or when is it due ? Thanks
Inserting data from multiple databases in same index
Hi, We are creating an index containing data from two databases. What we are trying to achieve is to make our search locate and return information no matter where the data came from. (BTW, we are using Compass, if it matters any) My problem is that I am not sure how to create such an index. Do I index in two passes, one for each database, while adding the content of the second SELECT to the first one? Or a different approach? I'm pretty sure this is (pretty much) a FAQ but I didn't find what I was looking for. Thanks, L - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Inserting data from multiple databases in same index
You can either 1) create one index for each database, and merge the results during search. 2) create the 2 indexes individually and merge them 3) merge records during SQL select. The 1) approach should be easy to scale linearly as your database grows. You can even distribute the indexes onto several boxes and achieve sharded search. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! On 7/22/2010 10:07 AM, L Duperval wrote: Hi, We are creating an index containing data from two databases. What we are trying to achieve is to make our search locate and return information no matter where the data came from. (BTW, we are using Compass, if it matters any) My problem is that I am not sure how to create such an index. Do I index in two passes, one for each database, while adding the content of the second SELECT to the first one? Or a different approach? I'm pretty sure this is (pretty much) a FAQ but I didn't find what I was looking for. Thanks, L - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Scoring exact matches higher in a stemmed field
> > Ideally, that would be through a class or a function I can override or > extend > How is that different than extending QP? About the "song of songs" example -- the result you describe is already what will happen. A document which contains just the word 'song' will score lower than a document containing "song of songs". Also, what I'd do in such a case is search for the phrase (in addition to the rest), 'cause documents containing the word "songs" 100 times will score higher than the single document that will contain "song of songs" once ... If you just want a query "abc def" to rank higher if a document contains the exact words, then I'd go w/ the QP extension approach, or do other sophistication like searching for 'abc' '\"abc\"' etc. or something like that. There are many tricks you can do on your end, w/o overriding much in Lucene. Still, IMO extending QP is the easiest and gives you the control you need. Shai On Mon, Jul 19, 2010 at 9:24 PM, Itamar Syn-Hershko wrote: > On 19/7/2010 5:50 PM, Shai Erera wrote: > >> If your analyzer outputs b and b$ in the same position, then the below >> query >> will already be what the QP output today If you want to incorporate >> boosting, I can suggest that you extend QP, override newTermQuery for >> example, and if the term is a stemmed term, then set the query's boost >> (Query.setBoost) accordingly. Would that work for you? >> >> > I want to avoid overriding the QP, and do this as a pluggable extension. > What other options do I have other than what you've suggested? > > Ideally, that would be through a class or a function I can override or > extend, so each term hit while searching will be examined. By checking its > type and text (for suffix), that interface could double its weight (or > boost). The similarity functions I mentioned could have provided this > ability (see below). How can this be done without them? > > You'll need to check whether you want to boost terms inside phrases, or >> entire phrases, and then override more methods from QP. But that approach >> will get you the native product of the engine, I think. >> > Just to make sure we are on the same page here, here's an example (assuming > the default tf/idf implementation in Lucene). > > I want to make sure anyone searching for "song of songs" will find texts > discussing the biblical book, and have them ranked the highest, instead of > having short texts containing one word "song" score higher. > > So what I do is have my stemming analyzer save the string "song of songs" > like this, where each parenthesis represents a token position: (song song$) > (song songs$). > > The part I'm missing is how to score terms with suffixes higher. The best > approach seem to be looking at the term read by IndexReader and boost this > finding somehow. The assumption is if IndexReader has read the term songs$ > it has been looked for, and therefore this is the exact word that has been > queried for. > > Which is the best Lucene part to hijack for this mission? > > Alternatively, you >> can set a payload on the stemmed terms and incorporate that into >> Similarity, >> but that's more costly. >> >> > I had mentioned Payloads - this will get me exactly what I want but as you > say are quite costly when used for almost every term in the index. If I > could replace the suffix with Payloads I would have done this (byte vs. > byte), but I'm using the suffix for one other thing. > > I don't follow that's been deprecated on Sim that you cannot use anymore? >> All I see are 3 deprecated static methods which are related to norms ... >> >> > In 2.3.2 there were these functions: > >public float idf(Term term, Searcher searcher) > >public float idf(Collection terms, Searcher searcher) > > These have been deprecated somewhere between that version and 2.9.2, and it > seems like I could have used those for what I'm trying to do. > > Thanks, > > Itamar. > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
RE: on-the-fly "filters" from docID lists
Hi Mike and Martin, We have a similar use-case. Is there a scalability/performance issue with the getDocIdSet having to iterate through hundreds of thousands of docIDs? Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, July 22, 2010 5:20 AM To: java-user@lucene.apache.org Subject: Re: on-the-fly "filters" from docID lists It sounds like you should implement a custom Filter? Its getDocIdSet would consult your foreign key-value store and iterate through the allowed docIDs, per segment. Mike On Wed, Jul 21, 2010 at 8:37 AM, Martin J wrote: > Hello, we are trying to implement a query type for Lucene (with eventual > target being Solr) where the query string passed in needs to be "filtered" > through a large list of document IDs per user. We can't store the user ID > information in the lucene index per document so we were planning to pull the > list of documents owned by user X from a key-value store at query time and > then build some sort of filter in memory before doing the Lucene/Solr query. > For example: > > content:"cars" user_id:X567 > > would first pull the list of docIDs that user_id:X567 has "access" to from a > keyvalue store and then we'd query the main index with content:"cars" but > only allow the docIDs that came back to be part of the response. The list of > docIDs can near the hundreds of thousands. > > What should I be looking at to implement such a feature? > > Thank you > Martin > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Scoring exact matches higher in a stemmed field
On 22/7/2010 9:20 PM, Shai Erera wrote: How is that different than extending QP? Mainly because the problem I'm having isn't there, and doing it from there doesn't feel right, and definitely not like solving the issue. I want to explore what other options there are before doing anything, and I started this thread because I hit a dead end after seeing Similarity can no more be of help. About the "song of songs" example -- the result you describe is already what will happen. A document which contains just the word 'song' will score lower than a document containing "song of songs". Incorrect, and I have a sample app to show that (this is how I thought of this example for the first place). Since while indexing the 2 words will be saved into index as 1:(song song$) 2:(song songs$), short documents with one word "song" will score higher than longer documents with "song of songs". This is a product of Lucene's default tf/idf implementation which cares about a field's length, and at this stage I want to avoid replacing it (with BM25 for example). Also, what I'd do in such a case is search for the phrase (in addition to the rest), 'cause documents containing the word "songs" 100 times will score higher than the single document that will contain "song of songs" once ... In one of my applications I am providing an "as typed" capability, which does exactly what you are suggesting (looking for the $-ed terms only), but I want my original analyzer (the one that also looks for non $-ed terms) to do better scoring. Without this the implementation is somewhat broken... If you just want a query "abc def" to rank higher if a document contains the exact words, then I'd go w/ the QP extension approach, or do other sophistication like searching for 'abc' '\"abc\"' etc. or something like that. There are many tricks you can do on your end, w/o overriding much in Lucene. Still, IMO extending QP is the easiest and gives you the control you need. I am overriding stuff in Lucene either way. I also don't want an exact match of a phrase to rank higher; I want an original term (saved as-is with a $ marker) to score higher than a stemmed / lemmatized one (without the marker). Sorry if the thread's title is misleading. I'd have used payloads if it wasn't costly. So my question is: where do I have control over boosting (or scoring), and also have access to the term's text? Thanks, Itamar. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Using lucene for substring matching
Hi, I'm about to write an application that does very simple text analysis, namely dictionary based entity entraction. The alternative is to do in memory matching with substring: String text; // could be any size, but normally "news paper length" List matches; for( String wordOrPhrase : dictionary) { if ( text.substring( wordOrPhrase ) >= 0 ) { matches.add( wordOrPhrase ); } } I am concerned the above code will be quite cpu intensitive, it will also be case sensitive and lot leave any room for fuzzy matching. I thought this task could also be solved by indexing every bit of text that is to be analyzed, and then executing a query per dicionary entry: (pseudo) lucene.index(text) List matches for( String wordOrPhrase : dictionary { if( lucene.search( wordOrPharse, text_id) gives hit ) { matches.add(wordOrPhrase) } } I have not used lucene very much, so I don't know if it is a good idea or not to use lucene for this task at all. Could anyone please share their thoughs on this? Thanks, Geir
Re: on-the-fly "filters" from docID lists
Well, Lucene can apply such a filter rather quickly; but, your custom code first has to build it... so it's really a question of whether your custom code can build up / iterate the filter scalably. Mike On Thu, Jul 22, 2010 at 4:37 PM, Burton-West, Tom wrote: > Hi Mike and Martin, > > We have a similar use-case. Is there a scalability/performance issue with > the getDocIdSet having to iterate through hundreds of thousands of docIDs? > > Tom Burton-West > http://www.hathitrust.org/blogs/large-scale-search > > -Original Message- > From: Michael McCandless [mailto:luc...@mikemccandless.com] > Sent: Thursday, July 22, 2010 5:20 AM > To: java-user@lucene.apache.org > Subject: Re: on-the-fly "filters" from docID lists > > It sounds like you should implement a custom Filter? > > Its getDocIdSet would consult your foreign key-value store and iterate > through the allowed docIDs, per segment. > > Mike > > On Wed, Jul 21, 2010 at 8:37 AM, Martin J wrote: >> Hello, we are trying to implement a query type for Lucene (with eventual >> target being Solr) where the query string passed in needs to be "filtered" >> through a large list of document IDs per user. We can't store the user ID >> information in the lucene index per document so we were planning to pull the >> list of documents owned by user X from a key-value store at query time and >> then build some sort of filter in memory before doing the Lucene/Solr query. >> For example: >> >> content:"cars" user_id:X567 >> >> would first pull the list of docIDs that user_id:X567 has "access" to from a >> keyvalue store and then we'd query the main index with content:"cars" but >> only allow the docIDs that came back to be part of the response. The list of >> docIDs can near the hundreds of thousands. >> >> What should I be looking at to implement such a feature? >> >> Thank you >> Martin >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Hot to get word importance in lucene index
Hi all! hmmm, i need to get how important is the word in entire document collection that is indexed in the lucene index. I need to extract some "representable words", lets say concepts that are common and can be representable to whole collection. Or collection "keywords". I did the fulltext indexing and the only field i am using are text contents, because titles of the documents are mostly not representable(numbers, codes etc) So, if i calculate tfidf, it gives me importance of single term with respect to single document. But if that word is repeating in the documents, how can i calculate its total importance within index? All help appreciated!! Thank you!!! -- View this message in context: http://lucene.472066.n3.nabble.com/Hot-to-get-word-importance-in-lucene-index-tp988836p988836.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Databases
Hi, Normally, when I am building my index directory for indexed documents, I used to keep my indexed files simply in a directory called 'filesToIndex'. So in this case, I do not use any standar database management system such as mySql or any other. 1) Will it be possible to use mySql or any other for the purpose of manage indexed documents in Lucene? 2) Is it necessary to follow such kind of methodology with Lucene? 3) If we do not use such type of database management system, will there be any disadvantages with large number of indexed files? Appreciate any reply from you. Thanks, Manjula.
Re: Databases
LuSql is a tool specifically oriented to extracting from JDBC accessible databases and indexing the contents. You can find it here: http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql User manual: http://cuvier.cisti.nrc.ca/~gnewton/lusql/v0.9/lusqlManual.pdf.html A new version is coming out in the next month, but the existing one should be fine for what you have described. If you have any questions, just let me know. Note that if you are interested in using Solr for your application, the data import handler (DIH) is a very flexible way of doing what you are describing, in a Solr context. http://wiki.apache.org/solr/DataImportHandler Thanks, -Glen Newton LuSql author http://zzzoot.blogspot.com/ On 23 July 2010 15:46, manjula wijewickrema wrote: > Hi, > > Normally, when I am building my index directory for indexed documents, I > used to keep my indexed files simply in a directory called 'filesToIndex'. > So in this case, I do not use any standar database management system such > as mySql or any other. > > 1) Will it be possible to use mySql or any other for the purpose of manage > indexed documents in Lucene? > > 2) Is it necessary to follow such kind of methodology with Lucene? > > 3) If we do not use such type of database management system, will there be > any disadvantages with large number of indexed files? > > Appreciate any reply from you. > Thanks, > Manjula. > -- - - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Reverse Lucene queries
Hi all, I have an interesting problem...instead of going from a query to a document collection, is it possible to come up with the best fit query for a given document collection (results)? "Best fit" being a query which maximizes the hit scores of the resulting document collection. How should I approach this? All suggestions appreciated. Thanks Shashi - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: on-the-fly "filters" from docID lists
Re scalability of filter construction - the database is likely to hold stable primary keys not lucene doc ids which are unstable in the face of updates. You therefore need a quick way of converting stable database keys read from the db into current lucene doc ids to create the filter. That could involve a lot of disk seeks unless you cache a pk->docid lookup in ram. You should use cachingwrapperfilter too to cache the computed user permissions from one search to the next. This can get messy. If the access permissions are centred around roles/groups it is normally faster to tag docs with these group names and query them with the list of roles the user holds. If individual user-doc-level perms are required you could also consider dynamically looking up perms for just the top n results being shown at the risk of needing to repeat the query with a larger n if insufficient matches pass the lookup. Cheers Mark On 23 Jul 2010, at 01:55, Michael McCandless wrote: > Well, Lucene can apply such a filter rather quickly; but, your custom > code first has to build it... so it's really a question of whether > your custom code can build up / iterate the filter scalably. > > Mike > > On Thu, Jul 22, 2010 at 4:37 PM, Burton-West, Tom wrote: >> Hi Mike and Martin, >> >> We have a similar use-case. Is there a scalability/performance issue with >> the getDocIdSet having to iterate through hundreds of thousands of docIDs? >> >> Tom Burton-West >> http://www.hathitrust.org/blogs/large-scale-search >> >> -Original Message- >> From: Michael McCandless [mailto:luc...@mikemccandless.com] >> Sent: Thursday, July 22, 2010 5:20 AM >> To: java-user@lucene.apache.org >> Subject: Re: on-the-fly "filters" from docID lists >> >> It sounds like you should implement a custom Filter? >> >> Its getDocIdSet would consult your foreign key-value store and iterate >> through the allowed docIDs, per segment. >> >> Mike >> >> On Wed, Jul 21, 2010 at 8:37 AM, Martin J wrote: >>> Hello, we are trying to implement a query type for Lucene (with eventual >>> target being Solr) where the query string passed in needs to be "filtered" >>> through a large list of document IDs per user. We can't store the user ID >>> information in the lucene index per document so we were planning to pull the >>> list of documents owned by user X from a key-value store at query time and >>> then build some sort of filter in memory before doing the Lucene/Solr query. >>> For example: >>> >>> content:"cars" user_id:X567 >>> >>> would first pull the list of docIDs that user_id:X567 has "access" to from a >>> keyvalue store and then we'd query the main index with content:"cars" but >>> only allow the docIDs that came back to be part of the response. The list of >>> docIDs can near the hundreds of thousands. >>> >>> What should I be looking at to implement such a feature? >>> >>> Thank you >>> Martin >>> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org