Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
- leave the current implementation, raising an exception; - handle the exception and limit the boolean query to the first 1024 (or what ever the limit is) terms; - select, between the possible terms, only the first 1024 (or what ever the limit is) more meaningful ones, leaving out all the others. I like this idea and I would finalize to myself like this: I'd also create a default rule for that to avoid handling exceptions for people who're happy with the default behavior: Keep and search for only the longest 1024 fragments, so it'll throw a,an,at,and,add,etc.., but it'll automatically keep 1024 variations like alpha,alfa,advanced,automatical,etc.. So, it'll automatically lower the search overhead and will still search fine without throwing exceptions. (for people who prefer the widest search range and do not care about the huge overhead, we could leave a boolean switch for keeping not the longest, but the shortest fragments) __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Anyone implemented custom hit ranking?
Hi! I have problems with short text ranking. I've read about same raking problems in the list archives, but found only hints and toughts (adjust DefaultSimilarity, Similarity, etc...), not complete solutions with source code. Anyone implemented a good solution for this problem? (example: my search application returns about 10-20 pages of 1-2 word hits for hello, and then it starts to list the longer texts) I've implemented a very simple solution: I boost documents shorter than 300 chars with 1/300*doclength at index time. Now it works a lot better. In fact, I can't see any problems now. Anyway, I think this is not the solution, this is a patch or workaround. So, I'd be interested in some kind of well designed complete solution for this problem. Regards, Sanyi __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to efficiently get # of search results, per attribute
I'd like to implement a search across several types of entities, let's say, classes, professors, and departments. I want the user to be able to enter a simple, single query and not have to specify what they're looking for. Then I want the search results to be something like this: Search results for: philosophy boyer Found: 121 classes - 5 professors - 2 departments search results here... I know I could iterate through every hit returned and count them up myself, but that seems inefficient if there are lots of results. Is there some other way to get this kind of information from the search result set? My other ideas are: doing a separate search each result type, or storing different types in different indexes. Any suggestions? Thanks for your help! -Chris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to efficiently get # of search results, per attribute
It depends on how many results they're looking through, here are two scenarios I see: 1] If you don't have that many records you can fetch all the results and then do a post parsing step the determine totals 2] If you have a lot of entries in each category and you're worried about fetching thousands of records every time, you can just have seperate indecies per category and search them in in parallel (not Lucene Parallel Search) and you can get up to 100 hits for each one (efficiency) but you'll also have the total from the search to display. Either way you can boost up speed using RamDirectory if you need more speed from the search, but whichever approach you choose I would recommend that you sit down and do some number crunching to figure out which way to go. Hope this helps Nader Henein Chris Lamprecht wrote: I'd like to implement a search across several types of entities, let's say, classes, professors, and departments. I want the user to be able to enter a simple, single query and not have to specify what they're looking for. Then I want the search results to be something like this: Search results for: philosophy boyer Found: 121 classes - 5 professors - 2 departments search results here... I know I could iterate through every hit returned and count them up myself, but that seems inefficient if there are lots of results. Is there some other way to get this kind of information from the search result set? My other ideas are: doing a separate search each result type, or storing different types in different indexes. Any suggestions? Thanks for your help! -Chris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene Scorers
On Friday 12 November 2004 22:56, Chuck Williams wrote: I had a similar need and wrote MaxDisjunctionQuery and MaxDisjunctionScorer. Unfortunately these are not available as a patch but I've included the original message below that has the code (modulo line breaks added by simple text email format). This code is functional -- I use it in my app. It is optimized for its stated use, which involves a small number of clauses. You'd want to improve the incremental sorting (e.g., using the bucket technique of BooleanQuery) if you need it for large numbers of clauses. When you're interested, you can also have a look here for yet another DisjunctionScorer: http://issues.apache.org/bugzilla/show_bug.cgi?id=31785 It has the advantage that it implements skipTo() so that it can be used as a subscorer of ConjunctionScorer, ie. it can be faster in situations like this: aa AND (bb OR cc) where bb and cc are treated by the DisjunctionScorer. When aa is a filter this can also be used to implement a filtering query. Re. Paul's suggested steps below, I did not integrate this with query parser as I didn't need that functionality (since I'm generating the multi-field expansions for which max is a much better scoring choice than sum). Chuck Included message: -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 9:55 PM To: [EMAIL PROTECTED] Subject: Contribution: better multi-field searching The files included below (MaxDisjunctionQuery.java and MaxDisjunctionScorer.java) provide a new mechanism for searching across multiple fields. The maximum indeed works well, also when the fields differ a lot length. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
about Stemming
Hi, I have used the DEMOS of lucene and I want to know as it is possible to be added Stemming for my applications. -- Miguel Angel Angeles R. Asesoria en Conectividad y Servidores Telf. 97451277 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: about Stemming
Miguel Angel schrieb: Hi, I have used the DEMOS of lucene and I want to know as it is possible to be added Stemming for my applications. have a look to the lucene-sandbox. Under contributions there are stemmers for many different languages. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
On Saturday 13 November 2004 09:16, Sanyi wrote: - leave the current implementation, raising an exception; - handle the exception and limit the boolean query to the first 1024 (or what ever the limit is) terms; - select, between the possible terms, only the first 1024 (or what ever the limit is) more meaningful ones, leaving out all the others. I like this idea and I would finalize to myself like this: I'd also create a default rule for that to avoid handling exceptions for people who're happy with the default behavior: Keep and search for only the longest 1024 fragments, so it'll throw a,an,at,and,add,etc.., but it'll automatically keep 1024 variations like alpha,alfa,advanced,automatical,etc.. Wouldn't it be counterintuitive to only use the longest matches for truncations? To have only longer matches one can also use queries with multiple ? characters, each matching exactly one character. I think it would be better encourage the users to use longer and maybe also more prefixes. This gives more precise results and is more efficient to execute. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: How to efficiently get # of search results, per attribute
My Lucene application includes multi-faceted navigation that does a more complex version of the below. I've got 5 different taxonomies into which every indexed item is classified. The largest of the taxonomies has over 15,000 entries while the other 4 are much smaller. For every search query, I determine the best small set of nodes from each taxonomy to present to the user as drill down options, and provide the counts regarding how many results fall under each of these nodes. At present I only have about 25,000 indexed objects and usually no more than 1,000 results from the initial query. To determine the drill-down options and counts, I scan up to 1,000 results computing the counts for all nodes into which these results classify. Then for each taxonomy I pick the best drill-down options available (orthogonal set with reasonable branching factor that covers all results) and present them with their counts. If there are more than 1,000 results, I extrapolate the computed counts to estimate the actual counts on the entire set of results. This is all done with a single index and a single search. The total time required for performing this computation for the one large taxonomy is under 10ms, running in full debug mode in my ide. The query response time overall is subjectively instantaneous at the UI (Google-speed or better). So, unless some dimension of the problem is much bigger than mine, I doubt performance will be an issue. Chuck -Original Message- From: Nader Henein [mailto:[EMAIL PROTECTED] Sent: Saturday, November 13, 2004 2:29 AM To: Lucene Users List Subject: Re: How to efficiently get # of search results, per attribute It depends on how many results they're looking through, here are two scenarios I see: 1] If you don't have that many records you can fetch all the results and then do a post parsing step the determine totals 2] If you have a lot of entries in each category and you're worried about fetching thousands of records every time, you can just have seperate indecies per category and search them in in parallel (not Lucene Parallel Search) and you can get up to 100 hits for each one (efficiency) but you'll also have the total from the search to display. Either way you can boost up speed using RamDirectory if you need more speed from the search, but whichever approach you choose I would recommend that you sit down and do some number crunching to figure out which way to go. Hope this helps Nader Henein Chris Lamprecht wrote: I'd like to implement a search across several types of entities, let's say, classes, professors, and departments. I want the user to be able to enter a simple, single query and not have to specify what they're looking for. Then I want the search results to be something like this: Search results for: philosophy boyer Found: 121 classes - 5 professors - 2 departments search results here... I know I could iterate through every hit returned and count them up myself, but that seems inefficient if there are lots of results. Is there some other way to get this kind of information from the search result set? My other ideas are: doing a separate search each result type, or storing different types in different indexes. Any suggestions? Thanks for your help! -Chris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to efficiently get # of search results, per attribute
Nader and Chuck, Thanks for the responses, they're both helpful. My index sizes will begin on the order of 200,000 classes, and 20,000 instructors (and much fewer departments), and grow over time to maybe a few million classes. Compared to some of the numbers I've seen on this mailing list, my dataset is fairly small. I think I'll not worry about performance for now, until unless it becomes an issue. -Chris On Sat, 13 Nov 2004 15:36:11 -0800, Chuck Williams [EMAIL PROTECTED] wrote: My Lucene application includes multi-faceted navigation that does a more complex version of the below. I've got 5 different taxonomies into which every indexed item is classified. The largest of the taxonomies has over 15,000 entries while the other 4 are much smaller. For every search query, I determine the best small set of nodes from each taxonomy to present to the user as drill down options, and provide the counts regarding how many results fall under each of these nodes. At present I only have about 25,000 indexed objects and usually no more than 1,000 results from the initial query. To determine the drill-down options and counts, I scan up to 1,000 results computing the counts for all nodes into which these results classify. Then for each taxonomy I pick the best drill-down options available (orthogonal set with reasonable branching factor that covers all results) and present them with their counts. If there are more than 1,000 results, I extrapolate the computed counts to estimate the actual counts on the entire set of results. This is all done with a single index and a single search. The total time required for performing this computation for the one large taxonomy is under 10ms, running in full debug mode in my ide. The query response time overall is subjectively instantaneous at the UI (Google-speed or better). So, unless some dimension of the problem is much bigger than mine, I doubt performance will be an issue. Chuck -Original Message- From: Nader Henein [mailto:[EMAIL PROTECTED] Sent: Saturday, November 13, 2004 2:29 AM To: Lucene Users List Subject: Re: How to efficiently get # of search results, per attribute It depends on how many results they're looking through, here are two scenarios I see: 1] If you don't have that many records you can fetch all the results and then do a post parsing step the determine totals 2] If you have a lot of entries in each category and you're worried about fetching thousands of records every time, you can just have seperate indecies per category and search them in in parallel (not Lucene Parallel Search) and you can get up to 100 hits for each one (efficiency) but you'll also have the total from the search to display. Either way you can boost up speed using RamDirectory if you need more speed from the search, but whichever approach you choose I would recommend that you sit down and do some number crunching to figure out which way to go. Hope this helps Nader Henein Chris Lamprecht wrote: I'd like to implement a search across several types of entities, let's say, classes, professors, and departments. I want the user to be able to enter a simple, single query and not have to specify what they're looking for. Then I want the search results to be something like this: Search results for: philosophy boyer Found: 121 classes - 5 professors - 2 departments search results here... I know I could iterate through every hit returned and count them up myself, but that seems inefficient if there are lots of results. Is there some other way to get this kind of information from the search result set? My other ideas are: doing a separate search each result type, or storing different types in different indexes. Any suggestions? Thanks for your help! -Chris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Anyone implemented custom hit ranking?
I've done some customization of scoring/ranking and plan to do more. A good place to start is with your own Similarity, extending Lucene's DefaultSimilarity. Like you, I found the default length normalization to not work well with my dataset. I separately weight each indexed field according to a static relative importance (implemented as a query boost factor that is automatically applied) and then disable length normalization altogether by redefining lengthNorm() to always return 1.0f. I also had problems with tf and idf normalization, especially with idf dominating the ranking determination. To address that, my Similarity increases the base of the log for each, and adds a final square root to the idf computation since Lucene squares the idf in the score computations. Have you tried the explain() mechanism? It is a great way to see precisely how your results are being scored (but be warned there is a final normalization in Hits that explain() does not show -- this final normalization does not affect the ranking order, but it does affect the final scores). Chuck -Original Message- From: Sanyi [mailto:[EMAIL PROTECTED] Sent: Saturday, November 13, 2004 12:38 AM To: [EMAIL PROTECTED] Subject: Anyone implemented custom hit ranking? Hi! I have problems with short text ranking. I've read about same raking problems in the list archives, but found only hints and toughts (adjust DefaultSimilarity, Similarity, etc...), not complete solutions with source code. Anyone implemented a good solution for this problem? (example: my search application returns about 10-20 pages of 1-2 word hits for hello, and then it starts to list the longer texts) I've implemented a very simple solution: I boost documents shorter than 300 chars with 1/300*doclength at index time. Now it works a lot better. In fact, I can't see any problems now. Anyway, I think this is not the solution, this is a patch or workaround. So, I'd be interested in some kind of well designed complete solution for this problem. Regards, Sanyi __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Mozilla Desktop Search
http://www.peerfear.org/rss/permalink/2004/11/13/MozillaDesktopSearch/ The Mozilla foundation may be considering a desktop search implementation http://computerworld.com/developmenttopics/websitemgmt/story/0,10801,97396,00.html?f=x10 : Having launched the much-awaited Version 1.0 of the Firefox browser yesterday (see story), The Mozilla Foundation is busy planning enhancements to the open-source product, including the possibility of integrating it with a variety of desktop search tools. The Mozilla Foundation also wants to place Firefox in PCs through reseller deals with PC hardware vendors and continue to sharpen the product's pop-up ad-blocking technology. I'm not sure this is a good idea. Maybe it is though. The technology just isn't there for cross platform search. I'd have to suggest using Lucene but using GCJ for a native compile into XPCOM components but I'm not sure if GCJ is up to the job here. If this approach is possible then I'd be very excited. One advantage to this approach is that an HTTP server wouldn't be necessary since you're already within the brower. Good for everyone involved. No bloated Tomcat causing problem and blazingly fast access within the browser. Also since TCP isn't involved you could gracefully fail when the search service isn't running; you could just start it. -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412