scoring algorithm
Hi, what is the purpose of tf_q * idf_t / norm_q in Lucene's scoring algorithm: score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) I dont understand, why the score has to be higher, when the frequency of a term in the query is higher. What is normalized by norm_q? Thanks, Chris _ Alles neu beim MSN Messenger: Emoticons, Hintergründe, Spiele! http://messenger.msn.de Jetzt die neue Version 6.0 testen! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
taxonomy with lucene
Hi, Has anyone tried building taxonomies in Lucene? Any idea what is the likely approach to be taken? thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
On Wed, Sep 17, 2003 at 08:00:42AM -0400, Erik Hatcher wrote: I'm about to start some refactorings on the web application demo that ships with Lucene to show off its features and be usable more easily and cleanly out of the box - i.e. just drop into Tomcat's webapps directory and go. Does anyone have any suggestions on what they'd like to see in the demo app? One odd thought (may be out of scope) is to put together a google-flavored query language, since most users are going to be unfamiliar with the default Lucene query language. Lucene doesn't really match google, but something google-flavored might be better at showing off Lucene's features in the demo. -- Steven J. Owens [EMAIL PROTECTED] I'm going to make broad, sweeping generalizations and strong, declarative statements, because otherwise I'll be here all night and this document will be four times longer and much less fun to read. Take it all with a grain of salt. - Me at http://darksleep.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: taxonomy with lucene
Has anyone tried building taxonomies in Lucene? Any idea what is the likely approach to be taken? I'm storing data with a hierarchical classification in a Lucene index, if that is what you mean. The approach is very simple. Every document has a field for a unique identifier, a field for the identifier of its immediate parent, and a field for those of all ancestors. This allows you to write queries such as name:human ancestor:2759 to find organisms that have human in their name, and are Eukaryotes (but not, say, viruses). This approach also allows you to efficiently display search results in a tree, even for very large result sets, as long as the hierarchy doesn't get to flat. One drawback of this approach is that doing incremental updates is not possible, or at least very complicated (duplicated information in the ancestor field), and you must be careful about the order in which you add documents to the index (parent before child). Let me know if you are interested in any further details. -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Proposition :adding minMergeDoc to IndexWriter
Hui, Concerning an other point of your request list I proposed a patch this week end on the lucene-dev list and i totally forgot that this feature was requested on the user list. This new feature should help you to set a number of Documents to be merged in memory independently of the mergeFactor. Any comments would be appreciated Best regards Julien Nioche http://www.lingway.com -- Debut du message initial --- De : fp235-5 [EMAIL PROTECTED] A : lucene-dev [EMAIL PROTECTED] Copies : Date : Sat, 20 Sep 2003 16:06:06 +0200 Sujet : [PATCH] IndexWriter : controling the number of Docs merged Hello, Someone made a suggestion yesterday about adding a variable to IndexWriter in order to control the number of Documents merged in RAMDirectory independently of the mergeFactor. (I'm sorry I don't remember who exactly and the mail arrived at my office). I'm proposing a tiny modification of IndexWriter to add this functionality. A variable minMergeDocs specifies the number of Documents to be merged in memory before starting a new Segment. The mergeFactor still control the number of Segments created in the Directory and thus it's possible to avoid the file number limitation problem. The diff file is attached. As noticed by Dmitry and Erik there are no true JUnit tests. I'd be OK to write a JUnit test for this feature. The problem is that the SegmentInfos field is private in IndexWriter and can't be used to check the number and size of the Segments. I ran a test using the infoStream variable of IndexWriter - everything seems to be OK. Any comments / suggestions are welcome. Regards Julien - Original Message - From: hui [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, September 22, 2003 3:40 PM Subject: Re: per-field Analyzer (was Re: some requests) Good work, Erik. Hui - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, September 20, 2003 4:13 AM Subject: per-field Analyzer (was Re: some requests) On Friday, September 19, 2003, at 07:45 PM, Erik Hatcher wrote: On Friday, September 19, 2003, at 11:15 AM, hui wrote: 1. Move the Analyzer down to field level from document level so some fields could be applied a specail analyzer.Other fields still use the default analyzer from the document level. For example, I do not need to index the number for the content field. It helps me reduce the index size a lot when I have some excel files. But I always need the created_date to be indexed though it is a number field. I know there are some workarounds put in the group, but I think it should be a good feature to have. The workaround is to write a custom analyzer and and have it do the desired thing per-field. Hmmm just thinking out loud here without knowing if this is possible, but could a generic wrapper Analyzer be written that allows other analyzers to be used under the covers based on a field name/analyzer mapping? If so, that would be quite cool and save folks from having to write custom analyzers as much to handle this pretty typical use-case. I'll look into this more in the very near future personally, but feel free to have a look at this yourself and see what you can come up with. What about something like this? public class PerFieldWrapperAnalyzer extends Analyzer { private Analyzer defaultAnalyzer; private Map analyzerMap = new HashMap(); public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) { this.defaultAnalyzer = defaultAnalyzer; } public void addAnalyzer(String fieldName, Analyzer analyzer) { analyzerMap.put(fieldName, analyzer); } public TokenStream tokenStream(String fieldName, Reader reader) { Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName); if (analyzer == null) { analyzer = defaultAnalyzer; } return analyzer.tokenStream(fieldName, reader); } } This would allow you to construct a single analyzer out of others, on a per-field basis, including a default one for any fields that do not have a special one. Whether the constructor should take the map or the addAnalyzer method is implemented is debatable, but I prefer the addAnalyzer way. Maybe addAnalyzer could return 'this' so you could chain: new PerFieldWrapperAnalyzer(new StandardAnalyzer).addAnalyzer(field1, new WhitespaceAnalyzer()).addAnalyzer(.). And I'm more inclined to call this thing PerFieldAnalyzerWrapper instead. Any naming suggestions? This simple little class would seem to be the answer to a very common question asked. Thoughts? Should this be made part of the core? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Limitation in size of Query class.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, all. In the Query class, and his subclasses. Are there any limitation in size ? Thanks in avance. - -- Cecilio Cano Calonge · Czy GNUpg Key = 5011 67C7 7C0B A513 C18F D93B 071B BA7C 9DF6 9399 -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.1 (GNU/Linux) iD8DBQE/cBSyBxu6fJ32k5kRAud3AJ9fVcYK1+h0zAVed2B77nDszn8vBgCdHzJx fKAPG0QFo7NWHMf1OZB+rks= =4Qif -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Confusion over wildcard search logic
BTW, this is with lucene 1.2 Thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Confusion over wildcard search logic
Ah, this is a fun one lots of fiddly issues with how queries work and how QueryParser works. I'll take a stab at some of these inline below On Monday, September 22, 2003, at 08:26 PM, Dan Quaroni wrote: I have a simple command line interface for testing. Interesting interface. Looks like something that if made generic enough would be handy to have at least in the sandbox. I'm getting some odd results, though, with certain logic of wildcard searches. not all your queries are truly WildcardQuery's though. look at the class it constructed to get a better idea of what is happening. It seems like depending on what order I put the fields of the query in alters the results drastically when I AND them together. Not quite the right explanation of what is happening. More below *** This one makes sense Query name:amb* State california name:amb* [EMAIL PROTECTED] amb* 2819 total matching documents Right QueryParser does a little optimization here and anything with a simple trailing * turns into a PrefixQuery, meaning all name fields that begin with amb. *** This is the REALLY confusing one. We know there's a company named AMB Property Corporation. Why do I get NO hits? Query name:amb prop* State california name:amb prop* [EMAIL PROTECTED] amb prop 0 total matching documents Notice you're now in PhraseQuery land. Wildcards don't work like you seem to expect here. What is really happening here is a query for documents that have amb and prop terms side by side in that order. The asterisk got axed by the analyzer. If you said name:amb name:prop* you'd get some hits I believe, as it would turn into a boolean query with a term and wildcard queries either OR'd or AND'd together. PhraseQuery does not support wildcards. A custom subclass of QueryParser could do some interesting things here and expand wildcard-like terms like this in a phrase into PhrasePrefixQuery, but that is probably overkill here (although maybe not). Look at the test case for PhrasePrefixQuery for some hints. Ok, so I get some results with this (I know the * isn't neccessary at the end of property, but bear with me for the next example where it goes all screwy) Query name:amb property* State california name:amb property* [EMAIL PROTECTED] amb name:amb property*:property* 56 total matching documents your default field for QueryParser is property*? Odd field name, or is the output fishy? I'm a bit confused by the property*: there. I'm assuming you're outputting the Query.toString here. See above for a different way to phrase the query. *** south san francisco is an exact match to the city. Why does this find 0 results??! Query name:amb property* AND city:south san francisco State california name:amb property* AND city:south san francisco [EMAIL PROTECTED] amb +name:amb property* AND city:south san francisco:property* +city:south name: amb property* AND city:south san francisco:san name:amb property* AND city:south san francisco:francisco 0 total matching documents with all the AND's going on, this makes sense because san and francisco end up as separate term queries. you'd have to say city:south san francisco to turn it into a PhraseQuery. Do this and suddenly I get matches Query name:amb propert* and city:south san fran* State california name:amb propert* and city:south san fran* [EMAIL PROTECTED] amb name:amb propert* and city:south san fran*:propert* city:south san fran56 total matching documents you're getting hits on the wildcard match at least, and probably on name field amb as well. again, phrase queries don't support wildcards like you've done here with south san fran* so you're not matching anything with that. * And look, this gets matches too: Query name:amb propert* and city:south san* State california name:amb propert* and city:south san* [EMAIL PROTECTED] amb propert city:south san 10732 total matching documents my guess here is you're getting hits on south san as a phrase query. are there that many in that area? * Yet do this and we're back to 0 results: Query name:amb propert* and city:south san fran* State california name:amb propert* and city:south san fran* [EMAIL PROTECTED] amb propert city:south san fran 0 total matching documents you're getting zero hits from amb propert* since * is getting stripped by the analyzer and there is no amb propert phrase match, and with the AND (which should be all uppercase, right?) definitely not getting hits. ** Now flip the query around and it works: Query city:south san fran* and name:amb propert* State california city:south san fran* and name:amb propert* [EMAIL PROTECTED] city:south san fran amb city:south san fran* and name:amb propert*:propert* 56 total matching documents You didn't quite flip it around, you took off some quotes
Re: Proposition :adding minMergeDoc to IndexWriter
It is a great. Julien. Thanks. Next time I am going to post the requests to the developer groups. Regards, Hui - Original Message - From: Julien Nioche [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, September 23, 2003 5:38 AM Subject: Proposition :adding minMergeDoc to IndexWriter Hui, Concerning an other point of your request list I proposed a patch this week end on the lucene-dev list and i totally forgot that this feature was requested on the user list. This new feature should help you to set a number of Documents to be merged in memory independently of the mergeFactor. Any comments would be appreciated Best regards Julien Nioche http://www.lingway.com -- Debut du message initial --- De : fp235-5 [EMAIL PROTECTED] A : lucene-dev [EMAIL PROTECTED] Copies : Date : Sat, 20 Sep 2003 16:06:06 +0200 Sujet : [PATCH] IndexWriter : controling the number of Docs merged Hello, Someone made a suggestion yesterday about adding a variable to IndexWriter in order to control the number of Documents merged in RAMDirectory independently of the mergeFactor. (I'm sorry I don't remember who exactly and the mail arrived at my office). I'm proposing a tiny modification of IndexWriter to add this functionality. A variable minMergeDocs specifies the number of Documents to be merged in memory before starting a new Segment. The mergeFactor still control the number of Segments created in the Directory and thus it's possible to avoid the file number limitation problem. The diff file is attached. As noticed by Dmitry and Erik there are no true JUnit tests. I'd be OK to write a JUnit test for this feature. The problem is that the SegmentInfos field is private in IndexWriter and can't be used to check the number and size of the Segments. I ran a test using the infoStream variable of IndexWriter - everything seems to be OK. Any comments / suggestions are welcome. Regards Julien - Original Message - From: hui [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, September 22, 2003 3:40 PM Subject: Re: per-field Analyzer (was Re: some requests) Good work, Erik. Hui - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, September 20, 2003 4:13 AM Subject: per-field Analyzer (was Re: some requests) On Friday, September 19, 2003, at 07:45 PM, Erik Hatcher wrote: On Friday, September 19, 2003, at 11:15 AM, hui wrote: 1. Move the Analyzer down to field level from document level so some fields could be applied a specail analyzer.Other fields still use the default analyzer from the document level. For example, I do not need to index the number for the content field. It helps me reduce the index size a lot when I have some excel files. But I always need the created_date to be indexed though it is a number field. I know there are some workarounds put in the group, but I think it should be a good feature to have. The workaround is to write a custom analyzer and and have it do the desired thing per-field. Hmmm just thinking out loud here without knowing if this is possible, but could a generic wrapper Analyzer be written that allows other analyzers to be used under the covers based on a field name/analyzer mapping? If so, that would be quite cool and save folks from having to write custom analyzers as much to handle this pretty typical use-case. I'll look into this more in the very near future personally, but feel free to have a look at this yourself and see what you can come up with. What about something like this? public class PerFieldWrapperAnalyzer extends Analyzer { private Analyzer defaultAnalyzer; private Map analyzerMap = new HashMap(); public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) { this.defaultAnalyzer = defaultAnalyzer; } public void addAnalyzer(String fieldName, Analyzer analyzer) { analyzerMap.put(fieldName, analyzer); } public TokenStream tokenStream(String fieldName, Reader reader) { Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName); if (analyzer == null) { analyzer = defaultAnalyzer; } return analyzer.tokenStream(fieldName, reader); } } This would allow you to construct a single analyzer out of others, on a per-field basis, including a default one for any fields that do not have a special one. Whether the constructor should take the map or the addAnalyzer method is implemented is debatable, but I prefer the addAnalyzer way. Maybe addAnalyzer could return 'this' so you could chain: new PerFieldWrapperAnalyzer(new StandardAnalyzer).addAnalyzer(field1, new
Re: Confusion over wildcard search logic
Erik's analysis is comprehensive and useful. I think this example reflects a common (and understandable) oversight - that wildcards do *not* work with a phrase. Got caught on that many times myself. Also there may be confusion about the format - field:(term1 term2), in that the examples provided don't seem to make use a parentheses. Finally, as I recall, there was some bug(s) with some wildcard patterns with 1.2. Regards, Terry - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, September 22, 2003 10:33 PM Subject: Re: Confusion over wildcard search logic Ah, this is a fun one lots of fiddly issues with how queries work and how QueryParser works. I'll take a stab at some of these inline below On Monday, September 22, 2003, at 08:26 PM, Dan Quaroni wrote: I have a simple command line interface for testing. Interesting interface. Looks like something that if made generic enough would be handy to have at least in the sandbox. I'm getting some odd results, though, with certain logic of wildcard searches. not all your queries are truly WildcardQuery's though. look at the class it constructed to get a better idea of what is happening. It seems like depending on what order I put the fields of the query in alters the results drastically when I AND them together. Not quite the right explanation of what is happening. More below *** This one makes sense Query name:amb* State california name:amb* [EMAIL PROTECTED] amb* 2819 total matching documents Right QueryParser does a little optimization here and anything with a simple trailing * turns into a PrefixQuery, meaning all name fields that begin with amb. *** This is the REALLY confusing one. We know there's a company named AMB Property Corporation. Why do I get NO hits? Query name:amb prop* State california name:amb prop* [EMAIL PROTECTED] amb prop 0 total matching documents Notice you're now in PhraseQuery land. Wildcards don't work like you seem to expect here. What is really happening here is a query for documents that have amb and prop terms side by side in that order. The asterisk got axed by the analyzer. If you said name:amb name:prop* you'd get some hits I believe, as it would turn into a boolean query with a term and wildcard queries either OR'd or AND'd together. PhraseQuery does not support wildcards. A custom subclass of QueryParser could do some interesting things here and expand wildcard-like terms like this in a phrase into PhrasePrefixQuery, but that is probably overkill here (although maybe not). Look at the test case for PhrasePrefixQuery for some hints. Ok, so I get some results with this (I know the * isn't neccessary at the end of property, but bear with me for the next example where it goes all screwy) Query name:amb property* State california name:amb property* [EMAIL PROTECTED] amb name:amb property*:property* 56 total matching documents your default field for QueryParser is property*? Odd field name, or is the output fishy? I'm a bit confused by the property*: there. I'm assuming you're outputting the Query.toString here. See above for a different way to phrase the query. *** south san francisco is an exact match to the city. Why does this find 0 results??! Query name:amb property* AND city:south san francisco State california name:amb property* AND city:south san francisco [EMAIL PROTECTED] amb +name:amb property* AND city:south san francisco:property* +city:south name: amb property* AND city:south san francisco:san name:amb property* AND city:south san francisco:francisco 0 total matching documents with all the AND's going on, this makes sense because san and francisco end up as separate term queries. you'd have to say city:south san francisco to turn it into a PhraseQuery. Do this and suddenly I get matches Query name:amb propert* and city:south san fran* State california name:amb propert* and city:south san fran* [EMAIL PROTECTED] amb name:amb propert* and city:south san fran*:propert* city:south san fran56 total matching documents you're getting hits on the wildcard match at least, and probably on name field amb as well. again, phrase queries don't support wildcards like you've done here with south san fran* so you're not matching anything with that. * And look, this gets matches too: Query name:amb propert* and city:south san* State california name:amb propert* and city:south san* [EMAIL PROTECTED] amb propert city:south san 10732 total matching documents my guess here is you're getting hits on south san as a phrase query. are there that many in that area? * Yet do
RE: Confusion over wildcard search logic
Yeah, thanks a lot for your help! I'm using the release version of Lucene version 1.2. not all your queries are truly WildcardQuery's though. look at the class it constructed to get a better idea of what is happening. Yeah, I printed the queries out to see what was going on and noticed that. I used wildcard in the subject as a sort of general description of what I was after rather than its technical meaning to lucene, but I was curious about why I was getting the query types I was getting. your default field for QueryParser is property*? Odd field name, or is the output fishy? I'm a bit confused by the property*: there. I'm assuming you're outputting the Query.toString here. That's just the Query.toString, yeah. I didn't set any default fields to property* so whatever it's spitting out comes from the QueryParser. Query name:amb propert* and city:south san fran* you're getting hits on the wildcard match at least, and probably on name field amb as well. again, phrase queries don't support wildcards like you've done here with south san fran* so you're not matching anything with that. Ok... What's the correct procedure for doing a multi-word wildcard where I want it to begin with south san fran but not get anything else that contains south or san? Just and together the south, san, and fran? Although this might produce good results, my understanding was that booleans retrieve all matches and store them in memory then resolve the booleans. If I use the term san to search California, I'm going to need a lot of memory to store all of the temporary results...! Or is that only true when doing booleans on different fields? If so, I think we have our solution! :) I'm still confused by the output of propert*: here - are you using the CVS version of Lucene? The honest to goodness 1.2 release. :) I hope my above analysis helps. I may not be perfectly right on everything, but should be relatively close at identifying the issues. Fixing it is more up to how you want to deal with it. Perhaps a custom QueryParser is more what you're after. Well, i think the real key piece of information was to drop the quotes to avoid the phrase queries. Thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Confusion over wildcard search logic
Your email prompted me to re-read the query parser documentation. There are only two examples using parentheses, which seem to be the answer to my questions. They are: (jakarta OR apache) AND website And title:(+return +pink panther) These leave a lot unanswered, though. I mean, for example, what would happen if the query were: title:(+return +pink panther) or title:(return*pink panther) I.e. are the + or booleans required between each word inside the parentheses? I guess the answer is that I need to just play with it and find out, but as others have mentioned, the documentation is lacking in some respects and I'd say this is one of them... Maybe I'll submit some answers when I figure them out. :) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Confusion over wildcard search logic
Better yet, submit some JUnit test cases that show how this stuff works, if the ones in Lucene's codebase aren't comprehensive enough. This is an excellent way to play with an API and get a good understanding of it and documenting it at the same time. Erik On Tuesday, September 23, 2003, at 10:25 AM, Otis Gospodnetic wrote: Hello, I guess the answer is that I need to just play with it and find out, but as others have mentioned, the documentation is lacking in some respects and I'd say this is one of them... Maybe I'll submit some answers when I figure them out. :) Thank you, always appreciated. Otis __ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Multiple Index search using SearchBean.
Hi folks, I have been using Lucene for a while. Our application needs to sort the result set by last modified date. I was really happy to see SearchBean and HitsIterator. My question is that can I use SearchBean for search using Multiple indices. I skimmed through the souce code but could not find any method to search using Multiple indices or an array of Directories. Can anyone help me. Thanks in advance. ~Vela - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Confusion over wildcard search logic
On Tuesday, September 23, 2003, at 10:09 AM, Dan Quaroni wrote: Yeah, thanks a lot for your help! I'm using the release version of Lucene version 1.2. Perhaps give the latest codebase a try too, just to see if any fixes (particularly in that WildcardQuery.toString) are there. you're getting hits on the wildcard match at least, and probably on name field amb as well. again, phrase queries don't support wildcards like you've done here with south san fran* so you're not matching anything with that. Ok... What's the correct procedure for doing a multi-word wildcard where I want it to begin with south san fran but not get anything else that contains south or san? Just and together the south, san, and fran? As far as I know there isn't a way to do this with QueryParser currently. The real way to do this with the existing API is to use PhrasePrefixQuery and do some manual setup before using it (like you'll see in the current test case and Javadocs for it) by enumerating all the terms that start with fran and passing that to a PhrasePrefixQuery (isn't this class misnamed? What does this have to do with prefix?) along with south and san. Although this might produce good results, my understanding was that booleans retrieve all matches and store them in memory then resolve the booleans. If I use the term san to search California, I'm going to need a lot of memory to store all of the temporary results...! +south +san +fran* ought to do the trick. i wouldn't worry about memory too much until you've seen it to be a problem. i think you'll be fine (but don't currently have the understanding or data to back that up). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Confusion over wildcard search logic
Perhaps give the latest codebase a try too, just to see if any fixes (particularly in that WildcardQuery.toString) are there. It's our intention to put this into a production environment soon, so we were waiting on 1.3 to go final before attempting to use it. i wouldn't worry about memory too much until you've seen it to be a problem. i think you'll be fine (but don't currently have the understanding or data to back that up). The reason I split up the indexes by state was that I was running out of memory (and searches were very slow) with the whole world of companies in one index with all kinds of boolean joining. With it split out, it seems to do pretty well. Well, after some extremely brief experimentation (Maybe I shoulda done it before writing the email, huh?) I discovered this: ** This worked pretty well and got me some good results - the company that I was looking for came back second (Which is pretty good given how general I made the query) Query name:(amb proper*) State california name:(amb proper*) [EMAIL PROTECTED] amb proper* 31988 total matching documents * This one matched a ton of documents, however the company I was looking for came up first in the list, though with a pretty abysmal score of 0.23769014 Query name:(amb prop*) and city:(south san fran*) State california name:(amb prop*) and city:(south san fran*) [EMAIL PROTECTED] (amb prop*) (city:south city:san city:fran*) 721977 total matching documents The previous query took 1552 millis. I was able to reduce that to 285 millis just by adding the +'s you suggested: Query name:(amb prop*) and city:(+south +san +fran*) State california name:(amb prop*) and city:(+south +san +fran*) [EMAIL PROTECTED] (amb prop*) (+city:south +city:san +city:fran*) 45011 total matching documents Incidently, I say everything that I do with great awe at the power of Lucene and respect for those who have made it possible. Please don't take anything I say as a gripe - I'm just learning how things work and that's a neccessary step to take for any new software package of this type. You just have to learn the ins and outs and little quirks to be able to take full advantage of it. Thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Is the lucene index serializable?
Can I send a small lucene index by SOAP/TCP/HTTP/RMI? Is there a way to serialize a Lucene Index? I wan to send it from the Indexer server to the Search Server, and then do a merge operation in the Search Server with the previous index file. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Is the lucene index serializable?
Can I send a small lucene index by SOAP/TCP/HTTP/RMI? Is there a way to serialize a Lucene Index? I wan to send it from the Indexer server to the Search Server, and then do a merge operation in the Search Server with the previous index file. Well, what about a very old fashioned way instead? Something like tar.gz.ftp? Not very glamourous, but workable... Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Design question
I, like a lot of other people are new to Lucene. Practical examples are pretty scarce. I have the following site: http://www.tasteofwhatcom.com It's built on JBoss 3.0.7/Tomcat 4.1.24, Apache 2.0.47/mod_jk 1.2.4, MySQL 3.23.57 and RedHat 9.0. I want to add search capabilites to the site to allow users to search for entries. All the menu items are in various MySQL tables. In addition some information is in static html and jsp pages. Some links are to PDF docs on the site. How do you create a Lucene index on 'menu items' in the database as well as the static pages both html and jsp, and any PDF docs that are linked? Are there any examples? Thanks, Jack - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Design question
I, like a lot of other people are new to Lucene. Practical examples are pretty scarce. If you don't mind learning by example, take a look at the Powered by Lucene page. A fair number of those projects are open source. http://jakarta.apache.org/lucene/docs/powered.html PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: scoring algorithm
On Tuesday 23 September 2003 00:12, Chris Hennen wrote: Hi, what is the purpose of tf_q * idf_t / norm_q in Lucene's scoring algorithm: score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) I dont understand, why the score has to be higher, when the frequency of a term in the query is higher. What is normalized by norm_q? To give the user the possibility to assign a higher weight to a term in a query, (by using a term weight or by repeating the term). The norm_q compensates the total score for the query weights, leaving the scores of two different queries somewhat comparable. Kind regards, Ype - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]