Re: QueryParser Rules article (Erik Hatcher)
On Wednesday, November 12, 2003, at 11:52 PM, Tomcat Programmer wrote: I thought Erik's article was great. There was one unanswered brainbender I had which I was hoping was in there, but... Maybe you can add this topic to the next one, Erik? Well, I'm not sure another article on QueryParser is warranted (yet), but I'll offer a response here When using the QueryParser class, the parse method will throw a TokenMgrError when there is a syntax error even as simple as a missing quote at the end of a phrase query. According to the javadoc, you should never see this class derived from Error being thrown (oops?) You must be using the instance parse method, rather than the static one. The static one does this: try { QueryParser parser = new QueryParser(field, analyzer); return parser.parse(query); } catch (TokenMgrError tme) { throw new ParseException(tme.getMessage()); } But the instance parse method is declared to throw a TokenMgrError. Why is that? I'd be happy to put that same try/catch in the instance parse method, although I want to double check (CC'ing lucene-dev on this one). Any reason not to remove the TokenMgrError exception from the instance parse method? Has anyone discovered a good practice for trapping syntax problems and then returning an informative message to the user on how to fix their query? I would be interested in code samples as well if you have any :) There is the javascript piece in the sandbox that could help pre-parsing expressions for validity. Otherwise, simply displaying acceptable examples of expressions is what I'd do. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
QueryParser Rules article (Erik Hatcher)
I thought Erik's article was great. There was one unanswered brainbender I had which I was hoping was in there, but... Maybe you can add this topic to the next one, Erik? Here is my issue: When using the QueryParser class, the parse method will throw a TokenMgrError when there is a syntax error even as simple as a missing quote at the end of a phrase query. According to the javadoc, you should never see this class derived from Error being thrown (oops?) I did some searching on the archive for this list, and turned up some old articles from 2001 in which Brian Goetz was asking Paul Friedman for an example of a query like that, so he could fix it. I saw that Paul posted a sample, but I never saw a response back from Brian. Looking in the CHANGES.txt file all the way back to 1.0 there is no mention of any modification regarding exceptions or errors. Has anyone discovered a good practice for trapping syntax problems and then returning an informative message to the user on how to fix their query? I would be interested in code samples as well if you have any :) Thanks a lot! -Tom __ Do you Yahoo!? Protect your identity with Yahoo! Mail AddressGuard http://antispam.yahoo.com/whatsnewfree - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Reopen IndexWriter after delete?
I agree it's a bit of a strange design. It seems that there should be one class that handles all modifications of the index. Usually you'd only have one instance of this so you wouldn't need to open and close it all the time (I'm basically writing one of these classes myself to simplify my code. I'm sure other people have written a similar class). There should be another class that is responsible for searching. You may have multiple instances of this so you can have multiple headends searching the index. The IndexWriter and IndexReader almost do this separation. It seems that if the IndexWriter had the delete functionality, instead of the IndexReader, things would be a lot simplier (from a synchronization standpoint). Maybe Otis, Erik or Doug could suggest why this may or may not be a good idea. -Reece -Original Message- From: Dror Matalon [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 12, 2003 12:06 PM To: Lucene Users List Subject: Re: Reopen IndexWriter after delete? Which begs the question: why do you need to use an IndexReader rather than an IndexWriter to delete an item? On Tue, Nov 11, 2003 at 02:46:37PM -0800, Otis Gospodnetic wrote: > > 1). If I delete a term using an IndexReader, can I use an existing > > IndexWriter to write to the index? Or do I need to close and reopen > > the IndexWriter? > > No. You should close IndexWriter first, then open IndexReader, then > call delete, then close IndexReader, and then open a new IndexWriter. > > > 2). Is it safe to call IndexReader.delete(term) while an IndexWriter > > is > > writing? Or should I be synchronizing these two tasks so only one > > occurs at a time? > > No, it is not safe. You should close the IndexWriter, then delete the > document and close IndexReader, and then get a new IndexWriter and > continue writing. > > Incidentally, I just wrote a section about concurrency issues and about > locking in Lucene for the upcoming Lucene book. > > Otis > > > __ > Do you Yahoo!? > Protect your identity with Yahoo! Mail AddressGuard > http://antispam.yahoo.com/whatsnewfree > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.fastbuzz.com http://www.zapatec.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Can use Lucene be used for this
Hello, This has probably been put forth on the list before, but how about the following approach for leftmost wildcard searches, at least for single term searches? Reverse the character order of all words after they're stemmed and before they're added to a special reverse-character-order index. Any time a wildcard was found at the beginning of the search term the special index would be engaged. Then a search for "*bar" would be converted to a search for "rab*" on the RCO index, and the search would find "raboof", and this result would then be unreversed upon display to yield: "foobar". Rene's special index could be several times larger in entry count, depending on the average length of the contained terms. A reverse-character-order index is the same size as its regular counterpart. Cheers, John -Original Message- From: Hackl, Rene [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 12, 2003 6:34 AM To: 'Lucene Users List' Subject: Re: Can use Lucene be used for this >> col2 like %aa% > Lucene doesn't handle queries where the start of the term is not known > very efficiently. Is it really able to handle them at all? I thought "*foo"-type queries were not supported. That's because I build two indexes for the purpose of simultaneous left and right truncation. One "normal" index and another special one, which takes tokens and breaks them down, for instance "foobar" would be indexed also as "oobar" and "obar". For a query "*oba*" the left wildcard would cause the special index to be searched for "oba*", not left truncated queries would search the normal index. The special index is created with maxFieldLength = 10 build-time specialIndex vs. normalIndex: +60% index size specialIndex vs. normalIndex: +240% index size specialIndex vs. originalDocSize: +60% Query execution is still very fast on a 3GB specialIndex. I guess the usability depends on how large your document collection is and what kind of search functionality you need. The drawbacks of this approach are that proximity and phrase searches on the special index are busted. Would it make sense to prevent creating the prx-file to reduce index size when not offering that kind of search anyway? Is it possible at all? Best regards, René - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Poor Performance when searching for 500+ terms
I know this is rare, But I am building an application that submits searches having 500+ search terms. A general example would be field1:w1 OR field1:w2 OR ... OR field1:w500 For 1 millions documents, the performance is OK if field1 in each document has less than 50 terms, I can get result < 1 sec. but if field1 has more than average 400 terms in each document, the performance degrades to around 6 secs. Is there anyway to improve this? And my second questions is that my query often comes with an AND condition with another search word. for example: field2:w AND (field1:w1 OR field1:w2, ... field1:w500) field2:w will only return less than 1000 records out of 1 millions. then I thought I could use a StringFilter Object? i.e. search on field2.w first, thus limit the search for 500 OR only on the field2.w 1000 results. somewhat like a join in database. But I checked the code and sees that IndexSearcher always perfomance the 500 disk searches before calling the filter object? Any suggestions on this? Also does lucene caches results in memory? I see the performance tends to get better after a few runs, especailly on searches on fields having small number of terms. If so, can I manipulate the cache size somehow to accommdate fields with large number of terms. Many thanks. Want to chat instantly with your online friends? Get the FREE Yahoo! Messenger http://mail.messenger.yahoo.co.uk - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reopen IndexWriter after delete?
Which begs the question: why do you need to use an IndexReader rather than an IndexWriter to delete an item? On Tue, Nov 11, 2003 at 02:46:37PM -0800, Otis Gospodnetic wrote: > > 1). If I delete a term using an IndexReader, can I use an existing > > IndexWriter to write to the index? Or do I need to close and reopen > > the IndexWriter? > > No. You should close IndexWriter first, then open IndexReader, then > call delete, then close IndexReader, and then open a new IndexWriter. > > > 2). Is it safe to call IndexReader.delete(term) while an IndexWriter > > is > > writing? Or should I be synchronizing these two tasks so only one > > occurs at a time? > > No, it is not safe. You should close the IndexWriter, then delete the > document and close IndexReader, and then get a new IndexWriter and > continue writing. > > Incidentally, I just wrote a section about concurrency issues and about > locking in Lucene for the upcoming Lucene book. > > Otis > > > __ > Do you Yahoo!? > Protect your identity with Yahoo! Mail AddressGuard > http://antispam.yahoo.com/whatsnewfree > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.fastbuzz.com http://www.zapatec.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Latent Semantic Indexing
Does Lucene implement Latent Semantic Indexing? Examples? Ralf -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Vector Space Model in Lucene?
Hi, does Lucene implement a Vector Space Model? If yes, does anybody have an example of how using it? Cheers, Ralf -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Connection Pooling
Hi! Does anyone have the code of a Connection Pooling? I am using JDK 1.3.1. Thank you! _ The new MSN 8: advanced junk mail protection and 2 months FREE* http://join.msn.com/?page=features/junkmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index pdf files with your content in lucene.
Hello well, not work zip the files. I can send files, if somebody won, to personal email. And if somebody can post this in a web site, very cool. I don´t post in a web site. Ernesto. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boost in Query Parser
On Wednesday, November 12, 2003, at 10:53 AM, MOYSE Gilles (Cetelem) wrote: Hello. I've made a Filter which recognizes special words and return them in a "boosted form", in a QueryParser sense. For instance, when the filter receives "special_word", it returns "special_word^3", so as to boost it. The problem is that the QueryParser understands the boost syntax when the string is given as an argument to the "parse" function, but does not get it when it is generated by a filter in the Analyzer. So, when my filter transforms "special_word" to "special_filter^3", the QueryParser does not create a Query object with "special_word" as value to look for and boost to 3, but with "special_word^3" to search and boost to 1. Of course, it does not match anything. Does anyone knows a solution to that problem ? Do I have to write my own QueryParser from the beginning or do I just have to correct 2 ot 3 lines of the original QueryParser to make it work the I'd like it to work ? One idea is to pre-process the string before handing it to QueryParser and do a string replacement with the boosting (^3) added appropriately. Writing your own QueryParser is certainly a possibility. There is nothing really to "correct" with the original QueryParser in this regard as it is working by design and there really is no way to feed expressions back from the analysis back into the parsing - doesn't really seem like a good idea to me. You can probably get away with subclassing QueryParser and overriding getFieldQuery to do what you want with the String passed in, and calling setBoost (rather than trying to inject "^3"). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Wildcard search and HOST tokens
On Wednesday, November 12, 2003, at 10:43 AM, Pascal Nadal wrote: the HostFilter I wrote (that tokenizes again HOST tokens) works wonderfully. I wonder if this has been fixed since Lucene 1.2 could you try the latest 1.3RC build available and see if it works without your HostFilter? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Boost in Query Parser
Hello. I've made a Filter which recognizes special words and return them in a "boosted form", in a QueryParser sense. For instance, when the filter receives "special_word", it returns "special_word^3", so as to boost it. The problem is that the QueryParser understands the boost syntax when the string is given as an argument to the "parse" function, but does not get it when it is generated by a filter in the Analyzer. So, when my filter transforms "special_word" to "special_filter^3", the QueryParser does not create a Query object with "special_word" as value to look for and boost to 3, but with "special_word^3" to search and boost to 1. Of course, it does not match anything. Does anyone knows a solution to that problem ? Do I have to write my own QueryParser from the beginning or do I just have to correct 2 ot 3 lines of the original QueryParser to make it work the I'd like it to work ? Thanks a lot. Gilles Moyse. -Message d'origine- De : Erik Hatcher [mailto:[EMAIL PROTECTED] Envoyé : mercredi 12 novembre 2003 15:16 À : Lucene Users List Objet : Re: Can use Lucene be used for this On Wednesday, November 12, 2003, at 07:34 AM, Hackl, Rene wrote: >>> col2 like %aa% > >> Lucene doesn't handle queries where the start of the term is not known >> very efficiently. > > Is it really able to handle them at all? I thought "*foo"-type queries > were > not supported. They are not supported by the QueryParser, but an API created WildcardQuery supports it. I certainly do not recommend using prefix-style wildcard queries though, knowing what happens under the covers. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Wildcard search and HOST tokens
when I do a query.toString(), it prints exactly my query. example: title:FE.MENU* gives title:FE.MENU* FE.MENU* when I search in the default field and the field 'title'. the HostFilter I wrote (that tokenizes again HOST tokens) works wonderfully. PS: thanks Erik -Message d'origine- De : Erik Hatcher [mailto:[EMAIL PROTECTED] Envoyé : mercredi 12 novembre 2003 12:43 À : Lucene Users List Objet : Re: Wildcard search and HOST tokens On Wednesday, November 12, 2003, at 05:55 AM, Pascal Nadal wrote: > My lucene indexes contain fields with values like this www.xxx.yyy.zzz > which are treated as HOST tokens. > My problem is the following : search results never contain documents > with > such fields when doing a wildcard query or a fuzzy query. Only > searches on > full field values work. > > example queries: www* www.* www.xxx* www?xxx?yyy www.yyy.y~ or just > yyy > > I'm using Lucene 1.2 and the StandardAnalyzer. It seems that the '.' > is the > problem. > > Is it a bug ? What does query.toString("") return? This generally has a lot of clues on what happened in QueryParser. Erik
Re: Can use Lucene be used for this
On Wednesday, November 12, 2003, at 07:34 AM, Hackl, Rene wrote: col2 like %aa% Lucene doesn't handle queries where the start of the term is not known very efficiently. Is it really able to handle them at all? I thought "*foo"-type queries were not supported. They are not supported by the QueryParser, but an API created WildcardQuery supports it. I certainly do not recommend using prefix-style wildcard queries though, knowing what happens under the covers. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Overview to Lucene
Hi Ralf, On Nov 12, 2003, at 14:06, [EMAIL PROTECTED] wrote: Does anybody know good articles which demonstrate parts of that or give a good start into Lucene? Otis Gospodnetic's articles are a good starting point: "Introduction to Text Indexing with Apache Jakarta Lucene" http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html "Advanced Text Indexing with Lucene" http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Overview to Lucene
Hello group, can somebody give me an overview to Lucene? What high level components does it include? Particularly I want to asnwer the following questions regarding available functionalty: 1) Does Lucene provide a Vector Space IR Model (with TF/IDF and Cosine Similarity)? 2) Does Lucene provide any collaborative filtering algoritms like correlation / user ranking etc. ? 3) Does Lucene provide a Probabilistic Model? 4) Does Lucene provide anything for indexing XML documents and using XML document structure for that? Or does it just work on flat text files? Does anybody know good articles which demonstrate parts of that or give a good start into Lucene? Thanks, Ralf -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can use Lucene be used for this
>> col2 like %aa% > Lucene doesn't handle queries where the start of the term is not known > very efficiently. Is it really able to handle them at all? I thought "*foo"-type queries were not supported. That's because I build two indexes for the purpose of simultaneous left and right truncation. One "normal" index and another special one, which takes tokens and breaks them down, for instance "foobar" would be indexed also as "oobar" and "obar". For a query "*oba*" the left wildcard would cause the special index to be searched for "oba*", not left truncated queries would search the normal index. The special index is created with maxFieldLength = 10 build-time specialIndex vs. normalIndex: +60% index size specialIndex vs. normalIndex: +240% index size specialIndex vs. originalDocSize: +60% Query execution is still very fast on a 3GB specialIndex. I guess the usability depends on how large your document collection is and what kind of search functionality you need. The drawbacks of this approach are that proximity and phrase searches on the special index are busted. Would it make sense to prevent creating the prx-file to reduce index size when not offering that kind of search anyway? Is it possible at all? Best regards, René - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reopen IndexWriter after delete?
Correct. write.lock is used for that. Otis --- Morus Walter <[EMAIL PROTECTED]> wrote: > Otis Gospodnetic writes: > > > > No, it is not safe. You should close the IndexWriter, then delete > the > > document and close IndexReader, and then get a new IndexWriter and > > continue writing. > > > IIRC lucene takes care that you do so. > Locking prevents you from having an open IndexWriter and > modify the index with an IndexReader (and vice verse). > > Morus > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > __ Do you Yahoo!? Protect your identity with Yahoo! Mail AddressGuard http://antispam.yahoo.com/whatsnewfree - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Wildcard search and HOST tokens
On Wednesday, November 12, 2003, at 05:55 AM, Pascal Nadal wrote: My lucene indexes contain fields with values like this www.xxx.yyy.zzz which are treated as HOST tokens. My problem is the following : search results never contain documents with such fields when doing a wildcard query or a fuzzy query. Only searches on full field values work. example queries: www* www.* www.xxx* www?xxx?yyy www.yyy.y~ or just yyy I'm using Lucene 1.2 and the StandardAnalyzer. It seems that the '.' is the problem. Is it a bug ? What does query.toString("") return? This generally has a lot of clues on what happened in QueryParser. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Wildcard search and HOST tokens
My lucene indexes contain fields with values like this www.xxx.yyy.zzz which are treated as HOST tokens. My problem is the following : search results never contain documents with such fields when doing a wildcard query or a fuzzy query. Only searches on full field values work. example queries: www* www.* www.xxx* www?xxx?yyy www.yyy.y~ or just yyy I'm using Lucene 1.2 and the StandardAnalyzer. It seems that the '.' is the problem. Is it a bug ? I wrote a HostFilter class which tokenizes again HOST tokens and it seems to work fine (full field values or wildcard queries)
Re: Can use Lucene be used for this
> I need to retrieve the value with simple queries on the data like: > col1 like %ab&, What does the ampersand mean? > col2 like %aa% Lucene doesn't handle queries where the start of the term is not known very efficiently. > and col3 sounds like ; No experience with this, but you could probably use the Soundex encoder from http://jakarta.apache.org/commons/codec/ for transforming words before indexing them (and before searching for them). -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
> I was basically thinking of using lucene to generate document > vectors, and writing my custom similarity algorithms for measuring > distance. > > I could then run this data through k-means or SOM algorithms for > calculating clusters First of all, I think it would already be great if there was some functionality for simply storing document vectors during the indexing process, so you could later on use IndexSearcher.docTerms(int i) to retrieve a BitSet or an array of floats that are weighted so that frequent terms have lower values. One difficulty I see here is that terms don't seem to have any unique identifiers, guess you'd have to manage those yourself... -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reopen IndexWriter after delete?
Otis Gospodnetic writes: > > No, it is not safe. You should close the IndexWriter, then delete the > document and close IndexReader, and then get a new IndexWriter and > continue writing. > IIRC lucene takes care that you do so. Locking prevents you from having an open IndexWriter and modify the index with an IndexReader (and vice verse). Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]