Lucene Book
Hi I am new to Lucene. Can anyone guide me from where i can download free Lucene book. Thanx Regards E.Faisal Important Email Information :- The information in this email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. If you are not the intended addressee please contact the sender and dispose of this e-mail immediately. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Book
On Sep 7, 2004, at 3:00 AM, [EMAIL PROTECTED] wrote: I am new to Lucene. Can anyone guide me from where i can download free Lucene book. Free?! http://www.manning.com/hatcher2 is the book Otis and I have spent the last year laboring on. It has been a long hard effort that is about to come to fruition. Lucene in Action is in copy/tech editing right now and will be pushed into production very shortly. There will be some chapters, as always with Manning, available for free download once the book has been typeset (probably even before physical copies are available). We have not decided which chapters we'll make available for free yet. I hope a few folks buy it - it would be a shame for my kids to go without food ;) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Book
Jeez, Erik! Where's your sense of public spirit ;-) Terry PS: Glad to hear you're (finally!) nearing publication. - Original Message - From: Erik Hatcher To: Lucene Users List Sent: Tuesday, September 07, 2004 6:43 AM Subject: Re: Lucene Book On Sep 7, 2004, at 3:00 AM, [EMAIL PROTECTED] wrote: I am new to Lucene. Can anyone guide me from where i can download free Lucene book. Free?! http://www.manning.com/hatcher2 is the book Otis and I have spent the last year laboring on. It has been a long hard effort that is about to come to fruition. Lucene in Action is in copy/tech editing right now and will be pushed into production very shortly. There will be some chapters, as always with Manning, available for free download once the book has been typeset (probably even before physical copies are available). We have not decided which chapters we'll make available for free yet. I hope a few folks buy it - it would be a shame for my kids to go without food ;) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Book
Hello Ebrahim, Like Erik said, the book about Lucene is coming soon. Although it won't be free, Erik, I, and a few other people already shared some of our knowledge about Lucene in several articles about Lucene. There is a page on the Lucene Wiki that has links to all known Lucene articles. I suggest you take a look at those while we finish Lucene in Action. Otis --- [EMAIL PROTECTED] wrote: Hi I am new to Lucene. Can anyone guide me from where i can download free Lucene book. Thanx Regards E.Faisal Important Email Information :- The information in this email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. If you are not the intended addressee please contact the sender and dispose of this e-mail immediately. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Use of + and - in queries
I don't understand the difference in using + and - in queries compared to using AND and NOT. Even the Query Syntax document seems a bit confused. In the section on the NOT operator it says: To search for documents that contain jakarta apache but not jakarta lucene use the query: jakarta apache NOT jakarta lucene Then in the section on the - operator you read this: To search for documents that contain jakarta apache but not jakarta lucene use the query: jakarta apache -jakarta lucene So what's the difference? -- Bill Tschumy Otherwise -- Austin, TX http://www.otherwise.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Spam:too many open files
I sent out an email to this list a few weeks ago about how to fix a corrupt index. I basically edited the segments file with a hex editor removing the entry for the missing file and decremented the total count of files from the file count that is near the beginning of the segments file. -Original Message- From: Patrick Kates [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 01, 2004 1:30 PM To: [EMAIL PROTECTED] Subject: Spam:too many open files I am having two problems with my client's lucene indexes. One, we are getting a FileNotFound exception (too many open files). THis would seem to indicate that I need to increase the number of open files on our Suse 9.0 Pro box. I have our sys admin working on this problem for me. Two, because of this error and subsequent restarting of the box, we seem to have lost an index segment or two. My client's tape backups do not contain the segments we know about. I am concerned about the missing index segments as they seem to be preventing any further update of the index. Does anyone have any suggestions as to how to fix this besides a full re-index of the problem indexes? I was wondering if maybe a merge of the index might solve the problem? I could move our nightly merge of the index files to sooner, but I am afraid that the merge might make matters worse? Any ideas or helpful speculation would be greatly appreciated. Patrick - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Spam:too many open files
A note to developers, the code checked into lucene CVS ~Aug 15th, post 1.4.1, was causing frequent index corruptions. When I reverted back to version 1.4 I no longer am getting the corruptions. I was unable to trace the problem to anything specific, but was using the newer code to take advantage of the sort fixes. -Original Message- From: Patrick Kates [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 01, 2004 1:30 PM To: [EMAIL PROTECTED] Subject: Spam:too many open files I am having two problems with my client's lucene indexes. One, we are getting a FileNotFound exception (too many open files). THis would seem to indicate that I need to increase the number of open files on our Suse 9.0 Pro box. I have our sys admin working on this problem for me. Two, because of this error and subsequent restarting of the box, we seem to have lost an index segment or two. My client's tape backups do not contain the segments we know about. I am concerned about the missing index segments as they seem to be preventing any further update of the index. Does anyone have any suggestions as to how to fix this besides a full re-index of the problem indexes? I was wondering if maybe a merge of the index might solve the problem? I could move our nightly merge of the index files to sooner, but I am afraid that the merge might make matters worse? Any ideas or helpful speculation would be greatly appreciated. Patrick - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: telling one version of the index from another?
Bill Janssen wrote: Hi. Hey, Bill. It's been a long time! I've got a Lucene application that's been in use for about two years. Some users are using Lucene 1.2, some 1.3, and some are moving to 1.4. The indices seem to behave differently under each version. I'd like to add code to my application that checks the current user's index version against the version of Lucene that they are using, and automatically re-indexes their files if necessary. However, I can't figure out how to tell the version, from the index files. Prior to 1.4, there were no format numbers in the index. These are being added, file-by-file, as we change file formats. As you've discovered, there is currently no public API to obtain the format number of an index. Also, the formats of different files are revved at different times, so there may not be a single format number for the entire index. (Perhaps we should remedy this, by, e.g., always revving the segments version whenever any file changes format.) The documentation on the file formats, at http://jakarta.apache.org/lucene/docs/fileformats.html, directs me to the segments file. However, when I look at a version 1.3 segments file, it seems to bear little relationship to the format described in fileformats.html. Have a look at the version of fileformats.html that shipped with 1.3. You can find this by browsing CVS, looking for the 1.3-final tag. But let me do it for you: http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/docs/fileformats.html?rev=1.15 According to CVS tags, that describes both the 1.3 and 1.2 index file formats. But the part of fileformats.html dealing with the segments file contains no compatibility notes, so I assume it hasn't changed since 1.3. I wrote the bit about compatibility notes when I first documented file formats, and then promptly forgot about it. So, until someone contributes them, there are no compatibility notes. Sorry. Even if it had, what's the idea of using -1 as the format number for 1.4? The idea is to promptly break 1.3 and 1.2 code which tries to read the index. Those versions of Lucene don't check format numbers (because there were none). Positive values would give unpredictable errors. A negative value causes an immediate failure. So, anyone know a way to tell the difference between the various versions of the index files? Crufty hacks welcome :-). The first four bytes of the segments file will mostly do the trick. If it is zero or positive, then the index is a 1.2 or 1.3 index. If it is -2, then it's a 1.4-final or later index. There was a change in formats between 1.2 and 1.3, with no format number change. This was in 1.3 RC1 (note #12 in CHANGES.txt). The semantics of each byte in norm files (.f[0-9]) changed. In 1.3 each byte represented 0.0-255.0 on a linear scale. In 1.3 and later they're eight-bit floats (three-bit mantissa, five-bit exponent, no sign bit). The net result is that if you use a 1.2 index with 1.3 or later then the correct documents will be returned, but scores and rankings will be wacky. With the exception of this last bit, 1.4 should be able to correctly handle indexes from earlier releases. Please report if this is not the case. Cheers, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Possible to remove duplicate documents in sort API?
Kevin A. Burton wrote: My problem is that I have two machines... one for searching, one for indexing. The searcher has an existing index. The indexer found an UPDATED document and then adds it to a new index and pushes that new index over to the searcher. The searcher then reloads and when someone performs a search BOTH documents could show up (including the stale document). I can't do a delete() on the searcher because the indexer doesn't have the entire index as the searcher. I can think of a couple ways to fix this. If the indexer box kept copies of the indexes that it has already sent to the searcher, then it can mark updated documents as deleted in these old indexes. Then you can, with the new index, also distribute new .del files for the old indexes. Alternately, you could, on the searcher box, before you open the new index, open an IndexReader on all of the existing indexes and mark all new documents as deleted in the old indexes. This shouldn't take more than a few seconds. IndexReader.delete() just sets a bit in a bit vector that is written to file by IndexReader.close(). So it's quite fast. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why doesn't Document use a HashSet instead of a LinkedList (DocumentFieldList)
Kevin A. Burton wrote: It looks like Document.java uses its own implementation of a LinkedList.. Why not use a HashMap to enable O(1) lookup... right now field lookup is O(N) which is certainly no fun. Was this benchmarked? Perhaps theres the assumption that since documents often have few fields the object overhead and hashcode overhead would have been less this way. I have never benchmarked this but would be surprised if it makes a measureable difference in any real application. A linked list is used because it naturally supports multiple entries with the same key. A home-grown linked list was used because, when Lucene was first written, java.util.LinkedList did not exist. Please feel free to benchmark this against a HashMap of LinkedList of Field. This would be slower to construct, which may offset any increased access speed. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Spam:too many open files
On Tuesday 07 September 2004 17:41, [EMAIL PROTECTED] wrote: A note to developers, the code checked into lucene CVS ~Aug 15th, post 1.4.1, was causing frequent index corruptions. When I reverted back to version 1.4 I no longer am getting the corruptions. Here are some changes from around that day: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentReader.java http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java Could you check which of those might have caused the problem? I guess there's not much the developers can do without the problem being reproducible. regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
getting most common terms for a smaller set of documents
Dear Lucene Users: What is the best way to get the most common terms for a subset of the total documents in your index? I know how to get the most common terms for a field for the entire index, but what is the most efficient way to do this for a subset of documents? Here is the code I am using to get the top numberOfTerms common terms for the field fieldName: public TermInfo[] mostCommonTerms(String fieldName, int numberOfTerms) { //make sure min will get a positive number if (numberOfTerms 1) { numberOfTerms = Integer.MAX_VALUE; } numberOfTerms = Math.min(numberOfTerms, 50); //String[] commonTerms = new String[numberOfTerms]; try { IndexReader reader = IndexReader.open(indexPath); TermInfoQueue tiq = new TermInfoQueue(numberOfTerms); TermEnum terms = reader.terms(); int minFreq = 0; while (terms.next()) { if(fieldName.equalsIgnoreCase(terms.term().field())) { if (terms.docFreq() minFreq) { tiq.put(new TermInfo(terms.term(), terms.docFreq())); if (tiq.size() = numberOfTerms) // if tiq overfull { tiq.pop(); // remove lowest in tiq minFreq = ((TermInfo) tiq.top()).docFreq; // reset // minFreq } } } } TermInfo[] res = new TermInfo[tiq.size()]; for (int i = 0; i res.length; i++) { res[res.length - i - 1] = (TermInfo) tiq.pop(); } reader.close(); return res; } catch (IOException ioe) { logger.error(IOException: + ioe.getMessage()); } return null; } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: telling one version of the index from another?
Thanks, Doug, much as I'd figured from looking at the code. Here's a follow-up question: Is there any programmatic way to tell which version of the Lucene code a program is using? A version number or string would be great (perhaps an idea for the next release), but a list of classes in one version but not in the previous one would do for the moment. (Perhaps we should remedy this, by, e.g., always revving the segments version whenever any file changes format.) I think you mean the segments format, right? And I highly recommend doing so. Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lucene locks index, tomcat has to stop and restart
Hi all, I met with such a problem with lucene demo: Each time when I create lucene index, I have to first stop tomcat, and restart tomcat after the index is created. The reason is: the index is locked when using IndexReader.open(index) method in the jsp file. So, I tried to modify the jsp codes by adding close(), but it shows error which said close() is not a static method. I checked the source codes of lucene IndexReader methods, and found that the close() method is final not static. I tried to change it to static, but resulted in many errors. So, does anybody meet the similar problem as me? Do you have any solutions? Thank you very very much.!! Ivy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Moving from a single server to a cluster
My application currently uses Lucene with an index living on the filesystem, and it works fine. I'm moving to a clustered environment soon and need to figure out how to keep my indexes together. Since the index is on the filesystem, each machine in the cluster will end up with a different index. I looked into JDBC Directory, but it's not tested under Oracle and doesn't seem like a very mature project. What are other people doing to solve this problem? -- Ben Sinclair [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lucene index parser problem
Hi, I have such a problem when creating lucene index for many html files: It shows aborted, expectedtagnametagend for those html files which contain java scripts. It seems it cannot parse the tags \. Does anyone has any solution? Thank you very very much...!!! Ivy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lucene locks index, tomcat has to stop and restart
Hi, I met with such a problem with lucene demo: Each time when I create lucene index, I have to first stop tomcat, and restart tomcat after the index is created. The reason is: the index is locked when using IndexReader.open(index) method in the jsp file. So, I tried to modify the jsp codes by adding close(), but it shows error which said close() is not a static method. I checked the source codes of lucene IndexReader methods, and found that the close() method is final not static. I tried to change it to static, but resulted in many errors. So, does anybody meet the similar problem as me? Do you have any solutions? Thank you very very much.!! Ivy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene index parser problem
Why oh why did you send this to the tomcat lists? Don't cross post! Especially when the question doesn't even apply to one of the lists. Patrick On Tue, 7 Sep 2004 16:35:35 -0400, hui liu [EMAIL PROTECTED] wrote: Hi, I have such a problem when creating lucene index for many html files: It shows aborted, expectedtagnametagend for those html files which contain java scripts. It seems it cannot parse the tags \. Does anyone has any solution? Thank you very very much...!!! Ivy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene locks index, tomcat has to stop and restart
This isn't a Tomcat specific problem, but sounds like a problem with how you the reader is being used. Somewhere in the JSP a IndexReader variable was probably assigned to. A line something like: IndexReader ir = IndexReader.open(somepath); To close the reader, and thus solve the problem, somewhere later, you need: ir.close(); with the needed try/catch in place. Again, please refrain from cross-posting...just because it happened on Tomcat doesn't make it a Tomcat problem. This is clearly a lucene usage problem. Patrick On Tue, 7 Sep 2004 16:37:42 -0400, hui liu [EMAIL PROTECTED] wrote: Hi, I met with such a problem with lucene demo: Each time when I create lucene index, I have to first stop tomcat, and restart tomcat after the index is created. The reason is: the index is locked when using IndexReader.open(index) method in the jsp file. So, I tried to modify the jsp codes by adding close(), but it shows error which said close() is not a static method. I checked the source codes of lucene IndexReader methods, and found that the close() method is final not static. I tried to change it to static, but resulted in many errors. So, does anybody meet the similar problem as me? Do you have any solutions? Thank you very very much.!! Ivy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene locks index, tomcat has to stop and restart
Ah, I see your problem. From the Lucene Javadocs on IndexSearcher.close(): Note that the underlying IndexReader is not closed, if IndexSearcher was constructed with IndexSearcher(IndexReader r). If the IndexReader was supplied implicitly by specifying a directory, then the IndexReader gets closed. Since you are explicitly passing in an IndexReader, the IndexSearcher is not closing it. But since you created the IndexReader without retaining a reference to it, you can not close it, thus you will always have an open index. You have a couple of options: Either retain a reference to IndexReader by doing the following: IndexReader ir = IndexReader.open(indexName); IndexSearcher searcher = new IndexSearcher(ir); or if indexName is just the path to the index, then just use: IndexSearcher searcher = new IndexSearcher(indexName); since that will manage the IndexReader for you. Hope that helps. Patrick CCing list for archives. On Tue, 7 Sep 2004 18:38:43 -0400, hui liu [EMAIL PROTECTED] wrote: First of all, thanks for your reply:-) But actually, I've already tried this and here is my code: searcher = new IndexSearcher(IndexReader.open(indexName)); and at some later place I wrote: IndexReader.close(); Both of them are within try and catch, and then I got such an error in IE by tomcat: non-static method close() cannot be referenced from a static context. I read the source code of IndexReader and found that the method close() is final not static. so I tried to change it to static, but got even more errors. I am wondering how do you use lucene? Has anyone met with the same thing? Thanks a lot. Ivy. On Tue, 7 Sep 2004 17:03:00 -0400, Patrick Burleson [EMAIL PROTECTED] wrote: This isn't a Tomcat specific problem, but sounds like a problem with how you the reader is being used. Somewhere in the JSP a IndexReader variable was probably assigned to. A line something like: IndexReader ir = IndexReader.open(somepath); To close the reader, and thus solve the problem, somewhere later, you need: ir.close(); with the needed try/catch in place. Again, please refrain from cross-posting...just because it happened on Tomcat doesn't make it a Tomcat problem. This is clearly a lucene usage problem. Patrick On Tue, 7 Sep 2004 16:37:42 -0400, hui liu [EMAIL PROTECTED] wrote: Hi, I met with such a problem with lucene demo: Each time when I create lucene index, I have to first stop tomcat, and restart tomcat after the index is created. The reason is: the index is locked when using IndexReader.open(index) method in the jsp file. So, I tried to modify the jsp codes by adding close(), but it shows error which said close() is not a static method. I checked the source codes of lucene IndexReader methods, and found that the close() method is final not static. I tried to change it to static, but resulted in many errors. So, does anybody meet the similar problem as me? Do you have any solutions? Thank you very very much.!! Ivy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Use of + and - in queries
Hi Bill, No difference, it's just that Lucene's query syntax recognizes both 'NOT' and '-' and uses them the same way - to exclude certain documents from sesrch results. Otis --- Bill Tschumy [EMAIL PROTECTED] wrote: I don't understand the difference in using + and - in queries compared to using AND and NOT. Even the Query Syntax document seems a bit confused. In the section on the NOT operator it says: To search for documents that contain jakarta apache but not jakarta lucene use the query: jakarta apache NOT jakarta lucene Then in the section on the - operator you read this: To search for documents that contain jakarta apache but not jakarta lucene use the query: jakarta apache -jakarta lucene So what's the difference? -- Bill Tschumy Otherwise -- Austin, TX http://www.otherwise.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving from a single server to a cluster
I've used scp and rsync successfully in the past. Lucene now includes a remote searcher (RMI stuff), so you may want to consider a single index, too. Otis --- Ben Sinclair [EMAIL PROTECTED] wrote: My application currently uses Lucene with an index living on the filesystem, and it works fine. I'm moving to a clustered environment soon and need to figure out how to keep my indexes together. Since the index is on the filesystem, each machine in the cluster will end up with a different index. I looked into JDBC Directory, but it's not tested under Oracle and doesn't seem like a very mature project. What are other people doing to solve this problem? -- Ben Sinclair [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: too many open files
I suspect it has to do with this change: --- jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java 2004/08/08 13:03:59 1.12 +++ jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java 2004/08/11 17:37:52 1.13 I wouldn't know where to start to reproduce the problem as it was happening just once a day or so on an index that was being both queried and added to real time to the tune of 100,000 docs a day / 50 queries a day. The corruption was always the same thing, the segments file listed an entry to a file that was not there. -Will -Original Message- From: Daniel Naber [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 07, 2004 1:54 PM To: Lucene Users List Subject: Re: Spam:too many open files On Tuesday 07 September 2004 17:41, [EMAIL PROTECTED] wrote: A note to developers, the code checked into lucene CVS ~Aug 15th, post 1.4.1, was causing frequent index corruptions. When I reverted back to version 1.4 I no longer am getting the corruptions. Here are some changes from around that day: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentReader.java http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java Could you check which of those might have caused the problem? I guess there's not much the developers can do without the problem being reproducible. regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Spam:too many open files
Hi Wallen, Actually, the files Daniel listed were modified on 8/11 and then again on 8/15. In the time between 8/11 to 8/15, I belive there could have been any number of problems, including corrupt indexes and poor multithreaded performance. However, I think after 8/15, the files should be in good working order. If you are not sure if you saw problems with pre-8/15 or post-8/15 version of the code, is it possible for you to try the latest CVS and see if the problem exists now? If it does, it will of course require urgent attention. Thanks very much! Dmitry. Daniel Naber wrote: On Tuesday 07 September 2004 17:41, [EMAIL PROTECTED] wrote: A note to developers, the code checked into lucene CVS ~Aug 15th, post 1.4.1, was causing frequent index corruptions. When I reverted back to version 1.4 I no longer am getting the corruptions. Here are some changes from around that day: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentReader.java http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java Could you check which of those might have caused the problem? I guess there's not much the developers can do without the problem being reproducible. regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MultiFieldQueryParser seems broken... Fix attached.
Hi! I'm using Lucene for an application which has lots of fields/document, in which the users can specify in their config files what fields they wish to be included by default in a search. I'd been happily using MultiFieldQueryParser to do the searches, but the darn users started demanding more Google-like searches; that is, they want the search terms to be implicitly AND-ed instead of implicitly OR-ed. No problem, thinks I, I'll just set the operator. Only to find this has no effect on MultiFieldQueryParser. Once I looked at the code, I find that MultiFieldQueryParser combines the clauses at the wrong level -- it combines them at the outermost level instead of the innermost level. This means that if you have two fields, author and title, and the search string cutting lucene, you'll get the final query (title:cutting title:lucene) (author:cutting author:lucene) If the search operator is OR, this isn't a problem. But if it is, you have two problems. The first is that MultiFieldQueryParser seems to ignore the operator entirely. But even if it didn't, the second problem is that the query formed would be +(title:cutting title:lucene) +(author:cutting author:lucene) That is, if the word Lucene was in both the author field and the title field, the match would fit. This clearly isn't what the searcher intended. You can re-write MultiFieldQueryParser, as I've done in the example code which I append here. This little program allows you to run either my parser (-DSearchTest.QueryParser=new) or the old parser (-DSearchTest.QueryParser=old). It allows you to use either OR (-DSearchTest.QueryDefaultOperator=or) or AND (-DSearchTest.QueryDefaultOperator=and) as the operator. And it allows you to pick your favorite set of default search terms (-DSearchTest.QueryDefaultFields=author:title:body, for example). It takes one argument, a query string, and outputs the re-written query after running it through the query parser. So to evaluate the above query: % java -classpath /import/lucene/lucene-1.4.1.jar:. \ -DSearchTest.QueryDefaultFields=title:author \ -DSearchTest.QueryDefaultOperator=AND \ -DSearchTest.QueryParser=old \ SearchTest cutting lucene query is (title:cutting title:lucene) (author:cutting author:lucene) % The class NewMultiFieldQueryParser does the combination at the inner level, using an override of addClause, instead of the outer level. Note that it can't cover all cases (notably PhrasePrefixQuery, because that class has no access methods which allow one to introspect over it, and SpanQueries, because I don't understand them well enough :-). I post it here in advance of filing a formal bug report for early feedback. But it will show up in a bug report in the near future. Running the above query with the new parser gives: % java -classpath /import/lucene/lucene-1.4.1.jar:. \ -DSearchTest.QueryDefaultFields=title:author \ -DSearchTest.QueryDefaultOperator=AND \ -DSearchTest.QueryParser=new \ SearchTest cutting lucene query is +(title:cutting author:cutting) +(title:lucene author:lucene) % which I claim is what the user is expecting. In addition, the new class uses an API more similar to QueryParser, so that the user has less to learn when using it. The code in it could probably just be folded into QueryParser, in fact. Bill the code for SearchTest: import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.document.Document; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.FuzzyQuery; import org.apache.lucene.search.WildcardQuery; import org.apache.lucene.search.PrefixQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.RangeQuery; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.Hits; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.queryParser.MultiFieldQueryParser; import org.apache.lucene.queryParser.FastCharStream; import org.apache.lucene.queryParser.TokenMgrError; import org.apache.lucene.queryParser.ParseException; import java.io.File; import java.io.StringReader; import java.util.Date; import java.util.HashMap; import java.util.Iterator; import java.util.StringTokenizer; class SearchTest { static class NewMultiFieldQueryParser extends QueryParser { static private final String DEFAULT_FIELD = %%; private String[] fields = null; public NewMultiFieldQueryParser (String[] f, Analyzer a) { super(DEFAULT_FIELD, a); fields = f; }
RE: Spam:too many open files
I will deploy and test through the end of the week and report back Friday if the problem persists. Thank you! -Original Message- From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 07, 2004 8:40 PM To: Lucene Users List Subject: Re: Spam:too many open files Hi Wallen, Actually, the files Daniel listed were modified on 8/11 and then again on 8/15. In the time between 8/11 to 8/15, I belive there could have been any number of problems, including corrupt indexes and poor multithreaded performance. However, I think after 8/15, the files should be in good working order. If you are not sure if you saw problems with pre-8/15 or post-8/15 version of the code, is it possible for you to try the latest CVS and see if the problem exists now? If it does, it will of course require urgent attention. Thanks very much! Dmitry. Daniel Naber wrote: On Tuesday 07 September 2004 17:41, [EMAIL PROTECTED] wrote: A note to developers, the code checked into lucene CVS ~Aug 15th, post 1.4.1, was causing frequent index corruptions. When I reverted back to version 1.4 I no longer am getting the corruptions. Here are some changes from around that day: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentReader.java http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java Could you check which of those might have caused the problem? I guess there's not much the developers can do without the problem being reproducible. regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Use of explain() vs search()
Hi all, I was wondering if anyone could tell me what the expected behaviour is for calling an explain() without calling a search() first on a particular query. Would it effectively do a search and then I can examine the Explanation in order to check whether it matches? I'm currently looking at some existing code to this effect: Explanation exp = searcher.explain(myQuery, docId) // Where docId was _not_ returned by a search on myQuery if (exp.getValue() 0.0f) { // Assuming document for docId matched query. } Is the assumption wrong? I ask because the result of this code is inconsistent with Hits h = searcher.search(myQuery); // there are not hits returned. Thanks in advance, Minh - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Use of explain() vs search()
Hi all, Sorry I should clarify my last point. The search() would return no hits, but the explain() using the apparently invalid docId returns a value greater than 0. For what it's worth it's performing a PhraseQuery. Thanks in advance, Minh Minh Kama Yie wrote: Hi all, I was wondering if anyone could tell me what the expected behaviour is for calling an explain() without calling a search() first on a particular query. Would it effectively do a search and then I can examine the Explanation in order to check whether it matches? I'm currently looking at some existing code to this effect: Explanation exp = searcher.explain(myQuery, docId) // Where docId was _not_ returned by a search on myQuery if (exp.getValue() 0.0f) { // Assuming document for docId matched query. } Is the assumption wrong? I ask because the result of this code is inconsistent with Hits h = searcher.search(myQuery); // there are not hits returned. Thanks in advance, Minh - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
pdf in Chinese
Hi all, i use pdfbox to parse pdf file to lucene document.when i parse Chinese pdf file,pdfbox is not always success. Is anyone have some advice? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]