Re: Numeric Range Restrictions: Queries vs Filters
Of course, not only did I manage to forget to include the attachment, but when I sent a reply with the code, mail.apache.org rejected it because it was a ZIP file. So let's see how mail.apache.or feels about 6 seperate text files. : Date: Mon, 22 Nov 2004 18:25:24 -0800 (PST) : Subject: Numeric Range Restrictions: Queries vs Filters : : (NOTE: numbers in [] indicate Footnotes) : : I'm rather new to Lucene (and this list), so if I'm grossly : misunderstanding things, forgive me. : : One of my main needs as I investigate Search technologies is to restrict : results based on Ranges of numeric values. Looking over the archives of : this list, it seems that lots of people have run into problems dealing : with this. In particular, whenever someone asks a question about "Numeric : Ranges" the question seem to always involve one (or more) of the : following: : :(a) Lexical sorting puts 11 in the range "1 TO 5" :(b) Dates (or Dates and Times) :(c) BooleanQuery$TooManyClauses Exceptions :(d) Should I use a filter? : : (a) is a solved problem as long as you use a formatter like : LongField.java[1] : : (b) is really nothing more then a special case of dealing with generic : numeric values. While there are certainly special purposes solutions that : sometimes apply to dealing with Date ranges, any good solution for dealing : with raw numeric ranges can be applied to Dates (and Times) : : (c) is a situation that seems to come up a lot because of the way : RangeQuery works. The rewrite method walks all of the Terms in the index : starting with "lowerTerm" and builds up BooleanQuery containing a separate : TermQuery for every Term found, until it reaches the upperTerm. This : causes a range search of "0001 TO 1000" to generate a BooleanQuery with N : clauses, where N is the quantity of unique values in the field which are : lexically greater then 0001 and lexically less then 1000. depending on : the nature of your data, this might be 0 BooleanClauses, or it might be : 1000 BooleanClauses; but the list is built before the search is ever even : executed. : : At first, this may seem really strange -- I know I was certainly confused : -- but there is a very good reason for it: Ultimately RangeQuery still : provides you with a meaningful score for each document, based on the : frequency (and quantity) of terms that document has in the range [2]. In : order to do that, it has to expand itself, but what if you don't care if : your Range restriction impacts the Score? [3] : : Which brings us to... : : (c) Filtering. Filters in general make a lot of sense to me. They are a : way to specify (at query time) that only a certain subset of the index : should be considered for results. The Filter class has a very straight : forward API that seems very easy to subclass to get the behavior I want. : The Query API on the other hand ... I freely admit, that I can't make : heads or tails out of it. I don't even know where I would begin to try : and write a new subclass of Query if I wanted to. : : I would think that most people who want to do a "numeric range : restriction" on their data, probably don't care about the Scoring benefits : of RangeQuery. Looking at the code base, the way DateFilter works seems : like it provides an ideal solution to any sort of Range restriction (not : just Dates) that *should* be more efficient then using RangeQuery when : dealing with an unbounded value set. (Both approaches need to iterate over : all of the terms in the specified field using TermEnum, but RangeQuery has : to build up an set of BooleanQuery objects for each matching term, and : then each of those queries have to help score the documents -- DateFilter : on the other hand only has to maintain a single BitSet of documents that : it finds as it iterates) : : But I was surprised then to see the following quote from "Erik Hatcher" in : the archives: : : "In fact, DateFilter by itself is practically of no use, I think." [4] : : ...Erik goes on to suggest that given "a set of canned date ranges", it : doesn't really matter if you use a RangeQuery or a DateFilter -- as long : as you cache them to reuse them (with something like CachingWrappingFilter : or QueryFilter). I'm hoping that he might elaborate on that comment? : : As a test, I wrote a "RangeFilter" which borrows heavily from DateFilter : to both convince myself it could work, and to do a comparison between it : and RangeQuery. [5] Based on my limited tests, using a Filter to restrict : to a Range is a lot faster then using RangeQuery -- independent of : caching. : : The attachment contains my RangeFilter, a unit test that demonstrates it, : and a Benchmarking unit test that does a side-by-side comparison with : RangeQuery [6]. If developers feel that this class is useful, then by all : means roll it into the code base. (90% of it is cut/pasted from : DateFilter/RangeQuery anyway) : : : Comments? ... Questions? ... Answers? : : : : Footnotes: : : [1] It
Numeric Range Restrictions: Queries vs Filters
(NOTE: numbers in [] indicate Footnotes) I'm rather new to Lucene (and this list), so if I'm grossly misunderstanding things, forgive me. One of my main needs as I investigate Search technologies is to restrict results based on Ranges of numeric values. Looking over the archives of this list, it seems that lots of people have run into problems dealing with this. In particular, whenever someone asks a question about "Numeric Ranges" the question seem to always involve one (or more) of the following: (a) Lexical sorting puts 11 in the range "1 TO 5" (b) Dates (or Dates and Times) (c) BooleanQuery$TooManyClauses Exceptions (d) Should I use a filter? (a) is a solved problem as long as you use a formatter like LongField.java[1] (b) is really nothing more then a special case of dealing with generic numeric values. While there are certainly special purposes solutions that sometimes apply to dealing with Date ranges, any good solution for dealing with raw numeric ranges can be applied to Dates (and Times) (c) is a situation that seems to come up a lot because of the way RangeQuery works. The rewrite method walks all of the Terms in the index starting with "lowerTerm" and builds up BooleanQuery containing a separate TermQuery for every Term found, until it reaches the upperTerm. This causes a range search of "0001 TO 1000" to generate a BooleanQuery with N clauses, where N is the quantity of unique values in the field which are lexically greater then 0001 and lexically less then 1000. depending on the nature of your data, this might be 0 BooleanClauses, or it might be 1000 BooleanClauses; but the list is built before the search is ever even executed. At first, this may seem really strange -- I know I was certainly confused -- but there is a very good reason for it: Ultimately RangeQuery still provides you with a meaningful score for each document, based on the frequency (and quantity) of terms that document has in the range [2]. In order to do that, it has to expand itself, but what if you don't care if your Range restriction impacts the Score? [3] Which brings us to... (c) Filtering. Filters in general make a lot of sense to me. They are a way to specify (at query time) that only a certain subset of the index should be considered for results. The Filter class has a very straight forward API that seems very easy to subclass to get the behavior I want. The Query API on the other hand ... I freely admit, that I can't make heads or tails out of it. I don't even know where I would begin to try and write a new subclass of Query if I wanted to. I would think that most people who want to do a "numeric range restriction" on their data, probably don't care about the Scoring benefits of RangeQuery. Looking at the code base, the way DateFilter works seems like it provides an ideal solution to any sort of Range restriction (not just Dates) that *should* be more efficient then using RangeQuery when dealing with an unbounded value set. (Both approaches need to iterate over all of the terms in the specified field using TermEnum, but RangeQuery has to build up an set of BooleanQuery objects for each matching term, and then each of those queries have to help score the documents -- DateFilter on the other hand only has to maintain a single BitSet of documents that it finds as it iterates) But I was surprised then to see the following quote from "Erik Hatcher" in the archives: "In fact, DateFilter by itself is practically of no use, I think." [4] ...Erik goes on to suggest that given "a set of canned date ranges", it doesn't really matter if you use a RangeQuery or a DateFilter -- as long as you cache them to reuse them (with something like CachingWrappingFilter or QueryFilter). I'm hoping that he might elaborate on that comment? As a test, I wrote a "RangeFilter" which borrows heavily from DateFilter to both convince myself it could work, and to do a comparison between it and RangeQuery. [5] Based on my limited tests, using a Filter to restrict to a Range is a lot faster then using RangeQuery -- independent of caching. The attachment contains my RangeFilter, a unit test that demonstrates it, and a Benchmarking unit test that does a side-by-side comparison with RangeQuery [6]. If developers feel that this class is useful, then by all means roll it into the code base. (90% of it is cut/pasted from DateFilter/RangeQuery anyway) Comments? ... Questions? ... Answers? Footnotes: [1] It seems to me this class is extremely useful, does anyone know if there's a particular reason it hasn't been added to the main Lucene codebase? http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04790.html [2] Take a look at RangeQueryScoreDemo.java in the attachment, which produces output something like this... Range Search for: 'apple' TO 'dog' 0.40924072 ... bed dog emu 0.38014847 ... DOG 0.2825246 ... cat 0.17657787 ... apple emu 0.12
multi-dimensional scaling
Is it possible to combine Lucene and multi-dimensional scaling in some way?
JDBCDirectory to prevent optimize()?
It seems that when compared to other datastores that Lucene starts to fall down. For example lucene doesn't perform online index optimizations so if you add 10 documents you have to run optimize() again and this isn't exactly a fast operation. I'm wondering about the potential for a generic JDBCDirectory for keeping the lucene index within a database. It sounds somewhat unconventional would allow you to perform live addDirectory updates without performing an optimize() again. Has anyone looked at this? How practical would it be. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Too many open files issue
A useful resource for increasing the number of file handles on various operating systems is the Volano Report: http://www.volano.com/report/ > I had requested help on an issue we have been facing with the "Too many > open files" Exception garbling the search indexes and crashing the > search on the web site. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Too many open files issue
I'm sorry, I wasn't involved in the original conversation but maybe I can jump in with some info that will help. The number of files depends on the merge factor, number of segments, and number of indexed fields in your index. It also depends on whether you are using "compound files" or not (this is a flag on the IndexWriter). With compound files flag on, segments have fixed number of files, regardless of how many fields you use. Without the flag, each field is a separate file. Let's say you have 10 segments (per your merge factor) that are being merged into a new segment (via an optimize call or just because you have reached the merge factor). This means there are 11 segments open at the same time. If you have 20 indexed fields and are not using compound files, that's 20 * 11 = 220 files. There are a few other files open as well, plus whatever other files and sockets that your JVM process is holding open at that time. This would include incoming connections, for example, if this is running inside a web server. If you are running in an application server, this could include connections and files open by other applications in that same app server. So the numbers run up quite a bit. By the way, it is usual to have the file descriptors limit set at 9000 or so for unix machines running production web applications. By the way 2, on Solaris, you will need to modify a value in /etc/systems to get up to this level. Not sure about Linux or other flavors. Another suggestion - you may want to look into a tool called "lsof". It is a utility that will show file handles open by a particular process. It could be that some other part of your process (or of the application server, VM, etc) is not closing files. This tool will help you see what files are open and you can validate that all of the really need to be open. Best of luck. Dmitry. Neelam Bhatnagar wrote: Hi, I had requested help on an issue we have been facing with the "Too many open files" Exception garbling the search indexes and crashing the search on the web site. As a suggestion, you had asked us to look at the articles on O'Reilly Network which had specific context around this exact problem. One of the suggestions was to increase the limit on the number of file descriptors on the file system. We tried it by first lowering the limit to 200 from 256 in order to reproduce the exception. The exception did get reproduced but even after increasing the limit to 500, the exception kept coming until after several rounds of trying to rebuild the index, we finally got to get it working for the default file descriptor limit of 256. This makes us wonder if your first suggestion of optimizing indexes is a pre-requisite to trying this option. Another piece of relevant information is that we have the default merge factor of 10. Kindly give us pointers to what it that we are doing wrong or should we be trying something completely different. Thanks and regards Neelam Bhatnagar - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: auto-generate uid?
Not exactly sure what you're trying to do. You can easily generate a number when you index each Document and insert it in a uid field (which is, BTW, what I do), and if you base it on a timestamp plus some characteristic of the document (which is also what I do), it should always be unique. As you add more documents, they will each get their own unique id. When you delete documents and optimize, these ids won't be affected. However, in your subsequent clarification, you indicated you already had a unique id, and want to find the maximum value. So why did you say you want one auto-generated? Terry - Original Message - From: aurora To: [EMAIL PROTECTED] Sent: Monday, November 22, 2004 4:39 PM Subject: Re: auto-generate uid? Just to clarify. I have a Field 'uid' those value is an unique integer. I use it as a key to the document stored externally. I don't mean Lucene's internal document number. I was wonder if there is a method to query the highest value of a field, perhaps something like: IndexReader.maxTerm('uid') > What would the purpose of an auto-generated UID be? > > But no, Lucene does not generate UID's for you. Documents are numbered > internally by their insertion order. This number changes, however, when > documents are deleted in the middle and the index is optimized. > > Erik > > On Nov 22, 2004, at 1:50 PM, aurora wrote: > >> Is there a way to auto-generate uid in Lucene? Even it is just a way to >> query the highest uid and let the application add one to it will do. >> >> Thanks. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: auto-generate uid?
Just to clarify. I have a Field 'uid' those value is an unique integer. I use it as a key to the document stored externally. I don't mean Lucene's internal document number. I was wonder if there is a method to query the highest value of a field, perhaps something like: IndexReader.maxTerm('uid') what you could do is to write your own IndexWriter class by extending the original one found in org.apache.lucene.index.IndexWriter. Than you have direct access to lucene's segment counter which could provide you a unique id for each document in the index. Those id's would stay sticky even if you modify the index after the intial creation process. is that the hint you need to start ? regards Bernhard What would the purpose of an auto-generated UID be? But no, Lucene does not generate UID's for you. Documents are numbered internally by their insertion order. This number changes, however, when documents are deleted in the middle and the index is optimized. Erik On Nov 22, 2004, at 1:50 PM, aurora wrote: Is there a way to auto-generate uid in Lucene? Even it is just a way to query the highest uid and let the application add one to it will do. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: auto-generate uid?
On Nov 22, 2004, at 4:39 PM, aurora wrote: Just to clarify. I have a Field 'uid' those value is an unique integer. I use it as a key to the document stored externally. I don't mean Lucene's internal document number. I was wonder if there is a method to query the highest value of a field, perhaps something like: IndexReader.maxTerm('uid') There isn't quite that type of API, though you can skip to a known one and enumerate from there: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/ TermEnum.html#skipTo(org.apache.lucene.index.Term) IndexReader gives you a TermEnum from either the terms() method or the terms(Term) method. Erik What would the purpose of an auto-generated UID be? But no, Lucene does not generate UID's for you. Documents are numbered internally by their insertion order. This number changes, however, when documents are deleted in the middle and the index is optimized. Erik On Nov 22, 2004, at 1:50 PM, aurora wrote: Is there a way to auto-generate uid in Lucene? Even it is just a way to query the highest uid and let the application add one to it will do. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: auto-generate uid?
Just to clarify. I have a Field 'uid' those value is an unique integer. I use it as a key to the document stored externally. I don't mean Lucene's internal document number. I was wonder if there is a method to query the highest value of a field, perhaps something like: IndexReader.maxTerm('uid') What would the purpose of an auto-generated UID be? But no, Lucene does not generate UID's for you. Documents are numbered internally by their insertion order. This number changes, however, when documents are deleted in the middle and the index is optimized. Erik On Nov 22, 2004, at 1:50 PM, aurora wrote: Is there a way to auto-generate uid in Lucene? Even it is just a way to query the highest uid and let the application add one to it will do. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: downloading Lucene 1.4.2
On Nov 22, 2004, at 3:27 PM, Hoss wrote: Does anyone know how to go about getting http://www.apache.org/dist/ updated? Yes, I do. It just takes a bit of time to do as I've not automated that step. It requires digitally signing several files and several other currently manual steps while I'm looking at the instructions. I apologize for not doing this yet myself. Other committers can also do this too if they are so inclined. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: downloading Lucene 1.4.2
Click the "here" link on the Lucene home page. We Lucene committers have been very very lame and have not published the binary distribution appropriately for the mirrors to pick up. One of these days we'll correct this, but for now you can click the link from the announcement on the home page. Erik On Nov 22, 2004, at 3:05 PM, Sullivan, Sean C - MWT wrote: According to the Lucene homepage, Lucene 1.4.2 was released on October 1, 2004 However, the "dist" on www.apache.org does not have a copy of Lucene 1.4.2 http://www.apache.org/dist/jakarta/lucene/binaries/ Where can I download Lucene 1.4.2? -Sean - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index in RAM - is it realy worthy?
Otis Gospodnetic wrote: For the Lucene book I wrote some test cases that compare FSDirectory and RAMDirectory. What I found was that with certain settings FSDirectory was almost as fast as RAMDirectory. Personally, I would push FSDirectory and hope that the OS and the Filesystem do their share of work and caching for me before looking for ways to optimize my code. Also another note is that doing an index merge in memory is probably faster if you just use a RAMDirectory and perform addIndexes to it. This would almost certainly be faster than optimizing on disk but I haven't benchmarked it. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index in RAM - is it realy worthy?
Otis Gospodnetic wrote: For the Lucene book I wrote some test cases that compare FSDirectory and RAMDirectory. What I found was that with certain settings FSDirectory was almost as fast as RAMDirectory. Personally, I would push FSDirectory and hope that the OS and the Filesystem do their share of work and caching for me before looking for ways to optimize my code. Yes... I performed the same benchmark and in my situation RAMDirectory for searches was about 2% slower. I'm willing to bet that it has to do with the fact that its a Hashtable and not a HashMap (which isn't synchronized). Also adding a constructor for the term size could make loading a RAMDirectory faster since you could prevent rehash. If you're on a modern machine your filesystme cache will end up buffering your disk anyway which I'm sure was happening in my situation. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index in RAM - is it realy worthy?
In my test, I have 12900 documents. Each document is small, a few discreet fields (KeyWord type) and 1 Text field containing only 1 sentence. with both mergeFactor and maxMergeDocs being 1000 using RamDirectory, the indexing job took about 9.2 seconds not using RamDirectory, the indexing job took about 122 seconds. I am not calling optimize. This is on windows Xp running java 1.5. Is there something very wrong or different in my setup to cause such a big different? Thanks -John On Mon, 22 Nov 2004 09:23:40 -0800 (PST), Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > For the Lucene book I wrote some test cases that compare FSDirectory > and RAMDirectory. What I found was that with certain settings > FSDirectory was almost as fast as RAMDirectory. Personally, I would > push FSDirectory and hope that the OS and the Filesystem do their share > of work and caching for me before looking for ways to optimize my code. > > Otis > > > > --- [EMAIL PROTECTED] wrote: > > > > > I did following test: > > I created the RAM folder on my Red Hat box and copied c. 1Gb of > > indexes > > there. > > I expected the queries to run much quicker. > > In reality it was even sometimes slower(sic!) > > > > Lucene has it's own RAM disk functionality. If I implement it, would > > it > > bring any benefits? > > > > Thanks in advance > > J. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: downloading Lucene 1.4.2
In the same Lucene News section where the announcement about 1.4.2 is listed, there is a link that says "Binary and source distributions are available here." ... http://cvs.apache.org/dist/jakarta/lucene/v1.4.2/ I got really confused yesterday after I already had the binary version and i was looking for the source and found the link you listed. Does anyone know how to go about getting http://www.apache.org/dist/ updated? : According to the Lucene homepage, Lucene 1.4.2 was released : on October 1, 2004 : : However, the "dist" on www.apache.org does not have a copy of : Lucene 1.4.2 : :http://www.apache.org/dist/jakarta/lucene/binaries/ : : Where can I download Lucene 1.4.2? -- --- "Oh, you're a tricky one."Chris M Hostetter -- Trisha Weir[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
downloading Lucene 1.4.2
According to the Lucene homepage, Lucene 1.4.2 was released on October 1, 2004 However, the "dist" on www.apache.org does not have a copy of Lucene 1.4.2 http://www.apache.org/dist/jakarta/lucene/binaries/ Where can I download Lucene 1.4.2? -Sean - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Too many open files issue
If you are on linux the number of file handles for a session is much lower than that for the whole machine. "ulimit -n" will tell you. There are instructions on the web for changing this setting, it involves the /etc/security/limits.conf and setting the values for "nofile". (bulkadm is my user) bulkadm softnofile 8192 bulkadm hardnofile 65536 Also, if you use the condensed file format you will have many fewer files. -Original Message- From: Neelam Bhatnagar [mailto:[EMAIL PROTECTED] Sent: Monday, November 22, 2004 10:02 AM To: Otis Gospodnetic Cc: [EMAIL PROTECTED] Subject: Too many open files issue Hi, I had requested help on an issue we have been facing with the "Too many open files" Exception garbling the search indexes and crashing the search on the web site. As a suggestion, you had asked us to look at the articles on O'Reilly Network which had specific context around this exact problem. One of the suggestions was to increase the limit on the number of file descriptors on the file system. We tried it by first lowering the limit to 200 from 256 in order to reproduce the exception. The exception did get reproduced but even after increasing the limit to 500, the exception kept coming until after several rounds of trying to rebuild the index, we finally got to get it working for the default file descriptor limit of 256. This makes us wonder if your first suggestion of optimizing indexes is a pre-requisite to trying this option. Another piece of relevant information is that we have the default merge factor of 10. Kindly give us pointers to what it that we are doing wrong or should we be trying something completely different. Thanks and regards Neelam Bhatnagar - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: auto-generate uid?
What would the purpose of an auto-generated UID be? But no, Lucene does not generate UID's for you. Documents are numbered internally by their insertion order. This number changes, however, when documents are deleted in the middle and the index is optimized. Erik On Nov 22, 2004, at 1:50 PM, aurora wrote: Is there a way to auto-generate uid in Lucene? Even it is just a way to query the highest uid and let the application add one to it will do. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
indexing benchmark
Hi folks: Is there an indexing benchmark somewhere? I see a search benchmark on the lucene home site. Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
auto-generate uid?
Is there a way to auto-generate uid in Lucene? Even it is just a way to query the highest uid and let the application add one to it will do. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Need help with filtering
It sounds like you need to pad your numbers with leading zeroes, i.e. use the same type of encoding as is required by RangeQuery's. If you query with 05 instead of 5 do you get what you expect? If all your document id's are fixed length, then string comparison will be isomorphic to integer comparison. Chuck > -Original Message- > From: Edwin Tang [mailto:[EMAIL PROTECTED] > Sent: Monday, November 22, 2004 10:34 AM > To: Lucene Users List > Subject: Re: Need help with filtering > > Hello again, > > I've modified DateFilter to filter out document IDs as suggested. All > seems to > be running well until I tried a specific test case. All my documents > have IDs > in the 400,000 range. If I set my lower limit to 5, nothing comes back. > After > examining the code, I found the issue to be at the following line: > TermEnum enumerator = reader.terms(new Term(field, start)); > > Is there a way to retrieve a set of documents with IDs using a Integer > comparison versus a String comparison? If I set "start" to 0, I get > everything, > but that's not very efficient. > > Thanks in advance, > Ed > > --- Paul Elschot <[EMAIL PROTECTED]> wrote: > > > On Wednesday 17 November 2004 01:20, Edwin Tang wrote: > > > Hello, > > > > > > I have been using DateFilter to limit my search results to a certain > date > > > range. I am now asked to replace this filter with one where my > search > > results > > > have document IDs greater than a given document ID. This document ID > is > > > assigned during indexing and is a Keyword field. > > > > > > I've browsed around the FAQs and archives and see that I can either > use > > > QueryFilter or BooleanQuery. I've tried both approaches to limit the > > document > > > ID range, but am getting the BooleanQuery.TooManyClauses exception > in both > > > cases. I've also tried bumping max number of clauses via > > setMaxClauseCount(), > > > but that number has gotten pretty big. > > > > > > Is there another approach to this? ... > > > > Recoding DateFilter to a DocumentIdFilter should be straightforward. > > > > The trick is to use only one document enumerator at a time for all > > terms. Document enumerators take buffer space, and that is the > > reason why BooleanQuery has an exception for too many clauses. > > > > Regards, > > Paul > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > __ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need help with filtering
Hello again, I've modified DateFilter to filter out document IDs as suggested. All seems to be running well until I tried a specific test case. All my documents have IDs in the 400,000 range. If I set my lower limit to 5, nothing comes back. After examining the code, I found the issue to be at the following line: TermEnum enumerator = reader.terms(new Term(field, start)); Is there a way to retrieve a set of documents with IDs using a Integer comparison versus a String comparison? If I set "start" to 0, I get everything, but that's not very efficient. Thanks in advance, Ed --- Paul Elschot <[EMAIL PROTECTED]> wrote: > On Wednesday 17 November 2004 01:20, Edwin Tang wrote: > > Hello, > > > > I have been using DateFilter to limit my search results to a certain date > > range. I am now asked to replace this filter with one where my search > results > > have document IDs greater than a given document ID. This document ID is > > assigned during indexing and is a Keyword field. > > > > I've browsed around the FAQs and archives and see that I can either use > > QueryFilter or BooleanQuery. I've tried both approaches to limit the > document > > ID range, but am getting the BooleanQuery.TooManyClauses exception in both > > cases. I've also tried bumping max number of clauses via > setMaxClauseCount(), > > but that number has gotten pretty big. > > > > Is there another approach to this? ... > > Recoding DateFilter to a DocumentIdFilter should be straightforward. > > The trick is to use only one document enumerator at a time for all > terms. Document enumerators take buffer space, and that is the > reason why BooleanQuery has an exception for too many clauses. > > Regards, > Paul > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index in RAM - is it realy worthy?
For the Lucene book I wrote some test cases that compare FSDirectory and RAMDirectory. What I found was that with certain settings FSDirectory was almost as fast as RAMDirectory. Personally, I would push FSDirectory and hope that the OS and the Filesystem do their share of work and caching for me before looking for ways to optimize my code. Otis --- [EMAIL PROTECTED] wrote: > > I did following test: > I created the RAM folder on my Red Hat box and copied c. 1Gb of > indexes > there. > I expected the queries to run much quicker. > In reality it was even sometimes slower(sic!) > > Lucene has it's own RAM disk functionality. If I implement it, would > it > bring any benefits? > > Thanks in advance > J. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Index in RAM - is it realy worthy?
I did following test: I created the RAM folder on my Red Hat box and copied c. 1Gb of indexes there. I expected the queries to run much quicker. In reality it was even sometimes slower(sic!) Lucene has it's own RAM disk functionality. If I implement it, would it bring any benefits? Thanks in advance J. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Question about multi-searching [re-post]
If you are going to compare scores across multiple indices, I'd suggest considering one of the patches here: http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 Chuck > -Original Message- > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > Sent: Monday, November 22, 2004 6:30 AM > To: Lucene Users List > Subject: Re: Question about multi-searching [re-post] > > > On Nov 22, 2004, at 9:18 AM, Cocula Remi wrote: > >> (First of all : what is the plurial of index in english ; indexes or > >> indices ?) > > We used "indexes" in Lucene in Action. Its a bit ambiguous in English, > but indexes sounds less formal and is acceptable. > > >> For that, I parse a new query using QueryParser or > >> MultiFieldQueryParser. > >> Then I search my indexes using the MultiSearcher class. > >> > >> Ok, but the problem comes when different analyzer are used for each > >> index. > >> QueryParser requires an analyzer to parse the query but a query > >> parsed with an analyzer is not suitable for searching into an index > >> that uses another analyzer. > >> > >> Does anyone know a trick to cope this problem. > > Nothing built into Lucene solves this problem specifically. You'll > have to come up with your own MultiSearcher-like facility that can > apply different queries to different indexes and merge the results back > together. This will be awkward when it comes to scoring though, since > each index is using a different query. > > >> Eventually I could run a different query on each index to obtain > >> several Hits objects. > >> Then I could write some collector that collects Hits in the order of > >> highest scores. > >> I wonder if this could work and if it would be as efficient as the > >> MultiSearcher . In this situation does it make sense to compare the > >> scores of two differents Hits. > > No, it won't make good sense to compare the scores between the queries, > but I suspect our queries are pretty close to one another if all that > varies is the analyzer. It still will be an awkward comparison though, > but maybe good enough for your needs? > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Limo 0.5
On Mon, 2004-11-22 at 02:27, Chandrashekhar wrote: > Hi, > > With Limo 0.5 , can i find out if certain word from some Document is indexed > or not? This feature doesn't exist as such. You could search for it and if results come up, then the word is in the documents it returns. I'll add enumerating the terms in an index to my list of things to add. Regards, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using multiple analysers within a query
On Nov 22, 2004, at 9:17 AM, Morus Walter wrote: Erik Hatcher writes: If your query isn't entered by users, you shouldn't use query parser in most cases anyway. I'd go even further and say in all cases. If you use lucene as a search server you have to provide the query somehow. E.g. we have an php application, that sends queries to a lucene search servlet. In this case it's justifiable to serialize the query into query parser syntax on the client side and have query parser read the query again on the server side. Ah, good point! I hadn't considered this scenario. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Too many open files issue
Hi, I had requested help on an issue we have been facing with the "Too many open files" Exception garbling the search indexes and crashing the search on the web site. As a suggestion, you had asked us to look at the articles on O'Reilly Network which had specific context around this exact problem. One of the suggestions was to increase the limit on the number of file descriptors on the file system. We tried it by first lowering the limit to 200 from 256 in order to reproduce the exception. The exception did get reproduced but even after increasing the limit to 500, the exception kept coming until after several rounds of trying to rebuild the index, we finally got to get it working for the default file descriptor limit of 256. This makes us wonder if your first suggestion of optimizing indexes is a pre-requisite to trying this option. Another piece of relevant information is that we have the default merge factor of 10. Kindly give us pointers to what it that we are doing wrong or should we be trying something completely different. Thanks and regards Neelam Bhatnagar
Re: Question about multi-searching [re-post]
On Nov 22, 2004, at 9:18 AM, Cocula Remi wrote: (First of all : what is the plurial of index in english ; indexes or indices ?) We used "indexes" in Lucene in Action. Its a bit ambiguous in English, but indexes sounds less formal and is acceptable. For that, I parse a new query using QueryParser or MultiFieldQueryParser. Then I search my indexes using the MultiSearcher class. Ok, but the problem comes when different analyzer are used for each index. QueryParser requires an analyzer to parse the query but a query parsed with an analyzer is not suitable for searching into an index that uses another analyzer. Does anyone know a trick to cope this problem. Nothing built into Lucene solves this problem specifically. You'll have to come up with your own MultiSearcher-like facility that can apply different queries to different indexes and merge the results back together. This will be awkward when it comes to scoring though, since each index is using a different query. Eventually I could run a different query on each index to obtain several Hits objects. Then I could write some collector that collects Hits in the order of highest scores. I wonder if this could work and if it would be as efficient as the MultiSearcher . In this situation does it make sense to compare the scores of two differents Hits. No, it won't make good sense to compare the scores between the queries, but I suspect our queries are pretty close to one another if all that varies is the analyzer. It still will be an awkward comparison though, but maybe good enough for your needs? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optimized??
As I understand it optimization is when you merge several segments into one allowing for faster queries. The FAQs and API have further details. http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q24 Luke - Original Message - From: "Miguel Angel" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Saturday, November 20, 2004 5:19 PM Subject: Optimized?? What`s mean Optimized index in Lucene¿? -- Miguel Angel Angeles R. Asesoria en Conectividad y Servidores Telf. 97451277 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How much time indexing doc ??
PDF(s) can definitely slow things down, depending on their size. If there are a few larger PDF documents that time is definitely possible. Luke - Original Message - From: "Miguel Angel" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Saturday, November 20, 2004 11:25 AM Subject: How much time indexing doc ?? > Hi, i have 1000 doc (Word, PDF and HTML) , those documents indexed > in 5 min. Is this correct?? or i have problem with my Analyzer, i > used StandartAnalyzer > -- > Miguel Angel Angeles R. > Asesoria en Conectividad y Servidores > Telf. 97451277 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Question about multi-searching [re-post]
> Hi, > > (First of all : what is the plurial of index in english ; indexes or indices > ?) > > > I want to search into several indexes (indices ?). > For that, I parse a new query using QueryParser or MultiFieldQueryParser. > Then I search my indexes using the MultiSearcher class. > > Ok, but the problem comes when different analyzer are used for each index. > QueryParser requires an analyzer to parse the query but a query parsed with > an analyzer is not suitable for searching into an index that uses another > analyzer. > > Does anyone know a trick to cope this problem. > > Eventually I could run a different query on each index to obtain several Hits > objects. > Then I could write some collector that collects Hits in the order of highest > scores. > I wonder if this could work and if it would be as efficient as the > MultiSearcher . In this situation does it make sense to compare the scores > of two differents Hits.
Re: Using multiple analysers within a query
Erik Hatcher writes: > > If your query isn't entered by users, you shouldn't use query parser in > > most cases anyway. > > I'd go even further and say in all cases. > If you use lucene as a search server you have to provide the query somehow. E.g. we have an php application, that sends queries to a lucene search servlet. In this case it's justifiable to serialize the query into query parser syntax on the client side and have query parser read the query again on the server side. I don't recall any problems with the aproach since we clean up the user before constructing the query. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using multiple analysers within a query
On Nov 22, 2004, at 2:56 AM, Morus Walter wrote: Kauler, Leto S writes: Would anyone have any suggestions on how this could be done? I was thinking maybe the QueryParser would have to be changed/extended to accept a separator other than colon ":", something like "=" for example to indicate this clause is not to be tokenised. I suggested that in a recent discussion and Erik Hatcher objected that it isn't a good idea, to require that users know which field to query in which way. I guess he is right. QueryParser is a one-size fits (?) all sort of beast. It has plenty of negatives, no question. If your query isn't entered by users, you shouldn't use query parser in most cases anyway. I'd go even further and say in all cases. Or perhaps this can all be done using a single analyser? Look at PerFieldAnalyzerWrapper. You will probably have to write a keyword analyzer (unless you can use whitespace analyzer in your case). We should probably add a KeywordAnalyzer to Lucene's core at some point. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: disadvantages
On Nov 22, 2004, at 12:36 AM, Luke Francl wrote: Well that really depends on how big your index is and what they search for, now doesn't it? ;) Everything is relative. -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Sun 11/21/2004 2:52 PM To: Lucene Users List Subject: Re: disadvantages On Nov 21, 2004, at 12:00 PM, Miguel Angel wrote: What are disadvantages the Lucene?? The users of your system won't have time to get coffee when running searches. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using multiple analysers within a query
Kauler, Leto S writes: > > Would anyone have any suggestions on how this could be done? I was > thinking maybe the QueryParser would have to be changed/extended to > accept a separator other than colon ":", something like "=" for example > to indicate this clause is not to be tokenised. I suggested that in a recent discussion and Erik Hatcher objected that it isn't a good idea, to require that users know which field to query in which way. I guess he is right. If your query isn't entered by users, you shouldn't use query parser in most cases anyway. > Or perhaps this can all > be done using a single analyser? > Look at PerFieldAnalyzerWrapper. You will probably have to write a keyword analyzer (unless you can use whitespace analyzer in your case). HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using multiple analysers within a query
On Monday 22 November 2004 05:02, Kauler, Leto S wrote: > Hi Lucene list, > > We have the need for analysed and 'not analysed/not tokenised' clauses > within one query. Imagine an unparsed query like: > > +title:"Hello World" +path:Resources\Live\1 > > In the above example we would want the first clause to use > StandardAnalyser and the second to use an analyser which returns the > term as a single token. So a parsed result might look like: > > +(title:hello title:world) +path:Resources\Live\1 > > Would anyone have any suggestions on how this could be done? I was > thinking maybe the QueryParser would have to be changed/extended to > accept a separator other than colon ":", something like "=" for example > to indicate this clause is not to be tokenised. Or perhaps this can all > be done using a single analyser? Overriding QueryParser.getFieldQuery() might work for you. It is given the field and the query text so an analyzer can be chosen depending on the field. In case you don't use the latest cvs head, it may be worthwhile to have a look. Some of the getFieldQuery methods have been deprecated, but I don't know when. Regards, Paul. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Limo 0.5
Hi, With Limo 0.5 , can i find out if certain word from some Document is indexed or not? With Regards, Chandrashekhar V Deshmukh