date window
Hi All, I've been thinking about this problem for some time now. I'm trying to figure out a way to store date windows in lucene so that I can easily filter as follows. A particular document can have several date windows. Give a specific date, only return those documents where that date falls within at least one of those windows. From what I can see, the only way I can think of doing it is to create a special field format and create a custom filter. The filter isn't that useful for caching though, because every query will have new date (essentially NOW()) Also, note that there are multiple windows here for a single document, we can't just search between min start and max end. Ideas from those more familiar with lucene would be greatly appreciated. James - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field.Store.Compress - does it improve performance of document reads?
On Thursday 17 May 2007 08:10, Andreas Guther wrote: > I am currently exploring how to solve performance problems I encounter with > Lucene document reads. > > We have amongst other fields one field (default) storing all searchable > fields. This field can become of considerable size since we are indexing > documents and store the content for display within results. > > I noticed that the read can be very expensive. I wonder now if it would > make sense to add this field as Field.Store.Compress to the index. Can > someone tell me if this would speed up the document read or if this is > something only interesting for saving space. I have not tried the compression yet, but in my experience a good way to reduce the costs of document reads from a disk is by reading them in document number order whenever possible. In this way one saves on the disk head seeks. Compression should actually help reducing the costs of disk head seeks even more. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Group the search results by a given field
Hi All, I was wondering - is it possible to search and group the results by a given field? For example, I have an index with several million records. Most of them are different Features of the same ID. I'd love to be able to do.. groupby=ID or something like that in the results, and provide the ID as a clickable link to see all the Features of that ID. I have used HitCollector class to accomplish this goal. In Collect method I have used following algo... Collect() { if Searcher.Doc(doc_id).get(ID) is not exist in HashKey then Add Searcher.Doc(doc_id).get(ID) as new HashKey in hash table and assign value = 1 else increment HashKey( Searcher.Doc(doc_id).get(ID)) value with 1 } But, it depends on HitCount. As soon as I get more hits it takes more time and my search performance is degrade. How it can be done with best performance..? Any ideas? Sawan
about to get
Hi lucener: I am want get the TermFreqVector 。but I must get docNum first. titleVector = reader.getTermFreqVector(docNum, "title"); but I can’t get Docnum by lucene Document. how can I get the docNum use Document object? Like this getTermFreqVector(doc,”title”); xiaojun tong 010-64489518-613 [EMAIL PROTECTED] www.feedsky.com
Re: Field.Store.Compress - does it improve performance of document reads?
I haven't tried compression either. I know there was some talk a while ago about deprecating, but that hasn't happened. The current implementation yields the highest level of compression. You might find better results by compressing in your application and storing as a binary field, thus giving you more control over CPU used. This is our current recommendation for dealing w/ compression. If you are not actually displaying that field, you should look into the FieldSelector API (via IndexReader). It allows you to lazily load fields or skip them all together and can yield a pretty significant savings when it comes to loading documents. FieldSelector is available in 2.1. -Grant On May 17, 2007, at 4:01 AM, Paul Elschot wrote: On Thursday 17 May 2007 08:10, Andreas Guther wrote: I am currently exploring how to solve performance problems I encounter with Lucene document reads. We have amongst other fields one field (default) storing all searchable fields. This field can become of considerable size since we are indexing documents and store the content for display within results. I noticed that the read can be very expensive. I wonder now if it would make sense to add this field as Field.Store.Compress to the index. Can someone tell me if this would speed up the document read or if this is something only interesting for saving space. I have not tried the compression yet, but in my experience a good way to reduce the costs of document reads from a disk is by reading them in document number order whenever possible. In this way one saves on the disk head seeks. Compression should actually help reducing the costs of disk head seeks even more. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: about to get
You can get it from a Hits object (see the id() method) or you can iterate over the docs from 0 to maxDoc -1 (skipping deleted docs) I have some code at http://www.cnlp.org/apachecon2005/ that shows various usages for Term Vector. The Lucene in Action book has some good examples as well. -Grant On May 17, 2007, at 6:10 AM, 童小军 wrote: Hi lucener: I am want get the TermFreqVector 。but I must get docNum first. titleVector = reader.getTermFreqVector(docNum, "title"); but I can’t get Docnum by lucene Document. how can I get the docNum use Document object? Like this getTermFreqVector(doc,”title”); xiaojun tong 010-64489518-613 [EMAIL PROTECTED] www.feedsky.com -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to ignore scoring for a Query?
Hi, I have two different use-cases for my queries. For the first, performance is not too critical and I want to sort the results by relevance (score). The second however, is performance critical, but the score for each result is not interesting. I guess, if it was possible to disable scoring for the query, I could improve performance (note that omitNorms on a Field is not an option, due to the first use case). Is there a straightforward way to disable scoring for a query (its a BooleanQuery btw with some clauses, which can be any other query). Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Group the search results by a given field
There has been significant discussion on this topic (way more than I can remember clearly) on the mail thread, but as I remember it's been referred to as "facet" or "faceted". I think you would get a lot of info searching for these terms at... http://www.gossamer-threads.com/lists/lucene/java-user/ Best Erick On 5/17/07, Sawan Sharma <[EMAIL PROTECTED]> wrote: Hi All, I was wondering - is it possible to search and group the results by a given field? For example, I have an index with several million records. Most of them are different Features of the same ID. I'd love to be able to do.. groupby=ID or something like that in the results, and provide the ID as a clickable link to see all the Features of that ID. I have used HitCollector class to accomplish this goal. In Collect method I have used following algo... Collect() { if Searcher.Doc(doc_id).get(ID) is not exist in HashKey then Add Searcher.Doc(doc_id).get(ID) as new HashKey in hash table and assign value = 1 else increment HashKey( Searcher.Doc(doc_id).get(ID)) value with 1 } But, it depends on HitCount. As soon as I get more hits it takes more time and my search performance is degrade. How it can be done with best performance..? Any ideas? Sawan
Re: Field.Store.Compress - does it improve performance of document reads?
Some time ago I posted the results in my peculiar app of using FieldSelector, and it gave dramatic improvements in my case (a factor of about 10). I suspect much of that was peculiar to my index design, so your mileage may vary. See a thread titled... *Lucene 2.1, using FieldSelector speeds up my app by a factor of 10+* Best Erick On 5/17/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: I haven't tried compression either. I know there was some talk a while ago about deprecating, but that hasn't happened. The current implementation yields the highest level of compression. You might find better results by compressing in your application and storing as a binary field, thus giving you more control over CPU used. This is our current recommendation for dealing w/ compression. If you are not actually displaying that field, you should look into the FieldSelector API (via IndexReader). It allows you to lazily load fields or skip them all together and can yield a pretty significant savings when it comes to loading documents. FieldSelector is available in 2.1. -Grant On May 17, 2007, at 4:01 AM, Paul Elschot wrote: > On Thursday 17 May 2007 08:10, Andreas Guther wrote: >> I am currently exploring how to solve performance problems I >> encounter with >> Lucene document reads. >> >> We have amongst other fields one field (default) storing all >> searchable >> fields. This field can become of considerable size since we are >> indexing >> documents and store the content for display within results. >> >> I noticed that the read can be very expensive. I wonder now if it >> would >> make sense to add this field as Field.Store.Compress to the >> index. Can >> someone tell me if this would speed up the document read or if >> this is >> something only interesting for saving space. > > I have not tried the compression yet, but in my experience a good way > to reduce the costs of document reads from a disk is by reading them > in document number order whenever possible. In this way one saves > on the disk head seeks. > Compression should actually help reducing the costs of disk head seeks > even more. > > Regards, > Paul Elschot > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field.Store.Compress - does it improve performance of document reads?
I am actually using the FieldSelector and unless I did something wrong it did not provide me any load performance improvements which was surprising to me and disappointing at the same time. The only difference I could see was when I returned for all fields a NO_LOAD which from my understanding is the same as skipping over the document. Right now I am looking into fragmentation problems of my huge index files. I am de-fragmenting the hard drive to see if this brings any read performance improvements. I am also wondering if the FieldCache as discussed in http://www.gossamer-threads.com/lists/lucene/general/28252 would help improve the situation. Andreas On 5/17/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: I haven't tried compression either. I know there was some talk a while ago about deprecating, but that hasn't happened. The current implementation yields the highest level of compression. You might find better results by compressing in your application and storing as a binary field, thus giving you more control over CPU used. This is our current recommendation for dealing w/ compression. If you are not actually displaying that field, you should look into the FieldSelector API (via IndexReader). It allows you to lazily load fields or skip them all together and can yield a pretty significant savings when it comes to loading documents. FieldSelector is available in 2.1. -Grant On May 17, 2007, at 4:01 AM, Paul Elschot wrote: > On Thursday 17 May 2007 08:10, Andreas Guther wrote: >> I am currently exploring how to solve performance problems I >> encounter with >> Lucene document reads. >> >> We have amongst other fields one field (default) storing all >> searchable >> fields. This field can become of considerable size since we are >> indexing >> documents and store the content for display within results. >> >> I noticed that the read can be very expensive. I wonder now if it >> would >> make sense to add this field as Field.Store.Compress to the >> index. Can >> someone tell me if this would speed up the document read or if >> this is >> something only interesting for saving space. > > I have not tried the compression yet, but in my experience a good way > to reduce the costs of document reads from a disk is by reading them > in document number order whenever possible. In this way one saves > on the disk head seeks. > Compression should actually help reducing the costs of disk head seeks > even more. > > Regards, > Paul Elschot > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Indexing Open Office documents
Anyone know how to add OpenOffice document to a Lucene index? Is there a parser for OpenOffice? thanks in advance jim s. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field.Store.Compress - does it improve performance of document reads?
h. Now that I re-read your first mail, something else suggests itself. You stated: "We have amongst other fields one field (default) storing all searchable fields". Do you need to store this field at all? You can search fields that are indexed but NOT stored. I've used something of the same technique where I index lots of different fields in the same search field so my queries aren't as complex, but return various stored fields to the user for display purposes. Often these latter fields are stored but NOT indexed. It might also be useful if you'd post some of your relevant code snippets, perhaps some innocent line is messing you up... Are you, perhaps, calling get() in a HitCollector? Or iterating through many documents with a Hits object? Or. Best Erick On 5/17/07, Andreas Guther <[EMAIL PROTECTED]> wrote: I am actually using the FieldSelector and unless I did something wrong it did not provide me any load performance improvements which was surprising to me and disappointing at the same time. The only difference I could see was when I returned for all fields a NO_LOAD which from my understanding is the same as skipping over the document. Right now I am looking into fragmentation problems of my huge index files. I am de-fragmenting the hard drive to see if this brings any read performance improvements. I am also wondering if the FieldCache as discussed in http://www.gossamer-threads.com/lists/lucene/general/28252 would help improve the situation. Andreas On 5/17/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > I haven't tried compression either. I know there was some talk a > while ago about deprecating, but that hasn't happened. The current > implementation yields the highest level of compression. You might > find better results by compressing in your application and storing as > a binary field, thus giving you more control over CPU used. This is > our current recommendation for dealing w/ compression. > > If you are not actually displaying that field, you should look into > the FieldSelector API (via IndexReader). It allows you to lazily > load fields or skip them all together and can yield a pretty > significant savings when it comes to loading documents. > FieldSelector is available in 2.1. > > -Grant > > On May 17, 2007, at 4:01 AM, Paul Elschot wrote: > > > On Thursday 17 May 2007 08:10, Andreas Guther wrote: > >> I am currently exploring how to solve performance problems I > >> encounter with > >> Lucene document reads. > >> > >> We have amongst other fields one field (default) storing all > >> searchable > >> fields. This field can become of considerable size since we are > >> indexing > >> documents and store the content for display within results. > >> > >> I noticed that the read can be very expensive. I wonder now if it > >> would > >> make sense to add this field as Field.Store.Compress to the > >> index. Can > >> someone tell me if this would speed up the document read or if > >> this is > >> something only interesting for saving space. > > > > I have not tried the compression yet, but in my experience a good way > > to reduce the costs of document reads from a disk is by reading them > > in document number order whenever possible. In this way one saves > > on the disk head seeks. > > Compression should actually help reducing the costs of disk head seeks > > even more. > > > > Regards, > > Paul Elschot > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > -- > Grant Ingersoll > Center for Natural Language Processing > http://www.cnlp.org/tech/lucene.asp > > Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ > LuceneFAQ > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Indexing Open Office documents
These is a parser for open office in Nutch. It is a plugin called parse-oo. You can find more information in the nutch mailing lists. On 5/17/07, jim shirreffs <[EMAIL PROTECTED]> wrote: Anyone know how to add OpenOffice document to a Lucene index? Is there a parser for OpenOffice? thanks in advance jim s. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: date window
: A particular document can have several date windows. : Give a specific date, only return those documents where that date : falls within at least one of those windows. : Also, note that there are multiple windows here for a single : document, we can't just search between min start and max end. This can tehoreticaly be done using a custom varient of PhraseQuery and "parallel fields" where you have a range_start field and a range_end field and the positions of the dates in each field "line up" so that you can tell which start corrisponds with which end. I mentioned this idea (which actual came from on off hand comment Doug made about PhraseQuery at last year's ApacheCon US) in this Solr thread when someone asked a similar question... http://www.nabble.com/One-item%2C-multiple-fields%2C-and-range-queries-tf2969183.html#a8404600 ..i have never tried implementing this in practice. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field.Store.Compress - does it improve performance of document reads?
On 17-May-07, at 6:43 AM, Andreas Guther wrote: I am actually using the FieldSelector and unless I did something wrong it did not provide me any load performance improvements which was surprising to me and disappointing at the same time. The only difference I could see was when I returned for all fields a NO_LOAD which from my understanding is the same as skipping over the document. Note that storing the field as binary or compressed will increase the speed gains from lazy loading. If the stored field is just text, then lucene has to scan the characters instead of .seek()ing to a byte position. -MIke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Is it possible to use a custom similarity class to cause extra terms in a field to lower score?
If I have two items in an index: Terminator 2 Terminator 2: Judgment Day And I score them against the query +title:(Terminator 2) they come up with the same score (which makes sense, it just isn't quite what I want) Would there be some method or combination of methods in Similarity that I could easily override to allow me to penalize the second item because it had "unused terms"? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Is it possible to use a custom similarity class to cause extra terms in a field to lower score?
: Terminator 2 : Terminator 2: Judgment Day : : And I score them against the query +title:(Terminator 2) : Would there be some method or combination of methods in Similarity : that I could easily override to allow me to penalize the second item : because it had "unused terms"? that's what the DefaultSimilarity does, it uses the (length)norm information stored when the documents are indexed to know which one is a better match (because it matches on a shorter field) I you aren'tseeing that behavior then perhaps you turned omitNorms for that field, or perhaps the byte encoding is making the distinction between your various terms too small -- overriding the lengthNorm function and reindexing might help. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Is it possible to use a custom similarity class to cause extra terms in a field to lower score?
Oops. I do indeed have omitNorms turned on. I will re-read the documentation on it and look at turning it off. Sorry for the bother. :/ On 5/17/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : Terminator 2 : Terminator 2: Judgment Day : : And I score them against the query +title:(Terminator 2) : Would there be some method or combination of methods in Similarity : that I could easily override to allow me to penalize the second item : because it had "unused terms"? that's what the DefaultSimilarity does, it uses the (length)norm information stored when the documents are indexed to know which one is a better match (because it matches on a shorter field) I you aren'tseeing that behavior then perhaps you turned omitNorms for that field, or perhaps the byte encoding is making the distinction between your various terms too small -- overriding the lengthNorm function and reindexing might help. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to ignore scoring for a Query?
Scoring cannot be turned off, currently. I once thought it is possible to skip scoring with the patch in LUCENE-584 JIRA issue, but I was wrong. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Benjamin Pasero <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, May 17, 2007 8:49:15 AM Subject: How to ignore scoring for a Query? Hi, I have two different use-cases for my queries. For the first, performance is not too critical and I want to sort the results by relevance (score). The second however, is performance critical, but the score for each result is not interesting. I guess, if it was possible to disable scoring for the query, I could improve performance (note that omitNorms on a Field is not an option, due to the first use case). Is there a straightforward way to disable scoring for a query (its a BooleanQuery btw with some clauses, which can be any other query). Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field.Store.Compress - does it improve performance of document reads?
- Original Message From: Paul Elschot <[EMAIL PROTECTED]> On Thursday 17 May 2007 08:10, Andreas Guther wrote: > I am currently exploring how to solve performance problems I encounter with > Lucene document reads. > > We have amongst other fields one field (default) storing all searchable > fields. This field can become of considerable size since we are indexing > documents and store the content for display within results. > > I noticed that the read can be very expensive. I wonder now if it would > make sense to add this field as Field.Store.Compress to the index. Can > someone tell me if this would speed up the document read or if this is > something only interesting for saving space. I have not tried the compression yet, but in my experience a good way to reduce the costs of document reads from a disk is by reading them in document number order whenever possible. In this way one saves on the disk head seeks. Compression should actually help reducing the costs of disk head seeks even more. OG: Does this really help in a multi-user environment where there are multiple parallel queries hitting the index and reading data from all over the index and the disk? They will all share the same disk head, so the head will still have to jump around to service all these requests, even if each request is being careful to read documents in docId order, no? Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory leak (JVM 1.6 only)
Hi Steve, You said the OOM happens only when you are indexing. You don't need LuceneIndexAccess for that, so get rid of that to avoid one suspect that is not part of Lucene core. What is your maxBufferedDocs set to? And since you are using JVM 1.6, check out jmap, jconsole & friends, they'll provide insight into where your OOM is coming from. I see your app is a webapp. How do you know it's Lucene and its indexing that are the source of OOM and not something else, such as a bug in Tomcat? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Stephen Gray <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, May 15, 2007 2:31:05 AM Subject: Memory leak (JVM 1.6 only) Hi everyone, I have an application that indexes/searches xml documents using Lucene. I'm having a problem with what looks like a memory leak, which occurs when indexing a large number of documents, but only when the application is running under JVM 1.6. Under JVM 1.5 there is no problem. What happens is that the memory allocated consistently rises during indexing until the JVM crashes with an OutOfMemory exception. I'm using Lucene 2.1, and am using Maik Schreiber's LuceneIndexAccess API, which hands out references to cached IndexWriter/Reader/Searchers to objects that need to use them, and handles closing and re-opening IndexSearchers after documents are added to the index. The application is running under Tomcat 6. I'm a bit out of my depth determining the source of the leak - I've tried using Netbeans profiler, which shows a large number of HashMap instances that survive a long time, but these are created by many different classes so it's difficult to pinpoint one source. Has anyone found similar problems with Lucene indexing operations running under JVM 1.6? Does anyone have any suggestions re how to deal with this? Any help much appreciated. Thanks, Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How can I limit the number of hits in my query?
Thank you, Erick, this is very useful! Have you ever taken a look at Google Suggest[1]? It's very fast, and the results are impressive. I think your suggestion will go a long way to fixing my problem, but there's probably still quite a gap between this approach and the kind of results that Google Suggest provides. I wonder how it could be possible to do the same with Lucene... Anyway, thanks a lot for the help! [1] http://www.google.com/webhp?complete=1&hl=en On Tue, 2007-05-15 at 12:46 -0400, Erick Erickson wrote: > OK, I'm going to go on the assumption that all you're interested in > is auto completion, so don't try to generalized this to queries.. > > Don't use queries, PrefixQuery or otherwise. Use one of the > TermEnums, probably WildcardTermEnum. What that will do is > allow you to find all the terms, in lexical order, that match > your fragment without using queries at all. This has several > advantages > > 1> it's fast. It doesn't require you to do anything but march down > some index list. > 2> it doesn't expand the terms prior to processing. No such thing > as "TooManyClauses". Perhaps OutOfMemory if you try to > return 100,000,000 terms > 3> you can stop whenever you've accumulated "enough" terms, > where "enough" is up to you. > > NOTE: there's also a RegexTermEnum in the > contrib section (last I knew, it may be in core in 2.1) that > allows arbitrary regex enumerations, but it's significantly > slower than WildcardTermEnum, which is hardly surprising > since it has to do more work... > > It's reasonable to ask how much use auto-completion is when the > poor user has, say, 10,000 terms to choose from, so I think it's > entirely reasonable to get the first, say, 100 terms and quit. You > should be able to do something like this quite easily with the Enums. > > I think your original solution of using queries would not be > satisfactory for the user anyway, *assuming* that the user is > as impatient as I am and wants some auto-complete options > RIGHT NOW , even if you solved the TooManyClauses > issue. > > Along the same lines, another question is whether you should > try to auto-complete when the user has typed less than, say, > 3 characters, but that's your design decision. > > Really, try the WildcardTermEnum. It's pretty neat. > > Hope this helps > Erick > > On 5/14/07, David Leangen <[EMAIL PROTECTED]> wrote: > > Thank you very much for this. Some more questions inline... > > > > > > - How can I limit the number of hits? I don't know > in > >advance what the data will be, so it's not > feasible for > >me to use RangeQuery. > > > > > > You can use a TopDocs or a HitCollector object which allows > you > > to process each object as it's hit. But I doubt you need to > do this. > > > No. I expect you're using a wildcard, and wildcard handling > is > > complicated. > > > Ok, you're right. It's not the limiting of the results that's > the > problem, it's the way the search is expanded. > > Since this is an autocomplete, when the user types, for > example "a" or a > Japanese character "あ", I am using PrefixFilter for this, so > I guess > the search turns into "a*" and "あ*" respectively. > > In the archive, the related posts I read either refer to a > DateRange > (where it is possible to search first by year, then month... > etc.), or > they suggest to increase the max count. > > Neither of these solutions work in my case... It's not a date, > and I > have no idea of the results in advance and it would not be > practical or > elegant to speculate on the results (for example first try > aa*~ab* and > see what that gives, etc.). > > I can get access to the "weight" values of the terms (a data > field > determined by their frequency of use), so I'll try something > related to > that. For people with more experience, would that be a good > path to > take? > > Otherwise, would a reasonable solution be to override or > re-implement > PrefixFilter? > > > Thank you so much! > David > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field.Store.Compress - does it improve performance of document reads?
I found a similar recommendation about the disc access and reading in order in the following message and implemented this in my code: http://www.gossamer-threads.com/lists/lucene/general/28268#28268 Since I am dealing with multiple index directories I sorted the document references by index number and then by doc id. This actually improved the read access and reduced the read time to 50% and less. I think this is a very interesting performance improvement tip that should have its place in the FAQ. Thanks for the input. Andreas On 5/17/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: - Original Message From: Paul Elschot <[EMAIL PROTECTED]> On Thursday 17 May 2007 08:10, Andreas Guther wrote: > I am currently exploring how to solve performance problems I encounter with > Lucene document reads. > > We have amongst other fields one field (default) storing all searchable > fields. This field can become of considerable size since we are indexing > documents and store the content for display within results. > > I noticed that the read can be very expensive. I wonder now if it would > make sense to add this field as Field.Store.Compress to the index. Can > someone tell me if this would speed up the document read or if this is > something only interesting for saving space. I have not tried the compression yet, but in my experience a good way to reduce the costs of document reads from a disk is by reading them in document number order whenever possible. In this way one saves on the disk head seeks. Compression should actually help reducing the costs of disk head seeks even more. OG: Does this really help in a multi-user environment where there are multiple parallel queries hitting the index and reading data from all over the index and the disk? They will all share the same disk head, so the head will still have to jump around to service all these requests, even if each request is being careful to read documents in docId order, no? Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory leak (JVM 1.6 only)
Hi Otis, Thanks very much for your reply. I've removed the LuceneIndexAccessor code, and still have the same problem, so that at least rules out LuceneIndexAccessor as the source. maxBufferedDocs is just set to the default, which I believe is 10. I've tried jconsole, + jmap/jhat for looking at the contents of the heap. One interesting thing is that although the memory allocated as reported by the processes tab of Windows Task Manager goes up and up, and the JVM eventually crashes with an OutOfMemory error, the total size of heap + non-heap as reported by jconsole is constant and much lower than the Windows-reported allocated memory. I've also tried Netbeans profiler, which suggests that the variables in the heap that are continually surviving garbage collection do not all originate from one class. I can't definitely rule out Tomcat. Clearly something is interacting with a change in JVM 1.6 and causing the problem. The fact that it only occurs during indexing not searching suggested that it might be related to the indexing code rather than Tomcat. It's much more likely that it's my code than Lucene, but I can't see anything in my code though I'm definitely no expert on memory leaks. All the variables created during indexing except IndexReader and Searcher instances are local to my addDocument function so should be garbage collected after each document is added. I did wonder if it might be related to SnowballAnalyzer as quite a few long lived variables in the heap were created by this - but then the heap is not increasing. Regards, Steve Otis Gospodnetic wrote: Hi Steve, You said the OOM happens only when you are indexing. You don't need LuceneIndexAccess for that, so get rid of that to avoid one suspect that is not part of Lucene core. What is your maxBufferedDocs set to? And since you are using JVM 1.6, check out jmap, jconsole & friends, they'll provide insight into where your OOM is coming from. I see your app is a webapp. How do you know it's Lucene and its indexing that are the source of OOM and not something else, such as a bug in Tomcat? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Stephen Gray <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, May 15, 2007 2:31:05 AM Subject: Memory leak (JVM 1.6 only) Hi everyone, I have an application that indexes/searches xml documents using Lucene. I'm having a problem with what looks like a memory leak, which occurs when indexing a large number of documents, but only when the application is running under JVM 1.6. Under JVM 1.5 there is no problem. What happens is that the memory allocated consistently rises during indexing until the JVM crashes with an OutOfMemory exception. I'm using Lucene 2.1, and am using Maik Schreiber's LuceneIndexAccess API, which hands out references to cached IndexWriter/Reader/Searchers to objects that need to use them, and handles closing and re-opening IndexSearchers after documents are added to the index. The application is running under Tomcat 6. I'm a bit out of my depth determining the source of the leak - I've tried using Netbeans profiler, which shows a large number of HashMap instances that survive a long time, but these are created by many different classes so it's difficult to pinpoint one source. Has anyone found similar problems with Lucene indexing operations running under JVM 1.6? Does anyone have any suggestions re how to deal with this? Any help much appreciated. Thanks, Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Stephen Gray Archive IT Officer Australian Social Science Data Archive 18 Balmain Crescent (Building #66) The Australian National University Canberra ACT 0200 Phone +61 2 6125 2185 Fax +61 2 6125 0627 Web http://assda.anu.edu.au/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: snowball (english) and filenames
On 16-May-07, at 11:00 PM, Doron Cohen wrote: If you enter a.b.c.d.e.f.g.h to that demo you'll see that the demo simply breaks the input text on '.' - that has nothing to do with filenames. That is not what I am seeing from my testing: a.b.c.d.e.f.g.h is not broken apart like how the snowball demo indicates it should do. At http://snowball.tartarus.org/demo.php "a.b.c.d.e.f.g.h" shows: a -> a b -> b c -> c d -> d e -> e f -> f g -> g h -> h For my lucene testing, I indexed one text file with one "a.b.c.d.e.f.g.h" string in it and opened the index up using Luke. It only indexed the string a.b.c.d.e.f.g.h (and didn't parse the string based on the periods). As a real world example, Logon.dll is being converted to "Logon.dl" rather than "Logon" and "dll" as indicated by the snowball demo. Also: Demo: some-msp.msp somemsp -> somemsp msp -> msp Lucene: some-msp.msp some msp.msp - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory leak (JVM 1.6 only)
Stephen Gray <[EMAIL PROTECTED]> wrote on 17/05/2007 22:40:01: > One interesting thing is that although the memory allocated as > reported by the processes tab of Windows Task Manager goes up and up, > and the JVM eventually crashes with an OutOfMemory error, the total size > of heap + non-heap as reported by jconsole is constant and much lower > than the Windows-reported allocated memory. I've also tried Netbeans > profiler, which suggests that the variables in the heap that are > continually surviving garbage collection do not all originate from one > class. Smells like native memory leak? Can jconsole/jmap/jhat monitor native mem? I once spent some time on what finally was a GZipOutputStream native mem usage/leak. Moving from Java 1.5 to 1.6 could expose such problem... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory leak (JVM 1.6 only)
Thanks. If the extra memory allocated is native memory I don't think jconsole includes it in "non-heap" as it doesn't show this as increasing, and jmap/jhat just dump/analyse the heap. Do you know of an application that can report native memory usage? Thanks, Steve Doron Cohen wrote: Stephen Gray <[EMAIL PROTECTED]> wrote on 17/05/2007 22:40:01: One interesting thing is that although the memory allocated as reported by the processes tab of Windows Task Manager goes up and up, and the JVM eventually crashes with an OutOfMemory error, the total size of heap + non-heap as reported by jconsole is constant and much lower than the Windows-reported allocated memory. I've also tried Netbeans profiler, which suggests that the variables in the heap that are continually surviving garbage collection do not all originate from one class. Smells like native memory leak? Can jconsole/jmap/jhat monitor native mem? I once spent some time on what finally was a GZipOutputStream native mem usage/leak. Moving from Java 1.5 to 1.6 could expose such problem... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Stephen Gray Archive IT Officer Australian Social Science Data Archive 18 Balmain Crescent (Building #66) The Australian National University Canberra ACT 0200 Phone +61 2 6125 2185 Fax +61 2 6125 0627 Web http://assda.anu.edu.au/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: snowball (english) and filenames
> a.b.c.d.e.f.g.h is not broken apart like how the snowball demo > indicates it should do. I am not sure about the "should" here - the way I see it, this is just how the demo works: Snowball stemmers operate on words, so the demo first breaks the input text into words and only then applies stemming. > For my lucene testing, I indexed one text file with one > "a.b.c.d.e.f.g.h" string in it and opened the index up using Luke. > It only indexed the string a.b.c.d.e.f.g.h (and didn't parse the > string based on the periods). In Lucene the way text is "broken" into words is up to application - and depends on the analyzer being used. WhitespaceAnalyzer would break on white space. StandardAnalyzer would do more sophisticated work. Analyzers are extendable, so you could modify their behavior. The wiki page "AnalysisParalysis" has some relevant info. Using Lucene's SimpleAnalyzer btw would break "a.b.c" into "a b c" which seems to be what you are looking for? HTH, Doron - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]