Re: Index keeps growing, then shrinks on restart
On Tue, Nov 11, 2014 at 4:26 AM, Ian Lea wrote: > Telling us the version of lucene and the OS you're running on is > always a good idea. > Oops, yes. Lucene 4.10.0, Linux. A guess here is that you aren't closing index readers, so the JVM will > be holding on to deleted files until it exits. > That's probably it. I found a code path where it seems I thought the reader's `close()` would be called by GC/finalize. Rob
Index keeps growing, then shrinks on restart
Hi, I have an index that's about 700 MB, and it grows over days to until it causes problems with disk size, at about 5GB. If the JVM process ends, the index shrinks back to about 700MB, I'm calling IndexWriter.commit() all the time. What else do you call to get it to compact it's use of space? thank you, Rob
phrase query, stop words, and highlighting?
Hi, I just noticed that a search like "rooms to go" is failing to highlight. (I use FastVectorHighlighter). I know it's caused the stop word (to). Is there a recommended way to fix this? I may just re-index without stop words, and see if that causes any problems. thanks, Rob
Re: indexing all suffixes to support leading wildcard?
Doh. Nevermind, I see it. I was searching with same analyzer that I used to index. Usually that's right, but in this case, no. Rob On Fri, Aug 29, 2014 at 10:59 AM, Rob Nikander wrote: > Thanks. That got the search working. Do you know if there's a trick for > using FastVectorHighlighter with ngrams? I followed that doc's advice to > use NGramTokenizer, and right now if the search matches "1234" it will only > highlight "123". > > Rob > > > > On Fri, Aug 29, 2014 at 12:18 AM, Jack Krupansky > wrote: > >> Use the ngram token filter, and the a query of 512 would match by itself: >> http://lucene.apache.org/core/4_9_0/analyzers-common/org/ >> apache/lucene/analysis/ngram/NGramTokenFilter.html >> >> -- Jack Krupansky >> >> -Original Message- From: Erick Erickson >> Sent: Thursday, August 28, 2014 11:52 PM >> To: java-user >> Subject: Re: indexing all suffixes to support leading wildcard? >> >> >> The "usual" approach is to index to a second field but backwards. >> See ReverseStringFilter... Then all your leading wildcards >> are really trailing wildcards in the reversed field. >> >> Best, >> Erick >> >> >> On Thu, Aug 28, 2014 at 10:38 AM, Rob Nikander >> wrote: >> >> Hi, >>> >>> I've got some short fields (phone num, email) that I'd like to search >>> using >>> good old string matching. (The full query is a boolean "or" that also >>> uses >>> real text fields.) I see the warnings about wildcard queries that start >>> with *, and I'm wondering... do you think it would be a good idea to >>> index >>> all the suffixes? Eg, a phone num 5551234, would become 7 values for the >>> "phoneNum" field: 4, 34, 234, etc. So "512*" would be a hit. >>> >>> And maybe do something with the boosts so it doesn't overvalue the match >>> when it hits multiple values. ? >>> >>> Rob >>> >>> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >
Re: indexing all suffixes to support leading wildcard?
Thanks. That got the search working. Do you know if there's a trick for using FastVectorHighlighter with ngrams? I followed that doc's advice to use NGramTokenizer, and right now if the search matches "1234" it will only highlight "123". Rob On Fri, Aug 29, 2014 at 12:18 AM, Jack Krupansky wrote: > Use the ngram token filter, and the a query of 512 would match by itself: > http://lucene.apache.org/core/4_9_0/analyzers-common/org/ > apache/lucene/analysis/ngram/NGramTokenFilter.html > > -- Jack Krupansky > > -Original Message- From: Erick Erickson > Sent: Thursday, August 28, 2014 11:52 PM > To: java-user > Subject: Re: indexing all suffixes to support leading wildcard? > > > The "usual" approach is to index to a second field but backwards. > See ReverseStringFilter... Then all your leading wildcards > are really trailing wildcards in the reversed field. > > Best, > Erick > > > On Thu, Aug 28, 2014 at 10:38 AM, Rob Nikander > wrote: > > Hi, >> >> I've got some short fields (phone num, email) that I'd like to search >> using >> good old string matching. (The full query is a boolean "or" that also >> uses >> real text fields.) I see the warnings about wildcard queries that start >> with *, and I'm wondering... do you think it would be a good idea to index >> all the suffixes? Eg, a phone num 5551234, would become 7 values for the >> "phoneNum" field: 4, 34, 234, etc. So "512*" would be a hit. >> >> And maybe do something with the boosts so it doesn't overvalue the match >> when it hits multiple values. ? >> >> Rob >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Can I update one field in doc?
I used the "Luke" tool to look at my documents. It shows that the positions and offsets in the term vectors get wiped out, in all fields. I'm using Lucene 4.8. I guess I'll just rebuild the entire doc. Rob On Thu, Aug 28, 2014 at 5:33 PM, Rob Nikander wrote: > I tried something like this, to loop through all docs in my index and > patch a field. But it appears to wipe out some parts of the stored values > in the document. For example, highlighting stopped working. > > [ scala code ] > val q = new MatchAllDocsQuery() > val topDocs = searcher.search(q, 100) > val field = new StringField(FieldNames.phone, "", Field.Store.YES) > > for (sdoc <- topDocs.scoreDocs) { >val doc = searcher.doc(sdoc.doc) >val id = doc.get(FieldNames.id) >var phone = doc.get(FieldNames.phone) >phone = phone + " changed" >doc.removeField(FieldNames.phone) >field.setStringValue(searchable) >doc.add(field) >writer.updateDocument(new Term(FieldNames.id, id), doc) > } > > Should it work? The documents have many fields and it takes 35 minutes to > rebuild the index from scratch. I'd like to be able to run smaller "patch" > tasks like this. > > Rob >
Can I update one field in doc?
I tried something like this, to loop through all docs in my index and patch a field. But it appears to wipe out some parts of the stored values in the document. For example, highlighting stopped working. [ scala code ] val q = new MatchAllDocsQuery() val topDocs = searcher.search(q, 100) val field = new StringField(FieldNames.phone, "", Field.Store.YES) for (sdoc <- topDocs.scoreDocs) { val doc = searcher.doc(sdoc.doc) val id = doc.get(FieldNames.id) var phone = doc.get(FieldNames.phone) phone = phone + " changed" doc.removeField(FieldNames.phone) field.setStringValue(searchable) doc.add(field) writer.updateDocument(new Term(FieldNames.id, id), doc) } Should it work? The documents have many fields and it takes 35 minutes to rebuild the index from scratch. I'd like to be able to run smaller "patch" tasks like this. Rob
indexing all suffixes to support leading wildcard?
Hi, I've got some short fields (phone num, email) that I'd like to search using good old string matching. (The full query is a boolean "or" that also uses real text fields.) I see the warnings about wildcard queries that start with *, and I'm wondering... do you think it would be a good idea to index all the suffixes? Eg, a phone num 5551234, would become 7 values for the "phoneNum" field: 4, 34, 234, etc. So "512*" would be a hit. And maybe do something with the boosts so it doesn't overvalue the match when it hits multiple values. ? Rob
Re: How to not span fields with phrase query?
Thank you for the explanation. I subclassed Analyzer and overrode `getPositionIncrementGap` for this field. It appears to have worked. Rob On Thu, Aug 28, 2014 at 10:21 AM, Michael Sokolov < msoko...@safaribooksonline.com> wrote: > Usually that's referred to as multiple "values" for the same field; in the > index there is no distinction between title:C and title:X as far as which > field they are in -- they're in the same field. > > If you want to prevent phrase queries from matching B C X, insert a > position gap between C and X; so A B C would be positions 0, 1, 2, and X, > Y, Z might be 4, 5, 6 instead of 3, 4, 5, which is probably what you have > now > > -Mike > > > On 08/28/2014 09:53 AM, Rob Nikander wrote: > >> Hi, >> >> If I have document with multiple fields "title" >> >> title: A B C >> title: X Y Z >> >> A phrase search for title:"B C X" matches this document. Can I prevent >> that? >> >> thanks, >> Rob >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
How to not span fields with phrase query?
Hi, If I have document with multiple fields "title" title: A B C title: X Y Z A phrase search for title:"B C X" matches this document. Can I prevent that? thanks, Rob
Consistent colors from highlighter, multiple fields?
Hi, I'm using FastVectorHighlighter, and I wanted to get highlights from multiple fields that matched, so I called `highlighter.getBestFragment` for each field. It returns null if it had nothing for that field. The problem is the colors don't match, so it looks confusing. For example, I search for "apple orange", and my HTML shows the term "apple" in green in the field1, but red in the field2. Any recommendations on how to fix this? thanks, Rob
Re: stemming irregular plurals?
Ah, yes, that does it. Thank you both. Rob On Jul 29, 2014, at 10:30 AM, Alexandre Patry wrote: > > On 29/07/2014 10:28, Rob Nikander wrote: >> Mmm. I don’t see a way to construct one, except passing an FST, which isn’t >> exactly a map. I look at the FST javadoc; it’s a rabbit hole. > You probably want to look at > http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilter.Builder.html > > Hope this help, > > Alexandre >> >> Rob >> >> On Jul 29, 2014, at 10:14 AM, Robert Muir wrote: >> >>> You can put this thing before your stemmer, with a custom map of exceptions: >>> >>> http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilter.html >>> >>> On Tue, Jul 29, 2014 at 10:03 AM, Robert Nikander >>> wrote: >>>> Hi, >>>> >>>> I created an Analyzer with a PorterStemFilter, and I’m searching some test >>>> documents. Normal plurals work; searching for “zebra” finds text with >>>> “zebras”. But searching for “goose” doesn’t find “geese”. Is that >>>> expected? Does it give up on irregular English? Is there a way to make >>>> that work, or a reason that it can’t? >>>> >>>> thanks, >>>> Rob >>>> >>>> >>>> >>>> - >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > -- > Alexandre Patry, Ph.D > Chercheur Principal / Principal Researcher > http://KeaText.com > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: stemming irregular plurals?
Mmm. I don’t see a way to construct one, except passing an FST, which isn’t exactly a map. I look at the FST javadoc; it’s a rabbit hole. Rob On Jul 29, 2014, at 10:14 AM, Robert Muir wrote: > You can put this thing before your stemmer, with a custom map of exceptions: > > http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilter.html > > On Tue, Jul 29, 2014 at 10:03 AM, Robert Nikander > wrote: >> Hi, >> >> I created an Analyzer with a PorterStemFilter, and I’m searching some test >> documents. Normal plurals work; searching for “zebra” finds text with >> “zebras”. But searching for “goose” doesn’t find “geese”. Is that expected? >> Does it give up on irregular English? Is there a way to make that work, or >> a reason that it can’t? >> >> thanks, >> Rob >> >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org