Re: Index keeps growing, then shrinks on restart

2014-11-11 Thread Rob Nikander
On Tue, Nov 11, 2014 at 4:26 AM, Ian Lea  wrote:

> Telling us the version of lucene and the OS you're running on is
> always a good idea.
>

Oops, yes.  Lucene 4.10.0, Linux.


A guess here is that you aren't closing index readers, so the JVM will
> be holding on to deleted files until it exits.
>

That's probably it. I found a code path where it seems I thought the
reader's `close()` would be called by GC/finalize.

Rob


Index keeps growing, then shrinks on restart

2014-11-10 Thread Rob Nikander
Hi,

I have an index that's about 700 MB, and it grows over days to until it
causes problems with disk size, at about 5GB.  If the JVM process ends, the
index shrinks back to about 700MB, I'm calling IndexWriter.commit() all the
time.  What else do you call to get it to compact it's use of space?

thank you,
Rob


phrase query, stop words, and highlighting?

2014-09-22 Thread Rob Nikander
Hi,

I just noticed that a search like "rooms to go" is failing to highlight. (I
use FastVectorHighlighter). I know it's caused the stop word (to). Is there
a recommended way to fix this?  I may just re-index without stop words, and
see if that causes any problems.

thanks,
Rob


Re: indexing all suffixes to support leading wildcard?

2014-08-29 Thread Rob Nikander
Doh. Nevermind, I see it. I was searching with same analyzer that I used to
index. Usually that's right, but in this case, no.

Rob


On Fri, Aug 29, 2014 at 10:59 AM, Rob Nikander 
wrote:

> Thanks. That got the search working. Do you know if there's a trick for
> using FastVectorHighlighter with ngrams?  I followed that doc's advice to
> use NGramTokenizer, and right now if the search matches "1234" it will only
> highlight "123".
>
> Rob
>
>
>
> On Fri, Aug 29, 2014 at 12:18 AM, Jack Krupansky 
> wrote:
>
>> Use the ngram token filter, and the a query of 512 would match by itself:
>> http://lucene.apache.org/core/4_9_0/analyzers-common/org/
>> apache/lucene/analysis/ngram/NGramTokenFilter.html
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Erick Erickson
>> Sent: Thursday, August 28, 2014 11:52 PM
>> To: java-user
>> Subject: Re: indexing all suffixes to support leading wildcard?
>>
>>
>> The "usual" approach is to index to a second field but backwards.
>> See ReverseStringFilter... Then all your leading wildcards
>> are really trailing wildcards in the reversed field.
>>
>> Best,
>> Erick
>>
>>
>> On Thu, Aug 28, 2014 at 10:38 AM, Rob Nikander 
>> wrote:
>>
>>  Hi,
>>>
>>> I've got some short fields (phone num, email) that I'd like to search
>>> using
>>> good old string matching.  (The full query is a boolean "or" that also
>>> uses
>>> real text fields.) I see the warnings about wildcard queries that start
>>> with *, and I'm wondering... do you think it would be a good idea to
>>> index
>>> all the suffixes?  Eg, a phone num 5551234, would become 7 values for the
>>> "phoneNum" field: 4, 34, 234, etc.  So "512*" would be a hit.
>>>
>>> And maybe do something with the boosts so it doesn't overvalue the match
>>> when it hits multiple values.  ?
>>>
>>> Rob
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>


Re: indexing all suffixes to support leading wildcard?

2014-08-29 Thread Rob Nikander
Thanks. That got the search working. Do you know if there's a trick for
using FastVectorHighlighter with ngrams?  I followed that doc's advice to
use NGramTokenizer, and right now if the search matches "1234" it will only
highlight "123".

Rob


On Fri, Aug 29, 2014 at 12:18 AM, Jack Krupansky 
wrote:

> Use the ngram token filter, and the a query of 512 would match by itself:
> http://lucene.apache.org/core/4_9_0/analyzers-common/org/
> apache/lucene/analysis/ngram/NGramTokenFilter.html
>
> -- Jack Krupansky
>
> -Original Message- From: Erick Erickson
> Sent: Thursday, August 28, 2014 11:52 PM
> To: java-user
> Subject: Re: indexing all suffixes to support leading wildcard?
>
>
> The "usual" approach is to index to a second field but backwards.
> See ReverseStringFilter... Then all your leading wildcards
> are really trailing wildcards in the reversed field.
>
> Best,
> Erick
>
>
> On Thu, Aug 28, 2014 at 10:38 AM, Rob Nikander 
> wrote:
>
>  Hi,
>>
>> I've got some short fields (phone num, email) that I'd like to search
>> using
>> good old string matching.  (The full query is a boolean "or" that also
>> uses
>> real text fields.) I see the warnings about wildcard queries that start
>> with *, and I'm wondering... do you think it would be a good idea to index
>> all the suffixes?  Eg, a phone num 5551234, would become 7 values for the
>> "phoneNum" field: 4, 34, 234, etc.  So "512*" would be a hit.
>>
>> And maybe do something with the boosts so it doesn't overvalue the match
>> when it hits multiple values.  ?
>>
>> Rob
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Can I update one field in doc?

2014-08-28 Thread Rob Nikander
I used the "Luke" tool to look at my documents. It shows that the positions
and offsets in the term vectors get wiped out, in all fields.  I'm using
Lucene 4.8.  I guess I'll just rebuild the entire doc.

Rob


On Thu, Aug 28, 2014 at 5:33 PM, Rob Nikander 
wrote:

> I tried something like this, to loop through all docs in my index and
> patch a field.  But it appears to wipe out some parts of the stored values
> in the document. For example, highlighting stopped working.
>
> [ scala code ]
> val q = new MatchAllDocsQuery()
> val topDocs = searcher.search(q, 100)
> val field = new StringField(FieldNames.phone, "", Field.Store.YES)
>
> for (sdoc <- topDocs.scoreDocs) {
>val doc = searcher.doc(sdoc.doc)
>val id = doc.get(FieldNames.id)
>var phone = doc.get(FieldNames.phone)
>phone = phone + " changed"
>doc.removeField(FieldNames.phone)
>field.setStringValue(searchable)
>doc.add(field)
>writer.updateDocument(new Term(FieldNames.id, id), doc)
> }
>
> Should it work?  The documents have many fields and it takes 35 minutes to
> rebuild the index from scratch. I'd like to be able to run smaller "patch"
> tasks like this.
>
> Rob
>


Can I update one field in doc?

2014-08-28 Thread Rob Nikander
I tried something like this, to loop through all docs in my index and patch
a field.  But it appears to wipe out some parts of the stored values in the
document. For example, highlighting stopped working.

[ scala code ]
val q = new MatchAllDocsQuery()
val topDocs = searcher.search(q, 100)
val field = new StringField(FieldNames.phone, "", Field.Store.YES)

for (sdoc <- topDocs.scoreDocs) {
   val doc = searcher.doc(sdoc.doc)
   val id = doc.get(FieldNames.id)
   var phone = doc.get(FieldNames.phone)
   phone = phone + " changed"
   doc.removeField(FieldNames.phone)
   field.setStringValue(searchable)
   doc.add(field)
   writer.updateDocument(new Term(FieldNames.id, id), doc)
}

Should it work?  The documents have many fields and it takes 35 minutes to
rebuild the index from scratch. I'd like to be able to run smaller "patch"
tasks like this.

Rob


indexing all suffixes to support leading wildcard?

2014-08-28 Thread Rob Nikander
Hi,

I've got some short fields (phone num, email) that I'd like to search using
good old string matching.  (The full query is a boolean "or" that also uses
real text fields.) I see the warnings about wildcard queries that start
with *, and I'm wondering... do you think it would be a good idea to index
all the suffixes?  Eg, a phone num 5551234, would become 7 values for the
"phoneNum" field: 4, 34, 234, etc.  So "512*" would be a hit.

And maybe do something with the boosts so it doesn't overvalue the match
when it hits multiple values.  ?

Rob


Re: How to not span fields with phrase query?

2014-08-28 Thread Rob Nikander
Thank you for the explanation. I subclassed Analyzer and overrode
`getPositionIncrementGap` for this field.  It appears to have worked.

Rob


On Thu, Aug 28, 2014 at 10:21 AM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> Usually that's referred to as multiple "values" for the same field; in the
> index there is no distinction between title:C and title:X as far as which
> field they are in -- they're in the same field.
>
> If you want to prevent phrase queries from matching B C X, insert a
> position gap between C and X; so A B C would be positions 0, 1, 2, and X,
> Y, Z might be 4, 5, 6 instead of 3, 4, 5, which is probably what you have
> now
>
> -Mike
>
>
> On 08/28/2014 09:53 AM, Rob Nikander wrote:
>
>> Hi,
>>
>> If I have document with multiple fields "title"
>>
>>  title: A B C
>>  title: X Y Z
>>
>> A phrase search for title:"B C X" matches this document. Can I prevent
>> that?
>>
>> thanks,
>> Rob
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


How to not span fields with phrase query?

2014-08-28 Thread Rob Nikander
Hi,

If I have document with multiple fields "title"

title: A B C
title: X Y Z

A phrase search for title:"B C X" matches this document. Can I prevent
that?

thanks,
Rob


Consistent colors from highlighter, multiple fields?

2014-08-25 Thread Rob Nikander
Hi,

I'm using FastVectorHighlighter, and I wanted to get highlights from
multiple fields that matched, so I called `highlighter.getBestFragment` for
each field. It returns null if it had nothing for that field.  The problem
is the colors don't match, so it looks confusing.  For example, I search
for "apple orange", and my HTML shows the term "apple" in green in the
field1, but red in the field2.

Any recommendations on how to fix this?

thanks,
Rob


Re: stemming irregular plurals?

2014-07-29 Thread Rob Nikander
Ah, yes, that does it.  Thank you both.

Rob


On Jul 29, 2014, at 10:30 AM, Alexandre Patry  
wrote:

> 
> On 29/07/2014 10:28, Rob Nikander wrote:
>> Mmm. I don’t see a way to construct one, except passing an FST, which isn’t 
>> exactly a map. I look at the FST javadoc; it’s a rabbit hole.
> You probably want to look at 
> http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilter.Builder.html
> 
> Hope this help,
> 
> Alexandre
>> 
>> Rob
>> 
>> On Jul 29, 2014, at 10:14 AM, Robert Muir  wrote:
>> 
>>> You can put this thing before your stemmer, with a custom map of exceptions:
>>> 
>>> http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilter.html
>>> 
>>> On Tue, Jul 29, 2014 at 10:03 AM, Robert Nikander
>>>  wrote:
>>>> Hi,
>>>> 
>>>> I created an Analyzer with a PorterStemFilter, and I’m searching some test 
>>>> documents.  Normal plurals work; searching for “zebra” finds text with 
>>>> “zebras”. But searching for “goose” doesn’t find “geese”.  Is that 
>>>> expected?  Does it give up on irregular English?  Is there a way to make 
>>>> that work, or a reason that it can’t?
>>>> 
>>>> thanks,
>>>> Rob
>>>> 
>>>> 
>>>> 
>>>> -
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>> 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> 
> -- 
> Alexandre Patry, Ph.D
> Chercheur Principal / Principal Researcher
> http://KeaText.com
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: stemming irregular plurals?

2014-07-29 Thread Rob Nikander
Mmm. I don’t see a way to construct one, except passing an FST, which isn’t 
exactly a map. I look at the FST javadoc; it’s a rabbit hole.

Rob

On Jul 29, 2014, at 10:14 AM, Robert Muir  wrote:

> You can put this thing before your stemmer, with a custom map of exceptions:
> 
> http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilter.html
> 
> On Tue, Jul 29, 2014 at 10:03 AM, Robert Nikander
>  wrote:
>> Hi,
>> 
>> I created an Analyzer with a PorterStemFilter, and I’m searching some test 
>> documents.  Normal plurals work; searching for “zebra” finds text with 
>> “zebras”. But searching for “goose” doesn’t find “geese”.  Is that expected? 
>>  Does it give up on irregular English?  Is there a way to make that work, or 
>> a reason that it can’t?
>> 
>> thanks,
>> Rob
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org