from:"Tate Avery"

RE: PorterStemmer / Levenshtein Distance

2004-11-05 Thread Tate Avery

Yousef,

If you are interested in using the Levenshtein algorithm outside of Lucene, it is 
available in the Jakarta StringUtils class...

T

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, November 05, 2004 3:44 AM
To: Lucene Users List
Subject: Re: PorterStemmer / Levenshtein Distance

For the distance algorithm, check out FuzzyQuery and its dependent code.

Within Lucene's codebase is also a PorterStemFilter.  However, we'd 
like to deprecate this in favor of the Snowball stemming code that 
currently lives in the Lucene Sandbox repository.

Erik

On Nov 4, 2004, at 9:12 PM, Yousef Ourabi wrote:

> Hey,
> On the site It says Lucence Uses Levenshtein distance
> algorithm for fuzzy matching, where is this in the
> source code? Also I would like to use the porter
> stemming algorithm for somethign else, Are there any
> documents on the Lucence implementation of Porter
> Stemmer.
>
> Best,
> Yousef
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

WordListLoader's whereabouts

2004-09-27 Thread Tate Avery

Hello,

I am trying to compile the analyzers from the Lucene sandbox contributions.  Many of 
them seem to import org.apache.lucene.analysis.WordlistLoader which is not currently 
in my classpath.

Does anyone know where I can find this class?  It does not appear to be in Lucene 1.4, 
so I am assuming it is another contribution perhaps?  Any help in tracking it down 
would be appreciated.

Also, some of the analyzers appear to have their own copy of this class (i.e. 
org.apache.lucene.analysis.nl.WordlistLoader).  Could I just relocate that one to the 
shared package, perhaps?

Thanks,
Tate

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Tate Avery

I get a NullPointerException shown (via Apache) when I try to access 
http://www.searchmorph.com/kat/spell.jsp

T

-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 14, 2004 3:23 PM
To: Lucene Users List
Subject: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene

Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>>
>> I can/should send the code out. The logic is that for any terms in a 
>> query that have zero matches, go thru all the terms(!) and calculate 
>> the Levenshtein string distance, and return the best matches. A more 
>> intelligent way of doing this is to instead look for terms that also 
>> match on the 1st "n" (prob 3) chars.
> 
> 
> ...or prepare in advance a fast lookup index - split all existing terms 
> to bi- or trigrams, create a separate lookup index, and then simply for 
> each term ask a phrase query (phrase = all n-grams from an input term), 
> with a slop > 0, to get similar existing terms. This should be fast, and 
> you could provide a "did you mean" function too...
> 

Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.

[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like "recursixe" 
or whatnot to see what suggestions it returns. Note this is not a normal 
search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp

[3] Here's the javadoc:

http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html

[4] Here's source in HTML:

http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152

[5] A few more details:

Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This 
Doc contains all ngrams (in my test case, like #2 above, of length 3 and 
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]

If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
  Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: AnalyZer HELP Please

2004-08-18 Thread Tate Avery


Basically, Google uses its stop lists selectively.  To me, the 2 rules appear to be:

1) Do not use stop list for items in quotes (i.e. exact phrase)
2) Do not use stop list if the query is ONLY stop words

Furthermore, they DO let you do silly things like find out there are approximately 
5,760,000,000 pages containing the word 'the' (see rule #2).:)

Anyway, that is how I interpreted my Google tests.  And, as an observation, one would 
need to be a bit creative to get the same behaviour with Lucene given the current 
analyzer setup (IMO).

In short, it is interesting that they do it that way, but it certainly doesn't mean 
that it is THE way to be done.


T


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 18, 2004 2:00 PM
To: Lucene Users List
Subject: Re: AnalyZer HELP Please


Thanks for doing the legwork.  My favorite example is "to be or not to 
be" with and without quotes.  The top hit without quotes is quite 
funny.

So, Google doesn't throw away stop words, but they do special query 
processing to keep you from doing silly things like "show me all 
documents with 'the' in them".  Look at Nutch for how it does something 
very similar.

Erik

On Aug 18, 2004, at 11:52 AM, Tate Avery wrote:

>
> That is interesting.
>
> I went to lookup the cases for this (on Google).
> Here are my 4 queries and the results:
>
>
> a) of the from it
>
>   - 25,500,000 matches containing 'of' and 'the' and 'from' and 'it'
>   - i.e. stop list NOT used if query is only stopwords
>
> b) "of the from it"
>
>   - 49 results for exact phrase match 'of the from it'
>   - i.e. stop list NOT used (see next 2 for real phrase effect)
>
> c) of the from it test
>
>   - The following words are very common and were not included in your 
> search: of the from it.
>   - In short, 241,000,000 matching the word 'test'
>   - i.e. stop list used if there is a non-stopword in the query
>
> d) "of the from it test"
>   
>   - 0 matches for this exact phrase
>   - i.e. stoplist NOT used for any words in a phrase query
>
>
> Tate
>
> p.s.  Um... did you say that was a rhetorical question?  ;-)
>
>
> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, August 18, 2004 6:17 AM
> To: Lucene Users List
> Subject: Re: AnalyZer HELP Please
>
>
>
>
> On Aug 18, 2004, at 3:41 AM, Karthik N S wrote:
>> Hi Guys
>>
>>   Finally with lot's experimentation, I came to know that
>>
>> A word  such as  'new'  already present in  Analyzer,
>>
>> will  not return  any hits [ Even when enclosed with Quotes "\""]
>>
>> such as  "New Year"
>>
>>
>>That's really Intresting:(
>
>
> That's why it's call stop word *removal*.  The purpose of removing
> words is to save space and eliminate words that are ultra common.
> Tuning the analysis process to your domain/environment is by far the
> trickiest part of using Lucene, and often is not even much of a
> consideration as the built-in Analyzers suffice.  It sounds to me that
> your stop word list is far too aggressive and you should consider
> trimming down the list of words that are removed.
>
> Or, even consider not removing words at all.  From the What Would
> Google Do (WWGD)? category does Google remove stop words?  I'll
> leave that as a rhetorical question for now :)
>
>   Erik
>
>>
>>
>> Thx
>> Karthik
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
>> Sent: Tuesday, August 17, 2004 7:35 PM
>> To: Lucene Users List
>> Subject: Re: AnalyZer HELP Please
>>
>>
>> On Aug 17, 2004, at 9:47 AM, Karthik N S wrote:
>>> I did as Erik  replied in his mail ,
>>> and  searched for the complete word   "\"New Year\""  ,
>>> but the QueryParser Still returns me hit for "Year"  Only.
>>>
>>> [ The Analyzer I use has 555 English Stop words  with  "new" present
>>> in it ]
>>
>> No wonder!
>>
>>> That's when I checked up with Analyzer's to verify,
>>> If u look at the list  Analyzer's  o/p
>>> GrammerAnalyzer is the one that has 555 English STOPWORDS.
>>>
>>> Do u think this is the bug in my Code.
>>
>> Whether this is a "bug" or not is really for your users to determine 
>> :

RE: AnalyZer HELP Please

2004-08-18 Thread Tate Avery


That is interesting.  

I went to lookup the cases for this (on Google).  
Here are my 4 queries and the results:


a) of the from it

- 25,500,000 matches containing 'of' and 'the' and 'from' and 'it'
- i.e. stop list NOT used if query is only stopwords

b) "of the from it"

- 49 results for exact phrase match 'of the from it'
- i.e. stop list NOT used (see next 2 for real phrase effect)

c) of the from it test

- The following words are very common and were not included in your search: of 
the from it.
- In short, 241,000,000 matching the word 'test'
- i.e. stop list used if there is a non-stopword in the query

d) "of the from it test"

- 0 matches for this exact phrase
- i.e. stoplist NOT used for any words in a phrase query


Tate

p.s.  Um... did you say that was a rhetorical question?  ;-)


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 18, 2004 6:17 AM
To: Lucene Users List
Subject: Re: AnalyZer HELP Please




On Aug 18, 2004, at 3:41 AM, Karthik N S wrote:
> Hi Guys
>
>   Finally with lot's experimentation, I came to know that
>
> A word  such as  'new'  already present in  Analyzer,
>
> will  not return  any hits [ Even when enclosed with Quotes "\""]
>
> such as  "New Year"
>
>
>That's really Intresting:(


That's why it's call stop word *removal*.  The purpose of removing 
words is to save space and eliminate words that are ultra common.  
Tuning the analysis process to your domain/environment is by far the 
trickiest part of using Lucene, and often is not even much of a 
consideration as the built-in Analyzers suffice.  It sounds to me that 
your stop word list is far too aggressive and you should consider 
trimming down the list of words that are removed.

Or, even consider not removing words at all.  From the What Would 
Google Do (WWGD)? category does Google remove stop words?  I'll 
leave that as a rhetorical question for now :)

Erik

>
>
> Thx
> Karthik
>
>
>
>
>
> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, August 17, 2004 7:35 PM
> To: Lucene Users List
> Subject: Re: AnalyZer HELP Please
>
>
> On Aug 17, 2004, at 9:47 AM, Karthik N S wrote:
>> I did as Erik  replied in his mail ,
>> and  searched for the complete word   "\"New Year\""  ,
>> but the QueryParser Still returns me hit for "Year"  Only.
>>
>> [ The Analyzer I use has 555 English Stop words  with  "new" present
>> in it ]
>
> No wonder!
>
>> That's when I checked up with Analyzer's to verify,
>> If u look at the list  Analyzer's  o/p
>> GrammerAnalyzer is the one that has 555 English STOPWORDS.
>>
>> Do u think this is the bug in my Code.
>
> Whether this is a "bug" or not is really for your users to determine :)
>   But it is absolutely the expected behavior.  QueryParser analyzes the
> expression too.  Even if you somehow changed QueryParser, if you never
> indexed the word "new" then you certainly cannot expect to search on it
> and find it.
>
>   Erik
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Finding All?

2004-08-13 Thread Tate Avery


I had to do this once and I put a field called "all" with a value of "true" for every 
document.

_doc.addField(Field.Keyword("all", "true"));

Then, if there was an empty query, I would substitute it for the query "all:true".  
And, of course, every doc would match this.

There might be a MUCH more elegant solution, but this certainly worked for me and was 
quite easy to incorporate.  And, it appears to order the documents by the order in 
which they were indexed.

T

p.s. You can probably do something using IndexReader directly... but the nice thing 
about this approach is that you are still just using a simple query.

-Original Message-
From: Patrick Burleson [mailto:[EMAIL PROTECTED]
Sent: Friday, August 13, 2004 3:25 PM
To: Lucene Users List
Subject: Finding All?


Is there a way for lucene to find all documents? Say if I have a
search input and someone puts  nothing in I want to go ahead and
return everything. Passing "*" to QueryParser was not pretty.

Thanks,
Patrick

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: boost keywords

2004-08-13 Thread Tate Avery


Well, as far as I know you can boost 3 different things:

- Field
- Document
- Query

So, I think you need to craft a solution using one of those.

Here are some possibilities for each:

1) Field
- make a keyword field which is alongside your content field
- boost your keyword field during indexing
- expand user queries to search 'content' and 'keywords'

2) Document
- I don't really think this one helps you in anyway

3) Query
- Scan a user query and selectively boost words that are known keywords
- This requires a keyword list and is not really scalable

That is all that comes to mind, at first glance.  So, IMO, the winner IS #1.

For example:

Field _headline = Field.Text("headline", "...");
_headline.setBoost(3);

Field _content = Field.Text("content", "...");

_document.addField(_headline);
_document.addField(_content);


But, the tricky part is modifying queries to use both fields.  If a user enters 
"virus", it is easy (i.e. "content:(virus) OR headline:(virus)").  But, it quickly 
gets more complex with more complex queries (especially boolean queries with AND and 
such ... you probably would need something roughly like this:  "a AND b" = "content:(a 
AND b) OR headline:(a AND b) OR (content:a AND headline:b) OR (headline:a AND 
content:b) and so on).

That's my 2 cents.

T



-Original Message-
From: news [mailto:[EMAIL PROTECTED] Behalf Of Leos Literak
Sent: Friday, August 13, 2004 8:52 AM
To: [EMAIL PROTECTED]
Subject: Re: boost keywords


Gerard Sychay napsal(a):
> Well, there is always the Lucene wiki. There's not a patterns page per
> se, but you could start one..

of course I could. If I had something to add :-)

but back to my issue. no reaction? So much people using
Lucene and no one knows? I would be gratefull for any
advice. Thanks

Leos


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Understanding Boolean Queries

2004-04-29 Thread Tate Avery

Sorry, I retract this statement...

>> 1) With _numClauses= and _required=false (for example), I have no
problems.
>> (This is confusing since  is more than maxClauseCount... but I won't
complain).

My little test app was using lucene-1.3-rc1.jar.  But, my REAL app is using
lucene-1.3-final.jar.

So, with _numClauses=1024 and _required=false, I have no problems.
And, with _numClauses=1025 and _required=false, I get the TooManyClauses
exception.

All of that is fine and good.  My main concern is the 32/33 threshold when
_required=true (see details below).

Tate

-Original Message-
From: Tate Avery [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 29, 2004 1:30 PM
To: 'Lucene Users List'
Cc: [EMAIL PROTECTED]
Subject: RE: Understanding Boolean Queries

Thank you for the response.

I am not using the QueryParser directly... it was just part of my overall
understanding of how this exception is coming about.  Same thing,
essentially, with the maxClauseCount.

Here is some code to illustrate what is confusing me and what I am trying to
ascertain:

int _numClauses = XXX;
boolean _required = XXX;  // 3 examples of these var settings below

BooleanQuery _query = new BooleanQuery();

for (int _i = 0; _i < _numClauses; _i++)
{
_query.add(
new BooleanClause(
new TermQuery(new Term("body", "term" + _i)),
_required,
false));
}

Hits _hits = new IndexSearcher(INDEX_DIR).search(_query);

1) With _numClauses= and _required=false (for example), I have no
problems.
(This is confusing since  is more than maxClauseCount... but I won't
complain).

2) With _numClauses=32 and _required=true, I also have no problems.

3) With _numClauses=33 and _required=true, I get
"java.lang.IndexOutOfBoundsException: More than 32 required/prohibited
clauses in query." as a runtime exception.

So, I guess I am trying to ask the following:

Is a query like (T1 AND T2 AND ... AND T32 AND T33) just completely illegal
for Lucene?
OR is there some way to extend this limit?
OR am I missing something that is clouding my understanding?

Thanks,
Tate

-Original Message-
From: Stephane James Vaucher [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 29, 2004 1:10 PM
To: Lucene Users List; [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: Understanding Boolean Queries

On Thu, 29 Apr 2004, Tate Avery wrote:

> Hello,
>
> I have been reviewing some of the code related to boolean queries and I
> wanted to see if my understanding is approximately correct regarding how
> they are handled and, more importantly, the limitations.

You can always submit requests for enhancements in bugzilla, so as to keep
track this issue.

> Here is what I have come to understand so far:
>
> 1) The QueryParser code generated from javacc will parse my boolean query
> and determine for each clause whether or not is 'required' (based on a few
> conditions, but, in short, whether or not it was introduced or followed by
> 'AND') or 'prohibited' (based, in short, on it being preceded by 'NOT').

Your usage seems pretty particular, why are you using the javacc
QueryParser?

> 2) As my BooleanQuery is being constructed, it will throw a
> BooleanQuery.TooManyClauses exception if I exceed
> BooleanQuery.maxClauseCount (which defaults to 1024).

It's configurable through sys properties or by
BooleanQuery.setMaxClauseCount(int maxClauseCount)
>
> 3) The maxClauseCount threshold appears not to care whether or not my
> clauses are 'required' or 'prohibited'... only how many of them there are
in
> total.
>
> 4) My BooleanQuery will prepare its own Scorer instance (i.e.
> BooleanScorer).  And, during this step, it will identify to the scorer
which
> clauses are 'required' or 'prohibited'.  And, if more than 32 fall into
this
> category, a IndexOutOfBoundsException ("More than 32 required/prohibited
> clauses in query.") is thrown.
> That's as far as I got.
> Now, I am a bit confused at this point.  Does this mean I can make a
boolean
> query consisting of up to 1024 clauses as long as no more than 32 of them
> are required or prohibited?  This doesn't seem right.  So, am I missing
> something in the way I am understanding this.
> I am (as you may have guessed) generating large boolean queries.  And, in
> some rare cases, I am receiving the exception identified in #4 (above).
So,
> I am trying to figure out whether or not I need to change/filter my
queries
> in a special way in order to avoid this exception.  And, in order to do
> this, I want to understand how these queries are being handled.

RE: Understanding Boolean Queries

2004-04-29 Thread Tate Avery

Thank you for the response.

I am not using the QueryParser directly... it was just part of my overall
understanding of how this exception is coming about.  Same thing,
essentially, with the maxClauseCount.


Here is some code to illustrate what is confusing me and what I am trying to
ascertain:

int _numClauses = XXX;
boolean _required = XXX;  // 3 examples of these var settings below

BooleanQuery _query = new BooleanQuery();

for (int _i = 0; _i < _numClauses; _i++)
{
_query.add(
new BooleanClause(
new TermQuery(new Term("body", "term" + _i)),
_required,
false));
}

Hits _hits = new IndexSearcher(INDEX_DIR).search(_query);


1) With _numClauses= and _required=false (for example), I have no
problems.
(This is confusing since  is more than maxClauseCount... but I won't
complain).

2) With _numClauses=32 and _required=true, I also have no problems.

3) With _numClauses=33 and _required=true, I get
"java.lang.IndexOutOfBoundsException: More than 32 required/prohibited
clauses in query." as a runtime exception.


So, I guess I am trying to ask the following:

Is a query like (T1 AND T2 AND ... AND T32 AND T33) just completely illegal
for Lucene?
OR is there some way to extend this limit?
OR am I missing something that is clouding my understanding?



Thanks,
Tate



-Original Message-
From: Stephane James Vaucher [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 29, 2004 1:10 PM
To: Lucene Users List; [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: Understanding Boolean Queries


On Thu, 29 Apr 2004, Tate Avery wrote:

> Hello,
>
> I have been reviewing some of the code related to boolean queries and I
> wanted to see if my understanding is approximately correct regarding how
> they are handled and, more importantly, the limitations.

You can always submit requests for enhancements in bugzilla, so as to keep
track this issue.

> Here is what I have come to understand so far:
>
> 1) The QueryParser code generated from javacc will parse my boolean query
> and determine for each clause whether or not is 'required' (based on a few
> conditions, but, in short, whether or not it was introduced or followed by
> 'AND') or 'prohibited' (based, in short, on it being preceded by 'NOT').

Your usage seems pretty particular, why are you using the javacc
QueryParser?

> 2) As my BooleanQuery is being constructed, it will throw a
> BooleanQuery.TooManyClauses exception if I exceed
> BooleanQuery.maxClauseCount (which defaults to 1024).

It's configurable through sys properties or by
BooleanQuery.setMaxClauseCount(int maxClauseCount)
>
> 3) The maxClauseCount threshold appears not to care whether or not my
> clauses are 'required' or 'prohibited'... only how many of them there are
in
> total.
>
> 4) My BooleanQuery will prepare its own Scorer instance (i.e.
> BooleanScorer).  And, during this step, it will identify to the scorer
which
> clauses are 'required' or 'prohibited'.  And, if more than 32 fall into
this
> category, a IndexOutOfBoundsException ("More than 32 required/prohibited
> clauses in query.") is thrown.
> That's as far as I got.
> Now, I am a bit confused at this point.  Does this mean I can make a
boolean
> query consisting of up to 1024 clauses as long as no more than 32 of them
> are required or prohibited?  This doesn't seem right.  So, am I missing
> something in the way I am understanding this.
> I am (as you may have guessed) generating large boolean queries.  And, in
> some rare cases, I am receiving the exception identified in #4 (above).
So,
> I am trying to figure out whether or not I need to change/filter my
queries
> in a special way in order to avoid this exception.  And, in order to do
> this, I want to understand how these queries are being handled.
> Finally, is there something related to the query syntax that could be my
> mistake?  For example, what is the difference between:
>   "A B" AND "C D" AND "D E"
> ... and...
>   ("A B") AND ("C D") AND ("D E")
> ... could that be the crux of it?

I can't help you here, and the doc seems rather thin (or nonexistent for
this class). I don't know the relation between the query and how the
scorer will process it.

Sorry I can't be of assistance,
sv

> Thank you for your time,
> Tate Avery
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Understanding Boolean Queries

2004-04-29 Thread Tate Avery

Hello,

I have been reviewing some of the code related to boolean queries and I
wanted to see if my understanding is approximately correct regarding how
they are handled and, more importantly, the limitations.


Here is what I have come to understand so far:

1) The QueryParser code generated from javacc will parse my boolean query
and determine for each clause whether or not is 'required' (based on a few
conditions, but, in short, whether or not it was introduced or followed by
'AND') or 'prohibited' (based, in short, on it being preceded by 'NOT').

2) As my BooleanQuery is being constructed, it will throw a
BooleanQuery.TooManyClauses exception if I exceed
BooleanQuery.maxClauseCount (which defaults to 1024).

3) The maxClauseCount threshold appears not to care whether or not my
clauses are 'required' or 'prohibited'... only how many of them there are in
total.

4) My BooleanQuery will prepare its own Scorer instance (i.e.
BooleanScorer).  And, during this step, it will identify to the scorer which
clauses are 'required' or 'prohibited'.  And, if more than 32 fall into this
category, a IndexOutOfBoundsException ("More than 32 required/prohibited
clauses in query.") is thrown.
That's as far as I got.
Now, I am a bit confused at this point.  Does this mean I can make a boolean
query consisting of up to 1024 clauses as long as no more than 32 of them
are required or prohibited?  This doesn't seem right.  So, am I missing
something in the way I am understanding this.
I am (as you may have guessed) generating large boolean queries.  And, in
some rare cases, I am receiving the exception identified in #4 (above).  So,
I am trying to figure out whether or not I need to change/filter my queries
in a special way in order to avoid this exception.  And, in order to do
this, I want to understand how these queries are being handled.
Finally, is there something related to the query syntax that could be my
mistake?  For example, what is the difference between:
"A B" AND "C D" AND "D E"
... and...
("A B") AND ("C D") AND ("D E")
... could that be the crux of it?

Thank you for your time,
Tate Avery


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: BooleanScorer - 32 required/prohibited clause limit

2004-04-27 Thread Tate Avery


Or if I overlooked some previous post or thread that covers this please help
me track it down.

Thank you,
Tate

-Original Message-
From: Tate Avery [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 27, 2004 10:20 AM
To: [EMAIL PROTECTED]
Subject: BooleanScorer - 32 required/prohibited clause limit


Hello,

I am using Lucene 1.3 and I ran into the following exception:

java.lang.IndexOutOfBoundsException: More than 32 required/prohibited
clauses in query.
at org.apache.lucene.search.BooleanScorer.add(BooleanScorer.java:98)

Is there any easy way to fix/adjust this (like the
BooleanQuery.maxClauseCount, for example)?
Strangely, I couldn't find mention of the BooleanScorer class in my javadoc.


Thank you for any tips.

Tate

p.s.  Yes, I am intentionally generating some rather long boolean queries.
:)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

BooleanScorer - 32 required/prohibited clause limit

2004-04-27 Thread Tate Avery

Hello,

I am using Lucene 1.3 and I ran into the following exception:

java.lang.IndexOutOfBoundsException: More than 32 required/prohibited
clauses in query.
at org.apache.lucene.search.BooleanScorer.add(BooleanScorer.java:98)

Is there any easy way to fix/adjust this (like the
BooleanQuery.maxClauseCount, for example)?
Strangely, I couldn't find mention of the BooleanScorer class in my javadoc.


Thank you for any tips.

Tate

p.s.  Yes, I am intentionally generating some rather long boolean queries.
:)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Software for suggesting alternative words or sentences

2004-04-16 Thread Tate Avery


Also...

http://jazzy.sourceforge.net/


-Original Message-
From: Felix Huber [mailto:[EMAIL PROTECTED]
Sent: Friday, April 16, 2004 1:17 PM
To: Lucene Users List
Subject: Re: Software for suggesting alternative words or sentences


Check http://www.iu.hio.no/~frodes/sprell/sprell.html - it includes a german
and a norwegian dictionary.

Regards,
Felix Huber



Venu Durgam wrote:
> I was wondering if there is any open source software for suggesting
> alternative words or sentences for search queries like Google.
>
> Thanks
> Venu Durgam


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Numeric field data

2004-04-02 Thread Tate Avery

Hello,

Is there a way (direct or indirect) to support a field with numeric data?

More specifically, I would be interested in doing a range search on numeric
data and having something like:

number:[1 TO 2]

... and not have it return 11 or 103, etc.  But, return 1.5, for example.

Is there any support in current and/or upcoming versions for this type of
thing?
Or, has anyone figured out a creative workaround to obtain the desired
result?


Thank you for any comments,
Tate

p.s.  Ideally, I would be able to do equal, greater than, less than and
these in combination with each other (i.e. ranges, greater than or equal to,
etc.).


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Nested category strategy

2004-04-01 Thread Tate Avery


Could you put them all into a tab-delimited string and store that as a
single field, then use a TabTokenizer on the field to search?

And, if you need to, do a .split("\t") on the field value in order to break
them back up into individual categories.




-Original Message-
From: David Black [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 01, 2004 2:49 PM
To: [EMAIL PROTECTED]
Subject: Nested category strategy


Hey All,

I'm trying to figure out the best approach to something.

Each document I index has an array of categories which looks like the
following example

/Science/Medicine/Serology/blood gas
/Biology/Fluids/Blood/

etc.

Anyway, there's a couple things I'm trying to deal with.

1. The fact that we have an undefined array size.  I can't just shove
these into a single field.  I could explode them into multiple fields
on the fly like category_1, category_2. etc. etc

2. The fact that a search will need to be performed like " category:
/Science/Medicine/*" would need to return all items within that
category.

Thanks in advance to anyone who can give me some help here.

Thanks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Searching in "all"

2004-04-01 Thread Tate Avery

Hello,

If I have, for example, 3 fields in my document (title, body, notes)... is there some 
easy what to search 'all'?


Below are the only 2 ideas I currently have/use:

1) If I want to search for 'x' in all, I do something like:
title:x OR body:x OR notes:x

... but this does not really work if you are search for (a AND b) and a is in the 
title and b is in the notes, etc... leading to an explosion of boolean combinations it 
seems.


2) Actually index an 'all' field for my document by just concatenating the content 
from the title, body, and notes fields.
... but this doubles my index size.  :(


So, is there a better way out there?

Thanks,
Tate

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Natural Language Queries

2004-01-26 Thread Tate Avery

Hello,

Has anyone come across a good (preferably open-source) module for parsing natural 
language queries into Lucene queries?
I.e. Identifying concepts (single vs. multi-word), concept expansion (via thesauri), 
filtering extraneous words, etc.


Any information would be appreciated.


Thank you,
Tate

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Displaying Query

2003-12-17 Thread Tate Avery


Try:

String larequet = query.toString("default field name here");

Example:

String larequet = query.toString("texte");

That should give string version of query.


-Original Message-
From: Gayo Diallo [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 17, 2003 10:46 AM
To: [EMAIL PROTECTED]
Subject: Displaying Query


Hi all,

I use this code
Query query = QueryParser.parse(q, "Contenu", new Analyseur());

String larequet = query.toString();

System.out.println("la requête à traiter est: " + larequet);

And I have as this line displayed "[EMAIL PROTECTED]"

I don't know Why I have't my query string displayed correctly. May someone 
help me.

Best regards,

Gayo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: SearchBlox J2EE Search Component Version 1.1 released

2003-12-02 Thread Tate Avery

If you buy it, apparently:
http://www.searchblox.com/buy.html

-Original Message-
From: Tun Lin [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 02, 2003 10:43 AM
To: 'Lucene Users List'; [EMAIL PROTECTED]
Subject: RE: SearchBlox J2EE Search Component Version 1.1 released

Hi,

Just a feedback.

SearchBlox can only search for html files. Will Searchblox support pdf, xml and
word documents in future? It will be perfect if it can support all document
types mentioned above.

-Original Message-
From: Robert Selvaraj [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, December 02, 2003 10:42 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: SearchBlox J2EE Search Component Version 1.1 released

SearchBlox is a J2EE search component that enables you to add search
functionality to your applications, intranets or portals in a matter of minutes.
SearchBlox uses Lucene Search API and features integrated HTTP and File System
crawlers, support for different document formats, support for indexing and
searching content in 15 languages and customizable search results, all
controlled from a browser-based Admin Console.

Main features in this update:
=
- Asian language support. SearchBlox now supports Japanese, Chinese Simplified,
Chinese Traditional and Korean language content.
- Performance enhancements to search
- Improved Hit Highlighting

SearchBlox is available as a Web Archive (WAR) and is deployable on any Servlet
2.3/JSP 1.2 compliant server. SearchBlox Getting-Started Guides are available
for the following servers:

JBoss - http://www.searchblox.com/gettingstarted_jboss.html
Jetty - http://www.searchblox.com/gettingstarted_jetty.html
JRun - http://www.searchblox.com/gettingstarted_jrun.html
Pramati - http://www.searchblox.com/gettingstarted_pramati.html
Resin - http://www.searchblox.com/gettingstarted_resin.html
Tomcat - http://www.searchblox.com/gettingstarted_tomcat.html
Weblogic - http://www.searchblox.com/gettingstarted_weblogic.html
Websphere - http://www.searchblox.com/gettingstarted_websphere.html

The SearchBlox FREE Edition is available free of charge and can index up to 1000
HTML documents.

The software can be downloaded from http://www.searchblox.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: New Lucene-powered Website

2003-12-02 Thread Tate Avery

Hello,

This is the first time that I noticed this.

Is the 'powered by Lucene' a legal requirement?  Or just a suggestion?
Does it apply to any system embedding Lucene (web pages, applications, etc)?
That is not covered in the Apache Software License, I believe.

Just curious...

Tate

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 02, 2003 9:26 AM
To: Lucene Users List
Subject: Re: New Lucene-powered Website

There was discussion about it, yes.  I don't think we ever reached any
conclusions, and the powered.html still says 'include the logo'.

Otis

--- Erik Hatcher <[EMAIL PROTECTED]> wrote:
> On Tuesday, December 2, 2003, at 07:34  AM, Otis Gospodnetic wrote:
> > Could you add a Lucene logo somewhere on your search results, as
> noted
> > here:
> > http://jakarta.apache.org/lucene/docs/powered.html ?
> 
> I thought we were going to loosen up the requirement to have the logo
> 
> on a search results page?
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Ask something about lucene

2003-11-19 Thread Tate Avery

Have a look at the API

http://jakarta.apache.org/lucene/docs/api/

For example, the Hits object has a score
see: org.apache.lucene.search.Hits  (score)

And the IndexReader allows you to get num docs in the index and term data, etc.
see: org.apache.lucene.index.IndexReader   (numDocs, docFreq)


Tate


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 19, 2003 9:29 AM
To: [EMAIL PROTECTED]
Subject: Ask something about lucene



Deal all,

I am a new comer into this lucene e-mail list. I am doing research on 
distributed information retrieval so I would like to have a few different 
systems. I run the simple demo of lucene, it does not seem to statisfy all 
my requirements.

I hope for those documents in the result, there is a score for each of them 
to indicate the possible relevance of the document to the given query.

some other questions are: does the index include any statistics of the 
collection? For example, how many documents in the collection have been 
indexed? what is the frequency of a particular term? 

Generally speaking, I would like to have a system that can run queries as in 
TREC conference. If lucene does not work like that, then is it possible 
forme to make lucene to become such a system with modest efforts? Or anyone 
has done something on that?

Especially, to those of you have inside understanging of lucene, please help 
me. Thanks a lot.

Shengli

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Which operations change document ids?

2003-11-17 Thread Tate Avery

Hello,

I am considering using the document id in order to implement a fast 'join' during 
relational search.

My first question is:  should I steer clear of this all together?  And why?  If not, I 
need to know which Lucene operations can cause document ids to change.

I am assuming that the following can cause potential changes:

1) Add document
- since it might trigger a merge

2) Optimize index
- since it does trigger a merge

3) Update document
- since it is a delete + add

What else could cause a document id to change?  Could delete provoke a doc id change?

And, I am assuming that the following DO NOT change the document id:

1) Query the index


Also, am I missing any others that will or will not cause a document id to change?  

Thank you,

Tate


P.S. It appears (to me) that the SearchBean (in lucene sandbox) sorting makes use of 
the Hits.id(int _n) method.  How does it cope, if at all, with changes to the 
underlying document ids?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Document Clustering

2003-11-11 Thread Tate Avery

Categorization typically assigns documents to a node in a pre-defined taxonomy.

For clustering, however, the categorization 'structure' is emergent... i.e. the 
clusters (which are analogous to taxonomy nodes) are created dynamically based on the 
content of the documents at hand.


-Original Message-
From: petite_abeille [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 11, 2003 10:50 AM
To: Lucene Users List
Subject: Re: Document Clustering


Hi Otis,

On Nov 11, 2003, at 16:41, Otis Gospodnetic wrote:

> How is document clustering different/related to text categorization?

Not that I'm an expert in any of this, but clustering is a much more 
"holistic" approach than categorization. Usually, categorization is 
understood as a more precise endeavor (e.g. dmoz.org), while clustering 
is much more "fuzzy" and non-deterministic. Both try to achieve the 
same goal though. So perhaps this is just a question of jargon.

I'm confident that the owner of this site could help bring some light 
on the finer point of clustering vs categorization:

http://www.lissus.com/resources/index.htm

Cheers,

PA.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Relational Search

2003-11-04 Thread Tate Avery

Hello,

I want to perform a 'relational search' meanining that I want to search 2 indexes and 
perform an intersection between the 2.  It would be very much like a table join in an 
SQL statement in terms of overall result.

So, I might have an index of documents of type A that would allow me to retrieve IDs.
And an index of documents of type B that would also return IDs of A type documents 
(like a foreign key).

So, I want to find all documents of type A that match query X and also match query Y 
on the B-type index.

Using SQL Server (with full-text search) it might be something like:

SELECT a.name FROM index1 a, index2 b 
WHERE CONTAINS(a.body, 'x') 
   AND b.id = a.id AND CONTAINS(b.title, 'y')

Doing my own intersection is not desirable (since loading the IDs from both results 
can be very time-consuming).
And, building a huge boolean query that performs the intersection (by loading the 
smaller ID set only) is still too slow.
I don't mind doing any extra work so long as I get an reasonably efficient relational 
search in the end.

Does anyone have any creative ideas for tackling this problem with Lucene?

Thank you,
Tate


p.s.  Using the Multisearcher allows for easy unions  :) ... but not intersections.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: large index query time

2003-10-24 Thread Tate Avery


Below are some posts from Doug (circa 2001) that I found very helpful with regard to 
understanding Lucene scalability.  I am assuming that they are still generally 
applicable.  You might also find them useful.

Tate


---


Performance for large indices is frequently governed by i/o performance.  If
an index is larger than RAM then searches will need to read data from disk.
This can quickly become a bottleneck.  A search for a term that occurs in a
million documents can require over 1MB of data, which can take some time to
read.  With multiple searching threads, the disk can easily become a
bottleneck.  Disk arrays can alleviate this, more RAM helps even more!

For some folks, queries that take over a second are unacceptable, for
others, ten seconds is okay.

Performance should be more-or-less linear: a two-million document index will
be almost twice as slow to search as a one-million document index.  There
are lots of factors, including document size, CPU-speed, RAM-size, i/o
subsystem, but a rough rule-of-thumb for Lucene performance might be that,
in a "typical" configuration, it can search a million documents per second.

So if you need to search 20 million 100kB documents on a 100Mhz 386 with 8MB
of RAM with sub-second response time, Lucene will probably fail.  But if you
need to search two million 2kB documents on a 500Mhz Pentium with 128MB of
RAM in a couple of seconds per query, you're probably okay.

- Doug Cutting (10/08/2001)


Some more precise statements: The cost to search for a term is proportional
to the number of documents that contain that term.  The cost to search for a
phrase is proportional to the sum of the number of occurrences of its
constituent terms.  The cost to execute a boolean query is the sum of the
costs of its sub-queries.  Longer documents contain more terms: usually both
more unique terms and more occurrences.

Total vocabulary size is not a big factor in search performance.  When you
open an index Lucene does read one out of every 128 unique terms into a
table, so an index with a large number of unique terms will be slower to
open.  Searching that table for query terms is also slower for bigger
indexes, but the time to search that table is not significant in overall
performance.  Lucene also reads at index open one byte per document per
indexed field (the normalization factor).  So an index with lots of
documents and fields will also be slower to open.  But, once opened, the
cost of searching is largely dependent on the frequency characteristics of
query terms.  And, since IndexReaders and Searchers are thread safe, you
don't need to open indexes very often.

- Doug Cutting (10/08/2001)





-Original Message-
From: Dan Quaroni [mailto:[EMAIL PROTECTED]
Sent: October 24, 2003 1:33 PM
To: 'Lucene Users List'
Subject: RE: large index query time


My experience is that the query time (and memory usage) can be affected
greatly by booleans that retrieve lots of results.

Are you finding it slow when doing a simple query that should return only a
handful of results, or is it on more complex queries?

-Original Message-
From: Maurice Coyle [mailto:[EMAIL PROTECTED]
Sent: Friday, October 24, 2003 1:29 PM
To: Lucene Users List
Subject: large index query time


hi,
i recently merged a whole lot of indexes into one big index for testing
purposes.  however, now the programs i use to search the index are taking
much longer.  this may be a stupid question (or very simple) and please tell
me if it is, but should this be the case?  i mean, i realise it'll take
longer to search over a larger collection, but it's taking an order of
magnitude longer.  this is the reaosn i'm asking, since if lucene is capable
of handling large-scale search apps presumably it's set up to search large
collections rapidly.

maybe there's some steps i can take to speed things up (i optimised the big
index when it was finished being created) or something i'm missing?  if i
can give any information which will help the diagnosis of this problem
please specify it.

thanks,
maurice


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Exact Match

2003-10-22 Thread Tate Avery


To ensure I understand...

If you have:

1)  A B C
2)  B C
3)  B C D
4)  C

You want "B C" to match #2 only
But, "C" to match #1, #2, #3, and #4

If so, you can have a tokenized field and an untokenized one...

Use the untokenized for matching 'exact' strings
Use the tokenized for finding a single word in the string

I.e.  check "B C" against untokenized
  check "C" against tokenized

That is, if you don't mind indexing the same data into 2 different fields.


-Original Message-
From: Wilton, Reece [mailto:[EMAIL PROTECTED]
Sent: October 22, 2003 12:49 PM
To: Lucene Users List
Subject: RE: Exact Match


If I use an untokenized field, would "fox" match this as well?  I need
to support both exact match searches and searches where one word exists
in the field.

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 22, 2003 9:44 AM
To: Lucene Users List
Subject: Re: Exact Match

Wilton, Reece wrote:
> Does Lucene support exact matching on a tokenized field?
> 
> So for example... if I add these three phrases to the index:
> - "The quick brown fox"
> - "The quick brown fox jumped"
> - "brown fox"
> 
> I want to be able to do an exact field match so when I search for
"brown
> fox" I only get the last one returned.  I can do this in my own code
by
> storing the data and then comparing it to the search phrase.  Is that
> the best way of doing this?

Why not just use an untokenized field?  Then just use a TermQuery, 
searching for the term "brown fox".

Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene on Windows

2003-10-21 Thread Tate Avery

Doug,

Re: high merge factor.  I was building test indexes and writing out 300 segments of 
300 docs and merging them every 90,000 kept the 'merging' time down to a minimum (for 
my slowish HD).

I was assuming that 11 of these large merges during the indexing of 1,000,000 docs 
(plus a final optimize) would be faster than 10,000 little merges if the mergeFactor 
was set to 10 (for the same corpus).

Maybe this is not the case.

Tate

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: October 21, 2003 12:37 PM
To: Lucene Users List
Subject: Re: Lucene on Windows

Tate Avery wrote:
> You might have trouble with "too many open files" if you set your mergeFactor too 
> high.  For example, on my Win2k, I can go up to mergeFactor=300 (or so).  At 400 I 
> get a too many open files error.  Note: the default mergeFactor of 10 should give no 
> trouble.

Please note that it is never recommended that you set mergeFactor 
anywhere near this high.  I don't know why folks do this.  It really 
doesn't make indexing much faster, and it makes searching slower if you 
don't optimize.  It's a bad idea.  The default setting of 10 works 
pretty well.  I've also had good experience setting it as high as 50 on 
big batch indexing runs, but do not recommend setting it much higher 
than that.  Even then, this can cause problems if you need to use 
several indexes at once, or you have lots of fields.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene on Windows

2003-10-20 Thread Tate Avery

You might have trouble with "too many open files" if you set your mergeFactor too 
high.  For example, on my Win2k, I can go up to mergeFactor=300 (or so).  At 400 I get 
a too many open files error.  Note: the default mergeFactor of 10 should give no 
trouble.

FYI - On my linux box, I got the 'too many open' error on mergeFactor=300 (and 200).  
So, I am using 100.

Tate

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: October 20, 2003 12:11 PM
To: Lucene Users List
Subject: Re: Lucene on Windows

On Monday, October 20, 2003, at 12:00  PM, Steve Jenkins wrote:
> Hi,
>
> Wonder if anyone can help. Has anyone used Lucene on a Windows 
> environment?
> Anyone know of any documentation specifically focused on doing that?
> Or anyone know of any gotchas to avoid?

Yup, used Lucene on Windows lots.  Is there a specific issue you feel 
is Windows related?  Its pure Java and works the same on all supported 
platforms.  So no real gotchas with respect to Windows.

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: PorterStemmer / Levenshtein Distance

WordListLoader's whereabouts

RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

RE: AnalyZer HELP Please

RE: AnalyZer HELP Please

RE: Finding All?

RE: boost keywords

RE: Understanding Boolean Queries

RE: Understanding Boolean Queries

Understanding Boolean Queries

RE: BooleanScorer - 32 required/prohibited clause limit

BooleanScorer - 32 required/prohibited clause limit

RE: Software for suggesting alternative words or sentences

Numeric field data

RE: Nested category strategy

Searching in "all"

Natural Language Queries

RE: Displaying Query

RE: SearchBlox J2EE Search Component Version 1.1 released

RE: New Lucene-powered Website

RE: Ask something about lucene

Which operations change document ids?

RE: Document Clustering

Relational Search

RE: large index query time

RE: Exact Match

RE: Lucene on Windows

RE: Lucene on Windows

28 matches

Site Navigation

Mail list logo

Footer information