Re: QueryParser Rules article (Erik Hatcher)

2003-11-12 Thread Erik Hatcher
On Wednesday, November 12, 2003, at 11:52  PM, Tomcat Programmer wrote:
I thought Erik's article was great. There was one
unanswered brainbender I had which I was hoping was in
there, but... Maybe you can add this topic to the next
one, Erik?
Well, I'm not sure another article on QueryParser is warranted (yet), 
but I'll offer a response here

When using the QueryParser class, the parse method
will throw a TokenMgrError when there is a syntax
error even as simple as a missing quote at the end of
a phrase query. According to the javadoc, you should
never see this class derived from Error being thrown
(oops?)
You must be using the instance parse method, rather than the static 
one.  The static one does this:

try {
  QueryParser parser = new QueryParser(field, analyzer);
  return parser.parse(query);
}
catch (TokenMgrError tme) {
  throw new ParseException(tme.getMessage());
}
But the instance parse method is declared to throw a TokenMgrError.

Why is that?   I'd be happy to put that same try/catch in the instance 
parse method, although I want to double check (CC'ing lucene-dev on 
this one).

Any reason not to remove the TokenMgrError exception from the instance 
parse method?

Has anyone discovered a good practice for trapping
syntax problems and then returning an informative
message to the user on how to fix their query? I would
be interested in code samples as well if you have any
:)
There is the javascript piece in the sandbox that could help 
pre-parsing expressions for validity.  Otherwise, simply displaying 
acceptable examples of expressions is what I'd do.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


QueryParser Rules article (Erik Hatcher)

2003-11-12 Thread Tomcat Programmer
I thought Erik's article was great. There was one
unanswered brainbender I had which I was hoping was in
there, but... Maybe you can add this topic to the next
one, Erik? 

Here is my issue: 

When using the QueryParser class, the parse method
will throw a TokenMgrError when there is a syntax
error even as simple as a missing quote at the end of
a phrase query. According to the javadoc, you should
never see this class derived from Error being thrown
(oops?)

I did some searching on the archive for this list, and
turned up some old articles from 2001 in which Brian
Goetz was asking Paul Friedman for an example of a
query like that, so he could fix it. I saw that Paul
posted a sample, but I never saw a response back from
Brian.  Looking in the CHANGES.txt file all the way
back to 1.0 there is no mention of any modification
regarding exceptions or errors. 

Has anyone discovered a good practice for trapping
syntax problems and then returning an informative
message to the user on how to fix their query? I would
be interested in code samples as well if you have any
:)

Thanks a lot! 

-Tom



__
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Reopen IndexWriter after delete?

2003-11-12 Thread Wilton, Reece
I agree it's a bit of a strange design.

It seems that there should be one class that handles all modifications
of the index.  Usually you'd only have one instance of this so you
wouldn't need to open and close it all the time (I'm basically writing
one of these classes myself to simplify my code.  I'm sure other people
have written a similar class).  There should be another class that is
responsible for searching.  You may have multiple instances of this so
you can have multiple headends searching the index.

The IndexWriter and IndexReader almost do this separation.  It seems
that if the IndexWriter had the delete functionality, instead of the
IndexReader, things would be a lot simplier (from a synchronization
standpoint).  Maybe Otis, Erik or Doug could suggest why this may or may
not be a good idea.

-Reece

-Original Message-
From: Dror Matalon [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 12, 2003 12:06 PM
To: Lucene Users List
Subject: Re: Reopen IndexWriter after delete?

Which begs the question: why do you need to use an IndexReader rather
than an IndexWriter to delete an item?

On Tue, Nov 11, 2003 at 02:46:37PM -0800, Otis Gospodnetic wrote:
> > 1).  If I delete a term using an IndexReader, can I use an existing
> > IndexWriter to write to the index?  Or do I need to close and reopen
> > the IndexWriter?
> 
> No.  You should close IndexWriter first, then open IndexReader, then
> call delete, then close IndexReader, and then open a new IndexWriter.
> 
> > 2).  Is it safe to call IndexReader.delete(term) while an
IndexWriter
> > is
> > writing?  Or should I be synchronizing these two tasks so only one
> > occurs at a time?
> 
> No, it is not safe.  You should close the IndexWriter, then delete the
> document and close IndexReader, and then get a new IndexWriter and
> continue writing.
> 
> Incidentally, I just wrote a section about concurrency issues and
about
> locking in Lucene for the upcoming Lucene book.
> 
> Otis
> 
> 
> __
> Do you Yahoo!?
> Protect your identity with Yahoo! Mail AddressGuard
> http://antispam.yahoo.com/whatsnewfree
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Can use Lucene be used for this

2003-11-12 Thread Majerus, John P.
Hello,
This has probably been put forth on the list before, but how about the following 
approach for leftmost wildcard searches, at least for single term searches?

Reverse the character order of all words after they're stemmed and before they're 
added to a special reverse-character-order index. Any time a wildcard was found at the 
beginning of the search term the special index would be engaged. Then a search for 
"*bar" would be converted to a search for "rab*" on the RCO index, and the search 
would find "raboof", and this result would then be unreversed upon display to yield: 
"foobar". 

Rene's special index could be several times larger in entry count, depending on the 
average length of the contained terms. A reverse-character-order index is the same 
size as its regular counterpart.

Cheers,
John
-Original Message-
From: Hackl, Rene [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 12, 2003 6:34 AM
To: 'Lucene Users List'
Subject: Re: Can use Lucene be used for this


>> col2 like %aa%

> Lucene doesn't handle queries where the start of the term is not known
> very efficiently.

Is it really able to handle them at all? I thought "*foo"-type queries were
not supported.

That's because I build two indexes for the purpose of simultaneous left and
right truncation. One "normal" index and another special one, which takes
tokens and breaks them down, for instance "foobar" would be indexed also as
"oobar" and "obar". For a query "*oba*" the left wildcard would cause the
special index to be searched for "oba*", not left truncated queries would
search the normal index.

The special index is created with maxFieldLength = 10

build-time specialIndex vs. normalIndex: +60%
index size specialIndex vs. normalIndex: +240%
index size specialIndex vs. originalDocSize: +60%

Query execution is still very fast on a 3GB specialIndex. 

I guess the usability depends on how large your document collection is and
what kind of search functionality you need. The drawbacks of this approach
are that proximity and phrase searches on the special index are busted. 

Would it make sense to prevent creating the prx-file to reduce index size
when not offering that kind of search anyway? Is it possible at all?

Best regards,
René

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Poor Performance when searching for 500+ terms

2003-11-12 Thread Jie Yang
I know this is rare, But I am building an application
that submits searches having 500+ search terms. A
general example would be 

field1:w1 OR field1:w2 OR ... OR field1:w500

For 1 millions documents, the performance is OK if
field1 in each document has less than 50 terms, I can
get result < 1 sec. but if field1 has more than
average 400 terms in each document, the performance
degrades to around 6 secs.

Is there anyway to improve this? 

And my second questions is that my query often comes
with an AND condition with another search word. for
example:

field2:w AND (field1:w1 OR field1:w2, ... field1:w500)

field2:w will only return less than 1000 records out
of 1 millions. then I thought I could use a
StringFilter Object? i.e. search on field2.w first,
thus limit the search for 500 OR only on the field2.w
1000 results. somewhat like a join in database. But I
checked the code and sees that IndexSearcher always
perfomance the 500 disk searches before calling the
filter object? Any suggestions on this?

Also does lucene caches results in memory? I see the
performance tends to get better after a few runs,
especailly on searches on fields having small number
of terms. If so, can I manipulate the cache size
somehow to accommdate fields with large number of
terms. 

Many thanks.



Want to chat instantly with your online friends?  Get the FREE Yahoo!
Messenger http://mail.messenger.yahoo.co.uk

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reopen IndexWriter after delete?

2003-11-12 Thread Dror Matalon
Which begs the question: why do you need to use an IndexReader rather
than an IndexWriter to delete an item?

On Tue, Nov 11, 2003 at 02:46:37PM -0800, Otis Gospodnetic wrote:
> > 1).  If I delete a term using an IndexReader, can I use an existing
> > IndexWriter to write to the index?  Or do I need to close and reopen
> > the IndexWriter?
> 
> No.  You should close IndexWriter first, then open IndexReader, then
> call delete, then close IndexReader, and then open a new IndexWriter.
> 
> > 2).  Is it safe to call IndexReader.delete(term) while an IndexWriter
> > is
> > writing?  Or should I be synchronizing these two tasks so only one
> > occurs at a time?
> 
> No, it is not safe.  You should close the IndexWriter, then delete the
> document and close IndexReader, and then get a new IndexWriter and
> continue writing.
> 
> Incidentally, I just wrote a section about concurrency issues and about
> locking in Lucene for the upcoming Lucene book.
> 
> Otis
> 
> 
> __
> Do you Yahoo!?
> Protect your identity with Yahoo! Mail AddressGuard
> http://antispam.yahoo.com/whatsnewfree
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Latent Semantic Indexing

2003-11-12 Thread Ralf Bierig
Does Lucene implement Latent Semantic Indexing? Examples? 

Ralf

-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Vector Space Model in Lucene?

2003-11-12 Thread ambiesense
Hi,

does Lucene implement a Vector Space Model? If yes, does anybody have an
example of how using it?

Cheers,
Ralf

-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Connection Pooling

2003-11-12 Thread Elsa Hernandez
Hi! Does anyone have the code of a Connection Pooling? I am using JDK 1.3.1.
Thank you!
_
The new MSN 8: advanced junk mail protection and 2 months FREE* 
http://join.msn.com/?page=features/junkmail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index pdf files with your content in lucene.

2003-11-12 Thread Ernesto De Santis
Hello

well, not work zip the files.

I can send files, if somebody won, to personal email.

And if somebody can post this in a web site, very cool.
I don´t post in a web site.

Ernesto.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boost in Query Parser

2003-11-12 Thread Erik Hatcher
On Wednesday, November 12, 2003, at 10:53  AM, MOYSE Gilles (Cetelem) 
wrote:
Hello.

I've made a Filter which recognizes special words and return them in a
"boosted form", in a QueryParser sense.
For instance, when the filter receives "special_word", it returns
"special_word^3", so as to boost it.
The problem is that the QueryParser understands the boost syntax when 
the
string is given as an argument to the "parse" function, but does not 
get it
when it is generated by a filter in the Analyzer.
So, when my filter transforms "special_word" to "special_filter^3", the
QueryParser does not create a Query object with "special_word" as 
value to
look for and boost to 3, but with "special_word^3" to search and boost 
to 1.
Of course, it does not match anything.

Does anyone knows a solution to that problem ? Do I have to write my 
own
QueryParser from the beginning or do I just have to correct 2 ot 3 
lines of
the original QueryParser to make it work the I'd like it to work ?
One idea is to pre-process the string before handing it to QueryParser 
and do a string replacement with the boosting (^3) added appropriately.

Writing your own QueryParser is certainly a possibility.  There is 
nothing really to "correct" with the original QueryParser in this 
regard as it is working by design and there really is no way to feed 
expressions back from the analysis back into the parsing - doesn't 
really seem like a good idea to me.  You can probably get away with 
subclassing QueryParser and overriding getFieldQuery to do what you 
want with the String passed in, and calling setBoost (rather than 
trying to inject "^3").

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Wildcard search and HOST tokens

2003-11-12 Thread Erik Hatcher
On Wednesday, November 12, 2003, at 10:43  AM, Pascal Nadal wrote:
the HostFilter I wrote (that tokenizes again HOST tokens) works 
wonderfully.
I wonder if this has been fixed since Lucene 1.2 could you try the 
latest 1.3RC build available and see if it works without your 
HostFilter?

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Boost in Query Parser

2003-11-12 Thread MOYSE Gilles (Cetelem)
Hello.

I've made a Filter which recognizes special words and return them in a
"boosted form", in a QueryParser sense.
For instance, when the filter receives "special_word", it returns
"special_word^3", so as to boost it.
The problem is that the QueryParser understands the boost syntax when the
string is given as an argument to the "parse" function, but does not get it
when it is generated by a filter in the Analyzer.
So, when my filter transforms "special_word" to "special_filter^3", the
QueryParser does not create a Query object with "special_word" as value to
look for and boost to 3, but with "special_word^3" to search and boost to 1.
Of course, it does not match anything.

Does anyone knows a solution to that problem ? Do I have to write my own
QueryParser from the beginning or do I just have to correct 2 ot 3 lines of
the original QueryParser to make it work the I'd like it to work ?

Thanks a lot.

Gilles Moyse.

-Message d'origine-
De : Erik Hatcher [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 12 novembre 2003 15:16
À : Lucene Users List
Objet : Re: Can use Lucene be used for this


On Wednesday, November 12, 2003, at 07:34  AM, Hackl, Rene wrote:
>>> col2 like %aa%
>
>> Lucene doesn't handle queries where the start of the term is not known
>> very efficiently.
>
> Is it really able to handle them at all? I thought "*foo"-type queries 
> were
> not supported.

They are not supported by the QueryParser, but an API created 
WildcardQuery supports it.

I certainly do not recommend using prefix-style wildcard queries 
though, knowing what happens under the covers.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Re: Wildcard search and HOST tokens

2003-11-12 Thread Pascal Nadal
when I do a query.toString(), it prints exactly my query.

example: title:FE.MENU*  gives  title:FE.MENU* FE.MENU*  when I search in
the default field and the field 'title'.

the HostFilter I wrote (that tokenizes again HOST tokens) works wonderfully.

PS: thanks Erik

 -Message d'origine-

De : Erik Hatcher [mailto:[EMAIL PROTECTED]

Envoyé : mercredi 12 novembre 2003 12:43

À : Lucene Users List

Objet : Re: Wildcard search and HOST tokens

 

On Wednesday, November 12, 2003, at 05:55 AM, Pascal Nadal wrote:

> My lucene indexes contain fields with values like this www.xxx.yyy.zzz

> which are treated as HOST tokens.

> My problem is the following : search results never contain documents 

> with

> such fields when doing a wildcard query or a fuzzy query. Only 

> searches on

> full field values work.

>

> example queries: www* www.* www.xxx* www?xxx?yyy www.yyy.y~ or just 

> yyy

>

> I'm using Lucene 1.2 and the StandardAnalyzer. It seems that the '.' 

> is the

> problem.

>

> Is it a bug ?

What does query.toString("") return? This generally has 

a lot of clues on what happened in QueryParser.

Erik



Re: Can use Lucene be used for this

2003-11-12 Thread Erik Hatcher
On Wednesday, November 12, 2003, at 07:34  AM, Hackl, Rene wrote:
col2 like %aa%

Lucene doesn't handle queries where the start of the term is not known
very efficiently.
Is it really able to handle them at all? I thought "*foo"-type queries 
were
not supported.
They are not supported by the QueryParser, but an API created 
WildcardQuery supports it.

I certainly do not recommend using prefix-style wildcard queries 
though, knowing what happens under the covers.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Overview to Lucene

2003-11-12 Thread petite_abeille
Hi Ralf,

On Nov 12, 2003, at 14:06, [EMAIL PROTECTED] wrote:

Does anybody know good articles which demonstrate parts of that or 
give a
good start into Lucene?
Otis Gospodnetic's articles are a good starting point:

"Introduction to Text Indexing with Apache Jakarta Lucene"
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html
"Advanced Text Indexing with Lucene"
http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html
Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Overview to Lucene

2003-11-12 Thread ambiesense
Hello group,

can somebody give me an overview to Lucene? What high level components does
it include? Particularly I want to asnwer the following questions regarding
available functionalty:

1) Does Lucene provide a Vector Space IR Model (with TF/IDF and Cosine
Similarity)?
2) Does Lucene provide any collaborative filtering algoritms like
correlation / user ranking etc. ?
3) Does Lucene provide a Probabilistic Model?
4) Does Lucene provide anything for indexing XML documents and using XML
document structure for that? Or does it just work on flat text files?

Does anybody know good articles which demonstrate parts of that or give a
good start into Lucene?

Thanks,
Ralf

-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can use Lucene be used for this

2003-11-12 Thread Hackl, Rene
>> col2 like %aa%

> Lucene doesn't handle queries where the start of the term is not known
> very efficiently.

Is it really able to handle them at all? I thought "*foo"-type queries were
not supported.

That's because I build two indexes for the purpose of simultaneous left and
right truncation. One "normal" index and another special one, which takes
tokens and breaks them down, for instance "foobar" would be indexed also as
"oobar" and "obar". For a query "*oba*" the left wildcard would cause the
special index to be searched for "oba*", not left truncated queries would
search the normal index.

The special index is created with maxFieldLength = 10

build-time specialIndex vs. normalIndex: +60%
index size specialIndex vs. normalIndex: +240%
index size specialIndex vs. originalDocSize: +60%

Query execution is still very fast on a 3GB specialIndex. 

I guess the usability depends on how large your document collection is and
what kind of search functionality you need. The drawbacks of this approach
are that proximity and phrase searches on the special index are busted. 

Would it make sense to prevent creating the prx-file to reduce index size
when not offering that kind of search anyway? Is it possible at all?

Best regards,
René

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reopen IndexWriter after delete?

2003-11-12 Thread Otis Gospodnetic
Correct.  write.lock is used for that.

Otis

--- Morus Walter <[EMAIL PROTECTED]> wrote:
> Otis Gospodnetic writes:
> > 
> > No, it is not safe.  You should close the IndexWriter, then delete
> the
> > document and close IndexReader, and then get a new IndexWriter and
> > continue writing.
> > 
> IIRC lucene takes care that you do so.
> Locking prevents you from having an open IndexWriter and
> modify the index with an IndexReader (and vice verse).
> 
> Morus
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Wildcard search and HOST tokens

2003-11-12 Thread Erik Hatcher
On Wednesday, November 12, 2003, at 05:55  AM, Pascal Nadal wrote:
My lucene indexes contain fields with values like this  www.xxx.yyy.zzz
which are treated as HOST tokens.
My problem is the following : search results never contain documents 
with
such fields when doing a wildcard query or a fuzzy query. Only 
searches on
full field values work.

example queries: www*  www.* www.xxx* www?xxx?yyy www.yyy.y~ or just 
yyy

I'm using Lucene 1.2 and the StandardAnalyzer. It seems that the '.' 
is the
problem.

Is it a bug ?
What does query.toString("") return?  This generally has 
a lot of clues on what happened in QueryParser.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Wildcard search and HOST tokens

2003-11-12 Thread Pascal Nadal
My lucene indexes contain fields with values like this  www.xxx.yyy.zzz
which are treated as HOST tokens.
My problem is the following : search results never contain documents with
such fields when doing a wildcard query or a fuzzy query. Only searches on
full field values work.
 
example queries: www*  www.* www.xxx* www?xxx?yyy www.yyy.y~ or just yyy
 
I'm using Lucene 1.2 and the StandardAnalyzer. It seems that the '.' is the
problem.
 
Is it a bug ?
 
I wrote a HostFilter class which tokenizes again HOST tokens and it seems to
work fine (full field values or wildcard queries)
 


Re: Can use Lucene be used for this

2003-11-12 Thread Eric Jain
> I need to retrieve the value with simple queries on the data like:

> col1 like %ab&,

What does the ampersand mean?

> col2 like %aa%

Lucene doesn't handle queries where the start of the term is not known
very efficiently.

> and col3 sounds like ;

No experience with this, but you could probably use the Soundex encoder
from http://jakarta.apache.org/commons/codec/ for transforming words
before indexing them (and before searching for them).

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-12 Thread Eric Jain
> I was basically thinking of using lucene to generate document
> vectors, and writing my custom similarity algorithms for measuring
> distance.
>
> I could then run this data through k-means or SOM algorithms for
> calculating clusters

First of all, I think it would already be great if there was some
functionality for simply storing document vectors during the indexing
process, so you could later on use

  IndexSearcher.docTerms(int i)

to retrieve a BitSet or an array of floats that are weighted so that
frequent terms have lower values.

One difficulty I see here is that terms don't seem to have any unique
identifiers, guess you'd have to manage those yourself...

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reopen IndexWriter after delete?

2003-11-12 Thread Morus Walter
Otis Gospodnetic writes:
> 
> No, it is not safe.  You should close the IndexWriter, then delete the
> document and close IndexReader, and then get a new IndexWriter and
> continue writing.
> 
IIRC lucene takes care that you do so.
Locking prevents you from having an open IndexWriter and
modify the index with an IndexReader (and vice verse).

Morus


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]