[PLAN]: SAXIndexer, indexing database via XML gateway

2003-06-06 Thread Che Dong
In current weblucene project including a SAX Based xml source indexer:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/weblucene/weblucene/webapp/WEB-INF/src/com/chedong/weblucene/index/

It can parse  xml data source like following example: 
?xml version=1.0 encoding=GB2312?
Table
 Record id=1
  Field name=Id39314/Field
  Field name=Titletitle of document/Field
  Field name=Authorchedong/Field
  Field name=Contentblah blah/Field
  Field name=PubTime2003-06-06/Field
  Index name=FullIndexTitle,Content/Index
  Index name=TitleIndex token=noAuthor/Index
 /Record
 ...
 
/Table

I use two Index elements in  each Record block to speciefy field = index mapping, The 
SAXIndexer will parse this xml source into Id, Title, Author, Content ,PubTime into 
Lucene store only Fields and create another two index fields:
one index field with Title + Content 
one index field Author without token

Recently I notice more and more application provided xml interface very similar to RSS:
for example: you can even dump table into xml output from phpMyAdmin like following:
?xml version=1.0 encoding=iso-8859-1?
mysql
  !-- Table user --
user
Hostlocalhost/Host
Userroot/User
Password/Password
Select_privY/Select_priv
Insert_privY/Insert_priv
Update_privY/Update_priv
Delete_privY/Delete_priv
Create_privY/Create_priv
Drop_privY/Drop_priv
Reload_privY/Reload_priv
Shutdown_privY/Shutdown_priv
Process_privY/Process_priv
File_privY/File_priv
Grant_privY/Grant_priv
References_privY/References_priv
Index_privY/Index_priv
Alter_privY/Alter_priv
Show_db_privY/Show_db_priv
Super_privY/Super_priv
Create_tmp_table_privY/Create_tmp_table_priv
Lock_tables_privY/Lock_tables_priv
Execute_privY/Execute_priv
Repl_slave_privY/Repl_slave_priv
Repl_client_privY/Repl_client_priv
ssl_type/ssl_type
ssl_cipher/ssl_cipher
x509_issuer/x509_issuer
x509_subject/x509_subject
max_questions0/max_questions
max_updates0/max_updates
max_connections0/max_connections
/user
...
/mysql

the SAXIndexer will be able to database xml dump directly if SAXIndexer can let 
specify field = index mapping rule from enternal program.
for example: 
java IndexRunner -c field_index_mapping.conf -i http://localhost/table_dump.xml

#the config file like following:
FullIndex   Title,Content 
AuthorIndex  Author  no

Hope this SAXIndexer can be added into Lucene demos make lucene end user can make 
lucene index from current database applications.

Regards

Che, Dong
http://www.chedong.com/

RE: java.lang.IllegalArgumentException: attempt to access a deleted document

2003-06-06 Thread Rob Outar
I added the following code:

   for (int i = 0; i  numOfDocs; i++) {
if ( !reader.isDeleted(i)) {
doc = reader.document(i);
docs[i] =
doc.get(SearchEngineConstants.REPOSITORY_PATH);
}
}
return docs;

but it never goes in the if statement, for every value of i, isDeleted(i) is
returning true?!?  Am I doing something wrong?  I was trying to do what Doug
outlined below.


Thanks,

Rob
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 04, 2003 12:34 PM
To: Lucene Users List
Subject: Re: java.lang.IllegalArgumentException: attempt to access a
deleted document


Rob Outar wrote:
  public synchronized String[] getDocuments() throws IOException {

 IndexReader reader = null;
 try {
 reader = IndexReader.open(this.indexLocation);
 int numOfDocs  = reader.numDocs();
 String[] docs  = new String[numOfDocs];
 Document doc   = null;

 for (int i = 0; i  numOfDocs; i++) {
 doc = reader.document(i);
 docs[i] = doc.get(SearchEngineConstants.REPOSITORY_PATH);
 }
 return docs;
 }
 finally {
 if (reader != null) {
 reader.close();
 }
 }
 }

The limit of your iteration should be IndexReader.maxDoc(), not
IndexReader.numDocs():

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReade
r.html#maxDoc()

Also, you should first check that each document is not deleted before
calling IndexReader.document(int):

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReade
r.html#isDeleted(int)

Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



String similarity search vs. typcial IR application...

2003-06-06 Thread Jim Hargrave
Our application is a string similarity searcher where the query is an input string and 
we want to find all fuzzy variants of the input string in the DB.  The Score is 
basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in 
common, Q is the number of unique query terms and D is the number of unique document 
terms. Our documents will be sentences.
 
I know Lucene has a fuzzy search capability - but I assume this would be very slow 
since it must search through the entire term list to find candidates.
 
In order to do the calculation I will need to have 'C' - the number of terms in common 
between query and document. Is there an API that I can call to get this info? Any 
hints on what it will take to modify Lucene to handle these kinds of queries? 
 
BTW: 
Ever consider using Lucene for DNA searching? - this technique could also be used to 
search large DNA databases.
 
Thanks!
 
Jim Hargrave


--
This message may contain confidential information, and is intended only for the use of 
the individual(s) to whom it is addressed.


==


Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Jim Hargrave
Probably shouldn't have added that last bit. Our app isn't a DNA searcher. But 
DASG+Lev does look interesting.
 
Our app is a linguistic application. We want to search for sentences which have many 
ngrams in common and rank them based on the score below. Similar to the TELLTALE 
system (do a google search TELLTALE + ngrams) - but we are not interested in IR per se 
- we want to compute a score based on pure string similarity. Sentences are docs, 
ngrams are terms.
 
Jim

 [EMAIL PROTECTED] 06/05/03 03:55PM 
AFAIK Lucene is not able to look DNA strings up effectively. You would 
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

Our application is a string similarity searcher where the query is an input string 
and we want to find all fuzzy variants of the input string in the DB.  The Score is 
basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in 
common, Q is the number of unique query terms and D is the number of unique document 
terms. Our documents will be sentences.
 
I know Lucene has a fuzzy search capability - but I assume this would be very slow 
since it must search through the entire term list to find candidates.
 
In order to do the calculation I will need to have 'C' - the number of terms in 
common between query and document. Is there an API that I can call to get this info? 
Any hints on what it will take to modify Lucene to handle these kinds of queries? 
  




-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 





--
This message may contain confidential information, and is intended only for the use of 
the individual(s) to whom it is addressed.


==


Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Leo Galambos
I see. Are you looking for this: 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html

On the other hand, if n is not fixed, you still have a problem. As far 
as I read this list it seems, that Lucene reads a dictionary (of terms) 
into memory, and it also allocates one file handle for each of the 
acting terms. It implies you would not break the terms up into n-grams 
and, as a result, you would use a slow look-up over the dictionary. I do 
not know if I express it correctly, but my personal feeling is, that you 
would rather write your application from scratch.

BTW: If you have nice terms, you could find all their n-grams 
occurencies in the dictionary, and compute a boost factor for each of 
the inverted lists. I.e., bbc is a term in a query, and for i-list of 
abba, the factor is 1 (bigram bb is there), for i-list of bbb, the 
factor is 2 (bb 2x). Then you use the Similarity class, and it is 
solved. Nevertheless, if the n-grams are not nice and the query is long, 
you will lost a lot of time in the dictionary look-up phase.

-g-

PS: I'm sorry for my English, just learning...

Jim Hargrave wrote:

Probably shouldn't have added that last bit. Our app isn't a DNA searcher. But DASG+Lev does look interesting.

Our app is a linguistic application. We want to search for sentences which have many ngrams in common and rank them based on the score below. Similar to the TELLTALE system (do a google search TELLTALE + ngrams) - but we are not interested in IR per se - we want to compute a score based on pure string similarity. Sentences are docs, ngrams are terms.

Jim

 

[EMAIL PROTECTED] 06/05/03 03:55PM 
   

AFAIK Lucene is not able to look DNA strings up effectively. You would 
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

 

Our application is a string similarity searcher where the query is an input string and we want to find all fuzzy variants of the input string in the DB.  The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of unique query terms and D is the number of unique document terms. Our documents will be sentences.

I know Lucene has a fuzzy search capability - but I assume this would be very slow since it must search through the entire term list to find candidates.

In order to do the calculation I will need to have 'C' - the number of terms in common between query and document. Is there an API that I can call to get this info? Any hints on what it will take to modify Lucene to handle these kinds of queries? 

   



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 





--
This message may contain confidential information, and is intended only for the use of 
the individual(s) to whom it is addressed.
==

 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Special Character Search

2003-06-06 Thread Ramrakhiani, Vikas
Hi,

I am trying to implement special character search. 
If I do a search with query title:java\-perl then documents with title
java-perl as well as java+perl comes up. While first result is desirable the
second one is not.
I want to know what is going wrong here ?
Also, I am using StandardAnalyzer and would like to retain the tokenizing
feature.

thanks for your help,
vikas.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



problems with search on Russian content

2003-06-06 Thread Vladimir
Hi!

I have lucene-1.3-rc1 and jdk1.3.1.

What to change in a demonstration example to carry out 
search in html files with coding Cp1251?

Thanks,
Vladimir.
---
Professional hosting for everyone - http://www.host.ru
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Trouble running web demo

2003-06-06 Thread psethi
hi,

 When i run the web demo i get an error that says 


 ERROR opening the Index - contact sysadmin!

 While parsing query: /opt/lucene/index not a directory

 i do not have the permission to modify opt so have not created an index
 directory in it.Thus i do not use the default as given /opt/lucene/index.
 I have changed the configuration files and also according to me modified the
 luceneweb.war file. Is there any other file that i should be modifying?Also I
 might be making an error redeploying luceneweb.war.

 how do i redploy the file and wht other errors can i be making?

 Thanks,
 Prerak


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: String similarity search vs. typcial IR application...

2003-06-06 Thread Frank Burough
I have seen some interesting work done on storing DNA sequence as a set of common 
patterns with unique sequence between them. If one uses an analyzer to break sequence 
into its set of patterns and unique sequence then Lucene could be used to search for 
exact pattern matches. I know of only one sequence search tool that was based on this 
approach. I don't know if it ever left the lab and made it into the mainstream. If I 
have time I will explore this a bit.

Frank Burough



 -Original Message-
 From: Leo Galambos [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, June 05, 2003 5:55 PM
 To: Lucene Users List
 Subject: Re: String similarity search vs. typcial IR application...
 
 
 AFAIK Lucene is not able to look DNA strings up effectively. 
 You would 
 use DASG+Lev (see my previous post - 05/30/2003 1916CEST).
 
 -g-
 
 Jim Hargrave wrote:
 
 Our application is a string similarity searcher where the 
 query is an 
 input string and we want to find all fuzzy variants of the 
 input string in the DB.  The Score is basically dice's 
 coefficient: 2C/Q+D, where C is the number of terms (n-grams) 
 in common, Q is the number of unique query terms and D is the 
 number of unique document terms. Our documents will be sentences.
  
 I know Lucene has a fuzzy search capability - but I assume 
 this would 
 be very slow since it must search through the entire term 
 list to find candidates.
  
 In order to do the calculation I will need to have 'C' - the 
 number of 
 terms in common between query and document. Is there an API 
 that I can call to get this info? Any hints on what it will 
 take to modify Lucene to handle these kinds of queries?
   
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Leo Galambos
Exact matches are not ideal for DNA applications, I guess. I am not a 
DNA expert, but those guys often need a feature that is termed 
``fuzzy''[*] in Lucene. They need Levenstein's and Hamming's metrics, 
and I think that Lucene has many drawbacks which disallow effective 
implementations. On the other hand, I am very interested in a method you 
mentioned. Could you give me a reference, please? Thank you.

-g-

[*] why do you use the label ``fuzzy''? It has nothing to do with fuzzy 
logic or fuzzy IR, I guess.

Frank Burough wrote:

I have seen some interesting work done on storing DNA sequence as a set of common patterns with unique sequence between them. If one uses an analyzer to break sequence into its set of patterns and unique sequence then Lucene could be used to search for exact pattern matches. I know of only one sequence search tool that was based on this approach. I don't know if it ever left the lab and made it into the mainstream. If I have time I will explore this a bit.

Frank Burough



 

-Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 05, 2003 5:55 PM
To: Lucene Users List
Subject: Re: String similarity search vs. typcial IR application...

AFAIK Lucene is not able to look DNA strings up effectively. 
You would 
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

   

Our application is a string similarity searcher where the 
 

query is an 
   

input string and we want to find all fuzzy variants of the 
 

input string in the DB.  The Score is basically dice's 
coefficient: 2C/Q+D, where C is the number of terms (n-grams) 
in common, Q is the number of unique query terms and D is the 
number of unique document terms. Our documents will be sentences.
   

I know Lucene has a fuzzy search capability - but I assume 
 

this would 
   

be very slow since it must search through the entire term 
 

list to find candidates.
   

In order to do the calculation I will need to have 'C' - the 
 

number of 
   

terms in common between query and document. Is there an API 
 

that I can call to get this info? Any hints on what it will 
take to modify Lucene to handle these kinds of queries?
   



 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Trouble running web demo

2003-06-06 Thread xx28
Try to chang permisssion 777 for index directory.

= Original Message From Lucene Users List 
[EMAIL PROTECTED] =
hi,

 When i run the web demo i get an error that says


 ERROR opening the Index - contact sysadmin!

 While parsing query: /opt/lucene/index not a directory

 i do not have the permission to modify opt so have not created an index
 directory in it.Thus i do not use the default as given /opt/lucene/index.
 I have changed the configuration files and also according to me modified the
 luceneweb.war file. Is there any other file that i should be modifying?Also 
I
 might be making an error redeploying luceneweb.war.

 how do i redploy the file and wht other errors can i be making?

 Thanks,
 Prerak


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: String similarity search vs. typcial IR application...

2003-06-06 Thread Frank Burough
The method I mention was based on using lempel-ziv (I expect my spelling is way off on 
this) algorithms used in lz compression. It relied only on exact matches of short 
stretches of DNA separated by non-matching sequence. The idea was to find stretches of 
sequence that had patterns in common, then generate the original sequences back and 
display alignments. I never paid any attention to its match scores since they seemed 
pretty arbitrary. The focus on this work was in looking for common repeats in human 
sequence to assist in masking them prior to doing analysis. I have lost touch with the 
original author of this but I am digging to see if I can extract the papers from my 
filing system. I will post them shortly (I hope!).

Frank


 -Original Message-
 From: Leo Galambos [mailto:[EMAIL PROTECTED] 
 Sent: Friday, June 06, 2003 10:00 AM
 To: Lucene Users List
 Subject: Re: String similarity search vs. typcial IR application...
 
 
 Exact matches are not ideal for DNA applications, I guess. I am not a 
 DNA expert, but those guys often need a feature that is termed 
 ``fuzzy''[*] in Lucene. They need Levenstein's and Hamming's metrics, 
 and I think that Lucene has many drawbacks which disallow effective 
 implementations. On the other hand, I am very interested in a 
 method you 
 mentioned. Could you give me a reference, please? Thank you.
 
 -g-
 
 [*] why do you use the label ``fuzzy''? It has nothing to do 
 with fuzzy 
 logic or fuzzy IR, I guess.
 
 Frank Burough wrote:
 
 I have seen some interesting work done on storing DNA 
 sequence as a set 
 of common patterns with unique sequence between them. If one uses an 
 analyzer to break sequence into its set of patterns and 
 unique sequence 
 then Lucene could be used to search for exact pattern 
 matches. I know 
 of only one sequence search tool that was based on this approach. I 
 don't know if it ever left the lab and made it into the 
 mainstream. If 
 I have time I will explore this a bit.
 
 Frank Burough
 
 
 
   
 
 -Original Message-
 From: Leo Galambos [mailto:[EMAIL PROTECTED]
 Sent: Thursday, June 05, 2003 5:55 PM
 To: Lucene Users List
 Subject: Re: String similarity search vs. typcial IR application...
 
 
 AFAIK Lucene is not able to look DNA strings up effectively.
 You would 
 use DASG+Lev (see my previous post - 05/30/2003 1916CEST).
 
 -g-
 
 Jim Hargrave wrote:
 
 
 
 Our application is a string similarity searcher where the
   
 
 query is an
 
 
 input string and we want to find all fuzzy variants of the
   
 
 input string in the DB.  The Score is basically dice's
 coefficient: 2C/Q+D, where C is the number of terms (n-grams) 
 in common, Q is the number of unique query terms and D is the 
 number of unique document terms. Our documents will be sentences.
 
 
 I know Lucene has a fuzzy search capability - but I assume
   
 
 this would
 
 
 be very slow since it must search through the entire term
   
 
 list to find candidates.
 
 
 In order to do the calculation I will need to have 'C' - the
   
 
 number of
 
 
 terms in common between query and document. Is there an API
   
 
 that I can call to get this info? Any hints on what it will
 take to modify Lucene to handle these kinds of queries?
 
 
  
 
   
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
   
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Where to get stopword lists?

2003-06-06 Thread Ulrich Mayring
Hello,

does anyone know of good stopword lists for use with Lucene? I'm 
interested in English and German lists.

The default lists aren't very complete, for example the English list 
doesn't contain words like every, because or until and the German 
list misses dem and des (definite articles).

Kind regards,

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Where to get stopword lists?

2003-06-06 Thread Doug Cutting
Ulrich Mayring wrote:
does anyone know of good stopword lists for use with Lucene? I'm 
interested in English and German lists.
The Snowball project has good stop lists.

See:

  http://snowball.tartarus.org/
  http://snowball.tartarus.org/english/stop.txt
  http://snowball.tartarus.org/german/stop.txt
Snowball stemmers are pre-packaged for use with Lucene at:

  http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/

This project should be updated to include the Snowball stop lists too. 
I have not had the time to do this.  This would be a great contribution 
if someone who is qualified has the time.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Where to get stopword lists?

2003-06-06 Thread Otis Gospodnetic
There is a much more complete list of Englihs stop words included in
the Lucene article (the intro one) on Onjava.com.
I can't help you with German stop words.

Otis

--- Ulrich Mayring [EMAIL PROTECTED] wrote:
 Hello,
 
 does anyone know of good stopword lists for use with Lucene? I'm 
 interested in English and German lists.
 
 The default lists aren't very complete, for example the English list 
 doesn't contain words like every, because or until and the
 German 
 list misses dem and des (definite articles).
 
 Kind regards,
 
 Ulrich
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Where to get stopword lists?

2003-06-06 Thread Ulrich Mayring
Doug Cutting wrote:
Snowball stemmers are pre-packaged for use with Lucene at:

  http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/
These look interesting. Am I right in assuming that in order to use 
these stemmers, I have to write an Analyzer and in its tokenStream 
method I return a SnowballFilter?

I'm a bit new to Lucene, as you might gather :)

Kind regards,

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Where to get stopword lists?

2003-06-06 Thread Bryan LaPlante
I found a some handy tools in the org.apache.lucene.analysis.de package
using the WordListLoader class you can load up your stop words in a verity
of ways including a line delimited text file thanks to Gerhard Schwarz.

Bryan LaPlante

- Original Message -
From: Ulrich Mayring [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, June 06, 2003 11:36 AM
Subject: Re: Where to get stopword lists?


 Doug Cutting wrote:
 
  Snowball stemmers are pre-packaged for use with Lucene at:
 
http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/

 These look interesting. Am I right in assuming that in order to use
 these stemmers, I have to write an Analyzer and in its tokenStream
 method I return a SnowballFilter?

 I'm a bit new to Lucene, as you might gather :)

 Kind regards,

 Ulrich



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Where to get stopword lists?

2003-06-06 Thread Anthony Eden
There is already an analyzer available in the sandbox.  Take a look 
here: http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/

Sincerely,
Anthony Eden
Ulrich Mayring wrote:
Doug Cutting wrote:

Snowball stemmers are pre-packaged for use with Lucene at:

  http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/


These look interesting. Am I right in assuming that in order to use 
these stemmers, I have to write an Analyzer and in its tokenStream 
method I return a SnowballFilter?

I'm a bit new to Lucene, as you might gather :)

Kind regards,

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Where to get stopword lists?

2003-06-06 Thread Leo Galambos
Ulrich Mayring wrote:

Hello,

does anyone know of good stopword lists for use with Lucene? I'm 
interested in English and German lists.
What does mean ``good''? It depends on your corpus IMHO. The best way, 
how one can get a ``good'' stop-list, is an analysis that's based on 
idf. Thus, index your documents, list all the terms with low idf out, 
save them in a file and use them in next indexing round.

Just a thought...

-g-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Ype Kingma
On Thursday 05 June 2003 14:12, Jim Hargrave wrote:
 Our application is a string similarity searcher where the query is an input
 string and we want to find all fuzzy variants of the input string in the
 DB.  The Score is basically dice's coefficient: 2C/Q+D, where C is the
 number of terms (n-grams) in common, Q is the number of unique query terms
 and D is the number of unique document terms. Our documents will be
 sentences.

 I know Lucene has a fuzzy search capability - but I assume this would be
 very slow since it must search through the entire term list to find
 candidates.

Fuzzy search is not as fast as searching with direct terms or truncation,
but it does not search _all_ the terms.

 In order to do the calculation I will need to have 'C' - the number of
 terms in common between query and document. Is there an API that I can call
 to get this info? Any hints on what it will take to modify Lucene to handle
 these kinds of queries?

Have a look at the coord() call in the Similarity interface of Lucene 1.3.
It gets called per document with overlap and nr. of query terms when you
search with your own similarity implementation. It's based on terms
and not on n-grams, so it might be no good in your case.
You might try indexing a 1-gram as a Lucene Term.
In case a 1-gram is pnly C, A, T or G (DNA proteins) this might be too much
overhead for Lucene to handle...

Good luck,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]