[PLAN]: SAXIndexer, indexing database via XML gateway

2003-06-06 Thread Che Dong
In current weblucene project including a SAX Based xml source indexer: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/weblucene/weblucene/webapp/WEB-INF/src/com/chedong/weblucene/index/ It can parse xml data source like following example: ?xml version=1.0 encoding=GB2312? Table Record id=1

RE: java.lang.IllegalArgumentException: attempt to access a deleted document

2003-06-06 Thread Rob Outar
I added the following code: for (int i = 0; i numOfDocs; i++) { if ( !reader.isDeleted(i)) { doc = reader.document(i); docs[i] = doc.get(SearchEngineConstants.REPOSITORY_PATH); } } return docs;

String similarity search vs. typcial IR application...

2003-06-06 Thread Jim Hargrave
Our application is a string similarity searcher where the query is an input string and we want to find all fuzzy variants of the input string in the DB. The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of unique query terms

Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Jim Hargrave
Probably shouldn't have added that last bit. Our app isn't a DNA searcher. But DASG+Lev does look interesting. Our app is a linguistic application. We want to search for sentences which have many ngrams in common and rank them based on the score below. Similar to the TELLTALE system (do a

Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Leo Galambos
I see. Are you looking for this: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html On the other hand, if n is not fixed, you still have a problem. As far as I read this list it seems, that Lucene reads a dictionary (of terms) into memory, and it also allocates

Special Character Search

2003-06-06 Thread Ramrakhiani, Vikas
Hi, I am trying to implement special character search. If I do a search with query title:java\-perl then documents with title java-perl as well as java+perl comes up. While first result is desirable the second one is not. I want to know what is going wrong here ? Also, I am using

problems with search on Russian content

2003-06-06 Thread Vladimir
Hi! I have lucene-1.3-rc1 and jdk1.3.1. What to change in a demonstration example to carry out search in html files with coding Cp1251? Thanks, Vladimir. --- Professional hosting for everyone - http://www.host.ru - To

Trouble running web demo

2003-06-06 Thread psethi
hi, When i run the web demo i get an error that says ERROR opening the Index - contact sysadmin! While parsing query: /opt/lucene/index not a directory i do not have the permission to modify opt so have not created an index directory in it.Thus i do not use the default as given

RE: String similarity search vs. typcial IR application...

2003-06-06 Thread Frank Burough
I have seen some interesting work done on storing DNA sequence as a set of common patterns with unique sequence between them. If one uses an analyzer to break sequence into its set of patterns and unique sequence then Lucene could be used to search for exact pattern matches. I know of only one

Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Leo Galambos
Exact matches are not ideal for DNA applications, I guess. I am not a DNA expert, but those guys often need a feature that is termed ``fuzzy''[*] in Lucene. They need Levenstein's and Hamming's metrics, and I think that Lucene has many drawbacks which disallow effective implementations. On the

RE: Trouble running web demo

2003-06-06 Thread xx28
Try to chang permisssion 777 for index directory. = Original Message From Lucene Users List [EMAIL PROTECTED] = hi, When i run the web demo i get an error that says ERROR opening the Index - contact sysadmin! While parsing query: /opt/lucene/index not a directory i do not

RE: String similarity search vs. typcial IR application...

2003-06-06 Thread Frank Burough
The method I mention was based on using lempel-ziv (I expect my spelling is way off on this) algorithms used in lz compression. It relied only on exact matches of short stretches of DNA separated by non-matching sequence. The idea was to find stretches of sequence that had patterns in common,

Where to get stopword lists?

2003-06-06 Thread Ulrich Mayring
Hello, does anyone know of good stopword lists for use with Lucene? I'm interested in English and German lists. The default lists aren't very complete, for example the English list doesn't contain words like every, because or until and the German list misses dem and des (definite articles).

Re: Where to get stopword lists?

2003-06-06 Thread Doug Cutting
Ulrich Mayring wrote: does anyone know of good stopword lists for use with Lucene? I'm interested in English and German lists. The Snowball project has good stop lists. See: http://snowball.tartarus.org/ http://snowball.tartarus.org/english/stop.txt

Re: Where to get stopword lists?

2003-06-06 Thread Otis Gospodnetic
There is a much more complete list of Englihs stop words included in the Lucene article (the intro one) on Onjava.com. I can't help you with German stop words. Otis --- Ulrich Mayring [EMAIL PROTECTED] wrote: Hello, does anyone know of good stopword lists for use with Lucene? I'm

Re: Where to get stopword lists?

2003-06-06 Thread Ulrich Mayring
Doug Cutting wrote: Snowball stemmers are pre-packaged for use with Lucene at: http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/ These look interesting. Am I right in assuming that in order to use these stemmers, I have to write an Analyzer and in its tokenStream method I return

Re: Where to get stopword lists?

2003-06-06 Thread Bryan LaPlante
I found a some handy tools in the org.apache.lucene.analysis.de package using the WordListLoader class you can load up your stop words in a verity of ways including a line delimited text file thanks to Gerhard Schwarz. Bryan LaPlante - Original Message - From: Ulrich Mayring [EMAIL

Re: Where to get stopword lists?

2003-06-06 Thread Anthony Eden
There is already an analyzer available in the sandbox. Take a look here: http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/ Sincerely, Anthony Eden Ulrich Mayring wrote: Doug Cutting wrote: Snowball stemmers are pre-packaged for use with Lucene at:

Re: Where to get stopword lists?

2003-06-06 Thread Leo Galambos
Ulrich Mayring wrote: Hello, does anyone know of good stopword lists for use with Lucene? I'm interested in English and German lists. What does mean ``good''? It depends on your corpus IMHO. The best way, how one can get a ``good'' stop-list, is an analysis that's based on idf. Thus, index

Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Ype Kingma
On Thursday 05 June 2003 14:12, Jim Hargrave wrote: Our application is a string similarity searcher where the query is an input string and we want to find all fuzzy variants of the input string in the DB. The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms