String similarity search vs. typcial IR application...

2003-06-06 Thread Jim Hargrave
Our application is a string similarity searcher where the query is an input string and 
we want to find all "fuzzy" variants of the input string in the DB.  The Score is 
basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in 
common, Q is the number of unique query terms and D is the number of unique document 
terms. Our documents will be sentences.
 
I know Lucene has a fuzzy search capability - but I assume this would be very slow 
since it must search through the entire term list to find candidates.
 
In order to do the calculation I will need to have 'C' - the number of terms in common 
between query and document. Is there an API that I can call to get this info? Any 
hints on what it will take to modify Lucene to handle these kinds of queries? 
 
BTW: 
Ever consider using Lucene for DNA searching? - this technique could also be used to 
search large DNA databases.
 
Thanks!
 
Jim Hargrave


--
This message may contain confidential information, and is intended only for the use of 
the individual(s) to whom it is addressed.


==


Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Leo Galambos
AFAIK Lucene is not able to look DNA strings up effectively. You would 
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

Our application is a string similarity searcher where the query is an input string and we want to find all "fuzzy" variants of the input string in the DB.  The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of unique query terms and D is the number of unique document terms. Our documents will be sentences.

I know Lucene has a fuzzy search capability - but I assume this would be very slow since it must search through the entire term list to find candidates.

In order to do the calculation I will need to have 'C' - the number of terms in common between query and document. Is there an API that I can call to get this info? Any hints on what it will take to modify Lucene to handle these kinds of queries? 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Jim Hargrave
Probably shouldn't have added that last bit. Our app isn't a DNA searcher. But 
DASG+Lev does look interesting.
 
Our app is a linguistic application. We want to search for sentences which have many 
ngrams in common and rank them based on the score below. Similar to the TELLTALE 
system (do a google search TELLTALE + ngrams) - but we are not interested in IR per se 
- we want to compute a score based on pure string similarity. Sentences are docs, 
ngrams are terms.
 
Jim

>>> [EMAIL PROTECTED] 06/05/03 03:55PM >>>
AFAIK Lucene is not able to look DNA strings up effectively. You would 
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

>Our application is a string similarity searcher where the query is an input string 
>and we want to find all "fuzzy" variants of the input string in the DB.  The Score is 
>basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in 
>common, Q is the number of unique query terms and D is the number of unique document 
>terms. Our documents will be sentences.
> 
>I know Lucene has a fuzzy search capability - but I assume this would be very slow 
>since it must search through the entire term list to find candidates.
> 
>In order to do the calculation I will need to have 'C' - the number of terms in 
>common between query and document. Is there an API that I can call to get this info? 
>Any hints on what it will take to modify Lucene to handle these kinds of queries? 
>  
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 





--
This message may contain confidential information, and is intended only for the use of 
the individual(s) to whom it is addressed.


==


Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Leo Galambos
I see. Are you looking for this: 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html

On the other hand, if n is not fixed, you still have a problem. As far 
as I read this list it seems, that Lucene reads a dictionary (of terms) 
into memory, and it also allocates one file handle for each of the 
acting terms. It implies you would not break the terms up into n-grams 
and, as a result, you would use a slow look-up over the dictionary. I do 
not know if I express it correctly, but my personal feeling is, that you 
would rather write your application from scratch.

BTW: If you have "nice terms", you could find all their n-grams 
occurencies in the dictionary, and compute a boost factor for each of 
the inverted lists. I.e., "bbc" is a term in a query, and for i-list of 
"abba", the factor is 1 (bigram "bb" is there), for i-list of "bbb", the 
factor is 2 ("bb" 2x). Then you use the Similarity class, and it is 
solved. Nevertheless, if the n-grams are not nice and the query is long, 
you will lost a lot of time in the dictionary look-up phase.

-g-

PS: I'm sorry for my English, just learning...

Jim Hargrave wrote:

Probably shouldn't have added that last bit. Our app isn't a DNA searcher. But DASG+Lev does look interesting.

Our app is a linguistic application. We want to search for sentences which have many ngrams in common and rank them based on the score below. Similar to the TELLTALE system (do a google search TELLTALE + ngrams) - but we are not interested in IR per se - we want to compute a score based on pure string similarity. Sentences are docs, ngrams are terms.

Jim

 

[EMAIL PROTECTED] 06/05/03 03:55PM >>>
   

AFAIK Lucene is not able to look DNA strings up effectively. You would 
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

 

Our application is a string similarity searcher where the query is an input string and we want to find all "fuzzy" variants of the input string in the DB.  The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of unique query terms and D is the number of unique document terms. Our documents will be sentences.

I know Lucene has a fuzzy search capability - but I assume this would be very slow since it must search through the entire term list to find candidates.

In order to do the calculation I will need to have 'C' - the number of terms in common between query and document. Is there an API that I can call to get this info? Any hints on what it will take to modify Lucene to handle these kinds of queries? 

   



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 





--
This message may contain confidential information, and is intended only for the use of 
the individual(s) to whom it is addressed.
==

 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: String similarity search vs. typcial IR application...

2003-06-06 Thread Frank Burough
I have seen some interesting work done on storing DNA sequence as a set of common 
patterns with unique sequence between them. If one uses an analyzer to break sequence 
into its set of patterns and unique sequence then Lucene could be used to search for 
exact pattern matches. I know of only one sequence search tool that was based on this 
approach. I don't know if it ever left the lab and made it into the mainstream. If I 
have time I will explore this a bit.

Frank Burough



> -Original Message-
> From: Leo Galambos [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, June 05, 2003 5:55 PM
> To: Lucene Users List
> Subject: Re: String similarity search vs. typcial IR application...
> 
> 
> AFAIK Lucene is not able to look DNA strings up effectively. 
> You would 
> use DASG+Lev (see my previous post - 05/30/2003 1916CEST).
> 
> -g-
> 
> Jim Hargrave wrote:
> 
> >Our application is a string similarity searcher where the 
> query is an 
> >input string and we want to find all "fuzzy" variants of the 
> input string in the DB.  The Score is basically dice's 
> coefficient: 2C/Q+D, where C is the number of terms (n-grams) 
> in common, Q is the number of unique query terms and D is the 
> number of unique document terms. Our documents will be sentences.
> > 
> >I know Lucene has a fuzzy search capability - but I assume 
> this would 
> >be very slow since it must search through the entire term 
> list to find candidates.
> > 
> >In order to do the calculation I will need to have 'C' - the 
> number of 
> >terms in common between query and document. Is there an API 
> that I can call to get this info? Any hints on what it will 
> take to modify Lucene to handle these kinds of queries?
> >  
> >
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Leo Galambos
Exact matches are not ideal for DNA applications, I guess. I am not a 
DNA expert, but those guys often need a feature that is termed 
``fuzzy''[*] in Lucene. They need Levenstein's and Hamming's metrics, 
and I think that Lucene has many drawbacks which disallow effective 
implementations. On the other hand, I am very interested in a method you 
mentioned. Could you give me a reference, please? Thank you.

-g-

[*] why do you use the label ``fuzzy''? It has nothing to do with fuzzy 
logic or fuzzy IR, I guess.

Frank Burough wrote:

I have seen some interesting work done on storing DNA sequence as a set of common patterns with unique sequence between them. If one uses an analyzer to break sequence into its set of patterns and unique sequence then Lucene could be used to search for exact pattern matches. I know of only one sequence search tool that was based on this approach. I don't know if it ever left the lab and made it into the mainstream. If I have time I will explore this a bit.

Frank Burough



 

-Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 05, 2003 5:55 PM
To: Lucene Users List
Subject: Re: String similarity search vs. typcial IR application...

AFAIK Lucene is not able to look DNA strings up effectively. 
You would 
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

   

Our application is a string similarity searcher where the 
 

query is an 
   

input string and we want to find all "fuzzy" variants of the 
 

input string in the DB.  The Score is basically dice's 
coefficient: 2C/Q+D, where C is the number of terms (n-grams) 
in common, Q is the number of unique query terms and D is the 
number of unique document terms. Our documents will be sentences.
   

I know Lucene has a fuzzy search capability - but I assume 
 

this would 
   

be very slow since it must search through the entire term 
 

list to find candidates.
   

In order to do the calculation I will need to have 'C' - the 
 

number of 
   

terms in common between query and document. Is there an API 
 

that I can call to get this info? Any hints on what it will 
take to modify Lucene to handle these kinds of queries?
   



 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: String similarity search vs. typcial IR application...

2003-06-06 Thread Frank Burough
The method I mention was based on using lempel-ziv (I expect my spelling is way off on 
this) algorithms used in lz compression. It relied only on exact matches of short 
stretches of DNA separated by non-matching sequence. The idea was to find stretches of 
sequence that had patterns in common, then generate the original sequences back and 
display alignments. I never paid any attention to its match scores since they seemed 
pretty arbitrary. The focus on this work was in looking for common repeats in human 
sequence to assist in masking them prior to doing analysis. I have lost touch with the 
original author of this but I am digging to see if I can extract the papers from my 
"filing system". I will post them shortly (I hope!).

Frank


> -Original Message-
> From: Leo Galambos [mailto:[EMAIL PROTECTED] 
> Sent: Friday, June 06, 2003 10:00 AM
> To: Lucene Users List
> Subject: Re: String similarity search vs. typcial IR application...
> 
> 
> Exact matches are not ideal for DNA applications, I guess. I am not a 
> DNA expert, but those guys often need a feature that is termed 
> ``fuzzy''[*] in Lucene. They need Levenstein's and Hamming's metrics, 
> and I think that Lucene has many drawbacks which disallow effective 
> implementations. On the other hand, I am very interested in a 
> method you 
> mentioned. Could you give me a reference, please? Thank you.
> 
> -g-
> 
> [*] why do you use the label ``fuzzy''? It has nothing to do 
> with fuzzy 
> logic or fuzzy IR, I guess.
> 
> Frank Burough wrote:
> 
> >I have seen some interesting work done on storing DNA 
> sequence as a set 
> >of common patterns with unique sequence between them. If one uses an 
> >analyzer to break sequence into its set of patterns and 
> unique sequence 
> >then Lucene could be used to search for exact pattern 
> matches. I know 
> >of only one sequence search tool that was based on this approach. I 
> >don't know if it ever left the lab and made it into the 
> mainstream. If 
> >I have time I will explore this a bit.
> >
> >Frank Burough
> >
> >
> >
> >  
> >
> >>-Original Message-
> >>From: Leo Galambos [mailto:[EMAIL PROTECTED]
> >>Sent: Thursday, June 05, 2003 5:55 PM
> >>To: Lucene Users List
> >>Subject: Re: String similarity search vs. typcial IR application...
> >>
> >>
> >>AFAIK Lucene is not able to look DNA strings up effectively.
> >>You would 
> >>use DASG+Lev (see my previous post - 05/30/2003 1916CEST).
> >>
> >>-g-
> >>
> >>Jim Hargrave wrote:
> >>
> >>
> >>
> >>>Our application is a string similarity searcher where the
> >>>  
> >>>
> >>query is an
> >>
> >>
> >>>input string and we want to find all "fuzzy" variants of the
> >>>  
> >>>
> >>input string in the DB.  The Score is basically dice's
> >>coefficient: 2C/Q+D, where C is the number of terms (n-grams) 
> >>in common, Q is the number of unique query terms and D is the 
> >>number of unique document terms. Our documents will be sentences.
> >>
> >>
> >>>I know Lucene has a fuzzy search capability - but I assume
> >>>  
> >>>
> >>this would
> >>
> >>
> >>>be very slow since it must search through the entire term
> >>>  
> >>>
> >>list to find candidates.
> >>
> >>
> >>>In order to do the calculation I will need to have 'C' - the
> >>>  
> >>>
> >>number of
> >>
> >>
> >>>terms in common between query and document. Is there an API
> >>>  
> >>>
> >>that I can call to get this info? Any hints on what it will
> >>take to modify Lucene to handle these kinds of queries?
> >>
> >>
> >>> 
> >>>
> >>>  
> >>>
> >>
> >>
> -
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >>
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >  
> >
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Ype Kingma
On Thursday 05 June 2003 14:12, Jim Hargrave wrote:
> Our application is a string similarity searcher where the query is an input
> string and we want to find all "fuzzy" variants of the input string in the
> DB.  The Score is basically dice's coefficient: 2C/Q+D, where C is the
> number of terms (n-grams) in common, Q is the number of unique query terms
> and D is the number of unique document terms. Our documents will be
> sentences.
>
> I know Lucene has a fuzzy search capability - but I assume this would be
> very slow since it must search through the entire term list to find
> candidates.

Fuzzy search is not as fast as searching with direct terms or truncation,
but it does not search _all_ the terms.

> In order to do the calculation I will need to have 'C' - the number of
> terms in common between query and document. Is there an API that I can call
> to get this info? Any hints on what it will take to modify Lucene to handle
> these kinds of queries?

Have a look at the coord() call in the Similarity interface of Lucene 1.3.
It gets called per document with overlap and nr. of query terms when you
search with your own similarity implementation. It's based on terms
and not on n-grams, so it might be no good in your case.
You might try indexing a 1-gram as a Lucene Term.
In case a 1-gram is pnly C, A, T or G (DNA proteins) this might be too much
overhead for Lucene to handle...

Good luck,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]