I have created a slightly hairy document collection that contains 10s of 
millions of DNA sequence words that I wish to process to find rarer and unique 
words. Each of the words is between 100 characters (nucleotides) and 1000 
characters in length.

I have been able to use WildcardQuery and FuzzyQuery to select for words - 
using the query “*ubst*” I can recover subst, substring etc.

I am a little challenged in selecting words in the reciprocal direction - if I 
start with a long word such as “sequence”, what would be the most appropriate 
way to select for the words in the database that are found within e.g. sequ, 
quenc and ence?

Is there a simple logical way that this could or should be done? A few pointers 
would be very much appreciated.

Cheers

Stephen





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to