I have created a slightly hairy document collection that contains 10s of millions of DNA sequence words that I wish to process to find rarer and unique words. Each of the words is between 100 characters (nucleotides) and 1000 characters in length.
I have been able to use WildcardQuery and FuzzyQuery to select for words - using the query “*ubst*” I can recover subst, substring etc. I am a little challenged in selecting words in the reciprocal direction - if I start with a long word such as “sequence”, what would be the most appropriate way to select for the words in the database that are found within e.g. sequ, quenc and ence? Is there a simple logical way that this could or should be done? A few pointers would be very much appreciated. Cheers Stephen --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org