Bendick Mahleko <[EMAIL PROTECTED]> wrote on 08/12/2005 07:48:27 
AM:

> Hello Mark,
> 
> I am indexing scientific data, where each word is potentially more than 
> 255 in length. So the point is, there doesn't seem to be a way to change 

> the maximum word length (via 'ft_max_word_len' - the parameter defining 
> the maximum length of any word as you pointed out) beyond 255. What are 
> my alternatives?
> 
> Thanks in advance.
> 
> Bendick
> 
> Mark Leith wrote:
> 
> >>-----Original Message-----
> >>From: Bendick Mahleko [mailto:[EMAIL PROTECTED] 
> >>Sent: 12 August 2005 12:22
> >>To: mysql@lists.mysql.com
> >>Subject: how to change ft_max_word_len value beyond 254
> >>
> >>Hello,
> >>I want to index a table using a TEXT value, with length > 
> >>255. I tried changing ft_max_word_len but each time I check 
> >>the status of variables, I notice the changes are not taken. 
> >>It defaults to 254. I am able to change this value to 
> >>anything below 254. Is there any other way to enforce this 
> >>ft_max_word_len value to some arbitrary value above 254?
> >>
> >>
> >>The point is, because my index length is being limited to 
> >>only 254, I am having false misses in my SELECT queries, 
> >>based on the TEXT index.
> >>
> >>Bendick
> >>
> > 
> > 
> > Hi Bendick,
> > 
> > Am I missing something here? The ft_max_word_len variable sets the 
maximum
> > length of any word that fulltext will index, *not* the maximum length 
of the
> > field that you are indexing. 
> > 
> > Now, unless you are indexing some scientific data, with for instance 
some
> > strange, long virus name - I don't know of any word, in the English 
language
> > at least, that is longer than 254 characters. I recently built a 
dictionary
> > table for fun, with ~500,000 words from the English language in the 
table,
> > so I can verify this for you if you want ;)
> > 
> > Perhaps your false misses are due to something else, such as
> > ft_min_word_len, or the values being in more than 50% of the rows etc. 

> > 
> > Mark
> > 
> > Mark Leith
> > Cool-Tools UK Limited
> > http://www.cool-tools.co.uk
> > http://leithal.cool-tools.co.uk 
> > 
> > 
> 

With bioinformatics being such a hot topic today, and because you didn't 
say exactly what kind of long, "scientific" data you are trying to index 
an idea occurred to me that you may be storing gene sequences. DNA 
sequences can be represented as LONG strings of A, C, T, and G but this 
doesn't leave any word breaks for the index to pick up on. 

With that in mind, you may be able to substitute any one of those letters 
with one of the stop letters and enable full-text indexing. Here is a 
visual example:

AGACATATACCCGCGTA
A.ACATATACCC.C.TA


I substituted a period for all G's in this sequence. I could have used any 
other punctuation or whitespace character. So long as you never exceed 255 
base pair combinations between any two occurrences of the "delimiter 
nucleotide", the FT Index should be able to properly capture the entire 
sequence.

When searching just convert your "target" nucleotide to your "stop 
character" and continue as usual. Could this technique help to reduce the 
number of false negatives in your application? For instance, you might 
replace all occurrences of the extremely common "amino" or "methyl" in 
chemical names with a "%" or "$" character. Not only could it help to 
compress the data but it introduces artificial "word breaks" into 
extremely long "words" without losing any information from the actual 
data.

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Reply via email to