Bendick Mahleko <[EMAIL PROTECTED]> wrote on 08/12/2005 07:48:27 AM:
> Hello Mark, > > I am indexing scientific data, where each word is potentially more than > 255 in length. So the point is, there doesn't seem to be a way to change > the maximum word length (via 'ft_max_word_len' - the parameter defining > the maximum length of any word as you pointed out) beyond 255. What are > my alternatives? > > Thanks in advance. > > Bendick > > Mark Leith wrote: > > >>-----Original Message----- > >>From: Bendick Mahleko [mailto:[EMAIL PROTECTED] > >>Sent: 12 August 2005 12:22 > >>To: mysql@lists.mysql.com > >>Subject: how to change ft_max_word_len value beyond 254 > >> > >>Hello, > >>I want to index a table using a TEXT value, with length > > >>255. I tried changing ft_max_word_len but each time I check > >>the status of variables, I notice the changes are not taken. > >>It defaults to 254. I am able to change this value to > >>anything below 254. Is there any other way to enforce this > >>ft_max_word_len value to some arbitrary value above 254? > >> > >> > >>The point is, because my index length is being limited to > >>only 254, I am having false misses in my SELECT queries, > >>based on the TEXT index. > >> > >>Bendick > >> > > > > > > Hi Bendick, > > > > Am I missing something here? The ft_max_word_len variable sets the maximum > > length of any word that fulltext will index, *not* the maximum length of the > > field that you are indexing. > > > > Now, unless you are indexing some scientific data, with for instance some > > strange, long virus name - I don't know of any word, in the English language > > at least, that is longer than 254 characters. I recently built a dictionary > > table for fun, with ~500,000 words from the English language in the table, > > so I can verify this for you if you want ;) > > > > Perhaps your false misses are due to something else, such as > > ft_min_word_len, or the values being in more than 50% of the rows etc. > > > > Mark > > > > Mark Leith > > Cool-Tools UK Limited > > http://www.cool-tools.co.uk > > http://leithal.cool-tools.co.uk > > > > > With bioinformatics being such a hot topic today, and because you didn't say exactly what kind of long, "scientific" data you are trying to index an idea occurred to me that you may be storing gene sequences. DNA sequences can be represented as LONG strings of A, C, T, and G but this doesn't leave any word breaks for the index to pick up on. With that in mind, you may be able to substitute any one of those letters with one of the stop letters and enable full-text indexing. Here is a visual example: AGACATATACCCGCGTA A.ACATATACCC.C.TA I substituted a period for all G's in this sequence. I could have used any other punctuation or whitespace character. So long as you never exceed 255 base pair combinations between any two occurrences of the "delimiter nucleotide", the FT Index should be able to properly capture the entire sequence. When searching just convert your "target" nucleotide to your "stop character" and continue as usual. Could this technique help to reduce the number of false negatives in your application? For instance, you might replace all occurrences of the extremely common "amino" or "methyl" in chemical names with a "%" or "$" character. Not only could it help to compress the data but it introduces artificial "word breaks" into extremely long "words" without losing any information from the actual data. Shawn Green Database Administrator Unimin Corporation - Spruce Pine