Hi, File.readlines returns an array which I think is the root cause of the problem. Just using File.read instead should solve your problem.
Cheers, Jens On Mon, Apr 28, 2008 at 03:04:36AM -0400, S D wrote: > It's my understanding that the tokens in a token_stream consist of text > along with start/stop positions that represent the byte positions of the > text within the corresponding document field. The documentation I've been > reading (i.e., O'Reilly - Ferret - page 67) suggests that these byte > positions represent positions within the entire field but based on my > testing it appears that the byte positions are with respect to the line that > contains the corresponding text within the field. I read my fields following > Brian McCallister: > > index.add_document :file => path, > :content => file.readlines > > > Hence, if I have a file that contains carriage returns, the token positions > will be reset with each new line. For example, the following file contents > (File A) > this is a sentence > will result in a token for the text "sentence" with start position equal to > 10 (assume "this" starts in position 0) while a file with a carriage return > this is a > sentence > will result in a token for the text "sentence" with start position equal to > 0. I get the same results for my custom tokenizer as well as > StandardTokenizer. The above does not seem consistent with the documentation > but more importantly, it seems that global positions are more useful than > line-based positions (e.g., for highlighting). > > Digging a little deeper it seems that the tokenizer's initialize method is > called each time the token_stream method of the containing analyzer is > called: > > class CustomAnalyzer > def token_stream(field, str) > ts = StandardTokenizer.new(str) > end > end > > Am I missing something here? Are the start/stop byte positions intended to > be with respect to the line? Is there a way for token_stream to only be > called once for an entire string sequence (even if carriage returns are > contained)? > > Thanks, > John > _______________________________________________ > Ferret-talk mailing list > [email protected] > http://rubyforge.org/mailman/listinfo/ferret-talk -- Jens Krämer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

