That was it. Stupid mistake on my part. Thanks! John
On Mon, Apr 28, 2008 at 6:37 AM, Jens Kraemer <[EMAIL PROTECTED]> wrote: > Hi, > > File.readlines returns an array which I think is the root cause of the > problem. > Just using File.read instead should solve your problem. > > Cheers, > Jens > > On Mon, Apr 28, 2008 at 03:04:36AM -0400, S D wrote: > > It's my understanding that the tokens in a token_stream consist of text > > along with start/stop positions that represent the byte positions of the > > text within the corresponding document field. The documentation I've > been > > reading (i.e., O'Reilly - Ferret - page 67) suggests that these byte > > positions represent positions within the entire field but based on my > > testing it appears that the byte positions are with respect to the line > that > > contains the corresponding text within the field. I read my fields > following > > Brian McCallister: > > > > index.add_document :file => path, > > :content => file.readlines > > > > > > Hence, if I have a file that contains carriage returns, the token > positions > > will be reset with each new line. For example, the following file > contents > > (File A) > > this is a sentence > > will result in a token for the text "sentence" with start position equal > to > > 10 (assume "this" starts in position 0) while a file with a carriage > return > > this is a > > sentence > > will result in a token for the text "sentence" with start position equal > to > > 0. I get the same results for my custom tokenizer as well as > > StandardTokenizer. The above does not seem consistent with the > documentation > > but more importantly, it seems that global positions are more useful > than > > line-based positions (e.g., for highlighting). > > > > Digging a little deeper it seems that the tokenizer's initialize method > is > > called each time the token_stream method of the containing analyzer is > > called: > > > > class CustomAnalyzer > > def token_stream(field, str) > > ts = StandardTokenizer.new(str) > > end > > end > > > > Am I missing something here? Are the start/stop byte positions intended > to > > be with respect to the line? Is there a way for token_stream to only be > > called once for an entire string sequence (even if carriage returns are > > contained)? > > > > Thanks, > > John > > > _______________________________________________ > > Ferret-talk mailing list > > [email protected] > > http://rubyforge.org/mailman/listinfo/ferret-talk > > -- > Jens Krämer > Finkenlust 14, 06449 Aschersleben, Germany > VAT Id DE251962952 > http://www.jkraemer.net/ - Blog > http://www.omdb.org/ - The new free film database > _______________________________________________ > Ferret-talk mailing list > [email protected] > http://rubyforge.org/mailman/listinfo/ferret-talk >
_______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

