Re: [Ferret-talk] Handling Carriage Returns

S D Tue, 29 Apr 2008 22:53:56 -0700

That was it. Stupid mistake on my part.

Thanks!
John


On Mon, Apr 28, 2008 at 6:37 AM, Jens Kraemer <[EMAIL PROTECTED]> wrote:

> Hi,
>
> File.readlines returns an array which I think is the root cause of the
> problem.
> Just using File.read instead should solve your problem.
>
> Cheers,
> Jens
>
> On Mon, Apr 28, 2008 at 03:04:36AM -0400, S D wrote:
> > It's my understanding that the tokens in a token_stream consist of text
> > along with start/stop positions that represent the byte positions of the
> > text within the corresponding document field. The documentation I've
> been
> > reading (i.e., O'Reilly - Ferret - page 67) suggests that these byte
> > positions represent positions within the entire field but based on my
> > testing it appears that the byte positions are with respect to the line
> that
> > contains the corresponding text within the field. I read my fields
> following
> > Brian McCallister:
> >
> >       index.add_document :file => path,
> >                          :content => file.readlines
> >
> >
> > Hence, if I have a file that contains carriage returns, the token
> positions
> > will be reset with each new line. For example, the following file
> contents
> > (File A)
> >           this is a sentence
> > will result in a token for the text "sentence" with start position equal
> to
> > 10 (assume "this" starts in position 0) while a file with a carriage
> return
> >           this is a
> >           sentence
> > will result in a token for the text "sentence" with start position equal
> to
> > 0. I get the same results for my custom tokenizer as well as
> > StandardTokenizer. The above does not seem consistent with the
> documentation
> > but more importantly, it seems that global positions are more useful
> than
> > line-based positions (e.g., for highlighting).
> >
> > Digging a little deeper it seems that the tokenizer's initialize method
> is
> > called each time the token_stream method of the containing analyzer is
> > called:
> >
> > class CustomAnalyzer
> >   def token_stream(field, str)
> >     ts = StandardTokenizer.new(str)
> >   end
> > end
> >
> > Am I missing something here? Are the start/stop byte positions intended
> to
> > be with respect to the line? Is there a way for token_stream to only be
> > called once for an entire string sequence (even if carriage returns are
> > contained)?
> >
> > Thanks,
> > John
>
> > _______________________________________________
> > Ferret-talk mailing list
> > [email protected]
> > http://rubyforge.org/mailman/listinfo/ferret-talk
>
> --
> Jens Krämer
> Finkenlust 14, 06449 Aschersleben, Germany
> VAT Id DE251962952
> http://www.jkraemer.net/ - Blog
> http://www.omdb.org/     - The new free film database
> _______________________________________________
> Ferret-talk mailing list
> [email protected]
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] Handling Carriage Returns

Reply via email to