Re: [Ferret-talk] Handling Carriage Returns

Jens Kraemer Mon, 28 Apr 2008 03:37:28 -0700

Hi,

File.readlines returns an array which I think is the root cause of the
problem. 
Just using File.read instead should solve your problem.


Cheers,
Jens

On Mon, Apr 28, 2008 at 03:04:36AM -0400, S D wrote:
> It's my understanding that the tokens in a token_stream consist of text
> along with start/stop positions that represent the byte positions of the
> text within the corresponding document field. The documentation I've been
> reading (i.e., O'Reilly - Ferret - page 67) suggests that these byte
> positions represent positions within the entire field but based on my
> testing it appears that the byte positions are with respect to the line that
> contains the corresponding text within the field. I read my fields following
> Brian McCallister:
> 
>       index.add_document :file => path,
>                          :content => file.readlines
> 
> 
> Hence, if I have a file that contains carriage returns, the token positions
> will be reset with each new line. For example, the following file contents
> (File A)
>           this is a sentence
> will result in a token for the text "sentence" with start position equal to
> 10 (assume "this" starts in position 0) while a file with a carriage return
>           this is a
>           sentence
> will result in a token for the text "sentence" with start position equal to
> 0. I get the same results for my custom tokenizer as well as
> StandardTokenizer. The above does not seem consistent with the documentation
> but more importantly, it seems that global positions are more useful than
> line-based positions (e.g., for highlighting).
> 
> Digging a little deeper it seems that the tokenizer's initialize method is
> called each time the token_stream method of the containing analyzer is
> called:
> 
> class CustomAnalyzer
>   def token_stream(field, str)
>     ts = StandardTokenizer.new(str)
>   end
> end
> 
> Am I missing something here? Are the start/stop byte positions intended to
> be with respect to the line? Is there a way for token_stream to only be
> called once for an entire string sequence (even if carriage returns are
> contained)?
> 
> Thanks,
> John

> _______________________________________________
> Ferret-talk mailing list
> [email protected]
> http://rubyforge.org/mailman/listinfo/ferret-talk

-- 
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] Handling Carriage Returns

Reply via email to