That's why it was mentioned as the simplest way, not the best way
performance-wise. It's worth mentioning I'm using RegExpAnalyzer to
index some information in a hundreds of thousands documents sized
index. I'm not hitting any roofs in terms of memory usage or
performance.

StandardAnalyzer relies on spaces to find tokens, also taking stop
words, hyphens into consideration, right? Do correct me if I'm wrong.
I don't know how Persian "works", but if you have any expression
that's not space separated, unless you're fortunate enough that your
users queried for it entirely, they won't get any results back.

The best solution for mixed text scenario, as far as I can tell, is to
have an analyzer that's complex enough to find out the language for
every character/word, and apply some sort of sub-analyzer for each
language it finds. This might require you to perform many passes
through the same string.

So to sum it up, it's not a matter of reinventing the wheel. It's a
quick hack that will get you imprecise results sometimes, but will
work with mixed text for sure, since your analyzer doesn't assume any
"westernisms" to be there when tokenizing text.


On 4/23/07, Reza Yeganeh <[EMAIL PROTECTED]> wrote:
> > ... I think the simplest way to make it work is to use a RegexpAnalyzer 
> > that takes
> > every character for a token.
>
> David's code uses StandardAnalyzer. It's implemented in C and is fast
> and advanced. I don't want to re-invent the wheel (e.g. www.example.com,
> emails, punctuation etc.). PerFieldAnalyzer is not a good solution for
> me too (I have mixed text). Persian is very similar to English, in
> punctuations (it has some extra marks), word foundation, and even stems.
>
>
> --
> Posted via http://www.ruby-forum.com/.
> _______________________________________________
> Ferret-talk mailing list
> [email protected]
> http://rubyforge.org/mailman/listinfo/ferret-talk
>


-- 
Julio C. Ody
http://rootshell.be/~julioody
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to