Re: finding the analyzer for a language...

2010-09-26 Thread Itamar Syn-Hershko
Shai, I was referring to your #2, which you already indicated in your reply wasn't part of the discussion. Itamar. On 26/9/2010 10:10 AM, Shai Erera wrote: The mapping is simply about returning the right Analyzer for the given Locale. You decide up front (as the Factory developer) what Analyze

Re: finding the analyzer for a language...

2010-09-26 Thread Shai Erera
The mapping is simply about returning the right Analyzer for the given Locale. You decide up front (as the Factory developer) what Analyzer / Tokenizer + TokenFilters combination you want to return for each language, and then when that language is input, you return it. That's it. Can you define mi

Re: finding the analyzer for a language...

2010-09-25 Thread Itamar Syn-Hershko
I may be missing the point here, but how do you define an analyzer <-> language match? What do you do in cases of mixed content, for example? Itamar. On 25/9/2010 10:27 PM, Shai Erera wrote: Shai Erera brought a similar idea up before, to use Locale, but my concerns are it would be limited by

Re: finding the analyzer for a language...

2010-09-25 Thread Shai Erera
> > Shai Erera brought a similar idea up before, to use Locale, but my concerns > are it would be limited by javas Locale mechanism... but we can figure this > out. > It really depends how sophisticated you want such an AnalyzerFactory (that's how I call it in my code) to be. We can define it to

Re: finding the analyzer for a language...

2010-09-25 Thread Bill Janssen
Robert Muir wrote: > On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen wrote: > > > I thought that since I'm updating UpLib's Lucene code, I should tackle > > the issue of document languages, as well. Right now I'm using an > > off-the-shelf language identifier, textcat, to figure out which langua

Re: finding the analyzer for a language...

2010-09-25 Thread Robert Muir
On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen wrote: > I thought that since I'm updating UpLib's Lucene code, I should tackle > the issue of document languages, as well. Right now I'm using an > off-the-shelf language identifier, textcat, to figure out which language > a Web page or PDF is (main

finding the analyzer for a language...

2010-09-24 Thread Bill Janssen
I thought that since I'm updating UpLib's Lucene code, I should tackle the issue of document languages, as well. Right now I'm using an off-the-shelf language identifier, textcat, to figure out which language a Web page or PDF is (mainly) written in. I then want to analyze that document with an a