Re: Language support

2016-08-23 Thread Walter Underwood
Synonyms are also domain specific. A synonym set for one area may be completely wrong in another. In cooking, arugula and rocket are the same thing. In military or aerospace, missile and rocket are very similar. I would start with librarians. They maintain controlled vocabularies (called “thes

Re: Language support

2008-03-20 Thread Benson Margulies
Oh, Walter! Hello! I thought that name was familiar. Greetings from Basis. All that makes sense. On Thu, Mar 20, 2008 at 1:00 PM, Walter Underwood <[EMAIL PROTECTED]> wrote: > Extreme, but guaranteed to work and it avoids bad IDF when there are > inter-language collisions. In Ultraseek, we only s

Re: Language support

2008-03-20 Thread Walter Underwood
Extreme, but guaranteed to work and it avoids bad IDF when there are inter-language collisions. In Ultraseek, we only stored the hash, so the size of the source token didn't matter. Trademarks are a bad source of collisions and anomalous IDF. If you have LaserJet support docs in 20 languages, the

Re: Language support

2008-03-20 Thread Benson Margulies
Token/by/token seems a bit extreme. Are you concerned with macaronic documents? On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood <[EMAIL PROTECTED]> wrote: > Nice list. > > You may still need to mark the language of each document. There are > plenty of cross-language collisions: "die" and "boot

Re: Language support

2008-03-20 Thread Benson Margulies
You can store in one field if you manage to hide a language code with the text. XML is overkill but effective for this. At one point, we'd investigated how to allow a Lucene analyzer to see more than one field (the language code as well as the text) but I don't think we came up with anything. On

Re: Language support

2008-03-20 Thread Walter Underwood
Nice list. You may still need to mark the language of each document. There are plenty of cross-language collisions: "die" and "boot" have different meanings in German and English. Proper nouns ("Laserjet") may be the same in all languages, a different problem if you are trying to get answers in on

Re: Language support

2008-03-20 Thread David King
Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer. I can do all o

Re: Language support

2008-03-20 Thread Benson Margulies
Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer. On Thu, Mar 20,

Re: Language support

2008-03-20 Thread David King
You may be interested in a recent discussion that took place on a similar subject: http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html Interesting, yes. But since it doesn't actually exist, it's not much help. I guess what I'm asking is, if my approach seems convoluted, I

RE: Language support

2008-03-20 Thread nicolas . dessaigne
You may be interested in a recent discussion that took place on a similar subject: http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html Nicolas -Message d'origine- De : David King [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 mars 2008 20:07 À : solr-user@lucene.apache.