Synonyms are also domain specific. A synonym set for one area may be completely
wrong in another.
In cooking, arugula and rocket are the same thing. In military or aerospace,
missile and rocket are very similar.
I would start with librarians. They maintain controlled vocabularies (called
“thes
Oh, Walter! Hello! I thought that name was familiar. Greetings from Basis.
All that makes sense.
On Thu, Mar 20, 2008 at 1:00 PM, Walter Underwood <[EMAIL PROTECTED]>
wrote:
> Extreme, but guaranteed to work and it avoids bad IDF when there are
> inter-language collisions. In Ultraseek, we only s
Extreme, but guaranteed to work and it avoids bad IDF when there are
inter-language collisions. In Ultraseek, we only stored the hash, so
the size of the source token didn't matter.
Trademarks are a bad source of collisions and anomalous IDF. If you have
LaserJet support docs in 20 languages, the
Token/by/token seems a bit extreme. Are you concerned with macaronic
documents?
On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood <[EMAIL PROTECTED]>
wrote:
> Nice list.
>
> You may still need to mark the language of each document. There are
> plenty of cross-language collisions: "die" and "boot
You can store in one field if you manage to hide a language code with the
text. XML is overkill but effective for this. At one point, we'd
investigated how to allow a Lucene analyzer to see more than one field (the
language code as well as the text) but I don't think we came up with
anything.
On
Nice list.
You may still need to mark the language of each document. There are
plenty of cross-language collisions: "die" and "boot" have different
meanings in German and English. Proper nouns ("Laserjet") may be the
same in all languages, a different problem if you are trying to get
answers in on
Unless you can come up with language-neutral tokenization and
stemming, you
need to:
a) know the language of each document.
b) run a different analyzer depending on the language.
c) force the user to tell you the language of the query.
d) run the query through the same analyzer.
I can do all o
Unless you can come up with language-neutral tokenization and stemming, you
need to:
a) know the language of each document.
b) run a different analyzer depending on the language.
c) force the user to tell you the language of the query.
d) run the query through the same analyzer.
On Thu, Mar 20,
You may be interested in a recent discussion that took place on a
similar
subject:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html
Interesting, yes. But since it doesn't actually exist, it's not much
help.
I guess what I'm asking is, if my approach seems convoluted, I
You may be interested in a recent discussion that took place on a similar
subject:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html
Nicolas
-Message d'origine-
De : David King [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 19 mars 2008 20:07
À : solr-user@lucene.apache.
10 matches
Mail list logo