Re: N-gram layer

2004-03-13 Thread Andrzej Bialecki
karl wettin wrote: On Sun, 1 Feb 2004 13:12:32 -0800 (PST) Otis Gospodnetic <[EMAIL PROTECTED]> wrote: Looking forward to the contribution. Sorry for the delay, but I've had quite some workload lately, and then I moved between apartments. I'm back and I'm ready to spend some time. I gave up dete

Re: N-gram layer

2004-03-11 Thread karl wettin
On Sun, 1 Feb 2004 13:12:32 -0800 (PST) Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Looking forward to the contribution. Sorry for the delay, but I've had quite some workload lately, and then I moved between apartments. I'm back and I'm ready to spend some time. I gave up detecting the languag

RE: N-gram layer

2004-02-09 Thread Nestel, Frank IZ/HZA-IOL
> Sent: Sunday, February 01, 2004 10:07 PM > To: [EMAIL PROTECTED] > Subject: N-gram layer > > > > Hello list, > > I'm Karl, and I just started testing Lucene the other day. > It's a great core engine, but feel there are some things > missing I'd be

AW: AW: N-gram layer and language guessing

2004-02-06 Thread Karsten Konrad
in [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 6. Februar 2004 07:58 An: Lucene Developers List Betreff: Re: AW: N-gram layer and language guessing On Tue, 3 Feb 2004 11:39:40 +0100 "Karsten Konrad" <[EMAIL PROTECTED]> wrote: > > Anyway, XtraMind's ngram language gue

Re: AW: N-gram layer and language guessing

2004-02-05 Thread karl wettin
On Tue, 3 Feb 2004 11:39:40 +0100 "Karsten Konrad" <[EMAIL PROTECTED]> wrote: > > Anyway, XtraMind's ngram language guesser gives the following > best five results on the swedish examples discussed previously: > > "jag heter kalle" > > swedish 100,00 % > norwegian 17,51 % > danish 10,02 % > af

Re: N-gram layer

2004-02-03 Thread Tatu Saloranta
On Tuesday 03 February 2004 02:18, karl wettin wrote: > On Tue, 3 Feb 2004 09:54:19 +0100 > > karl wettin <[EMAIL PROTECTED]> wrote: > > test has a weight of 1731 in Swedish > > test has a weight of 1726 in Danish > > Oh dear. Mine fails too. Considering swedish, danish and norwegian languages are

AW: AW: AW: N-gram layer and language guessing

2004-02-03 Thread Karsten Konrad
ge documents etc. Have fun with ngram, Karsten -Ursprüngliche Nachricht- Von: karl wettin [mailto:[EMAIL PROTECTED] Gesendet: Dienstag, 3. Februar 2004 14:01 An: Lucene Developers List Betreff: Re: AW: AW: N-gram layer and language guessing On Tue, 3 Feb 2004 13:36:35 +0100 &qu

Re: AW: AW: N-gram layer and language guessing

2004-02-03 Thread karl wettin
On Tue, 3 Feb 2004 13:36:35 +0100 "Karsten Konrad" <[EMAIL PROTECTED]> wrote: > > If you use ngrams consistently, you can leave out stemming and spend > your time with something different (like buing a bigger harddisc for > your indexes, you probably will need them then :) I didn't get your poin

AW: AW: N-gram layer and language guessing

2004-02-03 Thread Karsten Konrad
aMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: karl wettin [mailto:[EMAIL PROTECTED] Gesendet: Dienstag, 3. Februar 2004 12:58 An: Lucene Developers L

Re: AW: N-gram layer and language guessing

2004-02-03 Thread karl wettin
On Tue, 03 Feb 2004 12:47:06 +0100 Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Karsten Konrad wrote: > > The guesser uses only tri- and quad-grams and is based on > > a sophisticated machine learning algorithm instead of a raw > > TF/IDF-weighting. The upside of this is the "confidence" > > val

Re: AW: N-gram layer and language guessing

2004-02-03 Thread Andrzej Bialecki
Karsten Konrad wrote: Hi, does anybody here use a ngram-layer for fault-tolerant searching on *larger* texts? I ask because you can expect to see far more ngrams than words emerging from a text once you use at least quad-grams - and the number of different tokens indexed seems to be the most im

AW: N-gram layer and language guessing

2004-02-03 Thread Karsten Konrad
Intelligence Lab XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Gesendet: Dienstag, 3. Februar 2004 09:27 An: Lucene Dev

Re: N-gram layer

2004-02-03 Thread Andrzej Bialecki
karl wettin wrote: On Tue, 03 Feb 2004 09:27:25 +0100 Andrzej Bialecki <[EMAIL PROTECTED]> wrote: If I run the above example, I get the following: "jag heter kalle" - SV: 0.7197875 What is index 1.0 ? 1.0 - completely dissimilar language profiles 0.0 - completely similar language profiles

Re: N-gram layer

2004-02-03 Thread karl wettin
On Tue, 3 Feb 2004 09:54:19 +0100 karl wettin <[EMAIL PROTECTED]> wrote: > > > test has a weight of 1731 in Swedish > test has a weight of 1726 in Danish Oh dear. Mine fails too. -- karl - To unsubscribe, e-mail: [EMAIL PR

Re: N-gram layer

2004-02-03 Thread karl wettin
On Tue, 03 Feb 2004 09:27:25 +0100 Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > However, for the text "vad heter du" (what's your name) the detection > fails... :-) I'm sorry for my multiple replys.. 1->5 grams and penalty: vad heter du test has a weight of 1731 in Swedish test has a wei

Re: N-gram layer

2004-02-03 Thread karl wettin
On Tue, 03 Feb 2004 09:27:25 +0100 Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > If I run the above example, I get the following: > > "jag heter kalle" > - SV: 0.7197875 What is index 1.0 ? -- karl - To unsubscribe,

Re: N-gram layer

2004-02-03 Thread karl wettin
On Tue, 03 Feb 2004 09:27:25 +0100 Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > A question: what was your source for the representative hi-frequency > words in various languages? Was it your training corpus or some publication? I use the data supplied with Gertjan van Noord:s TextCat distrib

Re: N-gram layer

2004-02-03 Thread Andrzej Bialecki
karl wettin wrote: On Mon, 2 Feb 2004 20:10:57 +0100 "Jean-Francois Halleux" <[EMAIL PROTECTED]> wrote: during the past days, I've developped such a language guesser myself as a basis for a Lucene analyzer. It is based on trigrams. It is already working but not yet in a "publishable" state. If you

Re: N-gram layer

2004-02-03 Thread karl wettin
On Mon, 2 Feb 2004 20:10:57 +0100 "Jean-Francois Halleux" <[EMAIL PROTECTED]> wrote: > during the past days, I've developped such a language guesser myself > as a basis for a Lucene analyzer. It is based on trigrams. It is > already working but not yet in a "publishable" state. If you or others >

Re: N-gram layer

2004-02-02 Thread Erik Hatcher
On Feb 2, 2004, at 2:10 PM, Jean-Francois Halleux wrote: Hi Karl, during the past days, I've developped such a language guesser myself as a basis for a Lucene analyzer. It is based on trigrams. It is already working but not yet in a "publishable" state. If you or others are interested I can of

RE: N-gram layer

2004-02-02 Thread Jean-Francois Halleux
rancois Halleux -Original Message- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: dimanche 1 fevrier 2004 22:07 To: [EMAIL PROTECTED] Subject: N-gram layer Hello list, I'm Karl, and I just started testing Lucene the other day. It's a great core engine, but feel there are some thing

Re: N-gram layer

2004-02-01 Thread karl wettin
On Sun, 1 Feb 2004 22:15:26 -0600 "Robert Engels" <[EMAIL PROTECTED]> wrote: > Actually, you do not always need to store it in a field. > > See the Phonetic Query patch I posted (which does Soundex, Metaphone, > and can actually do any 'secondary' info query). Now it hit me, I really don't need

RE: N-gram layer

2004-02-01 Thread Robert Engels
01, 2004 3:07 PM To: [EMAIL PROTECTED] Subject: N-gram layer Hello list, I'm Karl, and I just started testing Lucene the other day. It's a great core engine, but feel there are some things missing I'd be happy to contribute with. I stated with writing a simple N-gram classifie

Re: N-gram layer

2004-02-01 Thread Otis Gospodnetic
The best Analyzer documentation so far is Erik Hatcher's "Parser Rulez" article. Link is under Resources page on Lucene's site. Looking forward to the contribution. Otis --- karl wettin <[EMAIL PROTECTED]> wrote: > > Hello list, > > I'm Karl, and I just started testing Lucene the other day.

N-gram layer

2004-02-01 Thread karl wettin
Hello list, I'm Karl, and I just started testing Lucene the other day. It's a great core engine, but feel there are some things missing I'd be happy to contribute with. I stated with writing a simple N-gram classifier to detect language of a text in order to automatically cluster documents by l