karl wettin wrote:
On Sun, 1 Feb 2004 13:12:32 -0800 (PST)
Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
Looking forward to the contribution.
Sorry for the delay, but I've had quite some workload lately, and then I
moved between apartments. I'm back and I'm ready to spend some time.
I gave up dete
On Sun, 1 Feb 2004 13:12:32 -0800 (PST)
Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> Looking forward to the contribution.
Sorry for the delay, but I've had quite some workload lately, and then I
moved between apartments. I'm back and I'm ready to spend some time.
I gave up detecting the languag
> Sent: Sunday, February 01, 2004 10:07 PM
> To: [EMAIL PROTECTED]
> Subject: N-gram layer
>
>
>
> Hello list,
>
> I'm Karl, and I just started testing Lucene the other day.
> It's a great core engine, but feel there are some things
> missing I'd be
in [mailto:[EMAIL PROTECTED]
Gesendet: Freitag, 6. Februar 2004 07:58
An: Lucene Developers List
Betreff: Re: AW: N-gram layer and language guessing
On Tue, 3 Feb 2004 11:39:40 +0100
"Karsten Konrad" <[EMAIL PROTECTED]> wrote:
>
> Anyway, XtraMind's ngram language gue
On Tue, 3 Feb 2004 11:39:40 +0100
"Karsten Konrad" <[EMAIL PROTECTED]> wrote:
>
> Anyway, XtraMind's ngram language guesser gives the following
> best five results on the swedish examples discussed previously:
>
> "jag heter kalle"
>
> swedish 100,00 %
> norwegian 17,51 %
> danish 10,02 %
> af
On Tuesday 03 February 2004 02:18, karl wettin wrote:
> On Tue, 3 Feb 2004 09:54:19 +0100
>
> karl wettin <[EMAIL PROTECTED]> wrote:
> > test has a weight of 1731 in Swedish
> > test has a weight of 1726 in Danish
>
> Oh dear. Mine fails too.
Considering swedish, danish and norwegian languages are
ge documents
etc.
Have fun with ngram,
Karsten
-Ursprüngliche Nachricht-
Von: karl wettin [mailto:[EMAIL PROTECTED]
Gesendet: Dienstag, 3. Februar 2004 14:01
An: Lucene Developers List
Betreff: Re: AW: AW: N-gram layer and language guessing
On Tue, 3 Feb 2004 13:36:35 +0100
&qu
On Tue, 3 Feb 2004 13:36:35 +0100
"Karsten Konrad" <[EMAIL PROTECTED]> wrote:
>
> If you use ngrams consistently, you can leave out stemming and spend
> your time with something different (like buing a bigger harddisc for
> your indexes, you probably will need them then :)
I didn't get your poin
aMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com
-Ursprüngliche Nachricht-
Von: karl wettin [mailto:[EMAIL PROTECTED]
Gesendet: Dienstag, 3. Februar 2004 12:58
An: Lucene Developers L
On Tue, 03 Feb 2004 12:47:06 +0100
Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Karsten Konrad wrote:
> > The guesser uses only tri- and quad-grams and is based on
> > a sophisticated machine learning algorithm instead of a raw
> > TF/IDF-weighting. The upside of this is the "confidence"
> > val
Karsten Konrad wrote:
Hi,
does anybody here use a ngram-layer for fault-tolerant searching
on *larger* texts? I ask because you can expect to see far more
ngrams than words emerging from a text once you use at least
quad-grams - and the number of different tokens indexed seems to
be the most im
Intelligence Lab
XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
[EMAIL PROTECTED]
www.xtramind.com
-Ursprüngliche Nachricht-
Von: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Gesendet: Dienstag, 3. Februar 2004 09:27
An: Lucene Dev
karl wettin wrote:
On Tue, 03 Feb 2004 09:27:25 +0100
Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
If I run the above example, I get the following:
"jag heter kalle"
- SV: 0.7197875
What is index 1.0 ?
1.0 - completely dissimilar language profiles
0.0 - completely similar language profiles
On Tue, 3 Feb 2004 09:54:19 +0100
karl wettin <[EMAIL PROTECTED]> wrote:
>
>
> test has a weight of 1731 in Swedish
> test has a weight of 1726 in Danish
Oh dear. Mine fails too.
--
karl
-
To unsubscribe, e-mail: [EMAIL PR
On Tue, 03 Feb 2004 09:27:25 +0100
Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>
> However, for the text "vad heter du" (what's your name) the detection
> fails... :-)
I'm sorry for my multiple replys..
1->5 grams and penalty:
vad heter du
test has a weight of 1731 in Swedish
test has a wei
On Tue, 03 Feb 2004 09:27:25 +0100
Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>
> If I run the above example, I get the following:
>
> "jag heter kalle"
> - SV: 0.7197875
What is index 1.0 ?
--
karl
-
To unsubscribe,
On Tue, 03 Feb 2004 09:27:25 +0100
Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>
> A question: what was your source for the representative hi-frequency
> words in various languages? Was it your training corpus or some publication?
I use the data supplied with Gertjan van Noord:s TextCat distrib
karl wettin wrote:
On Mon, 2 Feb 2004 20:10:57 +0100
"Jean-Francois Halleux" <[EMAIL PROTECTED]> wrote:
during the past days, I've developped such a language guesser myself
as a basis for a Lucene analyzer. It is based on trigrams. It is
already working but not yet in a "publishable" state. If you
On Mon, 2 Feb 2004 20:10:57 +0100
"Jean-Francois Halleux" <[EMAIL PROTECTED]> wrote:
> during the past days, I've developped such a language guesser myself
> as a basis for a Lucene analyzer. It is based on trigrams. It is
> already working but not yet in a "publishable" state. If you or others
>
On Feb 2, 2004, at 2:10 PM, Jean-Francois Halleux wrote:
Hi Karl,
during the past days, I've developped such a language guesser myself
as a
basis for a Lucene analyzer. It is based on trigrams. It is already
working
but not yet in a "publishable" state. If you or others are interested
I can
of
rancois Halleux
-Original Message-
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: dimanche 1 fevrier 2004 22:07
To: [EMAIL PROTECTED]
Subject: N-gram layer
Hello list,
I'm Karl, and I just started testing Lucene the other day. It's a great
core engine, but feel there are some thing
On Sun, 1 Feb 2004 22:15:26 -0600
"Robert Engels" <[EMAIL PROTECTED]> wrote:
> Actually, you do not always need to store it in a field.
>
> See the Phonetic Query patch I posted (which does Soundex, Metaphone,
> and can actually do any 'secondary' info query).
Now it hit me, I really don't need
01, 2004 3:07 PM
To: [EMAIL PROTECTED]
Subject: N-gram layer
Hello list,
I'm Karl, and I just started testing Lucene the other day. It's a great
core engine, but feel there are some things missing I'd be happy to
contribute with.
I stated with writing a simple N-gram classifie
The best Analyzer documentation so far is Erik Hatcher's "Parser Rulez"
article. Link is under Resources page on Lucene's site.
Looking forward to the contribution.
Otis
--- karl wettin <[EMAIL PROTECTED]> wrote:
>
> Hello list,
>
> I'm Karl, and I just started testing Lucene the other day.
Hello list,
I'm Karl, and I just started testing Lucene the other day. It's a great
core engine, but feel there are some things missing I'd be happy to
contribute with.
I stated with writing a simple N-gram classifier to detect language of
a text in order to automatically cluster documents by l
25 matches
Mail list logo