Hi Everyone,
My question is related to the arabic analysis package under:
org.apache.lucene.analysis.ar
It is cool and it is doing a great job, but it uses a special tokenizer:
ArabicLetterTokenizer
The problem with this tokenizer is that it fails to handle emails, urls
and acronyms
It's been a few years since I've worked on Arabic, but it sounds
reasonable. Care to submit a patch with unit tests showing the
StandardTokenizer properly handling all Arabic characters? http://wiki.apache.org/lucene-java/HowToContribute
On Feb 20, 2009, at 6:22 AM, Yusuf Aaji wrote:
Hi
I'm not sure why using a PhraseQuery allows you to search within a
sentence. PhraseQuery just makes sure that the terms appear next to
each other (or within some slop), but it isn't aware of sentence or
paragraph boundaries.
See
Yusuf,
You are 100% correct it is bad that this uses a custom tokenizer.
this was my motivation for attacking it from this angle:
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
(unfinished)
otherwise, at some point jflex
Hi,
Has there been any work done on getting confidence scores at runtime, so
that scores of documents can be compared across queries? I found one
reference in the mailing list to some work in 2003, but couldn't find any
follow-up:
http://osdir.com/ml/jakarta.lucene.user/2003-12/msg00093.html
The explanation of scores from the same document returned from 2 similar
queries differ in an unexpected way. There are 2 fields involved, 'contents'
and 'literals'. The 'literals' field has setBoost = 0. As you an see from
the explanations below, the total weight of the matching terms from the
: In 2.3.2 if the token �Co�mo� came through this it would get changed to
: �como� by the time it made it through the filters.In 2.4.0 this isn�t
: the case. It treats this one token as two so we get �co� and �mo�.So
: instead of search �como� or �Co�mo� to get all the hits we now have
some changes were made to the StandardTokenizer.jflex grammer (you can svn
diff the two URLs fairly trivially) to better deal with correctly identifying
word characters, but from what i can tell that should have reduced the number
of splits, not increased them.
it's hard to tell from your
tanks erick
i have got the latest INDEXER example from lia2 working properly
thanks a lot
Seid M
On 2/19/09, Michael McCandless luc...@mikemccandless.com wrote:
The early access version of LIA2 (accessible at
http://www.manning.com/hatcher3/)
has updated this example to work with recent