Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

2009-02-20 Thread Yusuf Aaji
Hi Everyone, My question is related to the arabic analysis package under: org.apache.lucene.analysis.ar It is cool and it is doing a great job, but it uses a special tokenizer: ArabicLetterTokenizer The problem with this tokenizer is that it fails to handle emails, urls and acronyms

Re: Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

2009-02-20 Thread Grant Ingersoll
It's been a few years since I've worked on Arabic, but it sounds reasonable. Care to submit a patch with unit tests showing the StandardTokenizer properly handling all Arabic characters? http://wiki.apache.org/lucene-java/HowToContribute On Feb 20, 2009, at 6:22 AM, Yusuf Aaji wrote: Hi

Re: searching a sentence or paragraph

2009-02-20 Thread Grant Ingersoll
I'm not sure why using a PhraseQuery allows you to search within a sentence. PhraseQuery just makes sure that the terms appear next to each other (or within some slop), but it isn't aware of sentence or paragraph boundaries. See

Re: Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

2009-02-20 Thread Robert Muir
Yusuf, You are 100% correct it is bad that this uses a custom tokenizer. this was my motivation for attacking it from this angle: https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel (unfinished) otherwise, at some point jflex

Confidence scores at search time

2009-02-20 Thread Ken Williams
Hi, Has there been any work done on getting confidence scores at runtime, so that scores of documents can be compared across queries? I found one reference in the mailing list to some work in 2003, but couldn't find any follow-up: http://osdir.com/ml/jakarta.lucene.user/2003-12/msg00093.html

queryNorm affect on score

2009-02-20 Thread Peter Keegan
The explanation of scores from the same document returned from 2 similar queries differ in an unexpected way. There are 2 fields involved, 'contents' and 'literals'. The 'literals' field has setBoost = 0. As you an see from the explanations below, the total weight of the matching terms from the

Re: 2.3.2 - 2.4.0 StandardTokenizer issue

2009-02-20 Thread Chris Hostetter
: In 2.3.2 if the token �Co�mo� came through this it would get changed to : �como� by the time it made it through the filters.In 2.4.0 this isn�t : the case. It treats this one token as two so we get �co� and �mo�.So : instead of search �como� or �Co�mo� to get all the hits we now have

RE: 2.3.2 - 2.4.0 StandardTokenizer issue

2009-02-20 Thread Philip Puffinburger
some changes were made to the StandardTokenizer.jflex grammer (you can svn diff the two URLs fairly trivially) to better deal with correctly identifying word characters, but from what i can tell that should have reduced the number of splits, not increased them. it's hard to tell from your

Re: Indexer.Java problem

2009-02-20 Thread Seid Mohammed
tanks erick i have got the latest INDEXER example from lia2 working properly thanks a lot Seid M On 2/19/09, Michael McCandless luc...@mikemccandless.com wrote: The early access version of LIA2 (accessible at http://www.manning.com/hatcher3/) has updated this example to work with recent