handling token created/deleted events in an Index

2008-06-16 Thread Mathieu Lecarme
With the LUCENE-1297, the SpellChecker will be able to choose how to  
estimate distance between two words.


Here are some other enhancement:
 * The capacity to synchronize the main Index and the SpellChecker  
Index. Handling tokens creation is easy, a simple TokenFilter can do  
the work. But for Token deletion, it's a bit harder. Lazy deleted can  
be used if each time, token popularity is checked in the main Index.  
It's a pull strategy, a push from the Directory should be lighter.
 * Choosing the similarity strategy. Now, it's only a Ngram  
computation. Homophony can be nice, for example.
 * Spell Index can be used for dynamic similarity without disturbing  
the main Index. By example, Snowball is nice for grouping words from  
its roots, but it disturbs the Index if you wont to make a start with  
query.


Some time ago, I suggested a patch LUCENE-1190, but, I guess it's too  
monolithic. A more modular way should be better.


Any comments or suggestion?

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WebLuke - include Jetty in Lucene binary distribution?

2008-04-25 Thread Mathieu Lecarme

markharw00d a écrit :




Any word on getting this committed as a contrib?
Not really changed the code since the message below. I can commit 
pretty much the contents of the zip file below any time you want.
Do folks still feel comfortable with the bloat this adds to the 
Lucene source distro? The gwt-dev-windows.jar contains the 
Java2Javascript compiler necessary for building and alone accounts for 
10 mb. Including Jetty adds another ~6 mb on top of that.


OK with this?


Why don't use ivy or maven for that?

M.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing phrases in index

2008-04-10 Thread Mathieu Lecarme

palexv a écrit :

Thanks!
Can you help me to get ShingleFilter class. It is absent in version 2.3.1.
How can I get it?
  
It's in the SVN version. You can backport it, are building your own, 
with a Stack.


M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimise Indexing time using lucene..

2008-04-09 Thread Mathieu Lecarme

lucene4varma a écrit :

Hi all,

I am new to lucene and am using it for text search in my web application,
and for that i need to index records in database.
We are using jdbc directory to store the indexes. Now the problem is when is
start the process of indexing the records for the first time it is taking
huge amount of time. Following is the code for indexing. 


rs = st.executequery(); // returns 2 million records
while(rs.next()) {
create java object .;
index java record into JDBC directory...;
}

The above process takes me huge amount of time for 2 million records.
Approximately it is taking 3-4 business days to run the process. 
Can any one please suggest me and approach by which i could cut down this

time.
  
jdbc directory is not a good idea. It's only useful when you need 
central repository.

Use large maxBufferedDocs in your IndexWriter.
With large amount of data, you'll get bottleneck : database reading, 
index writing, RAM for buffered docs, maybe CPU.
If your database reading is huge, and you are hurry, you can shard the 
index between multiple computer, and when it's finished, merge all the 
index, with champain.


M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing phrases in index

2008-04-09 Thread Mathieu Lecarme

palexv a écrit :

Hello all.
I have a question to advanced in lucene.
I have a set of phrases which I need to store in index. 
Is there is a way of storing phrases as terms in index?


How is the best way of writing such index? Should this field be tokenized?
  

not tokenized

What is the best way of searching phrases by mask in such index? Should I
use BooleanQuery, WildCartQuery or SpanQuery?
il you search complete phrase, just use Term, if you search part of 
phrase, use ShingleFilter.


 
How is the best way to escape from maxClauses exception when searching like

a*?
  

indexing indexed term.

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: shingles and punctuations

2008-04-08 Thread Mathieu Lecarme

setting a flag in a filter is easy :

8---

package org.apache.lucene.analysis.shingle;

import java.io.IOException;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;

/**
 * @author Mathieu Lecarme
 *
 */
public class SentenceCutterFilter extends TokenFilter{
  public static final int FLAG = 42;
  public Token previous = null;

  protected SentenceCutterFilter(TokenStream input) {
super(input);
  }

  public Token next() throws IOException {
Token current = input.next();
if(current == null)
  return null;
if(previous == null || (current.startOffset() -  
previous.endOffset())  1)

  current.setFlags(FLAG);
previous = current;
return current;
  }
}

8---
and using it at the right place is tricky :
8---

String test = This is a test, a big test;
TokenStream stream =
  new StopFilter(
new ShingleFilter(
  new SentenceCutterFilter(
new LowerCaseFilter(
  new ISOLatin1AccentFilter(
  new StandardTokenizer(new StringReader(test), 3),
  new String[]{is, a});

8---

But I must be to tired, but I can't patch the ShingleFilter to handle  
the flag.

I guess flag should be a bit, tested with a mask.

M.



Le 6 avr. 08 à 22:53, Grant Ingersoll a écrit :
For now, it's up to your app to know, unfortunately :-(  I think the  
WikipediaTokenizer is the only one using flags currently in the  
Lucene.



On Apr 6, 2008, at 10:43 PM, Mathieu Lecarme wrote:

I'll use Token flags to specifiy first token in a sentence, but how  
it's works? how flag collision is avoided? to keep it simple, i'll  
take 1 as flag, but what happens if an other filter use the same  
flags?


M.

Le 6 avr. 08 à 20:13, Grant Ingersoll a écrit :
I think you need sentence detection to take place further  
upstream.  Then you could use the Token type or Token flags to  
indicate punctuation, sentences, whatever and we could patch the  
shingle filter to ignore these things, or break and move onto the  
next one.


-Grant

On Apr 6, 2008, at 7:23 PM, Mathieu Lecarme wrote:

The newly ShingleFilter is very helpful to fetch group of words,  
but it doesn't handle ponctuation or any separation.
If you feed it with multiple sentences, you will get shingle that  
start in one sentences and end in the next.
In order to avoid that, you can handle token positions, if there  
is more than one char with the previous token, it should be  
punctation (or typo).

Any suggestions to handle only shingle in the same sentence?

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



shingles and punctuations

2008-04-06 Thread Mathieu Lecarme
The newly ShingleFilter is very helpful to fetch group of words, but  
it doesn't handle ponctuation or any separation.
If you feed it with multiple sentences, you will get shingle that  
start in one sentences and end in the next.
In order to avoid that, you can handle token positions, if there is  
more than one char with the previous token, it should be punctation  
(or typo).

Any suggestions to handle only shingle in the same sentence?

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WordNet synonyms overhead

2008-03-18 Thread Mathieu Lecarme

Harald Näger a écrit :

Hi,

I am especially interessted in the WordNet synonym expansion that was 
discussed in the Lucene in Action book. Is there anyone here on the list 
who has experience with this approach?


I'm curious about how much the synonym expansion will increase the size of an 
index. Are there any reliable figures of real-life applications?
  
Query expansion is better than index expansion. Faster use, smaller 
index, less noise when you search.


M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Created: (LUCENE-1229) NGramTokenFilter optimization in query phase

2008-03-14 Thread Mathieu Lecarme

Hiroaki Kawai (JIRA) a écrit :

NGramTokenFilter optimization in query phase


 Key: LUCENE-1229
 URL: https://issues.apache.org/jira/browse/LUCENE-1229
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Hiroaki Kawai


I found that NGramTokenFilter-ed token stream could be optimized in query.

A standard 1,2 NGramTokenFilter will generate a token stream from abcde as 
follows:
a ab b bc c cd d de e

When we index abcde, we'll use all of the tokens.

But when we query, we only need:
ab cd de
  

I don't understand why you index something that you will not query?
Why don'y you use a  bigram?

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: an API for synonym in Lucene-core

2008-03-13 Thread Mathieu Lecarme

I'll slice my contrib in small parts

Synonyms
1) Synonym (Token + a weight)
2) Synonym provider from OO.o thesaurus
3) SynonymTokenFilter
4) Query expander wich apply a filter (and a boost) on each of its TermQuery
5) a Synonym filter for the query expander
6) to be efficient, Synonym can be exclude if doesn't exist in the Index.
7) Stemming can be used as a dynamic Synonym

Spell checking or the do you mean? pattern
1) The main concept is in the SpellCheck contrib, but in a not 
expandable implementation
2) In some language, like French, homophony is very important in 
mispelling, there is more than one way to write it
3) Homophony rules is provided by Aspell in a neutral language (just 
like SnowBall for stemming), I implemented a translator to build Java 
class from aspell file (it's the same format in aspell evolution : 
myspell and hunspell, wich are used in OO.o and firefox family)

https://issues.apache.org/jira/browse/LUCENE-956

Storing information about word found in an index
1) It's the Dictionary used in SpellCheck contrib, in a more open way : 
a lexicon. It's a plain old lucene index, word become a Document, and 
Field store computed informations like size, Ngram token and homophony. 
All use filter took from TokenFilter, code duplication is avoided.
2) this information can be not synchronized with the index, in order to 
not slow indexation process, so some informations need to be lately 
check (is this synonym already exist in the index?), and lexicon 
correction can be done on the fly (if the synonym doesn't exist, write 
it in the lexicon for the next time). There is some work here to find 
the best and fastest way to keep information synchronized between index 
and lexicon (hard link, log for nightly replay, complete iteration over 
the index to find deleted and new stuff ...)

3) Similar (more than only Synonym) and Near (mispelled) words use Lexicon.
https://issues.apache.org/jira/browse/LUCENE-1190

Extending it
1) Lexicon can be used to store Noun, ie words that better work 
together, like New York, Apple II or Alexander the great. 
Extracting nouns from a thesaurus is very hard, but Wikipedia peoples 
done a part of the work, article titles can be a good start to build a 
noun list. And it works in many languages.
Noun can be used as an intuitive PhraseQuery, or as a suggestion for 
refining a results.


Implementig it well in Lucene
SpellCheck and WordNet contrib do a part of it, but in a specific and 
not extensible way, I think it's better when fundation is checked by 
Lucene maintener, and after, contrib is built on top of this fundation.


M.


Otis Gospodnetic a écrit :

Grant, I think Mathieu is hinting at his JIRA contribution (I looked at it 
briefly the other day, but haven't had the chance to really understand it).

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Mathieu Lecarme [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Wednesday, March 12, 2008 5:47:40 AM
Subject: an API for synonym in Lucene-core

Why Lucen doesn't have a clean synonym API?
WordNet contrib is not an answer, it provides an Interface for its own 
needs, and most of the world don't speak english.
Compass provides a tool, just like Solr. Lucene is the framework for 
applications like Solr, Nutch or Compass, why don't backport low level 
features of this project?
A synonym API should provide a TokenFilter, an abstract storage should 
map token - similar tokens with weight, and a tools for expanding query.
Openoffice dictionnary project can provides data in differents 
languages, with compatible licences, I  presume.


M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



an API for synonym in Lucene-core

2008-03-12 Thread Mathieu Lecarme

Why Lucen doesn't have a clean synonym API?
WordNet contrib is not an answer, it provides an Interface for its own 
needs, and most of the world don't speak english.
Compass provides a tool, just like Solr. Lucene is the framework for 
applications like Solr, Nutch or Compass, why don't backport low level 
features of this project?
A synonym API should provide a TokenFilter, an abstract storage should 
map token - similar tokens with weight, and a tools for expanding query.
Openoffice dictionnary project can provides data in differents 
languages, with compatible licences, I  presume.


M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

2008-03-07 Thread Mathieu Lecarme (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12576415#action_12576415
 ] 

Mathieu Lecarme commented on LUCENE-1190:
-

A simpler preview of Lexicon features :
http://blog.garambrogne.net/index.php?post/2008/03/07/A-lexicon-approach-for-Lucene-index


 a lexicon object for merging spellchecker and synonyms from stemming
 

 Key: LUCENE-1190
 URL: https://issues.apache.org/jira/browse/LUCENE-1190
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*, Search
Affects Versions: 2.3
Reporter: Mathieu Lecarme
 Attachments: aphone+lexicon.patch, aphone+lexicon.patch


 Some Lucene features need a list of referring word. Spellchecking is the 
 basic example, but synonyms is an other use. Other tools can be used 
 smoothlier with a list of words, without disturbing the main index : stemming 
 and other simplification of word (anagram, phonetic ...).
 For that, I suggest a Lexicon object, wich contains words (Term + frequency), 
 wich can be built from Lucene Directory, or plain text files.
 Classical TokenFilter can be used with Lexicon (LowerCaseFilter and 
 ISOLatin1AccentFilter should be the most useful).
 Lexicon uses a Lucene Directory, each Word is a Document, each meta is a 
 Field (word, ngram, phonetic, fields, anagram, size ...).
 Above a minimum size, number of differents words used in an index can be 
 considered as stable. So, a standard Lexicon (built from wikipedia by 
 example) can be used.
 A similarTokenFilter is provided.
 A spellchecker will come soon.
 A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
 Unused words can be remove on demand (lazy delete?)
 Any criticism or suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

2008-03-02 Thread Mathieu Lecarme

hum, quote and question disappear.

Le 2 mars 08 à 13:32, Mathieu Lecarme (JIRA) a écrit :



   [ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12574214 
#action_12574214 ]


Mathieu Lecarme commented on LUCENE-1190:
-


 For example, I don't know what you mean by Some Lucene features  
need a list of referring word.  Do you mean a list of associated  
words?



With a FuzzyQuery, for example, you iterate over Term in index, and
looking for the nearest one. PrefixQuery or regular expression work in
a similar way.
If you say, fuzzy querying will never gives a word with different size
of 1 (size+1 or size -1), you can restrict the list of candidates, and
ngram index can help you more.

Some token filter destroy the word. Stemmer for example. If you wont
to search wide, stemmer can help you, but can't use PrefixQuery with
stemmed word. So, you can stemme word in a lexicon and use it as a
synonym. You index dog and look for doggy,  dogs and dog.
Lexicon can use static list of word, from hunspell index or wikipedia
parsing, or words extracted from your index.


 Each meta is a Field what do you mean by that?  Could you  
please give an example?

for the word Lucene :

word:lucene
pop:42
anagram.anagram:celnu
aphone.start:LS
aphone.gram:LS
aphone.gram:SN
aphone.end:SN
aphone.size:3
aphone.phonem:LSN
ngram.start:lu
ngram.gram:lu
ngram.gram:uc
ngram.gram:ce
ngram.gram:en
ngram.gram:ne
ngram.end:ne
ngram.size:6
stemmer.stem:lucen




 Hm, not sure I know what you mean.  Are you saying that once you  
create a sufficiently large lexicon/dictionary/index, the number of  
new terms starts decreasing? (Heap's Law? http://en.wikipedia.org/wiki/Heaps'_law 
 )

Yes.


a lexicon object for merging spellchecker and synonyms from stemming


   Key: LUCENE-1190
   URL: https://issues.apache.org/jira/browse/LUCENE-1190
   Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*, Search
  Affects Versions: 2.3
  Reporter: Mathieu Lecarme
   Attachments: aphone+lexicon.patch, aphone+lexicon.patch


Some Lucene features need a list of referring word. Spellchecking  
is the basic example, but synonyms is an other use. Other tools can  
be used smoothlier with a list of words, without disturbing the  
main index : stemming and other simplification of word (anagram,  
phonetic ...).
For that, I suggest a Lexicon object, wich contains words (Term +  
frequency), wich can be built from Lucene Directory, or plain text  
files.
Classical TokenFilter can be used with Lexicon (LowerCaseFilter and  
ISOLatin1AccentFilter should be the most useful).
Lexicon uses a Lucene Directory, each Word is a Document, each meta  
is a Field (word, ngram, phonetic, fields, anagram, size ...).
Above a minimum size, number of differents words used in an index  
can be considered as stable. So, a standard Lexicon (built from  
wikipedia by example) can be used.

A similarTokenFilter is provided.
A spellchecker will come soon.
A fuzzySearch implementation, a neutral synonym TokenFilter can be  
done.

Unused words can be remove on demand (lazy delete?)
Any criticism or suggestions?


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

2008-02-29 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-1190:


Attachment: aphone+lexicon.patch

 a lexicon object for merging spellchecker and synonyms from stemming
 

 Key: LUCENE-1190
 URL: https://issues.apache.org/jira/browse/LUCENE-1190
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*, Search
Affects Versions: 2.3
Reporter: Mathieu Lecarme
 Attachments: aphone+lexicon.patch, aphone+lexicon.patch


 Some Lucene features need a list of referring word. Spellchecking is the 
 basic example, but synonyms is an other use. Other tools can be used 
 smoothlier with a list of words, without disturbing the main index : stemming 
 and other simplification of word (anagram, phonetic ...).
 For that, I suggest a Lexicon object, wich contains words (Term + frequency), 
 wich can be built from Lucene Directory, or plain text files.
 Classical TokenFilter can be used with Lexicon (LowerCaseFilter and 
 ISOLatin1AccentFilter should be the most useful).
 Lexicon uses a Lucene Directory, each Word is a Document, each meta is a 
 Field (word, ngram, phonetic, fields, anagram, size ...).
 Above a minimum size, number of differents words used in an index can be 
 considered as stable. So, a standard Lexicon (built from wikipedia by 
 example) can be used.
 A similarTokenFilter is provided.
 A spellchecker will come soon.
 A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
 Unused words can be remove on demand (lazy delete?)
 Any criticism or suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

2008-02-25 Thread Mathieu Lecarme (JIRA)
a lexicon object for merging spellchecker and synonyms from stemming


 Key: LUCENE-1190
 URL: https://issues.apache.org/jira/browse/LUCENE-1190
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*, Search
Affects Versions: 2.3
Reporter: Mathieu Lecarme
 Attachments: aphone+lexicon.patch

Some Lucene features need a list of referring word. Spellchecking is the basic 
example, but synonyms is an other use. Other tools can be used smoothlier with 
a list of words, without disturbing the main index : stemming and other 
simplification of word (anagram, phonetic ...).
For that, I suggest a Lexicon object, wich contains words (Term + frequency), 
wich can be built from Lucene Directory, or plain text files.
Classical TokenFilter can be used with Lexicon (LowerCaseFilter and 
ISOLatin1AccentFilter should be the most useful).
Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field 
(word, ngram, phonetic, fields, anagram, size ...).
Above a minimum size, number of differents words used in an index can be 
considered as stable. So, a standard Lexicon (built from wikipedia by example) 
can be used.
A similarTokenFilter is provided.
A spellchecker will come soon.
A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
Unused words can be remove on demand (lazy delete?)

Any criticism or suggestions?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

2008-02-25 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-1190:


Attachment: aphone+lexicon.patch

 a lexicon object for merging spellchecker and synonyms from stemming
 

 Key: LUCENE-1190
 URL: https://issues.apache.org/jira/browse/LUCENE-1190
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*, Search
Affects Versions: 2.3
Reporter: Mathieu Lecarme
 Attachments: aphone+lexicon.patch


 Some Lucene features need a list of referring word. Spellchecking is the 
 basic example, but synonyms is an other use. Other tools can be used 
 smoothlier with a list of words, without disturbing the main index : stemming 
 and other simplification of word (anagram, phonetic ...).
 For that, I suggest a Lexicon object, wich contains words (Term + frequency), 
 wich can be built from Lucene Directory, or plain text files.
 Classical TokenFilter can be used with Lexicon (LowerCaseFilter and 
 ISOLatin1AccentFilter should be the most useful).
 Lexicon uses a Lucene Directory, each Word is a Document, each meta is a 
 Field (word, ngram, phonetic, fields, anagram, size ...).
 Above a minimum size, number of differents words used in an index can be 
 considered as stable. So, a standard Lexicon (built from wikipedia by 
 example) can be used.
 A similarTokenFilter is provided.
 A spellchecker will come soon.
 A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
 Unused words can be remove on demand (lazy delete?)
 Any criticism or suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-956) phonem conversion from aspell dictionnary

2008-02-21 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-956:
---

Attachment: aphone.patch

New version, with more language (bg, br, da, de, el, en, fo, fr, is, ru), and 
an usable token filter. Usage case is similar to stem token filter.

 phonem conversion from aspell dictionnary
 -

 Key: LUCENE-956
 URL: https://issues.apache.org/jira/browse/LUCENE-956
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Affects Versions: 2.2
Reporter: Mathieu Lecarme
 Attachments: aphone.patch, aphone.patch


 First step to improve Spellchecker's suggestions : phonem conversion for 
 differents languages.
 The conversion code is build from aspell file description. The patch contains 
 class for managing english, french, wallon and swedish. If it's work well, 
 other available dictionnary from aspell project can be built.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need help for ordering results by specific order

2007-07-19 Thread Mathieu Lecarme
If I understand well your needs:
You ask lucene for a set of words
You wont to sort result by number of different words wich match?
The query is not good, it would be

+content:(aleden bob carray)

I don't understand how can you sort at indexing time with informations
known at querying time.

M.
savageboy a écrit :
 Yes, Mathieu.
 I just have the book Lucene in action by my hand, it is chinese language
 version, it is about lucene1.4, hope it is not too old.
 If I use SortComparatorSource, does it means it will be do the sort work at
 the user query time?
 Can I sort (maybe score it atindexing time)?



 Mathieu Lecarme wrote:
   
 Have a look of the book Lucene in action, ch 6.1 : using custom  
 sort method

 SortComparatorSource might be your friend. Lucene selecting stuff,  
 and you sort, just like you wont.

 M.
 Le 18 juil. 07 à 10:29, savageboy a écrit :

 
 Hi,
 I am newer for lucene.
 I have a project for search engine by Lucene2.0. But near the project
 finished, My boss want me to order the result by the sort blew:

 the query likes '+content:aleden bob carray '

 content 
 date
 order
 alden bob carray ...  
 2005/12/23
 1
 alden... alden ... bob... bob... carray...   2005/12/01
 2
 alden... alden ... bob... carray
 2005/11/28
 3
 alden... carray 
 2005/12/24
 4
 alden... bob 
 2005/12/24
 5

 the meaning of the sort above is no matter how much the term match  
 in the
 field content, there will be met four satuations :3 matched,2
 matched,1 matched,0 matched. In the 3 matched group, I need  
 sorting
 the result by it's date desc, and in the 2 matched group is same...

 But I dont know HOW to get this results in Lucene...
 Should I override the method of scoring? (tf(t in d) term in  
 field,idf(t)
 inverse doc frequence)
 Could you give me some references about it?

 I am really stucked, and Need You help!!


 -- 
 View this message in context: http://www.nabble.com/Need-help-for- 
 ordering-results-by-specific-order-tf4101844.html#a11664583
 Sent from the Lucene - Java Developer mailing list archive at  
 Nabble.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


   
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need help for ordering results by specific order

2007-07-18 Thread Mathieu Lecarme
Have a look of the book Lucene in action, ch 6.1 : using custom  
sort method


SortComparatorSource might be your friend. Lucene selecting stuff,  
and you sort, just like you wont.


M.
Le 18 juil. 07 à 10:29, savageboy a écrit :



Hi,
I am newer for lucene.
I have a project for search engine by Lucene2.0. But near the project
finished, My boss want me to order the result by the sort blew:

the query likes '+content:aleden bob carray '

content 
date

order
alden bob carray ...  
2005/12/23

1
alden... alden ... bob... bob... carray...   2005/12/01
2
alden... alden ... bob... carray
2005/11/28

3
alden... carray 
2005/12/24

4
alden... bob 
2005/12/24

5

the meaning of the sort above is no matter how much the term match  
in the

field content, there will be met four satuations :3 matched,2
matched,1 matched,0 matched. In the 3 matched group, I need  
sorting

the result by it's date desc, and in the 2 matched group is same...

But I dont know HOW to get this results in Lucene...
Should I override the method of scoring? (tf(t in d) term in  
field,idf(t)

inverse doc frequence)
Could you give me some references about it?

I am really stucked, and Need You help!!


--
View this message in context: http://www.nabble.com/Need-help-for- 
ordering-results-by-specific-order-tf4101844.html#a11664583
Sent from the Lucene - Java Developer mailing list archive at  
Nabble.com.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: for a better spellchecker

2007-07-13 Thread Mathieu Lecarme
The SpellChecker code mix indexing function, ngram treatment, and  
querying functions. Extending it will not produce clean code.
Is it relevant to first refactor SpellChecker code for extracting   
dictionary reading function and indexing/searching functions?
SpellChecker will get a method to add SpellEngine interface wich  
looks like


interface SpellEngine {
public void addWord(String word);
public String[] suggestSimilar(String word, int numSug);
}

and something to sort suggestions, like distance from suggested word.

M.

Le 9 juil. 07 à 02:38, Chris Hostetter a écrit :



: Now, SpellChecker use the trigram algorithm to find similar  
words. It

: works well for keyboard fumbles, but not well enough for short words
: and for languages like french where a same sound can be wrote
: differently.
: Spellchecking is a classical computer task, and aspell provides some
: nice and free (it's GNU) sound dictionary. Lots of dictionary are
: available.

The topic of spell correction as it pertains to Lucene users can  
really

have two meanings:
  a) an attempt to suggest potential spell correction of query strings
provided by a user as a form of input pre-processing
  b) to use Lucene as a tool to suggest spell corrections based on  
a known

corpus.

The contrib/spellchecker code is an application of B -- it may in  
fact

be useful for A but that doesn't mean there aren't other non-Lucene
tools for achieving A as well.

: I did a python parser which write translation code in different
: languages : python, php and java. A bit like snowball stuff.
: Few works will be done to generate lucene compliant code. But is the
: python generator is well enough to Lucene, or a translation must be
: done in Java to put it in Lucene source?

the Lucene-Java repository tends to be about java code, but
contrib/javascript is an example of code that may be of general use to
Lucene-Java users that isn't java.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-956) phonem conversion from aspell dictionnary

2007-07-11 Thread Mathieu Lecarme (JIRA)
phonem conversion from aspell dictionnary
-

 Key: LUCENE-956
 URL: https://issues.apache.org/jira/browse/LUCENE-956
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Affects Versions: 2.2
Reporter: Mathieu Lecarme


First step to improve Spellchecker's suggestions : phonem conversion for 
differents languages.
The conversion code is build from aspell file description. The patch contains 
class for managing english, french, wallon and swedish. If it's work well, 
other available dictionnary from aspell project can be built.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-956) phonem conversion from aspell dictionnary

2007-07-11 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-956:
---

Attachment: aphone.patch

 phonem conversion from aspell dictionnary
 -

 Key: LUCENE-956
 URL: https://issues.apache.org/jira/browse/LUCENE-956
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Affects Versions: 2.2
Reporter: Mathieu Lecarme
 Attachments: aphone.patch


 First step to improve Spellchecker's suggestions : phonem conversion for 
 differents languages.
 The conversion code is build from aspell file description. The patch contains 
 class for managing english, french, wallon and swedish. If it's work well, 
 other available dictionnary from aspell project can be built.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



build.xml for a contrib wich depend on an other contrib

2007-07-10 Thread Mathieu Lecarme
The first version of aspell format phonem converter in java is almost  
finished. The source is buildable with ant, but, in the lucene trunk,  
it failed. The build depends on SpellChecker wich is build after. How  
can can I fix it? A statical spellChecker.jar in lib in my contrib? a  
depends in the right place in my compile target?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Updated: (LUCENE-906) Elision filter for simple french analyzing

2007-06-28 Thread Mathieu Lecarme
Any news about the integration of this patch?

M.

Mathieu Lecarme (JIRA) a écrit :
  [ 
 https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
  ]

 Mathieu Lecarme updated LUCENE-906:
 ---

 Attachment: elision-0.2.patch

 All suggested corrections are done.

   
 Elision filter for simple french analyzing
 --

 Key: LUCENE-906
 URL: https://issues.apache.org/jira/browse/LUCENE-906
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Mathieu Lecarme
 Attachments: elision-0.2.patch, elision.patch


 If you don't wont to use stemming, StandardAnalyzer miss some french 
 strangeness like elision.
 l'avion wich means the plane must be tokenized as avion (plane).
 This filter could be used with other latin language if elision exists.
 

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-906) Elision filter for simple french analyzing

2007-06-13 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-906:
---

Attachment: elision-0.2.patch

All suggested corrections are done.

 Elision filter for simple french analyzing
 --

 Key: LUCENE-906
 URL: https://issues.apache.org/jira/browse/LUCENE-906
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Mathieu Lecarme
 Attachments: elision.patch


 If you don't wont to use stemming, StandardAnalyzer miss some french 
 strangeness like elision.
 l'avion wich means the plane must be tokenized as avion (plane).
 This filter could be used with other latin language if elision exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-906) Elision filter for simple french analyzing

2007-06-13 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-906:
---

Attachment: (was: elision-0.2.patch)

 Elision filter for simple french analyzing
 --

 Key: LUCENE-906
 URL: https://issues.apache.org/jira/browse/LUCENE-906
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Mathieu Lecarme
 Attachments: elision.patch


 If you don't wont to use stemming, StandardAnalyzer miss some french 
 strangeness like elision.
 l'avion wich means the plane must be tokenized as avion (plane).
 This filter could be used with other latin language if elision exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-906) Elision filter for simple french analyzing

2007-06-05 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-906:
---

Attachment: elision.patch

 Elision filter for simple french analyzing
 --

 Key: LUCENE-906
 URL: https://issues.apache.org/jira/browse/LUCENE-906
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Mathieu Lecarme
 Attachments: elision.patch


 If you don't wont to use stemming, StandardAnalyzer miss some french 
 strangeness like elision.
 l'avion wich means the plane must be tokenized as avion (plane).
 This filter could be used with other latin language if elision exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-906) Elision filter for simple french analyzing

2007-06-05 Thread Mathieu Lecarme (JIRA)
Elision filter for simple french analyzing
--

 Key: LUCENE-906
 URL: https://issues.apache.org/jira/browse/LUCENE-906
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Mathieu Lecarme


If you don't wont to use stemming, StandardAnalyzer miss some french 
strangeness like elision.
l'avion wich means the plane must be tokenized as avion (plane).
This filter could be used with other latin language if elision exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



using a french specific analyser without stemming

2007-06-04 Thread Mathieu Lecarme
For a project with a lot ofLucene search (via Compass), I had some  
troubles with stemming. Stemming is nice for enlarge search range,  
but make completion strange.
So FrenchAnalyzer was not usable. A simpler StandardAnalyzer makes  
the job right, except for some french speciality, like elision. In  
french the plane is translated by l'avion and not le avion, and  
the StandardTokenizer, used by StandardFilter can't tokenize it  
right. So, I make a specific filter (ElisionFilter), how can I give  
it to Lucene? With a Jira ticket, with the mailing list?


M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]