[jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

2008-03-07 Thread Mathieu Lecarme (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576415#action_12576415
 ] 

Mathieu Lecarme commented on LUCENE-1190:
-

A simpler preview of Lexicon features :
http://blog.garambrogne.net/index.php?post/2008/03/07/A-lexicon-approach-for-Lucene-index


> a lexicon object for merging spellchecker and synonyms from stemming
> 
>
> Key: LUCENE-1190
> URL: https://issues.apache.org/jira/browse/LUCENE-1190
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*, Search
>Affects Versions: 2.3
>Reporter: Mathieu Lecarme
> Attachments: aphone+lexicon.patch, aphone+lexicon.patch
>
>
> Some Lucene features need a list of referring word. Spellchecking is the 
> basic example, but synonyms is an other use. Other tools can be used 
> smoothlier with a list of words, without disturbing the main index : stemming 
> and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), 
> wich can be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and 
> ISOLatin1AccentFilter should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a 
> Field (word, ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be 
> considered as stable. So, a standard Lexicon (built from wikipedia by 
> example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

2008-03-02 Thread Mathieu Lecarme (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574214#action_12574214
 ] 

Mathieu Lecarme commented on LUCENE-1190:
-


With a FuzzyQuery, for example, you iterate over Term in index, and  
looking for the nearest one. PrefixQuery or regular expression work in  
a similar way.
If you say, fuzzy querying will never gives a word with different size  
of 1 (size+1 or size -1), you can restrict the list of candidates, and  
ngram index can help you more.

Some token filter destroy the word. Stemmer for example. If you wont  
to search wide, stemmer can help you, but can't use PrefixQuery with  
stemmed word. So, you can stemme word in a lexicon and use it as a  
synonym. You index "dog" and look for "doggy",  "dogs" and "dog".  
Lexicon can use static list of word, from hunspell index or wikipedia  
parsing, or words extracted from your index.

for the word "Lucene" :

word:lucene
pop:42
anagram.anagram:celnu
aphone.start:LS
aphone.gram:LS
aphone.gram:SN
aphone.end:SN
aphone.size:3
aphone.phonem:LSN
ngram.start:lu
ngram.gram:lu
ngram.gram:uc
ngram.gram:ce
ngram.gram:en
ngram.gram:ne
ngram.end:ne
ngram.size:6
stemmer.stem:lucen


Yes.

M.


> a lexicon object for merging spellchecker and synonyms from stemming
> 
>
> Key: LUCENE-1190
> URL: https://issues.apache.org/jira/browse/LUCENE-1190
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*, Search
>Affects Versions: 2.3
>Reporter: Mathieu Lecarme
> Attachments: aphone+lexicon.patch, aphone+lexicon.patch
>
>
> Some Lucene features need a list of referring word. Spellchecking is the 
> basic example, but synonyms is an other use. Other tools can be used 
> smoothlier with a list of words, without disturbing the main index : stemming 
> and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), 
> wich can be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and 
> ISOLatin1AccentFilter should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a 
> Field (word, ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be 
> considered as stable. So, a standard Lexicon (built from wikipedia by 
> example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

2008-02-29 Thread Mathieu Lecarme (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12573907#action_12573907
 ] 

Mathieu Lecarme commented on LUCENE-1190:
-

News features:
helper to extends query with similarity of each term :
+type:dog +name:rintint*
will become:
+type:(+dog (dogs doggy)^0.7) +name:rintint*

"Do you mean pattern" packaged over IndexSearcher. If search result is under a 
thresold, sorted suggestion list for each term is provided, and a rewritten 
query sentence:
truc:brawn
will become:
truc:brown 




> a lexicon object for merging spellchecker and synonyms from stemming
> 
>
> Key: LUCENE-1190
> URL: https://issues.apache.org/jira/browse/LUCENE-1190
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*, Search
>Affects Versions: 2.3
>Reporter: Mathieu Lecarme
> Attachments: aphone+lexicon.patch, aphone+lexicon.patch
>
>
> Some Lucene features need a list of referring word. Spellchecking is the 
> basic example, but synonyms is an other use. Other tools can be used 
> smoothlier with a list of words, without disturbing the main index : stemming 
> and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), 
> wich can be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and 
> ISOLatin1AccentFilter should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a 
> Field (word, ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be 
> considered as stable. So, a standard Lexicon (built from wikipedia by 
> example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

2008-02-29 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-1190:


Attachment: aphone+lexicon.patch

> a lexicon object for merging spellchecker and synonyms from stemming
> 
>
> Key: LUCENE-1190
> URL: https://issues.apache.org/jira/browse/LUCENE-1190
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*, Search
>Affects Versions: 2.3
>Reporter: Mathieu Lecarme
> Attachments: aphone+lexicon.patch, aphone+lexicon.patch
>
>
> Some Lucene features need a list of referring word. Spellchecking is the 
> basic example, but synonyms is an other use. Other tools can be used 
> smoothlier with a list of words, without disturbing the main index : stemming 
> and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), 
> wich can be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and 
> ISOLatin1AccentFilter should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a 
> Field (word, ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be 
> considered as stable. So, a standard Lexicon (built from wikipedia by 
> example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

2008-02-25 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-1190:


Attachment: aphone+lexicon.patch

> a lexicon object for merging spellchecker and synonyms from stemming
> 
>
> Key: LUCENE-1190
> URL: https://issues.apache.org/jira/browse/LUCENE-1190
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*, Search
>Affects Versions: 2.3
>Reporter: Mathieu Lecarme
> Attachments: aphone+lexicon.patch
>
>
> Some Lucene features need a list of referring word. Spellchecking is the 
> basic example, but synonyms is an other use. Other tools can be used 
> smoothlier with a list of words, without disturbing the main index : stemming 
> and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), 
> wich can be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and 
> ISOLatin1AccentFilter should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a 
> Field (word, ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be 
> considered as stable. So, a standard Lexicon (built from wikipedia by 
> example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

2008-02-25 Thread Mathieu Lecarme (JIRA)
a lexicon object for merging spellchecker and synonyms from stemming


 Key: LUCENE-1190
 URL: https://issues.apache.org/jira/browse/LUCENE-1190
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*, Search
Affects Versions: 2.3
Reporter: Mathieu Lecarme
 Attachments: aphone+lexicon.patch

Some Lucene features need a list of referring word. Spellchecking is the basic 
example, but synonyms is an other use. Other tools can be used smoothlier with 
a list of words, without disturbing the main index : stemming and other 
simplification of word (anagram, phonetic ...).
For that, I suggest a Lexicon object, wich contains words (Term + frequency), 
wich can be built from Lucene Directory, or plain text files.
Classical TokenFilter can be used with Lexicon (LowerCaseFilter and 
ISOLatin1AccentFilter should be the most useful).
Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field 
(word, ngram, phonetic, fields, anagram, size ...).
Above a minimum size, number of differents words used in an index can be 
considered as stable. So, a standard Lexicon (built from wikipedia by example) 
can be used.
A similarTokenFilter is provided.
A spellchecker will come soon.
A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
Unused words can be remove on demand (lazy delete?)

Any criticism or suggestions?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-956) phonem conversion from aspell dictionnary

2008-02-21 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-956:
---

Attachment: aphone.patch

New version, with more language (bg, br, da, de, el, en, fo, fr, is, ru), and 
an usable token filter. Usage case is similar to stem token filter.

> phonem conversion from aspell dictionnary
> -
>
> Key: LUCENE-956
> URL: https://issues.apache.org/jira/browse/LUCENE-956
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Affects Versions: 2.2
>Reporter: Mathieu Lecarme
> Attachments: aphone.patch, aphone.patch
>
>
> First step to improve Spellchecker's suggestions : phonem conversion for 
> differents languages.
> The conversion code is build from aspell file description. The patch contains 
> class for managing english, french, wallon and swedish. If it's work well, 
> other available dictionnary from aspell project can be built.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-956) phonem conversion from aspell dictionnary

2007-07-11 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-956:
---

Attachment: aphone.patch

> phonem conversion from aspell dictionnary
> -
>
> Key: LUCENE-956
> URL: https://issues.apache.org/jira/browse/LUCENE-956
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Affects Versions: 2.2
>Reporter: Mathieu Lecarme
> Attachments: aphone.patch
>
>
> First step to improve Spellchecker's suggestions : phonem conversion for 
> differents languages.
> The conversion code is build from aspell file description. The patch contains 
> class for managing english, french, wallon and swedish. If it's work well, 
> other available dictionnary from aspell project can be built.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-956) phonem conversion from aspell dictionnary

2007-07-11 Thread Mathieu Lecarme (JIRA)
phonem conversion from aspell dictionnary
-

 Key: LUCENE-956
 URL: https://issues.apache.org/jira/browse/LUCENE-956
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Affects Versions: 2.2
Reporter: Mathieu Lecarme


First step to improve Spellchecker's suggestions : phonem conversion for 
differents languages.
The conversion code is build from aspell file description. The patch contains 
class for managing english, french, wallon and swedish. If it's work well, 
other available dictionnary from aspell project can be built.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-906) Elision filter for simple french analyzing

2007-06-13 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-906:
---

Attachment: elision-0.2.patch

All suggested corrections are done.

> Elision filter for simple french analyzing
> --
>
> Key: LUCENE-906
> URL: https://issues.apache.org/jira/browse/LUCENE-906
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Mathieu Lecarme
> Attachments: elision-0.2.patch, elision.patch
>
>
> If you don't wont to use stemming, StandardAnalyzer miss some french 
> strangeness like elision.
> "l'avion" wich means "the plane" must be tokenized as "avion" (plane).
> This filter could be used with other latin language if elision exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-906) Elision filter for simple french analyzing

2007-06-13 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-906:
---

Attachment: (was: elision-0.2.patch)

> Elision filter for simple french analyzing
> --
>
> Key: LUCENE-906
> URL: https://issues.apache.org/jira/browse/LUCENE-906
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Mathieu Lecarme
> Attachments: elision.patch
>
>
> If you don't wont to use stemming, StandardAnalyzer miss some french 
> strangeness like elision.
> "l'avion" wich means "the plane" must be tokenized as "avion" (plane).
> This filter could be used with other latin language if elision exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-906) Elision filter for simple french analyzing

2007-06-13 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-906:
---

Attachment: elision-0.2.patch

All suggested corrections are done.

> Elision filter for simple french analyzing
> --
>
> Key: LUCENE-906
> URL: https://issues.apache.org/jira/browse/LUCENE-906
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Mathieu Lecarme
> Attachments: elision.patch
>
>
> If you don't wont to use stemming, StandardAnalyzer miss some french 
> strangeness like elision.
> "l'avion" wich means "the plane" must be tokenized as "avion" (plane).
> This filter could be used with other latin language if elision exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-906) Elision filter for simple french analyzing

2007-06-05 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-906:
---

Attachment: elision.patch

> Elision filter for simple french analyzing
> --
>
> Key: LUCENE-906
> URL: https://issues.apache.org/jira/browse/LUCENE-906
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Mathieu Lecarme
> Attachments: elision.patch
>
>
> If you don't wont to use stemming, StandardAnalyzer miss some french 
> strangeness like elision.
> "l'avion" wich means "the plane" must be tokenized as "avion" (plane).
> This filter could be used with other latin language if elision exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-906) Elision filter for simple french analyzing

2007-06-05 Thread Mathieu Lecarme (JIRA)
Elision filter for simple french analyzing
--

 Key: LUCENE-906
 URL: https://issues.apache.org/jira/browse/LUCENE-906
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Mathieu Lecarme


If you don't wont to use stemming, StandardAnalyzer miss some french 
strangeness like elision.
"l'avion" wich means "the plane" must be tokenized as "avion" (plane).
This filter could be used with other latin language if elision exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]