handling token created/deleted events in an Index
With the LUCENE-1297, the SpellChecker will be able to choose how to estimate distance between two words. Here are some other enhancement: * The capacity to synchronize the main Index and the SpellChecker Index. Handling tokens creation is easy, a simple TokenFilter can do the work. But for Token deletion, it's a bit harder. Lazy deleted can be used if each time, token popularity is checked in the main Index. It's a pull strategy, a push from the Directory should be lighter. * Choosing the similarity strategy. Now, it's only a Ngram computation. Homophony can be nice, for example. * Spell Index can be used for dynamic similarity without disturbing the main Index. By example, Snowball is nice for grouping words from its roots, but it disturbs the Index if you wont to make a start with query. Some time ago, I suggested a patch LUCENE-1190, but, I guess it's too monolithic. A more modular way should be better. Any comments or suggestion? M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WebLuke - include Jetty in Lucene binary distribution?
markharw00d a écrit : Any word on getting this committed as a contrib? Not really changed the code since the message below. I can commit pretty much the contents of the zip file below any time you want. Do folks still feel comfortable with the bloat this adds to the Lucene source distro? The gwt-dev-windows.jar contains the Java2Javascript compiler necessary for building and alone accounts for 10 mb. Including Jetty adds another ~6 mb on top of that. OK with this? Why don't use ivy or maven for that? M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Storing phrases in index
palexv a écrit : Thanks! Can you help me to get ShingleFilter class. It is absent in version 2.3.1. How can I get it? It's in the SVN version. You can backport it, are building your own, with a Stack. M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optimise Indexing time using lucene..
lucene4varma a écrit : Hi all, I am new to lucene and am using it for text search in my web application, and for that i need to index records in database. We are using jdbc directory to store the indexes. Now the problem is when is start the process of indexing the records for the first time it is taking huge amount of time. Following is the code for indexing. rs = st.executequery(); // returns 2 million records while(rs.next()) { create java object .; index java record into JDBC directory...; } The above process takes me huge amount of time for 2 million records. Approximately it is taking 3-4 business days to run the process. Can any one please suggest me and approach by which i could cut down this time. jdbc directory is not a good idea. It's only useful when you need central repository. Use large maxBufferedDocs in your IndexWriter. With large amount of data, you'll get bottleneck : database reading, index writing, RAM for buffered docs, maybe CPU. If your database reading is huge, and you are hurry, you can shard the index between multiple computer, and when it's finished, merge all the index, with champain. M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Storing phrases in index
palexv a écrit : Hello all. I have a question to advanced in lucene. I have a set of phrases which I need to store in index. Is there is a way of storing phrases as terms in index? How is the best way of writing such index? Should this field be tokenized? not tokenized What is the best way of searching phrases by mask in such index? Should I use BooleanQuery, WildCartQuery or SpanQuery? il you search complete phrase, just use Term, if you search part of phrase, use ShingleFilter. How is the best way to escape from maxClauses exception when searching like a*? indexing indexed term. M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: shingles and punctuations
setting a flag in a filter is easy : 8--- package org.apache.lucene.analysis.shingle; import java.io.IOException; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; /** * @author Mathieu Lecarme * */ public class SentenceCutterFilter extends TokenFilter{ public static final int FLAG = 42; public Token previous = null; protected SentenceCutterFilter(TokenStream input) { super(input); } public Token next() throws IOException { Token current = input.next(); if(current == null) return null; if(previous == null || (current.startOffset() - previous.endOffset()) 1) current.setFlags(FLAG); previous = current; return current; } } 8--- and using it at the right place is tricky : 8--- String test = This is a test, a big test; TokenStream stream = new StopFilter( new ShingleFilter( new SentenceCutterFilter( new LowerCaseFilter( new ISOLatin1AccentFilter( new StandardTokenizer(new StringReader(test), 3), new String[]{is, a}); 8--- But I must be to tired, but I can't patch the ShingleFilter to handle the flag. I guess flag should be a bit, tested with a mask. M. Le 6 avr. 08 à 22:53, Grant Ingersoll a écrit : For now, it's up to your app to know, unfortunately :-( I think the WikipediaTokenizer is the only one using flags currently in the Lucene. On Apr 6, 2008, at 10:43 PM, Mathieu Lecarme wrote: I'll use Token flags to specifiy first token in a sentence, but how it's works? how flag collision is avoided? to keep it simple, i'll take 1 as flag, but what happens if an other filter use the same flags? M. Le 6 avr. 08 à 20:13, Grant Ingersoll a écrit : I think you need sentence detection to take place further upstream. Then you could use the Token type or Token flags to indicate punctuation, sentences, whatever and we could patch the shingle filter to ignore these things, or break and move onto the next one. -Grant On Apr 6, 2008, at 7:23 PM, Mathieu Lecarme wrote: The newly ShingleFilter is very helpful to fetch group of words, but it doesn't handle ponctuation or any separation. If you feed it with multiple sentences, you will get shingle that start in one sentences and end in the next. In order to avoid that, you can handle token positions, if there is more than one char with the previous token, it should be punctation (or typo). Any suggestions to handle only shingle in the same sentence? M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
shingles and punctuations
The newly ShingleFilter is very helpful to fetch group of words, but it doesn't handle ponctuation or any separation. If you feed it with multiple sentences, you will get shingle that start in one sentences and end in the next. In order to avoid that, you can handle token positions, if there is more than one char with the previous token, it should be punctation (or typo). Any suggestions to handle only shingle in the same sentence? M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WordNet synonyms overhead
Harald Näger a écrit : Hi, I am especially interessted in the WordNet synonym expansion that was discussed in the Lucene in Action book. Is there anyone here on the list who has experience with this approach? I'm curious about how much the synonym expansion will increase the size of an index. Are there any reliable figures of real-life applications? Query expansion is better than index expansion. Faster use, smaller index, less noise when you search. M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Created: (LUCENE-1229) NGramTokenFilter optimization in query phase
Hiroaki Kawai (JIRA) a écrit : NGramTokenFilter optimization in query phase Key: LUCENE-1229 URL: https://issues.apache.org/jira/browse/LUCENE-1229 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: Hiroaki Kawai I found that NGramTokenFilter-ed token stream could be optimized in query. A standard 1,2 NGramTokenFilter will generate a token stream from abcde as follows: a ab b bc c cd d de e When we index abcde, we'll use all of the tokens. But when we query, we only need: ab cd de I don't understand why you index something that you will not query? Why don'y you use a bigram? M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: an API for synonym in Lucene-core
I'll slice my contrib in small parts Synonyms 1) Synonym (Token + a weight) 2) Synonym provider from OO.o thesaurus 3) SynonymTokenFilter 4) Query expander wich apply a filter (and a boost) on each of its TermQuery 5) a Synonym filter for the query expander 6) to be efficient, Synonym can be exclude if doesn't exist in the Index. 7) Stemming can be used as a dynamic Synonym Spell checking or the do you mean? pattern 1) The main concept is in the SpellCheck contrib, but in a not expandable implementation 2) In some language, like French, homophony is very important in mispelling, there is more than one way to write it 3) Homophony rules is provided by Aspell in a neutral language (just like SnowBall for stemming), I implemented a translator to build Java class from aspell file (it's the same format in aspell evolution : myspell and hunspell, wich are used in OO.o and firefox family) https://issues.apache.org/jira/browse/LUCENE-956 Storing information about word found in an index 1) It's the Dictionary used in SpellCheck contrib, in a more open way : a lexicon. It's a plain old lucene index, word become a Document, and Field store computed informations like size, Ngram token and homophony. All use filter took from TokenFilter, code duplication is avoided. 2) this information can be not synchronized with the index, in order to not slow indexation process, so some informations need to be lately check (is this synonym already exist in the index?), and lexicon correction can be done on the fly (if the synonym doesn't exist, write it in the lexicon for the next time). There is some work here to find the best and fastest way to keep information synchronized between index and lexicon (hard link, log for nightly replay, complete iteration over the index to find deleted and new stuff ...) 3) Similar (more than only Synonym) and Near (mispelled) words use Lexicon. https://issues.apache.org/jira/browse/LUCENE-1190 Extending it 1) Lexicon can be used to store Noun, ie words that better work together, like New York, Apple II or Alexander the great. Extracting nouns from a thesaurus is very hard, but Wikipedia peoples done a part of the work, article titles can be a good start to build a noun list. And it works in many languages. Noun can be used as an intuitive PhraseQuery, or as a suggestion for refining a results. Implementig it well in Lucene SpellCheck and WordNet contrib do a part of it, but in a specific and not extensible way, I think it's better when fundation is checked by Lucene maintener, and after, contrib is built on top of this fundation. M. Otis Gospodnetic a écrit : Grant, I think Mathieu is hinting at his JIRA contribution (I looked at it briefly the other day, but haven't had the chance to really understand it). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mathieu Lecarme [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Wednesday, March 12, 2008 5:47:40 AM Subject: an API for synonym in Lucene-core Why Lucen doesn't have a clean synonym API? WordNet contrib is not an answer, it provides an Interface for its own needs, and most of the world don't speak english. Compass provides a tool, just like Solr. Lucene is the framework for applications like Solr, Nutch or Compass, why don't backport low level features of this project? A synonym API should provide a TokenFilter, an abstract storage should map token - similar tokens with weight, and a tools for expanding query. Openoffice dictionnary project can provides data in differents languages, with compatible licences, I presume. M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
an API for synonym in Lucene-core
Why Lucen doesn't have a clean synonym API? WordNet contrib is not an answer, it provides an Interface for its own needs, and most of the world don't speak english. Compass provides a tool, just like Solr. Lucene is the framework for applications like Solr, Nutch or Compass, why don't backport low level features of this project? A synonym API should provide a TokenFilter, an abstract storage should map token - similar tokens with weight, and a tools for expanding query. Openoffice dictionnary project can provides data in differents languages, with compatible licences, I presume. M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming
[ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12576415#action_12576415 ] Mathieu Lecarme commented on LUCENE-1190: - A simpler preview of Lexicon features : http://blog.garambrogne.net/index.php?post/2008/03/07/A-lexicon-approach-for-Lucene-index a lexicon object for merging spellchecker and synonyms from stemming Key: LUCENE-1190 URL: https://issues.apache.org/jira/browse/LUCENE-1190 Project: Lucene - Java Issue Type: New Feature Components: contrib/*, Search Affects Versions: 2.3 Reporter: Mathieu Lecarme Attachments: aphone+lexicon.patch, aphone+lexicon.patch Some Lucene features need a list of referring word. Spellchecking is the basic example, but synonyms is an other use. Other tools can be used smoothlier with a list of words, without disturbing the main index : stemming and other simplification of word (anagram, phonetic ...). For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be built from Lucene Directory, or plain text files. Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter should be the most useful). Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...). Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used. A similarTokenFilter is provided. A spellchecker will come soon. A fuzzySearch implementation, a neutral synonym TokenFilter can be done. Unused words can be remove on demand (lazy delete?) Any criticism or suggestions? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming
hum, quote and question disappear. Le 2 mars 08 à 13:32, Mathieu Lecarme (JIRA) a écrit : [ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12574214 #action_12574214 ] Mathieu Lecarme commented on LUCENE-1190: - For example, I don't know what you mean by Some Lucene features need a list of referring word. Do you mean a list of associated words? With a FuzzyQuery, for example, you iterate over Term in index, and looking for the nearest one. PrefixQuery or regular expression work in a similar way. If you say, fuzzy querying will never gives a word with different size of 1 (size+1 or size -1), you can restrict the list of candidates, and ngram index can help you more. Some token filter destroy the word. Stemmer for example. If you wont to search wide, stemmer can help you, but can't use PrefixQuery with stemmed word. So, you can stemme word in a lexicon and use it as a synonym. You index dog and look for doggy, dogs and dog. Lexicon can use static list of word, from hunspell index or wikipedia parsing, or words extracted from your index. Each meta is a Field what do you mean by that? Could you please give an example? for the word Lucene : word:lucene pop:42 anagram.anagram:celnu aphone.start:LS aphone.gram:LS aphone.gram:SN aphone.end:SN aphone.size:3 aphone.phonem:LSN ngram.start:lu ngram.gram:lu ngram.gram:uc ngram.gram:ce ngram.gram:en ngram.gram:ne ngram.end:ne ngram.size:6 stemmer.stem:lucen Hm, not sure I know what you mean. Are you saying that once you create a sufficiently large lexicon/dictionary/index, the number of new terms starts decreasing? (Heap's Law? http://en.wikipedia.org/wiki/Heaps'_law ) Yes. a lexicon object for merging spellchecker and synonyms from stemming Key: LUCENE-1190 URL: https://issues.apache.org/jira/browse/LUCENE-1190 Project: Lucene - Java Issue Type: New Feature Components: contrib/*, Search Affects Versions: 2.3 Reporter: Mathieu Lecarme Attachments: aphone+lexicon.patch, aphone+lexicon.patch Some Lucene features need a list of referring word. Spellchecking is the basic example, but synonyms is an other use. Other tools can be used smoothlier with a list of words, without disturbing the main index : stemming and other simplification of word (anagram, phonetic ...). For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be built from Lucene Directory, or plain text files. Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter should be the most useful). Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...). Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used. A similarTokenFilter is provided. A spellchecker will come soon. A fuzzySearch implementation, a neutral synonym TokenFilter can be done. Unused words can be remove on demand (lazy delete?) Any criticism or suggestions? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming
[ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathieu Lecarme updated LUCENE-1190: Attachment: aphone+lexicon.patch a lexicon object for merging spellchecker and synonyms from stemming Key: LUCENE-1190 URL: https://issues.apache.org/jira/browse/LUCENE-1190 Project: Lucene - Java Issue Type: New Feature Components: contrib/*, Search Affects Versions: 2.3 Reporter: Mathieu Lecarme Attachments: aphone+lexicon.patch, aphone+lexicon.patch Some Lucene features need a list of referring word. Spellchecking is the basic example, but synonyms is an other use. Other tools can be used smoothlier with a list of words, without disturbing the main index : stemming and other simplification of word (anagram, phonetic ...). For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be built from Lucene Directory, or plain text files. Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter should be the most useful). Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...). Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used. A similarTokenFilter is provided. A spellchecker will come soon. A fuzzySearch implementation, a neutral synonym TokenFilter can be done. Unused words can be remove on demand (lazy delete?) Any criticism or suggestions? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming
a lexicon object for merging spellchecker and synonyms from stemming Key: LUCENE-1190 URL: https://issues.apache.org/jira/browse/LUCENE-1190 Project: Lucene - Java Issue Type: New Feature Components: contrib/*, Search Affects Versions: 2.3 Reporter: Mathieu Lecarme Attachments: aphone+lexicon.patch Some Lucene features need a list of referring word. Spellchecking is the basic example, but synonyms is an other use. Other tools can be used smoothlier with a list of words, without disturbing the main index : stemming and other simplification of word (anagram, phonetic ...). For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be built from Lucene Directory, or plain text files. Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter should be the most useful). Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...). Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used. A similarTokenFilter is provided. A spellchecker will come soon. A fuzzySearch implementation, a neutral synonym TokenFilter can be done. Unused words can be remove on demand (lazy delete?) Any criticism or suggestions? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming
[ https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathieu Lecarme updated LUCENE-1190: Attachment: aphone+lexicon.patch a lexicon object for merging spellchecker and synonyms from stemming Key: LUCENE-1190 URL: https://issues.apache.org/jira/browse/LUCENE-1190 Project: Lucene - Java Issue Type: New Feature Components: contrib/*, Search Affects Versions: 2.3 Reporter: Mathieu Lecarme Attachments: aphone+lexicon.patch Some Lucene features need a list of referring word. Spellchecking is the basic example, but synonyms is an other use. Other tools can be used smoothlier with a list of words, without disturbing the main index : stemming and other simplification of word (anagram, phonetic ...). For that, I suggest a Lexicon object, wich contains words (Term + frequency), wich can be built from Lucene Directory, or plain text files. Classical TokenFilter can be used with Lexicon (LowerCaseFilter and ISOLatin1AccentFilter should be the most useful). Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field (word, ngram, phonetic, fields, anagram, size ...). Above a minimum size, number of differents words used in an index can be considered as stable. So, a standard Lexicon (built from wikipedia by example) can be used. A similarTokenFilter is provided. A spellchecker will come soon. A fuzzySearch implementation, a neutral synonym TokenFilter can be done. Unused words can be remove on demand (lazy delete?) Any criticism or suggestions? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-956) phonem conversion from aspell dictionnary
[ https://issues.apache.org/jira/browse/LUCENE-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathieu Lecarme updated LUCENE-956: --- Attachment: aphone.patch New version, with more language (bg, br, da, de, el, en, fo, fr, is, ru), and an usable token filter. Usage case is similar to stem token filter. phonem conversion from aspell dictionnary - Key: LUCENE-956 URL: https://issues.apache.org/jira/browse/LUCENE-956 Project: Lucene - Java Issue Type: New Feature Components: Analysis Affects Versions: 2.2 Reporter: Mathieu Lecarme Attachments: aphone.patch, aphone.patch First step to improve Spellchecker's suggestions : phonem conversion for differents languages. The conversion code is build from aspell file description. The patch contains class for managing english, french, wallon and swedish. If it's work well, other available dictionnary from aspell project can be built. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need help for ordering results by specific order
If I understand well your needs: You ask lucene for a set of words You wont to sort result by number of different words wich match? The query is not good, it would be +content:(aleden bob carray) I don't understand how can you sort at indexing time with informations known at querying time. M. savageboy a écrit : Yes, Mathieu. I just have the book Lucene in action by my hand, it is chinese language version, it is about lucene1.4, hope it is not too old. If I use SortComparatorSource, does it means it will be do the sort work at the user query time? Can I sort (maybe score it atindexing time)? Mathieu Lecarme wrote: Have a look of the book Lucene in action, ch 6.1 : using custom sort method SortComparatorSource might be your friend. Lucene selecting stuff, and you sort, just like you wont. M. Le 18 juil. 07 à 10:29, savageboy a écrit : Hi, I am newer for lucene. I have a project for search engine by Lucene2.0. But near the project finished, My boss want me to order the result by the sort blew: the query likes '+content:aleden bob carray ' content date order alden bob carray ... 2005/12/23 1 alden... alden ... bob... bob... carray... 2005/12/01 2 alden... alden ... bob... carray 2005/11/28 3 alden... carray 2005/12/24 4 alden... bob 2005/12/24 5 the meaning of the sort above is no matter how much the term match in the field content, there will be met four satuations :3 matched,2 matched,1 matched,0 matched. In the 3 matched group, I need sorting the result by it's date desc, and in the 2 matched group is same... But I dont know HOW to get this results in Lucene... Should I override the method of scoring? (tf(t in d) term in field,idf(t) inverse doc frequence) Could you give me some references about it? I am really stucked, and Need You help!! -- View this message in context: http://www.nabble.com/Need-help-for- ordering-results-by-specific-order-tf4101844.html#a11664583 Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need help for ordering results by specific order
Have a look of the book Lucene in action, ch 6.1 : using custom sort method SortComparatorSource might be your friend. Lucene selecting stuff, and you sort, just like you wont. M. Le 18 juil. 07 à 10:29, savageboy a écrit : Hi, I am newer for lucene. I have a project for search engine by Lucene2.0. But near the project finished, My boss want me to order the result by the sort blew: the query likes '+content:aleden bob carray ' content date order alden bob carray ... 2005/12/23 1 alden... alden ... bob... bob... carray... 2005/12/01 2 alden... alden ... bob... carray 2005/11/28 3 alden... carray 2005/12/24 4 alden... bob 2005/12/24 5 the meaning of the sort above is no matter how much the term match in the field content, there will be met four satuations :3 matched,2 matched,1 matched,0 matched. In the 3 matched group, I need sorting the result by it's date desc, and in the 2 matched group is same... But I dont know HOW to get this results in Lucene... Should I override the method of scoring? (tf(t in d) term in field,idf(t) inverse doc frequence) Could you give me some references about it? I am really stucked, and Need You help!! -- View this message in context: http://www.nabble.com/Need-help-for- ordering-results-by-specific-order-tf4101844.html#a11664583 Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: for a better spellchecker
The SpellChecker code mix indexing function, ngram treatment, and querying functions. Extending it will not produce clean code. Is it relevant to first refactor SpellChecker code for extracting dictionary reading function and indexing/searching functions? SpellChecker will get a method to add SpellEngine interface wich looks like interface SpellEngine { public void addWord(String word); public String[] suggestSimilar(String word, int numSug); } and something to sort suggestions, like distance from suggested word. M. Le 9 juil. 07 à 02:38, Chris Hostetter a écrit : : Now, SpellChecker use the trigram algorithm to find similar words. It : works well for keyboard fumbles, but not well enough for short words : and for languages like french where a same sound can be wrote : differently. : Spellchecking is a classical computer task, and aspell provides some : nice and free (it's GNU) sound dictionary. Lots of dictionary are : available. The topic of spell correction as it pertains to Lucene users can really have two meanings: a) an attempt to suggest potential spell correction of query strings provided by a user as a form of input pre-processing b) to use Lucene as a tool to suggest spell corrections based on a known corpus. The contrib/spellchecker code is an application of B -- it may in fact be useful for A but that doesn't mean there aren't other non-Lucene tools for achieving A as well. : I did a python parser which write translation code in different : languages : python, php and java. A bit like snowball stuff. : Few works will be done to generate lucene compliant code. But is the : python generator is well enough to Lucene, or a translation must be : done in Java to put it in Lucene source? the Lucene-Java repository tends to be about java code, but contrib/javascript is an example of code that may be of general use to Lucene-Java users that isn't java. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-956) phonem conversion from aspell dictionnary
phonem conversion from aspell dictionnary - Key: LUCENE-956 URL: https://issues.apache.org/jira/browse/LUCENE-956 Project: Lucene - Java Issue Type: New Feature Components: Analysis Affects Versions: 2.2 Reporter: Mathieu Lecarme First step to improve Spellchecker's suggestions : phonem conversion for differents languages. The conversion code is build from aspell file description. The patch contains class for managing english, french, wallon and swedish. If it's work well, other available dictionnary from aspell project can be built. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-956) phonem conversion from aspell dictionnary
[ https://issues.apache.org/jira/browse/LUCENE-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathieu Lecarme updated LUCENE-956: --- Attachment: aphone.patch phonem conversion from aspell dictionnary - Key: LUCENE-956 URL: https://issues.apache.org/jira/browse/LUCENE-956 Project: Lucene - Java Issue Type: New Feature Components: Analysis Affects Versions: 2.2 Reporter: Mathieu Lecarme Attachments: aphone.patch First step to improve Spellchecker's suggestions : phonem conversion for differents languages. The conversion code is build from aspell file description. The patch contains class for managing english, french, wallon and swedish. If it's work well, other available dictionnary from aspell project can be built. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
build.xml for a contrib wich depend on an other contrib
The first version of aspell format phonem converter in java is almost finished. The source is buildable with ant, but, in the lucene trunk, it failed. The build depends on SpellChecker wich is build after. How can can I fix it? A statical spellChecker.jar in lib in my contrib? a depends in the right place in my compile target? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Updated: (LUCENE-906) Elision filter for simple french analyzing
Any news about the integration of this patch? M. Mathieu Lecarme (JIRA) a écrit : [ https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathieu Lecarme updated LUCENE-906: --- Attachment: elision-0.2.patch All suggested corrections are done. Elision filter for simple french analyzing -- Key: LUCENE-906 URL: https://issues.apache.org/jira/browse/LUCENE-906 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Mathieu Lecarme Attachments: elision-0.2.patch, elision.patch If you don't wont to use stemming, StandardAnalyzer miss some french strangeness like elision. l'avion wich means the plane must be tokenized as avion (plane). This filter could be used with other latin language if elision exists. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-906) Elision filter for simple french analyzing
[ https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathieu Lecarme updated LUCENE-906: --- Attachment: elision-0.2.patch All suggested corrections are done. Elision filter for simple french analyzing -- Key: LUCENE-906 URL: https://issues.apache.org/jira/browse/LUCENE-906 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Mathieu Lecarme Attachments: elision.patch If you don't wont to use stemming, StandardAnalyzer miss some french strangeness like elision. l'avion wich means the plane must be tokenized as avion (plane). This filter could be used with other latin language if elision exists. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-906) Elision filter for simple french analyzing
[ https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathieu Lecarme updated LUCENE-906: --- Attachment: (was: elision-0.2.patch) Elision filter for simple french analyzing -- Key: LUCENE-906 URL: https://issues.apache.org/jira/browse/LUCENE-906 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Mathieu Lecarme Attachments: elision.patch If you don't wont to use stemming, StandardAnalyzer miss some french strangeness like elision. l'avion wich means the plane must be tokenized as avion (plane). This filter could be used with other latin language if elision exists. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-906) Elision filter for simple french analyzing
[ https://issues.apache.org/jira/browse/LUCENE-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathieu Lecarme updated LUCENE-906: --- Attachment: elision.patch Elision filter for simple french analyzing -- Key: LUCENE-906 URL: https://issues.apache.org/jira/browse/LUCENE-906 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Mathieu Lecarme Attachments: elision.patch If you don't wont to use stemming, StandardAnalyzer miss some french strangeness like elision. l'avion wich means the plane must be tokenized as avion (plane). This filter could be used with other latin language if elision exists. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-906) Elision filter for simple french analyzing
Elision filter for simple french analyzing -- Key: LUCENE-906 URL: https://issues.apache.org/jira/browse/LUCENE-906 Project: Lucene - Java Issue Type: New Feature Components: Analysis Reporter: Mathieu Lecarme If you don't wont to use stemming, StandardAnalyzer miss some french strangeness like elision. l'avion wich means the plane must be tokenized as avion (plane). This filter could be used with other latin language if elision exists. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
using a french specific analyser without stemming
For a project with a lot ofLucene search (via Compass), I had some troubles with stemming. Stemming is nice for enlarge search range, but make completion strange. So FrenchAnalyzer was not usable. A simpler StandardAnalyzer makes the job right, except for some french speciality, like elision. In french the plane is translated by l'avion and not le avion, and the StandardTokenizer, used by StandardFilter can't tokenize it right. So, I make a specific filter (ElisionFilter), how can I give it to Lucene? With a Jira ticket, with the mailing list? M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]