Re: PorterStemfilter
--- Tea Yu <[EMAIL PROTECTED]> wrote: > David, > > For me I don't want a search for "in print" gives > results from "in printer"? > I'll consider that over-stemmed elsecase. Here the "in" won't be considered as it is a stopword in most of the analyzers. I know it is in StandardAnalyzer. So searching for 'in print' will not return the document containing 'in printer' because stem('printer') is 'printer' and not 'print'. So 'printer' is what getting stored in the index. Enclosing in double quotes does not prevent stemming. > I'm also not that satisfactory when "effective" is > stemmed to "effect" by > snowball recently I have tested this with PorterStemFilter and there is also "effective" is stemmed as "effect". There are more serious problems. "printable" is stemmed as "printabl". Thanks, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PorterStemfilter
David, For me I don't want a search for "in print" gives results from "in printer"? I'll consider that over-stemmed elsecase. I'm also not that satisfactory when "effective" is stemmed to "effect" by snowball recently Cheers Tea > Hi David > > I like KStem more than Porter / Snowball - but still has limitations > although performs better as it has a dictionary to augment the rules. > > Note that KStem will also treat "print" and "printer" as two distinct terms, > probably treating it as verb and noun respectively. > > Cheers > > Pete Lewis > > - Original Message - > From: "David Spencer" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Tuesday, September 14, 2004 7:19 PM > Subject: Re: PorterStemfilter > > > > Honey George wrote: > > > > > Hi, > > > This might be more of a questing related to the > > > PorterStemmer algorithm rather than with lucene, but > > > if anyone has the knowledge please share. > > > > You might want to also try the Snowball stemmer: > > > > http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/ > > > > And KStem: > > > > http://ciir.cs.umass.edu/downloads/ > > > > > > I am using the PorterStemFilter that some with lucene > > > and it turns out that searching for the word 'printer' > > > does not return a document containing the text > > > 'print'. To narrow down the problem, I have tested the > > > PorterStemFilter in a standalone programs and it turns > > > out that the stem of printer is 'printer' and not > > > 'print'. That is 'printer' is not equal to 'print' + > > > 'er', the whole of the word is stem. Can somebody > > > explain the behavior. > > > > > > Thanks & Regards, > > >George > > > > > > > > > > > > > > > > > > ___ALL-NEW > Yahoo! Messenger - all new features - even more fun! > http://uk.messenger.yahoo.com > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Almost. If the user enters "a recursize purser", then: "a", which is in, say, >50% of the documents, is probably spelled correctly and "recursize", which is in zero documents, is probably mispelled. But what about "purser"? If we run the spell check algorithm on "purser" and generate "parser", should we show it to the user? If "purser" occurs in 1% of documents and "parser" occurs in 5%, then we probably should, since "parser" is a more common word than "purser". But if "parser" only occurs in 1% of the documents and purser occurs in 5%, then we probably shouldn't bother suggesting "parser". OK, sure, got it. I'll give it a think and try to add this option to my just submitted spelling code. If you wanted to get really fancy then you could check how frequently combinations of query terms occur, i.e., does "purser" or "parser" occur more frequently near "descent". But that gets expensive. Yeah, expensive for a large scale search engine, but probably appropriate for a desktop engine. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Similarity score computation documentation
Your analysis sounds correct. At base, a weight is a normalized tf*idf. So a document weight is: docTf * idf * docNorm and a query weight is: queryTf * idf * queryNorm where queryTf is always one. So the product of these is (docTf * idf * docNorm) * (idf * queryNorm), which indeed contains idf twice. I think the best documentation fix would be to add another idf(t) clause at the end of the formula, next to queryNorm(q), so this is clear. Does that sound right to you? Doug Ken McCracken wrote: Hi, I was looking through the score computation when running search, and think there may be a discrepancy between what is _documented_ in the org.apache.lucene.search.Similarity class overview Javadocs, and what actually occurs in the code. I believe the problem is only with the documentation. I'm pretty sure that there should be an idf^2 in the sum. Look at org.apache.lucene.search.TermQuery, the inner class TermWeight. You can see that first sumOfSquaredWeights() is called, followed by normalize(), during search. Further, the resulting value stored in the field "value" is set as the "weightValue" on the TermScorer. If we look at what happens to TermWeight, sumOfSquaredWeights() sets "queryWeight" to idf * boost. During normalize(), "queryWeight" is multiplied by the query norm, and "value" is set to queryWeight * idf == idf * boost * query norm * idf == idf^2 * boost * query norm. This becomes the "weightValue" in the TermScorer that is then used to multiply with the appropriate tf, etc., values. The remaining terms in the Similarity description are properly appended. I also see that the queryNorm effectively "cancels out" (dimensionally, since it is a 1/ square root of a sum of squares of idfs) one of the idfs, so the formula still ends up being roughly a TF-IDF formula. But the idf^2 should still be there, along with the expansion of queryNorm. Am I mistaken, or is the documentation off? Thanks for your help, -Ken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Hits.doc(x) and range queries
Hi guys! I've posted previously that Hits.doc(x) was taking a long time. Turns out it has to do with a date range in our query. We usually do date ranges like this: Date:[(lucene date field) - (lucene date field)] Sometimes the begin date is "0" which is what we get from DateField.dateToString( ( new Date( 0 ) ). This is when getting our search results from the Hits object takes an absurd amount of time. Its usually each time the Hits object attempts to get more results from an IndexSearcher ( aka, every 100? ). It also takes up more memory... I was wondering why it affects the search so much even though we're only returning 350 or so results. Does the QueryParser do something similar to the DateFilter on range queries? Would it be better to use a DateFilter? We're using Lucene 1.2 (with plans to upgrade). Do newer versions of Lucene have this problem? Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Almost. If the user enters "a recursize purser", then: "a", which is in, say, >50% of the documents, is probably spelled correctly and "recursize", which is in zero documents, is probably mispelled. But what about "purser"? If we run the spell check algorithm on "purser" and generate "parser", should we show it to the user? If "purser" occurs in 1% of documents and "parser" occurs in 5%, then we probably should, since "parser" is a more common word than "purser". But if "parser" only occurs in 1% of the documents and purser occurs in 5%, then we probably shouldn't bother suggesting "parser". If you wanted to get really fancy then you could check how frequently combinations of query terms occur, i.e., does "purser" or "parser" occur more frequently near "descent". But that gets expensive. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Andrzej Bialecki wrote: I was wondering about the way you build the n-gram queries. You basically don't care about their position in the input term. Originally I thought about using PhraseQuery with a slop - however, after checking the source of PhraseQuery I realized that this probably wouldn't be that fast... You use BooleanQuery and start/end boosts instead, which may give similar results in the end but much cheaper. Sloppy PhraseQuery's are slower than BooleanQueries, but not horribly slower. The problem is that they don't handle the case where phrase elements are missing altogether, while a BooleanQuery does. So what you really need is maybe a variation of a sloppy PhraseQuery that scores matches that do not contain all of the terms... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Andrzej Bialecki wrote: David Spencer wrote: ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too... Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. The background for this suggestion was that I was playing some time ago with a Luke plugin that builds various sorts of ancillary indexes, but then I never finished it... Kudos for actually making it work ;-) Sure, it was a fun little edge project. For the most part the code was done last week right after this thread appeared, but it always takes a while to get it from 95 to 100%. [1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful. There have been many requests for this or similar functionality in the past, I believe it should go into sandbox. I was wondering about the way you build the n-gram queries. You basically don't care about their position in the input term. Originally I thought about using PhraseQuery with a slop - however, after checking the source of PhraseQuery I realized that this probably wouldn't be that fast... You use BooleanQuery and start/end boosts instead, which may give similar results in the end but much cheaper. I also wonder how this algorithm would behave for smaller values of Sure, I'll try to rebuild the demo w/ lengths 2-5 (and then the query page can test any conitguous combo). start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram length, the more "fuzziness" you introduce, which may or may not be desirable (increased recall at the cost of precision - for small indexes this may be useful from the user's perspective because you will always get a plausible hit, for huge indexes it's a loss). [2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections. http://www.searchmorph.com/kat/spell.jsp Very nice demo! Thanks, kinda designed for ngram-nerds if you know what I mean :) I bet it's running way faster than the linear search Indeed, this is almost zero time, whereas the simple and dumb linear search was taking me 10sec. I will have to redo the sites main search page so it uses this new code, TBD, prob tomorrow. over terms :-), even though you have to build the index in advance. But if you work with static or mostly static indexes this doesn't matter. Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often. You may want to experiment with 2 <= n <= 5. Some n-gram based Yep, will do prob tomorrow. techniques use all lengths together, some others use just single length, results also vary depending on the language... I think in "plain" English then the way a word is suggested as a spelling correction is: - frequently occuring words score higher - words that share more ngrams with the orig word score higher - words that share rare ngrams with the orig word score higher I think this is a reasonable heuristics. Reading the code I would present it this way: ok, thx, will update - words that share more ngrams with the orig word score higher, and words that share rare ngrams with the orig word score higher (as a natural consequence of using BooleanQuery), - and, frequently occuring words score higher (as a consequence of using per-Document boosts), - from reading the source code I see that you use Levenshtein distance to prune the resultset of too long/too short results, I think also that because you don't use the positional information about the input n-grams you may be getting some really weird hits. Good point, though I haven't seen this yet. Might be due to the prefix boost and maybe some Markov chain magic tending to only show reasonable words. You could prune them by simply checking if you find a (threshold) of input ngrams in the right sequence in the found terms. This shouldn't be too costly Good point, I'll try to add that in as an optional parameter. because you operate on a smal
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
David Spencer wrote: ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too... Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. The background for this suggestion was that I was playing some time ago with a Luke plugin that builds various sorts of ancillary indexes, but then I never finished it... Kudos for actually making it work ;-) [1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful. There have been many requests for this or similar functionality in the past, I believe it should go into sandbox. I was wondering about the way you build the n-gram queries. You basically don't care about their position in the input term. Originally I thought about using PhraseQuery with a slop - however, after checking the source of PhraseQuery I realized that this probably wouldn't be that fast... You use BooleanQuery and start/end boosts instead, which may give similar results in the end but much cheaper. I also wonder how this algorithm would behave for smaller values of start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram length, the more "fuzziness" you introduce, which may or may not be desirable (increased recall at the cost of precision - for small indexes this may be useful from the user's perspective because you will always get a plausible hit, for huge indexes it's a loss). [2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections. http://www.searchmorph.com/kat/spell.jsp Very nice demo! I bet it's running way faster than the linear search over terms :-), even though you have to build the index in advance. But if you work with static or mostly static indexes this doesn't matter. Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often. You may want to experiment with 2 <= n <= 5. Some n-gram based techniques use all lengths together, some others use just single length, results also vary depending on the language... I think in "plain" English then the way a word is suggested as a spelling correction is: - frequently occuring words score higher - words that share more ngrams with the orig word score higher - words that share rare ngrams with the orig word score higher I think this is a reasonable heuristics. Reading the code I would present it this way: - words that share more ngrams with the orig word score higher, and words that share rare ngrams with the orig word score higher (as a natural consequence of using BooleanQuery), - and, frequently occuring words score higher (as a consequence of using per-Document boosts), - from reading the source code I see that you use Levenshtein distance to prune the resultset of too long/too short results, I think also that because you don't use the positional information about the input n-grams you may be getting some really weird hits. You could prune them by simply checking if you find a (threshold) of input ngrams in the right sequence in the found terms. This shouldn't be too costly because you operate on a small result set. [6] If people want to vote me in as a committer to the sandbox then I can Well, someone needs to maintain the code after all... ;-) -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Tate Avery wrote: I get a NullPointerException shown (via Apache) when I try to access http://www.searchmorph.com/kat/spell.jsp How embarassing! Sorry! Fixed! T -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 3:23 PM To: Lucene Users List Subject: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene Andrzej Bialecki wrote: David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars. ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too... Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. Let's see. [1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful. [2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections. http://www.searchmorph.com/kat/spell.jsp [3] Here's the javadoc: http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html [4] Here's source in HTML: http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152 [5] A few more details: Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often. I think in "plain" English then the way a word is suggested as a spelling correction is: - frequently occuring words score higher - words that share more ngrams with the orig word score higher - words that share rare ngrams with the orig word score higher [6] If people want to vote me in as a committer to the sandbox then I can check this code in - though again, I'd appreciate feedback. thx, Dave - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
I get a NullPointerException shown (via Apache) when I try to access http://www.searchmorph.com/kat/spell.jsp T -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 3:23 PM To: Lucene Users List Subject: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene Andrzej Bialecki wrote: > David Spencer wrote: > >> >> I can/should send the code out. The logic is that for any terms in a >> query that have zero matches, go thru all the terms(!) and calculate >> the Levenshtein string distance, and return the best matches. A more >> intelligent way of doing this is to instead look for terms that also >> match on the 1st "n" (prob 3) chars. > > > ...or prepare in advance a fast lookup index - split all existing terms > to bi- or trigrams, create a separate lookup index, and then simply for > each term ask a phrase query (phrase = all n-grams from an input term), > with a slop > 0, to get similar existing terms. This should be fast, and > you could provide a "did you mean" function too... > Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. Let's see. [1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful. [2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections. http://www.searchmorph.com/kat/spell.jsp [3] Here's the javadoc: http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html [4] Here's source in HTML: http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152 [5] A few more details: Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often. I think in "plain" English then the way a word is suggested as a spelling correction is: - frequently occuring words score higher - words that share more ngrams with the orig word score higher - words that share rare ngrams with the orig word score higher [6] If people want to vote me in as a committer to the sandbox then I can check this code in - though again, I'd appreciate feedback. thx, Dave - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Andrzej Bialecki wrote: David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead look for terms that also match on the 1st "n" (prob 3) chars. ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get similar existing terms. This should be fast, and you could provide a "did you mean" function too... Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. Let's see. [1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful. [2] Here's a demo page. I built an ngram index for ngrams of length 3 and 4 based on the existing index I have of approx 100k javadoc-generated pages. You type in a misspelled word like "recursixe" or whatnot to see what suggestions it returns. Note this is not a normal search index query -- rather this is a test page for spelling corrections. http://www.searchmorph.com/kat/spell.jsp [3] Here's the javadoc: http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html [4] Here's source in HTML: http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152 [5] A few more details: Based on a subsequent mail in this thread I set boosts for the words in the ngram index. The background is each word (er..term for a given field) in the orig index is a separate Document in the ngram index. This Doc contains all ngrams (in my test case, like #2 above, of length 3 and 4) of the word. I also set a boost of log(word_freq)/log(num_docs) so that more frequently words will tend to be suggested more often. I think in "plain" English then the way a word is suggested as a spelling correction is: - frequently occuring words score higher - words that share more ngrams with the orig word score higher - words that share rare ngrams with the orig word score higher [6] If people want to vote me in as a committer to the sandbox then I can check this code in - though again, I'd appreciate feedback. thx, Dave package org.apache.lucene.spell; /* * The Apache Software License, Version 1.1 * * Copyright (c) 2001-2003 The Apache Software Foundation. All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright *notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright *notice, this list of conditions and the following disclaimer in *the documentation and/or other materials provided with the *distribution. * * 3. The end-user documentation included with the redistribution, *if any, must include the following acknowledgment: * "This product includes software developed by the *Apache Software Foundation (http://www.apache.org/)." *Alternately, this acknowledgment may appear in the software itself, *if and wherever such third-party acknowledgments normally appear. * * 4. The names "Apache" and "Apache Software Foundation" and *"Apache Lucene" must not be used to endorse or promote products *derived from this software without prior written permission. For *written permission, please contact [EMAIL PROTECTED] * * 5. Products derived from this software may not be called "Apache", *"Apache Lucene", nor may "Apache" appear in their name, without *prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * ===
Re: PorterStemfilter
Hi David I like KStem more than Porter / Snowball - but still has limitations although performs better as it has a dictionary to augment the rules. Note that KStem will also treat "print" and "printer" as two distinct terms, probably treating it as verb and noun respectively. Cheers Pete Lewis - Original Message - From: "David Spencer" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, September 14, 2004 7:19 PM Subject: Re: PorterStemfilter > Honey George wrote: > > > Hi, > > This might be more of a questing related to the > > PorterStemmer algorithm rather than with lucene, but > > if anyone has the knowledge please share. > > You might want to also try the Snowball stemmer: > > http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/ > > And KStem: > > http://ciir.cs.umass.edu/downloads/ > > > > I am using the PorterStemFilter that some with lucene > > and it turns out that searching for the word 'printer' > > does not return a document containing the text > > 'print'. To narrow down the problem, I have tested the > > PorterStemFilter in a standalone programs and it turns > > out that the stem of printer is 'printer' and not > > 'print'. That is 'printer' is not equal to 'print' + > > 'er', the whole of the word is stem. Can somebody > > explain the behavior. > > > > Thanks & Regards, > >George > > > > > > > > > > > > ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PorterStemfilter
Hi George There are lots of problems with Port stemmers, not great for English but get worse for other languages. If you look at: http://snowball.tartarus.org/demo.php You'll see the Snowball demo - this is basically another instance of Porter. If you enter "print" and "printer" and submit then the results will be "print" and "printer" - hence showing the the Porter stemmed versions are the same as the originals. Therefore they are both distinct terms in their own right and searches on one will not hit the other. Cheers Pete Lewis - Original Message - From: "Honey George" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, September 14, 2004 6:57 PM Subject: PorterStemfilter > Hi, > This might be more of a questing related to the > PorterStemmer algorithm rather than with lucene, but > if anyone has the knowledge please share. > > I am using the PorterStemFilter that some with lucene > and it turns out that searching for the word 'printer' > does not return a document containing the text > 'print'. To narrow down the problem, I have tested the > PorterStemFilter in a standalone programs and it turns > out that the stem of printer is 'printer' and not > 'print'. That is 'printer' is not equal to 'print' + > 'er', the whole of the word is stem. Can somebody > explain the behavior. > > Thanks & Regards, >George > > > > > > ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PorterStemfilter
Honey George wrote: Hi, This might be more of a questing related to the PorterStemmer algorithm rather than with lucene, but if anyone has the knowledge please share. You might want to also try the Snowball stemmer: http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/ And KStem: http://ciir.cs.umass.edu/downloads/ I am using the PorterStemFilter that some with lucene and it turns out that searching for the word 'printer' does not return a document containing the text 'print'. To narrow down the problem, I have tested the PorterStemFilter in a standalone programs and it turns out that the stem of printer is 'printer' and not 'print'. That is 'printer' is not equal to 'print' + 'er', the whole of the word is stem. Can somebody explain the behavior. Thanks & Regards, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Help for text based indexing
You could recieve the group name as an input from the user and construct a BooleanQuery internally which will qyery only the group field based on the user input. So the user need not append the group name with the search string. Thanks, George --- mahaveer jain <[EMAIL PROTECTED]> wrote: > If i have rightly understood, you mean to say that > the query for search has to be > > "Group1" AND "Hello" (if hello is what I want to > search ?) > > Cocula Remi <[EMAIL PROTECTED]> wrote: > A keyword is not tokenized, that's why you wont be > able to search over a part of it. You'd rather use a > Text fied. > > About creating a special field : > > IndexWriter Ir = > > File f = > Document doc = new Document(); > if > (f.toString.startsWith("C:\tomcat\webapps\Root\Group1") > { > doc.add(Field.Text("group", "Group1")); > } > if > (f.toString.startsWith("C:\tomcat\webapps\Root\Group2") > { > doc.add(Field.Text("group", "Group2")); > } > doc.add(Field.Text("content", getContent(f))); > Ir.addDocument(doc); > > > > Then you can search in group1 with query like that : > > > group:Group1 AND rest_of_the_query. > > > > -Message d'origine- > De : mahaveer jain [mailto:[EMAIL PROTECTED] > Envoyé : mardi 14 septembre 2004 18:03 > À : Lucene Users List > Objet : RE: Help for text based indexing > > > Well in my case the path is KeyWord. I had tried > that earlier and it does not seems to work in a > single index file. > > Can you explain a bit more about adding group1 and > group2 ? > > Cocula Remi wrote: > Well you could add a field to each of your Documents > whose value would be either "group1" or "group2". > Or you could use the path to your files ... > > > > -Message d'origine- > De : mahaveer jain [mailto:[EMAIL PROTECTED] > Envoyé : mardi 14 septembre 2004 17:49 > À : [EMAIL PROTECTED] > Objet : RE: Help for text based indexing > > > I am clear with looping recursively to index all the > file under Root folder. > But the problem is if I want to search only in > group1 or group2.Is that possible to search only in > one of the group folder ? > > Cocula Remi wrote: > You just have to loop recurssively over the > C:\tomcat\webapps\Root tree to create your index. > Yes you can index databases; you will just have to > write a mechanism that is able to create > org.apache.lucene.document.Document from database. > For instance : > - connect JDBC > - run a query for obtaining a ResultSet > - loop for each row of that ResultSet : > Create a new org.apache.lucene.document.Document > from ResultSet data > and add this document to the Index. > end loop. > > For incremental indexing, I suppose you have to > store some timestamp field in your index; but it's > up to you. > Note that Lucene is very fast and I don't think that > incremetal indexing is required for small or medium > amout of data. > > > > -Message d'origine- > De : mahaveer jain [mailto:[EMAIL PROTECTED] > Envoyé : mardi 14 septembre 2004 17:22 > À : [EMAIL PROTECTED] > Objet : Help for text based indexing > > > > Hi > > I have implemented Text based search using lucene. I > was wonderful playing around with it. > > Now I want to enchance the application. > > I have a Root folder, under that I have many other > folder, that are group specific, say (group1, > group2, .. so on). The Root folder is in > C:\tomcat\webapps\Root and group folder within that. > > Now I am index for these groups separately, ie , I > have index as C:/index/group1, C:/index/group2, > C:/index/group3 and so on > > I want to know if I can have only one index for all > these say C:/index/Root (this has index for all the > folder) and I should be able to Search using > C:\tomcat\webapps\Root\group1(if want to search for > group1) similarly for the other groups. > > Let me know if this is possible and have anybody > tried this. > > 2nd question > > Is lucene good to index databases ? How do we > support incremental indexing ? > > (Right now I am using LIKE for searching ) > > Thanks in Advance > > Mahaveer > > > > - > Do you Yahoo!? > vote.yahoo.com - Register online to vote today! > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > - > Do you Yahoo!? > vote.yahoo.com - Register online to vote today! > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > - > Do you Yahoo!? > vote.yahoo.com - Register online to vote today! > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > -
PorterStemfilter
Hi, This might be more of a questing related to the PorterStemmer algorithm rather than with lucene, but if anyone has the knowledge please share. I am using the PorterStemFilter that some with lucene and it turns out that searching for the word 'printer' does not return a document containing the text 'print'. To narrow down the problem, I have tested the PorterStemFilter in a standalone programs and it turns out that the stem of printer is 'printer' and not 'print'. That is 'printer' is not equal to 'print' + 'er', the whole of the word is stem. Can somebody explain the behavior. Thanks & Regards, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Help for text based indexing
If i have rightly understood, you mean to say that the query for search has to be "Group1" AND "Hello" (if hello is what I want to search ?) Cocula Remi <[EMAIL PROTECTED]> wrote: A keyword is not tokenized, that's why you wont be able to search over a part of it. You'd rather use a Text fied. About creating a special field : IndexWriter Ir = File f = Document doc = new Document(); if (f.toString.startsWith("C:\tomcat\webapps\Root\Group1") { doc.add(Field.Text("group", "Group1")); } if (f.toString.startsWith("C:\tomcat\webapps\Root\Group2") { doc.add(Field.Text("group", "Group2")); } doc.add(Field.Text("content", getContent(f))); Ir.addDocument(doc); Then you can search in group1 with query like that : group:Group1 AND rest_of_the_query. -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 18:03 À : Lucene Users List Objet : RE: Help for text based indexing Well in my case the path is KeyWord. I had tried that earlier and it does not seems to work in a single index file. Can you explain a bit more about adding group1 and group2 ? Cocula Remi wrote: Well you could add a field to each of your Documents whose value would be either "group1" or "group2". Or you could use the path to your files ... -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 17:49 À : [EMAIL PROTECTED] Objet : RE: Help for text based indexing I am clear with looping recursively to index all the file under Root folder. But the problem is if I want to search only in group1 or group2.Is that possible to search only in one of the group folder ? Cocula Remi wrote: You just have to loop recurssively over the C:\tomcat\webapps\Root tree to create your index. Yes you can index databases; you will just have to write a mechanism that is able to create org.apache.lucene.document.Document from database. For instance : - connect JDBC - run a query for obtaining a ResultSet - loop for each row of that ResultSet : Create a new org.apache.lucene.document.Document from ResultSet data and add this document to the Index. end loop. For incremental indexing, I suppose you have to store some timestamp field in your index; but it's up to you. Note that Lucene is very fast and I don't think that incremetal indexing is required for small or medium amout of data. -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 17:22 À : [EMAIL PROTECTED] Objet : Help for text based indexing Hi I have implemented Text based search using lucene. I was wonderful playing around with it. Now I want to enchance the application. I have a Root folder, under that I have many other folder, that are group specific, say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group folder within that. Now I am index for these groups separately, ie , I have index as C:/index/group1, C:/index/group2, C:/index/group3 and so on I want to know if I can have only one index for all these say C:/index/Root (this has index for all the folder) and I should be able to Search using C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other groups. Let me know if this is possible and have anybody tried this. 2nd question Is lucene good to index databases ? How do we support incremental indexing ? (Right now I am using LIKE for searching ) Thanks in Advance Mahaveer - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers!
RE: Help for text based indexing
A keyword is not tokenized, that's why you wont be able to search over a part of it. You'd rather use a Text fied. About creating a special field : IndexWriter Ir = File f = Document doc = new Document(); if (f.toString.startsWith("C:\tomcat\webapps\Root\Group1") { doc.add(Field.Text("group", "Group1")); } if (f.toString.startsWith("C:\tomcat\webapps\Root\Group2") { doc.add(Field.Text("group", "Group2")); } doc.add(Field.Text("content", getContent(f))); Ir.addDocument(doc); Then you can search in group1 with query like that : group:Group1 AND rest_of_the_query. -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 18:03 À : Lucene Users List Objet : RE: Help for text based indexing Well in my case the path is KeyWord. I had tried that earlier and it does not seems to work in a single index file. Can you explain a bit more about adding group1 and group2 ? Cocula Remi <[EMAIL PROTECTED]> wrote: Well you could add a field to each of your Documents whose value would be either "group1" or "group2". Or you could use the path to your files ... -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 17:49 À : [EMAIL PROTECTED] Objet : RE: Help for text based indexing I am clear with looping recursively to index all the file under Root folder. But the problem is if I want to search only in group1 or group2.Is that possible to search only in one of the group folder ? Cocula Remi wrote: You just have to loop recurssively over the C:\tomcat\webapps\Root tree to create your index. Yes you can index databases; you will just have to write a mechanism that is able to create org.apache.lucene.document.Document from database. For instance : - connect JDBC - run a query for obtaining a ResultSet - loop for each row of that ResultSet : Create a new org.apache.lucene.document.Document from ResultSet data and add this document to the Index. end loop. For incremental indexing, I suppose you have to store some timestamp field in your index; but it's up to you. Note that Lucene is very fast and I don't think that incremetal indexing is required for small or medium amout of data. -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 17:22 À : [EMAIL PROTECTED] Objet : Help for text based indexing Hi I have implemented Text based search using lucene. I was wonderful playing around with it. Now I want to enchance the application. I have a Root folder, under that I have many other folder, that are group specific, say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group folder within that. Now I am index for these groups separately, ie , I have index as C:/index/group1, C:/index/group2, C:/index/group3 and so on I want to know if I can have only one index for all these say C:/index/Root (this has index for all the folder) and I should be able to Search using C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other groups. Let me know if this is possible and have anybody tried this. 2nd question Is lucene good to index databases ? How do we support incremental indexing ? (Right now I am using LIKE for searching ) Thanks in Advance Mahaveer - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: ANT +BUILD + LUCENE
Hi, I've used the following Ant targets for build scripts that required platform dependent work. In the example here, the property "catalina.home" is set according to what platform we're running on. You can adapt as needed. >>> "Karthik N S" <[EMAIL PROTECTED]> 09/13/04 10:34PM >>> Hi Erik 1) Using Ant and Build.xml I want to run the org.apache.lucene.demo.IndexFiles to create an Indexfolder 2) Problem is The same Build.xml is to be used Across the O/s for creating Index 3) The path of Lucene1-4-final.jar are in respective directories for the O/s... [ Note :- The Path of Lucene_home,I/P and O/p directories are also O/s Specific should be in the Build.xml and should be trigged somthing by this type or I hope u get the situation. :{ With regards Karthik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 7:37 PM To: Lucene Users List Subject: Re: ANT +BUILD + LUCENE I'm not following what you want very clearly, but there is an task in Lucene's Sandbox. Please post what you are trying, and I'd be happy to help once I see the details. Erik On Sep 12, 2004, at 4:44 PM, Karthik N S wrote: > Hi > > Guys > > > Apologies.. > > > The Task for me is to build the Index folder using Lucene & a simple > Build.xml for ANT > > The Problem .. Same 'Build .xml' should be used for differnet > O/s... > [ Win / Linux ] > > The glitch is respective jar files such as Lucene-1.4 .jar & other jar > files are not in same dir for the O/s. > Also the I/p , O/p Indexer path for source/target may also vary. > > > Please Somebody Help me. :( > > > > with regards > Karthik > > > > > WITH WARM REGARDS > HAVE A NICE DAY > [ N.S.KARTHIK] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Help for text based indexing
Well in my case the path is KeyWord. I had tried that earlier and it does not seems to work in a single index file. Can you explain a bit more about adding group1 and group2 ? Cocula Remi <[EMAIL PROTECTED]> wrote: Well you could add a field to each of your Documents whose value would be either "group1" or "group2". Or you could use the path to your files ... -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 17:49 À : [EMAIL PROTECTED] Objet : RE: Help for text based indexing I am clear with looping recursively to index all the file under Root folder. But the problem is if I want to search only in group1 or group2.Is that possible to search only in one of the group folder ? Cocula Remi wrote: You just have to loop recurssively over the C:\tomcat\webapps\Root tree to create your index. Yes you can index databases; you will just have to write a mechanism that is able to create org.apache.lucene.document.Document from database. For instance : - connect JDBC - run a query for obtaining a ResultSet - loop for each row of that ResultSet : Create a new org.apache.lucene.document.Document from ResultSet data and add this document to the Index. end loop. For incremental indexing, I suppose you have to store some timestamp field in your index; but it's up to you. Note that Lucene is very fast and I don't think that incremetal indexing is required for small or medium amout of data. -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 17:22 À : [EMAIL PROTECTED] Objet : Help for text based indexing Hi I have implemented Text based search using lucene. I was wonderful playing around with it. Now I want to enchance the application. I have a Root folder, under that I have many other folder, that are group specific, say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group folder within that. Now I am index for these groups separately, ie , I have index as C:/index/group1, C:/index/group2, C:/index/group3 and so on I want to know if I can have only one index for all these say C:/index/Root (this has index for all the folder) and I should be able to Search using C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other groups. Let me know if this is possible and have anybody tried this. 2nd question Is lucene good to index databases ? How do we support incremental indexing ? (Right now I am using LIKE for searching ) Thanks in Advance Mahaveer - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? vote.yahoo.com - Register online to vote today!
RE: Help for text based indexing
Well you could add a field to each of your Documents whose value would be either "group1" or "group2". Or you could use the path to your files ... -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 17:49 À : [EMAIL PROTECTED] Objet : RE: Help for text based indexing I am clear with looping recursively to index all the file under Root folder. But the problem is if I want to search only in group1 or group2.Is that possible to search only in one of the group folder ? Cocula Remi <[EMAIL PROTECTED]> wrote: You just have to loop recurssively over the C:\tomcat\webapps\Root tree to create your index. Yes you can index databases; you will just have to write a mechanism that is able to create org.apache.lucene.document.Document from database. For instance : - connect JDBC - run a query for obtaining a ResultSet - loop for each row of that ResultSet : Create a new org.apache.lucene.document.Document from ResultSet data and add this document to the Index. end loop. For incremental indexing, I suppose you have to store some timestamp field in your index; but it's up to you. Note that Lucene is very fast and I don't think that incremetal indexing is required for small or medium amout of data. -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 17:22 À : [EMAIL PROTECTED] Objet : Help for text based indexing Hi I have implemented Text based search using lucene. I was wonderful playing around with it. Now I want to enchance the application. I have a Root folder, under that I have many other folder, that are group specific, say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group folder within that. Now I am index for these groups separately, ie , I have index as C:/index/group1, C:/index/group2, C:/index/group3 and so on I want to know if I can have only one index for all these say C:/index/Root (this has index for all the folder) and I should be able to Search using C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other groups. Let me know if this is possible and have anybody tried this. 2nd question Is lucene good to index databases ? How do we support incremental indexing ? (Right now I am using LIKE for searching ) Thanks in Advance Mahaveer - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Help for text based indexing
I am clear with looping recursively to index all the file under Root folder. But the problem is if I want to search only in group1 or group2.Is that possible to search only in one of the group folder ? Cocula Remi <[EMAIL PROTECTED]> wrote: You just have to loop recurssively over the C:\tomcat\webapps\Root tree to create your index. Yes you can index databases; you will just have to write a mechanism that is able to create org.apache.lucene.document.Document from database. For instance : - connect JDBC - run a query for obtaining a ResultSet - loop for each row of that ResultSet : Create a new org.apache.lucene.document.Document from ResultSet data and add this document to the Index. end loop. For incremental indexing, I suppose you have to store some timestamp field in your index; but it's up to you. Note that Lucene is very fast and I don't think that incremetal indexing is required for small or medium amout of data. -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 17:22 À : [EMAIL PROTECTED] Objet : Help for text based indexing Hi I have implemented Text based search using lucene. I was wonderful playing around with it. Now I want to enchance the application. I have a Root folder, under that I have many other folder, that are group specific, say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group folder within that. Now I am index for these groups separately, ie , I have index as C:/index/group1, C:/index/group2, C:/index/group3 and so on I want to know if I can have only one index for all these say C:/index/Root (this has index for all the folder) and I should be able to Search using C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other groups. Let me know if this is possible and have anybody tried this. 2nd question Is lucene good to index databases ? How do we support incremental indexing ? (Right now I am using LIKE for searching ) Thanks in Advance Mahaveer - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? vote.yahoo.com - Register online to vote today!
RE: Help for text based indexing
You just have to loop recurssively over the C:\tomcat\webapps\Root tree to create your index. Yes you can index databases; you will just have to write a mechanism that is able to create org.apache.lucene.document.Document from database. For instance : - connect JDBC - run a query for obtaining a ResultSet - loop for each row of that ResultSet : Create a new org.apache.lucene.document.Document from ResultSet data and add this document to the Index. end loop. For incremental indexing, I suppose you have to store some timestamp field in your index; but it's up to you. Note that Lucene is very fast and I don't think that incremetal indexing is required for small or medium amout of data. -Message d'origine- De : mahaveer jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 17:22 À : [EMAIL PROTECTED] Objet : Help for text based indexing Hi I have implemented Text based search using lucene. I was wonderful playing around with it. Now I want to enchance the application. I have a Root folder, under that I have many other folder, that are group specific, say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group folder within that. Now I am index for these groups separately, ie , I have index as C:/index/group1, C:/index/group2, C:/index/group3 and so on I want to know if I can have only one index for all these say C:/index/Root (this has index for all the folder) and I should be able to Search using C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other groups. Let me know if this is possible and have anybody tried this. 2nd question Is lucene good to index databases ? How do we support incremental indexing ? (Right now I am using LIKE for searching ) Thanks in Advance Mahaveer - Do you Yahoo!? vote.yahoo.com - Register online to vote today! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Help for text based indexing
Hi I have implemented Text based search using lucene. I was wonderful playing around with it. Now I want to enchance the application. I have a Root folder, under that I have many other folder, that are group specific, say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group folder within that. Now I am index for these groups separately, ie , I have index as C:/index/group1, C:/index/group2, C:/index/group3 and so on I want to know if I can have only one index for all these say C:/index/Root (this has index for all the folder) and I should be able to Search using C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other groups. Let me know if this is possible and have anybody tried this. 2nd question Is lucene good to index databases ? How do we support incremental indexing ? (Right now I am using LIKE for searching ) Thanks in Advance Mahaveer - Do you Yahoo!? vote.yahoo.com - Register online to vote today!
Re: ANT +BUILD + LUCENE
Karthik, You are still being a bit cryptic and making it hard for me to comprehend what the problem is, but here are some general pieces of advice with Ant related to what I think you are doing: * There is no need to use conditional logic to have a different set of properties for different operating systems. There is an implicit and declarative way to do this: But whitespace gets in the way, so you could use the ant-contrib (http://ant-contrib.sourceforge.net/tasks/index.html) which would be cleaner than the value of ${os.name}. * Using IndexFiles from the demo is awkward, to me. Why not give the sandbox task a try? * Ant has a task that might be handy for you. Please post how you are using (I can only presume), if that is the issue. Erik On Sep 13, 2004, at 10:34 PM, Karthik N S wrote: Hi Erik 1) Using Ant and Build.xml I want to run the org.apache.lucene.demo.IndexFiles to create an Indexfolder 2) Problem is The same Build.xml is to be used Across the O/s for creating Index 3) The path of Lucene1-4-final.jar are in respective directories for the O/s... [ Note :- The Path of Lucene_home,I/P and O/p directories are also O/s Specific should be in the Build.xml and should be trigged somthing by this type or I hope u get the situation. :{ With regards Karthik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 7:37 PM To: Lucene Users List Subject: Re: ANT +BUILD + LUCENE I'm not following what you want very clearly, but there is an task in Lucene's Sandbox. Please post what you are trying, and I'd be happy to help once I see the details. Erik On Sep 12, 2004, at 4:44 PM, Karthik N S wrote: Hi Guys Apologies.. The Task for me is to build the Index folder using Lucene & a simple Build.xml for ANT The Problem .. Same 'Build .xml' should be used for differnet O/s... [ Win / Linux ] The glitch is respective jar files such as Lucene-1.4 .jar & other jar files are not in same dir for the O/s. Also the I/p , O/p Indexer path for source/target may also vary. Please Somebody Help me. :( with regards Karthik WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search PharseQuery
Natarajan.T wrote: Ok you are correct ... Suppose if I type "what java" then how can I handle... You don't have to handle it, lucene does it. If you don't like how lucene handles it then you may extend the functionality. If you use the same analyzer for indexing and searching then you will find the results with both search strings: "what java" and "what is java". At least I obtain them in both cases. That's right you will obtain "what java" if you search for "what is java", in my case is acceptable. If it is not acceptable in your project, I suggest to try to create a new Analyzer. I whish you luck, Sergiu Regards, Natarajan. -Original Message- From: sergiu gordea [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 7:38 PM To: Lucene Users List Subject: Re: Search PharseQuery Natarajan.T wrote: Hi, Thanks for your response. For example search keyword is like below... Language "what is java" Token 1: language Token 2: what is java(like google) Regards, Natarajan. Lucene works exaclty as you describe above with a simple correction ... The analyzer has a list of stopped keywords, and I bet "is" is one of them for your analyzer. I don't mind right now about this, so I won't dig to find a solution for this problem, but the resolution should be searched around Analyzer classes. All the best, Sergiu -Original Message- From: Aad Nales [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 5:19 PM To: 'Lucene Users List' Subject: RE: Search PharseQuery Hi, Not sure if this is what you need but I created a lastname filter which in Dutch means potential double last names like:"van der Vaart". In order to process these I created a finite state machine that queried these last names. Since I only needed the filter on 'index' time and I never use it for querieying, this may not be what you are looking for. It should be simple to index 'what is java' as a single token and to search for that same token. However, you will need to create a list of accepted 'tokens'. If this is what you need let me know, I will make the code available... cheers, Aad Nales -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Tuesday, 14 September, 2004 13:39 To: Lucene Users List Subject: Re: Search PharseQuery --- "Natarajan.T" <[EMAIL PROTECTED]> wrote: Hi All, How do I implement PharseQuery API? What exactly you mean by implement? Are you trying to extend the current behavior or only trying find out the usage? Thanks, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: ANT +BUILD + LUCENE
Hi Erik 1) Using Ant and Build.xml I want to run the org.apache.lucene.demo.IndexFiles to create an Indexfolder 2) Problem is The same Build.xml is to be used Across the O/s for creating Index 3) The path of Lucene1-4-final.jar are in respective directories for the O/s... [ Note :- The Path of Lucene_home,I/P and O/p directories are also O/s Specific should be in the Build.xml and should be trigged somthing by this type or I hope u get the situation. :{ With regards Karthik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 7:37 PM To: Lucene Users List Subject: Re: ANT +BUILD + LUCENE I'm not following what you want very clearly, but there is an task in Lucene's Sandbox. Please post what you are trying, and I'd be happy to help once I see the details. Erik On Sep 12, 2004, at 4:44 PM, Karthik N S wrote: > Hi > > Guys > > > Apologies.. > > > The Task for me is to build the Index folder using Lucene & a simple > Build.xml for ANT > > The Problem .. Same 'Build .xml' should be used for differnet > O/s... > [ Win / Linux ] > > The glitch is respective jar files such as Lucene-1.4 .jar & other jar > files are not in same dir for the O/s. > Also the I/p , O/p Indexer path for source/target may also vary. > > > Please Somebody Help me. :( > > > > with regards > Karthik > > > > > WITH WARM REGARDS > HAVE A NICE DAY > [ N.S.KARTHIK] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search PharseQuery
Ok you are correct ... Suppose if I type "what java" then how can I handle... Regards, Natarajan. -Original Message- From: sergiu gordea [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 7:38 PM To: Lucene Users List Subject: Re: Search PharseQuery Natarajan.T wrote: >Hi, > >Thanks for your response. > >For example search keyword is like below... > >Language "what is java" > >Token 1: language >Token 2: what is java(like google) > > >Regards, >Natarajan. > > > Lucene works exaclty as you describe above with a simple correction ... The analyzer has a list of stopped keywords, and I bet "is" is one of them for your analyzer. I don't mind right now about this, so I won't dig to find a solution for this problem, but the resolution should be searched around Analyzer classes. All the best, Sergiu > > > >-Original Message- >From: Aad Nales [mailto:[EMAIL PROTECTED] >Sent: Tuesday, September 14, 2004 5:19 PM >To: 'Lucene Users List' >Subject: RE: Search PharseQuery > >Hi, > >Not sure if this is what you need but I created a lastname filter which >in Dutch means potential double last names like:"van der Vaart". In >order to process these I created a finite state machine that queried >these last names. Since I only needed the filter on 'index' time and I >never use it for querieying, this may not be what you are looking for. >It should be simple to index 'what is java' as a single token and to >search for that same token. However, you will need to create a list of >accepted 'tokens'. If this is what you need let me know, I will make the >code available... > >cheers, >Aad Nales > >-Original Message- >From: Honey George [mailto:[EMAIL PROTECTED] >Sent: Tuesday, 14 September, 2004 13:39 >To: Lucene Users List >Subject: Re: Search PharseQuery > > > --- "Natarajan.T" <[EMAIL PROTECTED]> >wrote: > > >>Hi All, >> >> >> >>How do I implement PharseQuery API? >> >> > >What exactly you mean by implement? Are you trying to >extend the current behavior or only trying find out >the usage? >Thanks, > George > > > > > > >___ALL-NEW >Yahoo! Messenger - all new features - even more fun! >http://uk.messenger.yahoo.com > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Addition to contributions page
Perhaps we should @deprecate the contributions page like we did with the Powered By page, and migrate it to the wiki? Erik On Sep 13, 2004, at 6:50 PM, Daniel Naber wrote: On Friday 10 September 2004 15:48, Chas Emerick wrote: PDFTextStream should be added to the 'Document Converters' section, with this URL < http://snowtide.com >, and perhaps this heading: 'PDFTextStream -- PDF text and metadata extraction'. The 'Author' field should probably be left blank, since there's no single creator. I just added it. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ANT +BUILD + LUCENE
I'm not following what you want very clearly, but there is an task in Lucene's Sandbox. Please post what you are trying, and I'd be happy to help once I see the details. Erik On Sep 12, 2004, at 4:44 PM, Karthik N S wrote: Hi Guys Apologies.. The Task for me is to build the Index folder using Lucene & a simple Build.xml for ANT The Problem .. Same 'Build .xml' should be used for differnet O/s... [ Win / Linux ] The glitch is respective jar files such as Lucene-1.4 .jar & other jar files are not in same dir for the O/s. Also the I/p , O/p Indexer path for source/target may also vary. Please Somebody Help me. :( with regards Karthik WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search PharseQuery
Natarajan.T wrote: Hi, Thanks for your response. For example search keyword is like below... Language "what is java" Token 1: language Token 2: what is java(like google) Regards, Natarajan. Lucene works exaclty as you describe above with a simple correction ... The analyzer has a list of stopped keywords, and I bet "is" is one of them for your analyzer. I don't mind right now about this, so I won't dig to find a solution for this problem, but the resolution should be searched around Analyzer classes. All the best, Sergiu -Original Message- From: Aad Nales [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 5:19 PM To: 'Lucene Users List' Subject: RE: Search PharseQuery Hi, Not sure if this is what you need but I created a lastname filter which in Dutch means potential double last names like:"van der Vaart". In order to process these I created a finite state machine that queried these last names. Since I only needed the filter on 'index' time and I never use it for querieying, this may not be what you are looking for. It should be simple to index 'what is java' as a single token and to search for that same token. However, you will need to create a list of accepted 'tokens'. If this is what you need let me know, I will make the code available... cheers, Aad Nales -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Tuesday, 14 September, 2004 13:39 To: Lucene Users List Subject: Re: Search PharseQuery --- "Natarajan.T" <[EMAIL PROTECTED]> wrote: Hi All, How do I implement PharseQuery API? What exactly you mean by implement? Are you trying to extend the current behavior or only trying find out the usage? Thanks, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Indexing object graphs
Interesting! http://kasparov.skife.org/blog/2004/09/13#lucene-graphs - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search PharseQuery
Hi, Thanks for your response. For example search keyword is like below... Language "what is java" Token 1: language Token 2: what is java(like google) Regards, Natarajan. -Original Message- From: Aad Nales [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 5:19 PM To: 'Lucene Users List' Subject: RE: Search PharseQuery Hi, Not sure if this is what you need but I created a lastname filter which in Dutch means potential double last names like:"van der Vaart". In order to process these I created a finite state machine that queried these last names. Since I only needed the filter on 'index' time and I never use it for querieying, this may not be what you are looking for. It should be simple to index 'what is java' as a single token and to search for that same token. However, you will need to create a list of accepted 'tokens'. If this is what you need let me know, I will make the code available... cheers, Aad Nales -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Tuesday, 14 September, 2004 13:39 To: Lucene Users List Subject: Re: Search PharseQuery --- "Natarajan.T" <[EMAIL PROTECTED]> wrote: > Hi All, > > > > How do I implement PharseQuery API? What exactly you mean by implement? Are you trying to extend the current behavior or only trying find out the usage? Thanks, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Document Relevance
Hi I am new to Lucene. Could anyone tell me how to set the RELEVANCE in which the search results are displayed. Any online Examples available on this topic I welcome ur suggestions Thanx & Regards E.Faisal Important Email Information :- The information in this email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. If you are not the intended addressee please contact the sender and dispose of this e-mail immediately. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search PharseQuery
--- "Natarajan.T" <[EMAIL PROTECTED]> wrote: > I am trying to extend the current behavior. You might have already seen a mail from Cocula Remi on this. Please provide more details of the problem for specific comments - basically the problem you are facing and/or what behavior you are trying to extend. This was not clear from your email. An example will make things more clear. Thanks & Regards, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search PharseQuery
I am trying to extend the current behavior. Regards, Natarajan. -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 5:09 PM To: Lucene Users List Subject: Re: Search PharseQuery --- "Natarajan.T" <[EMAIL PROTECTED]> wrote: > Hi All, > > > > How do I implement PharseQuery API? What exactly you mean by implement? Are you trying to extend the current behavior or only trying find out the usage? Thanks, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search PharseQuery
Hi, Not sure if this is what you need but I created a lastname filter which in Dutch means potential double last names like:"van der Vaart". In order to process these I created a finite state machine that queried these last names. Since I only needed the filter on 'index' time and I never use it for querieying, this may not be what you are looking for. It should be simple to index 'what is java' as a single token and to search for that same token. However, you will need to create a list of accepted 'tokens'. If this is what you need let me know, I will make the code available... cheers, Aad Nales -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Tuesday, 14 September, 2004 13:39 To: Lucene Users List Subject: Re: Search PharseQuery --- "Natarajan.T" <[EMAIL PROTECTED]> wrote: > Hi All, > > > > How do I implement PharseQuery API? What exactly you mean by implement? Are you trying to extend the current behavior or only trying find out the usage? Thanks, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search PharseQuery
--- "Natarajan.T" <[EMAIL PROTECTED]> wrote: > Hi All, > > > > How do I implement PharseQuery API? What exactly you mean by implement? Are you trying to extend the current behavior or only trying find out the usage? Thanks, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search PharseQuery
Hi Serigu, String queryString = "\"waht is java\""; Query q = QueryParser.parse(queryString, "field", new StandardAnalyzer()); System.out.println(q.toString()); This is enough for starting consult Lucene API for more information Are you tested the above query? This search keyword is not a single keyword. Regards, Natarajan. -Original Message- From: sergiu gordea [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 4:34 PM To: Lucene Users List Subject: Re: Search PharseQuery String queryString = "\"waht is java\""; Query q = QueryParser.parse(queryString, "field", new StandardAnalyzer()); System.out.println(q.toString()); This is enough for starting consult Lucene API for more information Sergiu Natarajan.T wrote: >Hi, > >Thanks for your mail, that link says only theoretically but I need some >sample > > >Regards, >Natarajan. > > >-Original Message- >From: Cocula Remi [mailto:[EMAIL PROTECTED] >Sent: Tuesday, September 14, 2004 2:58 PM >To: Lucene Users List >Subject: RE: Search PharseQuery > >Use QueryParser. >please take a look at >http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html >It's pretty clear. > > >-Message d'origine- >De : Natarajan.T [mailto:[EMAIL PROTECTED] >Envoyé : mardi 14 septembre 2004 11:26 >À : 'Lucene Users List' >Objet : Search PharseQuery > > >Hi All, > > > >How do I implement PharseQuery API? Pls send me some sample code.( How >can I handle "java is platform" as single word? > >) > > > >Regards, > >Natarajan. > > > > > > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search PharseQuery
String queryString = "\"waht is java\""; Query q = QueryParser.parse(queryString, "field", new StandardAnalyzer()); System.out.println(q.toString()); This is enough for starting consult Lucene API for more information Sergiu Natarajan.T wrote: Hi, Thanks for your mail, that link says only theoretically but I need some sample Regards, Natarajan. -Original Message- From: Cocula Remi [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 2:58 PM To: Lucene Users List Subject: RE: Search PharseQuery Use QueryParser. please take a look at http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html It's pretty clear. -Message d'origine- De : Natarajan.T [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 11:26 À : 'Lucene Users List' Objet : Search PharseQuery Hi All, How do I implement PharseQuery API? Pls send me some sample code.( How can I handle "java is platform" as single word? ) Regards, Natarajan. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search PharseQuery
Hi, Thanks for your mail, that link says only theoretically but I need some sample Regards, Natarajan. -Original Message- From: Cocula Remi [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 2:58 PM To: Lucene Users List Subject: RE: Search PharseQuery Use QueryParser. please take a look at http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html It's pretty clear. -Message d'origine- De : Natarajan.T [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 11:26 À : 'Lucene Users List' Objet : Search PharseQuery Hi All, How do I implement PharseQuery API? Pls send me some sample code.( How can I handle "java is platform" as single word? ) Regards, Natarajan. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search PharseQuery
Use QueryParser. please take a look at http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html It's pretty clear. -Message d'origine- De : Natarajan.T [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 septembre 2004 11:26 À : 'Lucene Users List' Objet : Search PharseQuery Hi All, How do I implement PharseQuery API? Pls send me some sample code.( How can I handle "java is platform" as single word? ) Regards, Natarajan. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search PharseQuery
Hi All, How do I implement PharseQuery API? Pls send me some sample code.( How can I handle "java is platform" as single word? ) Regards, Natarajan.
Re: OutOfMemory example
On Tuesday 14 September 2004 08:32, JiÅÃ Kuhn wrote: > The error is thrown in exactly the same point as before. This morning I > downloaded Lucene from CVS, now the jar is lucene-1.5-rc1-dev.jar, JVM > is 1.4.2_05-b04, both Linux and Windows. Now I can reproduce the problem. I first tried running the code inside Eclipse, but the Exception doesn't occur there. It does occur on the command line. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]