Re: PorterStemfilter

2004-09-14 Thread Honey George
 --- Tea Yu <[EMAIL PROTECTED]> wrote: 
> David,
> 
> For me I don't want a search for "in print" gives
> results from "in printer"?
> I'll consider that over-stemmed elsecase.
Here the "in" won't be considered as it is a stopword
in most of the analyzers. I know it is in
StandardAnalyzer. So searching for 'in print' will not
return the document containing 'in printer' because
stem('printer') is 'printer' and not 'print'. So
'printer' is what getting stored in the index.
Enclosing in double quotes does not prevent stemming.

> I'm also not that satisfactory when "effective" is
> stemmed to "effect" by
> snowball recently


I have tested this with PorterStemFilter and there is
also "effective" is stemmed as "effect". There are
more serious problems. "printable" is stemmed as
"printabl".

Thanks,
  George





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PorterStemfilter

2004-09-14 Thread Tea Yu
David,

For me I don't want a search for "in print" gives results from "in printer"?
I'll consider that over-stemmed elsecase.

I'm also not that satisfactory when "effective" is stemmed to "effect" by
snowball recently

Cheers
Tea


> Hi David
>
> I like KStem more than Porter / Snowball - but still has limitations
> although performs better as it has a dictionary to augment the rules.
>
> Note that KStem will also treat "print" and "printer" as two distinct
terms,
> probably treating it as verb and noun respectively.
>
> Cheers
>
> Pete Lewis
>
> - Original Message - 
> From: "David Spencer" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Tuesday, September 14, 2004 7:19 PM
> Subject: Re: PorterStemfilter
>
>
> > Honey George wrote:
> >
> > > Hi,
> > >  This might be more of a questing related to the
> > > PorterStemmer algorithm rather than with lucene, but
> > > if anyone has the knowledge please share.
> >
> > You might want to also try the Snowball stemmer:
> >
> > http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/
> >
> > And KStem:
> >
> > http://ciir.cs.umass.edu/downloads/
> > >
> > > I am using the PorterStemFilter that some with lucene
> > > and it turns out that searching for the word 'printer'
> > > does not return a document containing the text
> > > 'print'. To narrow down the problem, I have tested the
> > > PorterStemFilter in a standalone programs and it turns
> > > out that the stem of printer is 'printer' and not
> > > 'print'. That is 'printer' is not equal to 'print' +
> > > 'er', the whole of the word is stem. Can somebody
> > > explain the behavior.
> > >
> > > Thanks & Regards,
> > >George
> > >
> > >
> > >
> > >
> > >
> > > ___ALL-NEW
> Yahoo! Messenger - all new features - even more fun!
> http://uk.messenger.yahoo.com
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Doug Cutting wrote:
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a 
term in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of 
frequency,  it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare 
terms in the query might be misspelled (i.e. not what the user 
intended) and we suggest alternativies to these words too (in addition 
to the words in the query that are not in the index at all).

Almost.
If the user enters "a recursize purser", then: "a", which is in, say, 
 >50% of the documents, is probably spelled correctly and "recursize", 
which is in zero documents, is probably mispelled.  But what about 
"purser"?  If we run the spell check algorithm on "purser" and generate 
"parser", should we show it to the user?  If "purser" occurs in 1% of 
documents and "parser" occurs in 5%, then we probably should, since 
"parser" is a more common word than "purser".  But if "parser" only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting "parser".
OK, sure, got it.
I'll give it a think and try to add this option to my just submitted 
spelling code.


If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does "purser" or "parser" occur 
more frequently near "descent".  But that gets expensive.
Yeah, expensive for a large scale search engine, but probably 
appropriate for a desktop engine.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Similarity score computation documentation

2004-09-14 Thread Doug Cutting
Your analysis sounds correct.
At base, a weight is a normalized tf*idf.  So a document weight is:
  docTf * idf * docNorm
and a query weight is:
  queryTf * idf * queryNorm
where queryTf is always one.
So the product of these is (docTf * idf * docNorm) * (idf * queryNorm), 
which indeed contains idf twice.  I think the best documentation fix 
would be to add another idf(t) clause at the end of the formula, next to 
queryNorm(q), so this is clear.  Does that sound right to you?

Doug
Ken McCracken wrote:
Hi,
I was looking through the score computation when running search, and
think there may be a discrepancy between what is _documented_ in the
org.apache.lucene.search.Similarity class overview Javadocs, and what
actually occurs in the code.
I believe the problem is only with the documentation.
I'm pretty sure that there should be an idf^2 in the sum.  Look at
org.apache.lucene.search.TermQuery, the inner class TermWeight.  You
can see that first sumOfSquaredWeights() is called, followed by
normalize(), during search.  Further, the resulting value stored in
the field "value" is set as the "weightValue" on the TermScorer.
If we look at what happens to TermWeight, sumOfSquaredWeights() sets
"queryWeight" to idf * boost.  During normalize(), "queryWeight" is
multiplied by the query norm, and "value" is set to queryWeight * idf
== idf * boost * query norm * idf == idf^2 * boost * query norm.  This
becomes the "weightValue" in the TermScorer that is then used to
multiply with the appropriate tf, etc., values.
The remaining terms in the Similarity description are properly
appended.  I also see that the queryNorm effectively "cancels out"
(dimensionally, since it is a 1/ square root of a sum of squares of
idfs) one of the idfs, so the formula still ends up being roughly a
TF-IDF formula.  But the idf^2 should still be there, along with the
expansion of queryNorm.
Am I mistaken, or is the documentation off?
Thanks for your help,
-Ken
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Hits.doc(x) and range queries

2004-09-14 Thread roy-lucene-user
Hi guys!

I've posted previously that Hits.doc(x) was taking a long time.  Turns out it
has to do with a date range in our query.  We usually do date ranges like this:
Date:[(lucene date field) - (lucene date field)]

Sometimes the begin date is "0" which is what we get from
DateField.dateToString( ( new Date( 0 ) ).

This is when getting our search results from the Hits object takes an absurd
amount of time.  Its usually each time the Hits object attempts to get more
results from an IndexSearcher ( aka, every 100? ).

It also takes up more memory...

I was wondering why it affects the search so much even though we're only
returning 350 or so results.  Does the QueryParser do something similar to the
DateFilter on range queries?  Would it be better to use a DateFilter?

We're using Lucene 1.2 (with plans to upgrade).  Do newer versions of Lucene
have this problem?

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a term 
in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of frequency, 
 it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in 
the query that are not in the index at all).
Almost.
If the user enters "a recursize purser", then: "a", which is in, say, 
>50% of the documents, is probably spelled correctly and "recursize", 
which is in zero documents, is probably mispelled.  But what about 
"purser"?  If we run the spell check algorithm on "purser" and generate 
"parser", should we show it to the user?  If "purser" occurs in 1% of 
documents and "parser" occurs in 5%, then we probably should, since 
"parser" is a more common word than "purser".  But if "parser" only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting "parser".

If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does "purser" or "parser" occur 
more frequently near "descent".  But that gets expensive.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
Andrzej Bialecki wrote:
I was wondering about the way you build the n-gram queries. You 
basically don't care about their position in the input term. Originally 
I thought about using PhraseQuery with a slop - however, after checking 
the source of PhraseQuery I realized that this probably wouldn't be that 
fast... You use BooleanQuery and start/end boosts instead, which may 
give similar results in the end but much cheaper.
Sloppy PhraseQuery's are slower than BooleanQueries, but not horribly 
slower.  The problem is that they don't handle the case where phrase 
elements are missing altogether, while a BooleanQuery does.  So what you 
really need is maybe a variation of a sloppy PhraseQuery that scores 
matches that do not contain all of the terms...

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
...or prepare in advance a fast lookup index - split all existing
terms to bi- or trigrams, create a separate lookup index, and then
simply for each term ask a phrase query (phrase = all n-grams from
an input term), with a slop > 0, to get similar existing terms.
This should be fast, and you could provide a "did you mean"
function too...
Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
phases. First you build a "fast lookup index" as mentioned above.
Then to correct a word you do a query in this index based on the
ngrams in the misspelled word.
The background for this suggestion was that I was playing some time ago 
with a Luke plugin that builds various sorts of ancillary indexes, but 
then I never finished it... Kudos for actually making it work ;-)
Sure, it was a fun little edge project. For the most part the code was 
done last week right after this thread appeared, but it always takes a 
while to get it from 95 to 100%.


[1] Source is attached and I'd like to contribute it to the sandbox,
esp if someone can validate that what it's doing is reasonable and
useful.

There have been many requests for this or similar functionality in the 
past, I believe it should go into sandbox.

I was wondering about the way you build the n-gram queries. You 
basically don't care about their position in the input term. Originally 
I thought about using PhraseQuery with a slop - however, after checking 
the source of PhraseQuery I realized that this probably wouldn't be that 
fast... You use BooleanQuery and start/end boosts instead, which may 
give similar results in the end but much cheaper.

I also wonder how this algorithm would behave for smaller values of 
Sure, I'll try to rebuild the demo w/ lengths 2-5 (and then the query 
page can test any conitguous combo).

start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram 
length, the more "fuzziness" you introduce, which may or may not be 
desirable (increased recall at the cost of precision - for small indexes 
this may be useful from the user's perspective because you will always 
get a plausible hit, for huge indexes it's a loss).

[2] Here's a demo page. I built an ngram index for ngrams of length 3
 and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like
"recursixe" or whatnot to see what suggestions it returns. Note this
is not a normal search index query -- rather this is a test page for
spelling corrections.

http://www.searchmorph.com/kat/spell.jsp

Very nice demo! 
Thanks, kinda designed for ngram-nerds if you know what I mean :)
I bet it's running way faster than the linear search 
Indeed, this is almost zero time, whereas the simple and dumb linear 
search was taking me 10sec. I will have to redo the sites main search 
page so it uses this new code, TBD, prob tomorrow.

over terms :-), even though you have to build the index in advance. But 
if you work with static or mostly static indexes this doesn't matter.

Based on a subsequent mail in this thread I set boosts for the words
in the ngram index. The background is each word (er..term for a given
 field) in the orig index is a separate Document in the ngram index.
This Doc contains all ngrams (in my test case, like #2 above, of
length 3 and 4) of the word. I also set a boost of
log(word_freq)/log(num_docs) so that more frequently words will tend
to be suggested more often.

You may want to experiment with 2 <= n <= 5. Some n-gram based 
Yep, will do prob tomorrow.
techniques use all lengths together, some others use just single length, 
results also vary depending on the language...

I think in "plain" English then the way a word is suggested as a 
spelling correction is: - frequently occuring words score higher -
words that share more ngrams with the orig word score higher - words
that share rare ngrams with the orig word score higher

I think this is a reasonable heuristics. Reading the code I would 
present it this way:
ok, thx, will update
- words that share more ngrams with the orig word score higher, and
  words that share rare ngrams with the orig word score higher
  (as a natural consequence of using BooleanQuery),
- and, frequently occuring words score higher (as a consequence of using
  per-Document boosts),
- from reading the source code I see that you use Levenshtein distance
  to prune the resultset of too long/too short results,
I think also that because you don't use the positional information about 
 the input n-grams you may be getting some really weird hits.
Good point, though I haven't seen this yet. Might be due to the prefix 
boost and maybe some Markov chain magic tending to only show reasonable 
words.

You could 
prune them by simply checking if you find a (threshold) of input ngrams 
in the right sequence in the found terms. This shouldn't be too costly 
Good point, I'll try to add that in as an optional parameter.
because you operate on a smal

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Andrzej Bialecki
David Spencer wrote:
...or prepare in advance a fast lookup index - split all existing
terms to bi- or trigrams, create a separate lookup index, and then
simply for each term ask a phrase query (phrase = all n-grams from
an input term), with a slop > 0, to get similar existing terms.
This should be fast, and you could provide a "did you mean"
function too...
Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
phases. First you build a "fast lookup index" as mentioned above.
Then to correct a word you do a query in this index based on the
ngrams in the misspelled word.
The background for this suggestion was that I was playing some time ago 
with a Luke plugin that builds various sorts of ancillary indexes, but 
then I never finished it... Kudos for actually making it work ;-)

[1] Source is attached and I'd like to contribute it to the sandbox,
esp if someone can validate that what it's doing is reasonable and
useful.
There have been many requests for this or similar functionality in the 
past, I believe it should go into sandbox.

I was wondering about the way you build the n-gram queries. You 
basically don't care about their position in the input term. Originally 
I thought about using PhraseQuery with a slop - however, after checking 
the source of PhraseQuery I realized that this probably wouldn't be that 
fast... You use BooleanQuery and start/end boosts instead, which may 
give similar results in the end but much cheaper.

I also wonder how this algorithm would behave for smaller values of 
start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram 
length, the more "fuzziness" you introduce, which may or may not be 
desirable (increased recall at the cost of precision - for small indexes 
this may be useful from the user's perspective because you will always 
get a plausible hit, for huge indexes it's a loss).

[2] Here's a demo page. I built an ngram index for ngrams of length 3
 and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like
"recursixe" or whatnot to see what suggestions it returns. Note this
is not a normal search index query -- rather this is a test page for
spelling corrections.

http://www.searchmorph.com/kat/spell.jsp
Very nice demo! I bet it's running way faster than the linear search 
over terms :-), even though you have to build the index in advance. But 
if you work with static or mostly static indexes this doesn't matter.

Based on a subsequent mail in this thread I set boosts for the words
in the ngram index. The background is each word (er..term for a given
 field) in the orig index is a separate Document in the ngram index.
This Doc contains all ngrams (in my test case, like #2 above, of
length 3 and 4) of the word. I also set a boost of
log(word_freq)/log(num_docs) so that more frequently words will tend
to be suggested more often.
You may want to experiment with 2 <= n <= 5. Some n-gram based 
techniques use all lengths together, some others use just single length, 
results also vary depending on the language...

I think in "plain" English then the way a word is suggested as a 
spelling correction is: - frequently occuring words score higher -
words that share more ngrams with the orig word score higher - words
that share rare ngrams with the orig word score higher
I think this is a reasonable heuristics. Reading the code I would 
present it this way:

- words that share more ngrams with the orig word score higher, and
  words that share rare ngrams with the orig word score higher
  (as a natural consequence of using BooleanQuery),
- and, frequently occuring words score higher (as a consequence of using
  per-Document boosts),
- from reading the source code I see that you use Levenshtein distance
  to prune the resultset of too long/too short results,
I think also that because you don't use the positional information about 
 the input n-grams you may be getting some really weird hits. You could 
prune them by simply checking if you find a (threshold) of input ngrams 
in the right sequence in the found terms. This shouldn't be too costly 
because you operate on a small result set.

[6]
If people want to vote me in as a committer to the sandbox then I can
Well, someone needs to maintain the code after all... ;-)
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Tate Avery wrote:
I get a NullPointerException shown (via Apache) when I try to access http://www.searchmorph.com/kat/spell.jsp
How embarassing!
Sorry!
Fixed!


T
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 14, 2004 3:23 PM
To: Lucene Users List
Subject: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene
Andrzej Bialecki wrote:

David Spencer wrote:

I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate 
the Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.

...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop > 0, to get similar existing terms. This should be fast, and 
you could provide a "did you mean" function too...


Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like "recursixe" 
or whatnot to see what suggestions it returns. Note this is not a normal 
search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp
[3] Here's the javadoc:
http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
[5] A few more details:
Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This 
Doc contains all ngrams (in my test case, like #2 above, of length 3 and 
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]
If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
  Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Tate Avery

I get a NullPointerException shown (via Apache) when I try to access 
http://www.searchmorph.com/kat/spell.jsp


T

-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 14, 2004 3:23 PM
To: Lucene Users List
Subject: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene


Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>>
>> I can/should send the code out. The logic is that for any terms in a 
>> query that have zero matches, go thru all the terms(!) and calculate 
>> the Levenshtein string distance, and return the best matches. A more 
>> intelligent way of doing this is to instead look for terms that also 
>> match on the 1st "n" (prob 3) chars.
> 
> 
> ...or prepare in advance a fast lookup index - split all existing terms 
> to bi- or trigrams, create a separate lookup index, and then simply for 
> each term ask a phrase query (phrase = all n-grams from an input term), 
> with a slop > 0, to get similar existing terms. This should be fast, and 
> you could provide a "did you mean" function too...
> 

Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.

[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like "recursixe" 
or whatnot to see what suggestions it returns. Note this is not a normal 
search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp

[3] Here's the javadoc:

http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html

[4] Here's source in HTML:

http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152

[5] A few more details:

Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This 
Doc contains all ngrams (in my test case, like #2 above, of length 3 and 
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]

If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
  Dave




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate 
the Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.

...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop > 0, to get similar existing terms. This should be fast, and 
you could provide a "did you mean" function too...

Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like "recursixe" 
or whatnot to see what suggestions it returns. Note this is not a normal 
search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp
[3] Here's the javadoc:
http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
[5] A few more details:
Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This 
Doc contains all ngrams (in my test case, like #2 above, of length 3 and 
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]
If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
 Dave

package org.apache.lucene.spell;

/* 
 * The Apache Software License, Version 1.1
 *
 * Copyright (c) 2001-2003 The Apache Software Foundation.  All rights
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *notice, this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *notice, this list of conditions and the following disclaimer in
 *the documentation and/or other materials provided with the
 *distribution.
 *
 * 3. The end-user documentation included with the redistribution,
 *if any, must include the following acknowledgment:
 *   "This product includes software developed by the
 *Apache Software Foundation (http://www.apache.org/)."
 *Alternately, this acknowledgment may appear in the software itself,
 *if and wherever such third-party acknowledgments normally appear.
 *
 * 4. The names "Apache" and "Apache Software Foundation" and
 *"Apache Lucene" must not be used to endorse or promote products
 *derived from this software without prior written permission. For
 *written permission, please contact [EMAIL PROTECTED]
 *
 * 5. Products derived from this software may not be called "Apache",
 *"Apache Lucene", nor may "Apache" appear in their name, without
 *prior written permission of the Apache Software Foundation.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 * ===

Re: PorterStemfilter

2004-09-14 Thread Pete Lewis
Hi David

I like KStem more than Porter / Snowball - but still has limitations
although performs better as it has a dictionary to augment the rules.

Note that KStem will also treat "print" and "printer" as two distinct terms,
probably treating it as verb and noun respectively.

Cheers

Pete Lewis

- Original Message - 
From: "David Spencer" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, September 14, 2004 7:19 PM
Subject: Re: PorterStemfilter


> Honey George wrote:
>
> > Hi,
> >  This might be more of a questing related to the
> > PorterStemmer algorithm rather than with lucene, but
> > if anyone has the knowledge please share.
>
> You might want to also try the Snowball stemmer:
>
> http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/
>
> And KStem:
>
> http://ciir.cs.umass.edu/downloads/
> >
> > I am using the PorterStemFilter that some with lucene
> > and it turns out that searching for the word 'printer'
> > does not return a document containing the text
> > 'print'. To narrow down the problem, I have tested the
> > PorterStemFilter in a standalone programs and it turns
> > out that the stem of printer is 'printer' and not
> > 'print'. That is 'printer' is not equal to 'print' +
> > 'er', the whole of the word is stem. Can somebody
> > explain the behavior.
> >
> > Thanks & Regards,
> >George
> >
> >
> >
> >
> >
> > ___ALL-NEW
Yahoo! Messenger - all new features - even more fun!
http://uk.messenger.yahoo.com
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PorterStemfilter

2004-09-14 Thread Pete Lewis
Hi George

There are lots of problems with Port stemmers, not great for English but get
worse for other languages.

If you look at:

http://snowball.tartarus.org/demo.php

You'll see the Snowball demo - this is basically another instance of Porter.

If you enter "print" and "printer" and submit then the results will be
"print" and "printer" - hence showing the the Porter stemmed versions are
the same as the originals.  Therefore they are both distinct terms in their
own right and searches on one will not hit the other.

Cheers

Pete Lewis

- Original Message - 
From: "Honey George" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, September 14, 2004 6:57 PM
Subject: PorterStemfilter


> Hi,
>  This might be more of a questing related to the
> PorterStemmer algorithm rather than with lucene, but
> if anyone has the knowledge please share.
>
> I am using the PorterStemFilter that some with lucene
> and it turns out that searching for the word 'printer'
> does not return a document containing the text
> 'print'. To narrow down the problem, I have tested the
> PorterStemFilter in a standalone programs and it turns
> out that the stem of printer is 'printer' and not
> 'print'. That is 'printer' is not equal to 'print' +
> 'er', the whole of the word is stem. Can somebody
> explain the behavior.
>
> Thanks & Regards,
>George
>
>
>
>
>
> ___ALL-NEW Yahoo!
Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PorterStemfilter

2004-09-14 Thread David Spencer
Honey George wrote:
Hi,
 This might be more of a questing related to the
PorterStemmer algorithm rather than with lucene, but
if anyone has the knowledge please share.
You might want to also try the Snowball stemmer:
http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/
And KStem:
http://ciir.cs.umass.edu/downloads/
I am using the PorterStemFilter that some with lucene
and it turns out that searching for the word 'printer'
does not return a document containing the text
'print'. To narrow down the problem, I have tested the
PorterStemFilter in a standalone programs and it turns
out that the stem of printer is 'printer' and not
'print'. That is 'printer' is not equal to 'print' +
'er', the whole of the word is stem. Can somebody
explain the behavior.
Thanks & Regards,
   George



___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Help for text based indexing

2004-09-14 Thread Honey George
You could recieve the group name as an input from the
user and construct a BooleanQuery internally which
will qyery only the group field based on the user
input. So the user need not append the group name with
the search string.

Thanks,
   George
 --- mahaveer jain <[EMAIL PROTECTED]> wrote: 
> If i have rightly understood, you mean to say that
> the query for search has  to be 
>  
> "Group1" AND "Hello" (if hello is what I want to
> search ?)
>  
> Cocula Remi <[EMAIL PROTECTED]> wrote:
> A keyword is not tokenized, that's why you wont be
> able to search over a part of it. You'd rather use a
> Text fied.
> 
> About creating a special field : 
> 
> IndexWriter Ir = 
> 
> File f = 
> Document doc = new Document();
> if
>
(f.toString.startsWith("C:\tomcat\webapps\Root\Group1")
> {
> doc.add(Field.Text("group", "Group1"));
> }
> if
>
(f.toString.startsWith("C:\tomcat\webapps\Root\Group2")
> {
> doc.add(Field.Text("group", "Group2"));
> }
> doc.add(Field.Text("content", getContent(f)));
> Ir.addDocument(doc);
> 
> 
> 
> Then you can search in group1 with query like that :
> 
> 
> group:Group1 AND rest_of_the_query.
> 
> 
> 
> -Message d'origine-
> De : mahaveer jain [mailto:[EMAIL PROTECTED]
> Envoyé : mardi 14 septembre 2004 18:03
> À : Lucene Users List
> Objet : RE: Help for text based indexing
> 
> 
> Well in my case the path is KeyWord. I had tried
> that earlier and it does not seems to work in a
> single index file. 
> 
> Can you explain a bit more about adding group1 and
> group2 ?
> 
> Cocula Remi wrote:
> Well you could add a field to each of your Documents
> whose value would be either "group1" or "group2".
> Or you could use the path to your files ...
> 
> 
> 
> -Message d'origine-
> De : mahaveer jain [mailto:[EMAIL PROTECTED]
> Envoyé : mardi 14 septembre 2004 17:49
> À : [EMAIL PROTECTED]
> Objet : RE: Help for text based indexing
> 
> 
> I am clear with looping recursively to index all the
> file under Root folder.
> But the problem is if I want to search only in
> group1 or group2.Is that possible to search only in
> one of the group folder ?
> 
> Cocula Remi wrote:
> You just have to loop recurssively over the
> C:\tomcat\webapps\Root tree to create your index.
> Yes you can index databases; you will just have to
> write a mechanism that is able to create
> org.apache.lucene.document.Document from database.
> For instance : 
> - connect JDBC
> - run a query for obtaining a ResultSet
> - loop for each row of that ResultSet :
> Create a new org.apache.lucene.document.Document
> from ResultSet data
> and add this document to the Index.
> end loop.
> 
> For incremental indexing, I suppose you have to
> store some timestamp field in your index; but it's
> up to you.
> Note that Lucene is very fast and I don't think that
> incremetal indexing is required for small or medium
> amout of data.
> 
> 
> 
> -Message d'origine-
> De : mahaveer jain [mailto:[EMAIL PROTECTED]
> Envoyé : mardi 14 septembre 2004 17:22
> À : [EMAIL PROTECTED]
> Objet : Help for text based indexing
> 
> 
> 
> Hi
> 
> I have implemented Text based search using lucene. I
> was wonderful playing around with it.
> 
> Now I want to enchance the application.
> 
> I have a Root folder, under that I have many other
> folder, that are group specific, say (group1,
> group2, .. so on). The Root folder is in
> C:\tomcat\webapps\Root and group folder within that.
> 
> Now I am index for these groups separately, ie , I
> have index as C:/index/group1, C:/index/group2,
> C:/index/group3 and so on
> 
> I want to know if I can have only one index for all
> these say C:/index/Root (this has index for all the
> folder) and I should be able to Search using
> C:\tomcat\webapps\Root\group1(if want to search for
> group1) similarly for the other groups.
> 
> Let me know if this is possible and have anybody
> tried this.
> 
> 2nd question
> 
> Is lucene good to index databases ? How do we
> support incremental indexing ?
> 
> (Right now I am using LIKE for searching )
> 
> Thanks in Advance
> 
> Mahaveer
> 
> 
> 
> -
> Do you Yahoo!?
> vote.yahoo.com - Register online to vote today!
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 
> 
> -
> Do you Yahoo!?
> vote.yahoo.com - Register online to vote today!
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 
> 
> -
> Do you Yahoo!?
> vote.yahoo.com - Register online to vote today!
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 
>   
> -

PorterStemfilter

2004-09-14 Thread Honey George
Hi,
 This might be more of a questing related to the
PorterStemmer algorithm rather than with lucene, but
if anyone has the knowledge please share.

I am using the PorterStemFilter that some with lucene
and it turns out that searching for the word 'printer'
does not return a document containing the text
'print'. To narrow down the problem, I have tested the
PorterStemFilter in a standalone programs and it turns
out that the stem of printer is 'printer' and not
'print'. That is 'printer' is not equal to 'print' +
'er', the whole of the word is stem. Can somebody
explain the behavior.

Thanks & Regards,
   George





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Help for text based indexing

2004-09-14 Thread mahaveer jain
If i have rightly understood, you mean to say that the query for search has  to be 
 
"Group1" AND "Hello" (if hello is what I want to search ?)
 
Cocula Remi <[EMAIL PROTECTED]> wrote:
A keyword is not tokenized, that's why you wont be able to search over a part of it. 
You'd rather use a Text fied.

About creating a special field : 

IndexWriter Ir = 

File f = 
Document doc = new Document();
if (f.toString.startsWith("C:\tomcat\webapps\Root\Group1")
{
doc.add(Field.Text("group", "Group1"));
}
if (f.toString.startsWith("C:\tomcat\webapps\Root\Group2")
{
doc.add(Field.Text("group", "Group2"));
}
doc.add(Field.Text("content", getContent(f)));
Ir.addDocument(doc);



Then you can search in group1 with query like that : 

group:Group1 AND rest_of_the_query.



-Message d'origine-
De : mahaveer jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 18:03
À : Lucene Users List
Objet : RE: Help for text based indexing


Well in my case the path is KeyWord. I had tried that earlier and it does not seems to 
work in a single index file. 

Can you explain a bit more about adding group1 and group2 ?

Cocula Remi wrote:
Well you could add a field to each of your Documents whose value would be either 
"group1" or "group2".
Or you could use the path to your files ...



-Message d'origine-
De : mahaveer jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 17:49
À : [EMAIL PROTECTED]
Objet : RE: Help for text based indexing


I am clear with looping recursively to index all the file under Root folder.
But the problem is if I want to search only in group1 or group2.Is that possible to 
search only in one of the group folder ?

Cocula Remi wrote:
You just have to loop recurssively over the C:\tomcat\webapps\Root tree to create your 
index.
Yes you can index databases; you will just have to write a mechanism that is able to 
create org.apache.lucene.document.Document from database.
For instance : 
- connect JDBC
- run a query for obtaining a ResultSet
- loop for each row of that ResultSet :
Create a new org.apache.lucene.document.Document from ResultSet data
and add this document to the Index.
end loop.

For incremental indexing, I suppose you have to store some timestamp field in your 
index; but it's up to you.
Note that Lucene is very fast and I don't think that incremetal indexing is required 
for small or medium amout of data.



-Message d'origine-
De : mahaveer jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 17:22
À : [EMAIL PROTECTED]
Objet : Help for text based indexing



Hi

I have implemented Text based search using lucene. I was wonderful playing around with 
it.

Now I want to enchance the application.

I have a Root folder, under that I have many other folder, that are group specific, 
say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group 
folder within that.

Now I am index for these groups separately, ie , I have index as C:/index/group1, 
C:/index/group2, C:/index/group3 and so on

I want to know if I can have only one index for all these say C:/index/Root (this has 
index for all the folder) and I should be able to Search using 
C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other 
groups.

Let me know if this is possible and have anybody tried this.

2nd question

Is lucene good to index databases ? How do we support incremental indexing ?

(Right now I am using LIKE for searching )

Thanks in Advance

Mahaveer



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!

RE: Help for text based indexing

2004-09-14 Thread Cocula Remi
A keyword is not tokenized, that's why you wont be able to search over a part of it. 
You'd rather use a Text fied.

About creating a special field  : 

IndexWriter Ir = 

File f = 
Document  doc = new Document();
if (f.toString.startsWith("C:\tomcat\webapps\Root\Group1")
{
doc.add(Field.Text("group", "Group1"));
}
if (f.toString.startsWith("C:\tomcat\webapps\Root\Group2")
{
doc.add(Field.Text("group", "Group2"));
}
doc.add(Field.Text("content", getContent(f)));
Ir.addDocument(doc);



Then you can search in group1 with query like that : 

 group:Group1 AND rest_of_the_query.



-Message d'origine-
De : mahaveer jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 18:03
À : Lucene Users List
Objet : RE: Help for text based indexing


Well in my case the path is KeyWord. I had tried that earlier and it does not seems to 
work in a single index file. 
 
Can you explain a bit more about adding group1 and group2 ?
 
Cocula Remi <[EMAIL PROTECTED]> wrote:
Well you could add a field to each of your Documents whose value would be either 
"group1" or "group2".
Or you could use the path to your files ...



-Message d'origine-
De : mahaveer jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 17:49
À : [EMAIL PROTECTED]
Objet : RE: Help for text based indexing


I am clear with looping recursively to index all the file under Root folder.
But the problem is if I want to search only in group1 or group2.Is that possible to 
search only in one of the group folder ?

Cocula Remi wrote:
You just have to loop recurssively over the C:\tomcat\webapps\Root tree to create your 
index.
Yes you can index databases; you will just have to write a mechanism that is able to 
create org.apache.lucene.document.Document from database.
For instance : 
- connect JDBC
- run a query for obtaining a ResultSet
- loop for each row of that ResultSet :
Create a new org.apache.lucene.document.Document from ResultSet data
and add this document to the Index.
end loop.

For incremental indexing, I suppose you have to store some timestamp field in your 
index; but it's up to you.
Note that Lucene is very fast and I don't think that incremetal indexing is required 
for small or medium amout of data.



-Message d'origine-
De : mahaveer jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 17:22
À : [EMAIL PROTECTED]
Objet : Help for text based indexing



Hi

I have implemented Text based search using lucene. I was wonderful playing around with 
it.

Now I want to enchance the application.

I have a Root folder, under that I have many other folder, that are group specific, 
say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group 
folder within that.

Now I am index for these groups separately, ie , I have index as C:/index/group1, 
C:/index/group2, C:/index/group3 and so on

I want to know if I can have only one index for all these say C:/index/Root (this has 
index for all the folder) and I should be able to Search using 
C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other 
groups.

Let me know if this is possible and have anybody tried this.

2nd question

Is lucene good to index databases ? How do we support incremental indexing ?

(Right now I am using LIKE for searching )

Thanks in Advance

Mahaveer



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: ANT +BUILD + LUCENE

2004-09-14 Thread Gerard Sychay
Hi,

I've used the following Ant targets for build scripts that required
platform dependent work. In the example here, the property
"catalina.home" is set according to what platform we're running on. You
can adapt as needed.

  

  


  

  

  


  

  


  




>>> "Karthik N S" <[EMAIL PROTECTED]> 09/13/04 10:34PM >>>
Hi

  Erik


   1) Using Ant and Build.xml I want to run the
org.apache.lucene.demo.IndexFiles to create an Indexfolder

   2) Problem is The same Build.xml is to be used Across the O/s for
creating Index

   3) The path of Lucene1-4-final.jar  are in respective directories
for the
O/s...

[ Note :- The Path of Lucene_home,I/P and O/p directories are
also
O/s Specific should be in the Build.xml  and
should be trigged somthing   by this type


 
  
  

   or


  



I hope u get the situation. :{


With regards
Karthik



-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 14, 2004 7:37 PM
To: Lucene Users List
Subject: Re: ANT +BUILD + LUCENE


I'm not following what you want very clearly, but there is an 
task in Lucene's Sandbox.

Please post what you are trying, and I'd be happy to help once I see
the details.

Erik

On Sep 12, 2004, at 4:44 PM, Karthik N S wrote:

> Hi
>
> Guys
>
>
> Apologies..
>
>
> The Task for me is to build the Index folder using Lucene &  a
simple
> Build.xml  for ANT
>
> The Problem .. Same 'Build .xml'  should be used for differnet
> O/s...
> [ Win / Linux ]
>
> The glitch is  respective jar files such as Lucene-1.4 .jar & other
jar
> files are not in same dir for the O/s.
> Also the  I/p , O/p Indexer path for source/target may also vary.
>
>
> Please Somebody Help me. :(
>
>
>
> with regards
> Karthik
>
>
>
>
>   WITH WARM REGARDS
>   HAVE A NICE DAY
>   [ N.S.KARTHIK]
>
>
>
>
>
-
> To unsubscribe, e-mail: [EMAIL PROTECTED] 
> For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Help for text based indexing

2004-09-14 Thread mahaveer jain
Well in my case the path is KeyWord. I had tried that earlier and it does not seems to 
work in a single index file. 
 
Can you explain a bit more about adding group1 and group2 ?
 
Cocula Remi <[EMAIL PROTECTED]> wrote:
Well you could add a field to each of your Documents whose value would be either 
"group1" or "group2".
Or you could use the path to your files ...



-Message d'origine-
De : mahaveer jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 17:49
À : [EMAIL PROTECTED]
Objet : RE: Help for text based indexing


I am clear with looping recursively to index all the file under Root folder.
But the problem is if I want to search only in group1 or group2.Is that possible to 
search only in one of the group folder ?

Cocula Remi wrote:
You just have to loop recurssively over the C:\tomcat\webapps\Root tree to create your 
index.
Yes you can index databases; you will just have to write a mechanism that is able to 
create org.apache.lucene.document.Document from database.
For instance : 
- connect JDBC
- run a query for obtaining a ResultSet
- loop for each row of that ResultSet :
Create a new org.apache.lucene.document.Document from ResultSet data
and add this document to the Index.
end loop.

For incremental indexing, I suppose you have to store some timestamp field in your 
index; but it's up to you.
Note that Lucene is very fast and I don't think that incremetal indexing is required 
for small or medium amout of data.



-Message d'origine-
De : mahaveer jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 17:22
À : [EMAIL PROTECTED]
Objet : Help for text based indexing



Hi

I have implemented Text based search using lucene. I was wonderful playing around with 
it.

Now I want to enchance the application.

I have a Root folder, under that I have many other folder, that are group specific, 
say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group 
folder within that.

Now I am index for these groups separately, ie , I have index as C:/index/group1, 
C:/index/group2, C:/index/group3 and so on

I want to know if I can have only one index for all these say C:/index/Root (this has 
index for all the folder) and I should be able to Search using 
C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other 
groups.

Let me know if this is possible and have anybody tried this.

2nd question

Is lucene good to index databases ? How do we support incremental indexing ?

(Right now I am using LIKE for searching )

Thanks in Advance

Mahaveer



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

RE: Help for text based indexing

2004-09-14 Thread Cocula Remi
Well you could add a field to each of your Documents whose value would be either 
"group1" or "group2".
Or you could use the path to your files ...



-Message d'origine-
De : mahaveer jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 17:49
À : [EMAIL PROTECTED]
Objet : RE: Help for text based indexing


I am clear with looping recursively to index all the file under Root folder.
But the problem is if I want to search only in group1 or group2.Is that possible to 
search only in one of the group folder ?
 
Cocula Remi <[EMAIL PROTECTED]> wrote:
You just have to loop recurssively over the C:\tomcat\webapps\Root tree to create your 
index.
Yes you can index databases; you will just have to write a mechanism that is able to 
create org.apache.lucene.document.Document from database.
For instance : 
- connect JDBC
- run a query for obtaining a ResultSet
- loop for each row of that ResultSet :
Create a new org.apache.lucene.document.Document from ResultSet data
and add this document to the Index.
end loop.

For incremental indexing, I suppose you have to store some timestamp field in your 
index; but it's up to you.
Note that Lucene is very fast and I don't think that incremetal indexing is required 
for small or medium amout of data.



-Message d'origine-
De : mahaveer jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 17:22
À : [EMAIL PROTECTED]
Objet : Help for text based indexing



Hi

I have implemented Text based search using lucene. I was wonderful playing around with 
it.

Now I want to enchance the application.

I have a Root folder, under that I have many other folder, that are group specific, 
say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group 
folder within that.

Now I am index for these groups separately, ie , I have index as C:/index/group1, 
C:/index/group2, C:/index/group3 and so on

I want to know if I can have only one index for all these say C:/index/Root (this has 
index for all the folder) and I should be able to Search using 
C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other 
groups.

Let me know if this is possible and have anybody tried this.

2nd question

Is lucene good to index databases ? How do we support incremental indexing ?

(Right now I am using LIKE for searching )

Thanks in Advance

Mahaveer



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Help for text based indexing

2004-09-14 Thread mahaveer jain
I am clear with looping recursively to index all the file under Root folder.
But the problem is if I want to search only in group1 or group2.Is that possible to 
search only in one of the group folder ?
 
Cocula Remi <[EMAIL PROTECTED]> wrote:
You just have to loop recurssively over the C:\tomcat\webapps\Root tree to create your 
index.
Yes you can index databases; you will just have to write a mechanism that is able to 
create org.apache.lucene.document.Document from database.
For instance : 
- connect JDBC
- run a query for obtaining a ResultSet
- loop for each row of that ResultSet :
Create a new org.apache.lucene.document.Document from ResultSet data
and add this document to the Index.
end loop.

For incremental indexing, I suppose you have to store some timestamp field in your 
index; but it's up to you.
Note that Lucene is very fast and I don't think that incremetal indexing is required 
for small or medium amout of data.



-Message d'origine-
De : mahaveer jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 17:22
À : [EMAIL PROTECTED]
Objet : Help for text based indexing



Hi

I have implemented Text based search using lucene. I was wonderful playing around with 
it.

Now I want to enchance the application.

I have a Root folder, under that I have many other folder, that are group specific, 
say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group 
folder within that.

Now I am index for these groups separately, ie , I have index as C:/index/group1, 
C:/index/group2, C:/index/group3 and so on

I want to know if I can have only one index for all these say C:/index/Root (this has 
index for all the folder) and I should be able to Search using 
C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other 
groups.

Let me know if this is possible and have anybody tried this.

2nd question

Is lucene good to index databases ? How do we support incremental indexing ?

(Right now I am using LIKE for searching )

Thanks in Advance

Mahaveer



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

RE: Help for text based indexing

2004-09-14 Thread Cocula Remi
You just have to loop recurssively over the C:\tomcat\webapps\Root tree to create your 
index.
Yes you can index databases; you will just have to write a mechanism that is able to 
create org.apache.lucene.document.Document from database.
For instance : 
- connect JDBC
- run a query for obtaining a ResultSet
- loop for each row of that ResultSet :
Create a new org.apache.lucene.document.Document from ResultSet data
and add this document to the Index.
end loop.

For incremental indexing, I suppose you have to store some timestamp field in your 
index; but it's up to you.
Note that Lucene is very fast and I don't think that incremetal indexing is required 
for small or medium amout of data.



-Message d'origine-
De : mahaveer jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 17:22
À : [EMAIL PROTECTED]
Objet : Help for text based indexing



Hi

I have implemented Text based search using lucene. I was wonderful playing around with 
it.

Now I want to enchance the application.

I have a Root folder, under that I have many other folder, that are group specific, 
say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group 
folder within that.

Now I am index for these groups separately, ie , I have index as C:/index/group1, 
C:/index/group2, C:/index/group3 and so on

I want to know if I can have only one index for all these say C:/index/Root (this has 
index for all the folder) and I should be able to Search using 
C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other 
groups.

Let me know if this is possible and have anybody tried this.

2nd question

Is lucene good to index databases ? How do we support incremental indexing ?

(Right now I am using LIKE for searching )

Thanks in Advance

Mahaveer



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Help for text based indexing

2004-09-14 Thread mahaveer jain

Hi

I have implemented Text based search using lucene. I was wonderful playing around with 
it.

Now I want to enchance the application.

I have a Root folder, under that I have many other folder, that are group specific, 
say (group1, group2, .. so on). The Root folder is in C:\tomcat\webapps\Root and group 
folder within that.

Now I am index for these groups separately, ie , I have index as C:/index/group1, 
C:/index/group2, C:/index/group3 and so on

I want to know if I can have only one index for all these say C:/index/Root (this has 
index for all the folder) and I should be able to Search using 
C:\tomcat\webapps\Root\group1(if want to search for group1) similarly for the other 
groups.

Let me know if this is possible and have anybody tried this.

2nd question

Is lucene good to index databases ? How do we support incremental indexing ?

(Right now I am using LIKE for searching )

Thanks in Advance

Mahaveer



-
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!

Re: ANT +BUILD + LUCENE

2004-09-14 Thread Erik Hatcher
Karthik,
You are still being a bit cryptic and making it hard for me to 
comprehend what the problem is, but here are some general pieces of 
advice with Ant related to what I think you are doing:

* There is no need to use conditional logic to have a different set of 
properties for different operating systems.  There is an implicit and 
declarative way to do this:

 
But whitespace gets in the way, so you could use the ant-contrib 
 (http://ant-contrib.sourceforge.net/tasks/index.html) which 
would be cleaner than the value of ${os.name}.

* Using IndexFiles from the demo is awkward, to me.  Why not give the 
sandbox  task a try?

* Ant has a  task that might be handy for you.
Please post how you are using  (I can only presume), if that is 
the issue.

Erik
On Sep 13, 2004, at 10:34 PM, Karthik N S wrote:
Hi
  Erik
   1) Using Ant and Build.xml I want to run the
org.apache.lucene.demo.IndexFiles to create an Indexfolder
   2) Problem is The same Build.xml is to be used Across the O/s for
creating Index
   3) The path of Lucene1-4-final.jar  are in respective directories 
for the
O/s...

[ Note :- The Path of Lucene_home,I/P and O/p directories are 
also
O/s Specific should be in the Build.xml  and
should be trigged somthing   by this type

 
  
  
   or

  

I hope u get the situation. :{
With regards
Karthik

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 14, 2004 7:37 PM
To: Lucene Users List
Subject: Re: ANT +BUILD + LUCENE
I'm not following what you want very clearly, but there is an 
task in Lucene's Sandbox.
Please post what you are trying, and I'd be happy to help once I see
the details.
Erik
On Sep 12, 2004, at 4:44 PM, Karthik N S wrote:
Hi
Guys
Apologies..
The Task for me is to build the Index folder using Lucene &  a simple
Build.xml  for ANT
The Problem .. Same 'Build .xml'  should be used for differnet
O/s...
[ Win / Linux ]
The glitch is  respective jar files such as Lucene-1.4 .jar & other 
jar
files are not in same dir for the O/s.
Also the  I/p , O/p Indexer path for source/target may also vary.

Please Somebody Help me. :(

with regards
Karthik

  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search PharseQuery

2004-09-14 Thread sergiu gordea
Natarajan.T wrote:
Ok you are correct ...
Suppose if I type "what java" then how can I handle...
 

You don't have to handle it, lucene does it. If you don't like how 
lucene handles it then you may extend
the functionality.

If you use the same analyzer for indexing and searching then you will 
find the results with both search strings:

"what java" and "what is java".
At least I obtain them in both cases. 
That's right you will obtain 
"what java" if you search for "what is java", in my case is acceptable.

If it is not acceptable in your project, I suggest to try to create a new Analyzer.
 I whish you luck,
 Sergiu



Regards,
Natarajan.
-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 14, 2004 7:38 PM
To: Lucene Users List
Subject: Re: Search PharseQuery

Natarajan.T wrote:
 

Hi,
Thanks for your response.
For example search keyword is like below...
Language "what is java"
Token 1:  language
Token 2: what is java(like google)
Regards,
Natarajan.

   

Lucene works exaclty as you describe above with a simple correction ...
The analyzer has a list of  stopped keywords, and I bet "is" is one of
them for your analyzer.
I don't mind right now about this, so I won't dig to find a solution for
this problem, but the resolution
should be searched around Analyzer classes.
All the best,
 Sergiu

 

-Original Message-
From: Aad Nales [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 14, 2004 5:19 PM
To: 'Lucene Users List'
Subject: RE: Search PharseQuery

Hi,
Not sure if this is what you need but I created a lastname filter which
in Dutch means potential double last names like:"van der Vaart". In
order to process these I created a finite state machine that queried
these last names. Since I only needed the filter on 'index' time and I
never use it for querieying, this may not be what you are looking for.
It should be simple to index 'what is java' as a single token and to
search for that same token. However, you will need to create a list of
accepted 'tokens'. If this is what you need let me know, I will make
   

the
 

code available...
cheers,
Aad Nales
-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, 14 September, 2004 13:39
To: Lucene Users List
Subject: Re: Search PharseQuery

--- "Natarajan.T" <[EMAIL PROTECTED]>
wrote: 

   

Hi All,

How do I implement PharseQuery API?
  

 

What exactly you mean by implement? Are you trying to
extend the current behavior or only trying find out
the usage?
Thanks,
George




___ALL-NEW
Yahoo! Messenger - all new features - even more fun!
http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: ANT +BUILD + LUCENE

2004-09-14 Thread Karthik N S
Hi

  Erik


   1) Using Ant and Build.xml I want to run the
org.apache.lucene.demo.IndexFiles to create an Indexfolder

   2) Problem is The same Build.xml is to be used Across the O/s for
creating Index

   3) The path of Lucene1-4-final.jar  are in respective directories for the
O/s...

[ Note :- The Path of Lucene_home,I/P and O/p directories are also
O/s Specific should be in the Build.xml  and
should be trigged somthing   by this type


 
  
  

   or


  



I hope u get the situation. :{


With regards
Karthik



-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 14, 2004 7:37 PM
To: Lucene Users List
Subject: Re: ANT +BUILD + LUCENE


I'm not following what you want very clearly, but there is an 
task in Lucene's Sandbox.

Please post what you are trying, and I'd be happy to help once I see
the details.

Erik

On Sep 12, 2004, at 4:44 PM, Karthik N S wrote:

> Hi
>
> Guys
>
>
> Apologies..
>
>
> The Task for me is to build the Index folder using Lucene &  a simple
> Build.xml  for ANT
>
> The Problem .. Same 'Build .xml'  should be used for differnet
> O/s...
> [ Win / Linux ]
>
> The glitch is  respective jar files such as Lucene-1.4 .jar & other jar
> files are not in same dir for the O/s.
> Also the  I/p , O/p Indexer path for source/target may also vary.
>
>
> Please Somebody Help me. :(
>
>
>
> with regards
> Karthik
>
>
>
>
>   WITH WARM REGARDS
>   HAVE A NICE DAY
>   [ N.S.KARTHIK]
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search PharseQuery

2004-09-14 Thread Natarajan.T
Ok you are correct ...


Suppose if I type "what java" then how can I handle...

Regards,
Natarajan.

-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 14, 2004 7:38 PM
To: Lucene Users List
Subject: Re: Search PharseQuery

Natarajan.T wrote:

>Hi,
>
>Thanks for your response.
>
>For example search keyword is like below...
>
>Language "what is java"
>
>Token 1:  language
>Token 2: what is java(like google)
>
>
>Regards,
>Natarajan.
>
>  
>
Lucene works exaclty as you describe above with a simple correction ...

 The analyzer has a list of  stopped keywords, and I bet "is" is one of

them for your analyzer.
I don't mind right now about this, so I won't dig to find a solution for

this problem, but the resolution
should be searched around Analyzer classes.
 All the best,

  Sergiu




>
>
>
>-Original Message-
>From: Aad Nales [mailto:[EMAIL PROTECTED] 
>Sent: Tuesday, September 14, 2004 5:19 PM
>To: 'Lucene Users List'
>Subject: RE: Search PharseQuery
>
>Hi,
>
>Not sure if this is what you need but I created a lastname filter which
>in Dutch means potential double last names like:"van der Vaart". In
>order to process these I created a finite state machine that queried
>these last names. Since I only needed the filter on 'index' time and I
>never use it for querieying, this may not be what you are looking for.
>It should be simple to index 'what is java' as a single token and to
>search for that same token. However, you will need to create a list of
>accepted 'tokens'. If this is what you need let me know, I will make
the
>code available...
>
>cheers,
>Aad Nales
>
>-Original Message-
>From: Honey George [mailto:[EMAIL PROTECTED] 
>Sent: Tuesday, 14 September, 2004 13:39
>To: Lucene Users List
>Subject: Re: Search PharseQuery
>
>
> --- "Natarajan.T" <[EMAIL PROTECTED]>
>wrote: 
>  
>
>>Hi All,
>>
>> 
>>
>>How do I implement PharseQuery API?
>>
>>
>
>What exactly you mean by implement? Are you trying to
>extend the current behavior or only trying find out
>the usage?
>Thanks,
>  George
>
>
>
>   
>   
>   
>___ALL-NEW
>Yahoo! Messenger - all new features - even more fun!
>http://uk.messenger.yahoo.com
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>  
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Addition to contributions page

2004-09-14 Thread Erik Hatcher
Perhaps we should @deprecate the contributions page like we did with 
the Powered By page, and migrate it to the wiki?

Erik
On Sep 13, 2004, at 6:50 PM, Daniel Naber wrote:
On Friday 10 September 2004 15:48, Chas Emerick wrote:
PDFTextStream should be added to the 'Document Converters' section,
with this URL < http://snowtide.com >, and perhaps this heading:
'PDFTextStream -- PDF text and metadata extraction'.  The 'Author'
field should probably be left blank, since there's no single creator.
I just added it.
Regards
 Daniel
--
http://www.danielnaber.de
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: ANT +BUILD + LUCENE

2004-09-14 Thread Erik Hatcher
I'm not following what you want very clearly, but there is an  
task in Lucene's Sandbox.

Please post what you are trying, and I'd be happy to help once I see 
the details.

Erik
On Sep 12, 2004, at 4:44 PM, Karthik N S wrote:
Hi
Guys
Apologies..
The Task for me is to build the Index folder using Lucene &  a simple
Build.xml  for ANT
The Problem .. Same 'Build .xml'  should be used for differnet 
O/s...
[ Win / Linux ]

The glitch is  respective jar files such as Lucene-1.4 .jar & other jar
files are not in same dir for the O/s.
Also the  I/p , O/p Indexer path for source/target may also vary.
Please Somebody Help me. :(

with regards
Karthik

  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search PharseQuery

2004-09-14 Thread sergiu gordea
Natarajan.T wrote:
Hi,
Thanks for your response.
For example search keyword is like below...
Language "what is java"
Token 1:  language
Token 2: what is java(like google)
Regards,
Natarajan.
 

Lucene works exaclty as you describe above with a simple correction ...
The analyzer has a list of  stopped keywords, and I bet "is" is one of  
them for your analyzer.
I don't mind right now about this, so I won't dig to find a solution for 
this problem, but the resolution
should be searched around Analyzer classes.
All the best,

 Sergiu



-Original Message-
From: Aad Nales [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 14, 2004 5:19 PM
To: 'Lucene Users List'
Subject: RE: Search PharseQuery

Hi,
Not sure if this is what you need but I created a lastname filter which
in Dutch means potential double last names like:"van der Vaart". In
order to process these I created a finite state machine that queried
these last names. Since I only needed the filter on 'index' time and I
never use it for querieying, this may not be what you are looking for.
It should be simple to index 'what is java' as a single token and to
search for that same token. However, you will need to create a list of
accepted 'tokens'. If this is what you need let me know, I will make the
code available...
cheers,
Aad Nales
-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, 14 September, 2004 13:39
To: Lucene Users List
Subject: Re: Search PharseQuery

--- "Natarajan.T" <[EMAIL PROTECTED]>
wrote: 
 

Hi All,

How do I implement PharseQuery API?
   

What exactly you mean by implement? Are you trying to
extend the current behavior or only trying find out
the usage?
Thanks,
 George




___ALL-NEW
Yahoo! Messenger - all new features - even more fun!
http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Indexing object graphs

2004-09-14 Thread Erik Hatcher
Interesting!
http://kasparov.skife.org/blog/2004/09/13#lucene-graphs
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Search PharseQuery

2004-09-14 Thread Natarajan.T
Hi,

Thanks for your response.

For example search keyword is like below...

Language "what is java"

Token 1:  language
Token 2: what is java(like google)


Regards,
Natarajan.





-Original Message-
From: Aad Nales [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 14, 2004 5:19 PM
To: 'Lucene Users List'
Subject: RE: Search PharseQuery

Hi,

Not sure if this is what you need but I created a lastname filter which
in Dutch means potential double last names like:"van der Vaart". In
order to process these I created a finite state machine that queried
these last names. Since I only needed the filter on 'index' time and I
never use it for querieying, this may not be what you are looking for.
It should be simple to index 'what is java' as a single token and to
search for that same token. However, you will need to create a list of
accepted 'tokens'. If this is what you need let me know, I will make the
code available...

cheers,
Aad Nales

-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, 14 September, 2004 13:39
To: Lucene Users List
Subject: Re: Search PharseQuery


 --- "Natarajan.T" <[EMAIL PROTECTED]>
wrote: 
> Hi All,
> 
>  
> 
> How do I implement PharseQuery API?

What exactly you mean by implement? Are you trying to
extend the current behavior or only trying find out
the usage?
Thanks,
  George






___ALL-NEW
Yahoo! Messenger - all new features - even more fun!
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Document Relevance

2004-09-14 Thread ebrahim . faisal
Hi

I am new to Lucene.

Could anyone tell me how to set the RELEVANCE in which the search results
are displayed.

Any online Examples available on this topic

I welcome ur suggestions

Thanx & Regards
E.Faisal
Important Email Information :- The  information  in  this  email is
confidential and may  be  legally  privileged. It  is  intended  solely for
the addressee. Access to  this email  by anyone  else  is  unauthorized.
If you are not the intended recipient, any disclosure, copying,
distribution or any action taken or omitted to be taken in reliance on it,
is prohibited  and may be unlawful. If you are not the intended addressee
please contact the sender and dispose of this e-mail immediately.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search PharseQuery

2004-09-14 Thread Honey George
 --- "Natarajan.T" <[EMAIL PROTECTED]>
wrote: 
> I am trying to extend the current behavior.
You might have already seen a mail from Cocula Remi on
this. Please provide more details of the problem for
specific comments - basically the problem you are
facing and/or what behavior you are trying to extend.
This was not clear from your email. An example will
make things more clear.

Thanks & Regards,
   George







___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search PharseQuery

2004-09-14 Thread Natarajan.T
I am trying to extend the current behavior.

Regards,
Natarajan.

-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 14, 2004 5:09 PM
To: Lucene Users List
Subject: Re: Search PharseQuery

 --- "Natarajan.T" <[EMAIL PROTECTED]>
wrote: 
> Hi All,
> 
>  
> 
> How do I implement PharseQuery API? 

What exactly you mean by implement? Are you trying to
extend the current behavior or only trying find out
the usage?
Thanks,
  George






___ALL-NEW
Yahoo! Messenger - all new features - even more fun!
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search PharseQuery

2004-09-14 Thread Aad Nales
Hi,

Not sure if this is what you need but I created a lastname filter which
in Dutch means potential double last names like:"van der Vaart". In
order to process these I created a finite state machine that queried
these last names. Since I only needed the filter on 'index' time and I
never use it for querieying, this may not be what you are looking for.
It should be simple to index 'what is java' as a single token and to
search for that same token. However, you will need to create a list of
accepted 'tokens'. If this is what you need let me know, I will make the
code available...

cheers,
Aad Nales

-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, 14 September, 2004 13:39
To: Lucene Users List
Subject: Re: Search PharseQuery


 --- "Natarajan.T" <[EMAIL PROTECTED]>
wrote: 
> Hi All,
> 
>  
> 
> How do I implement PharseQuery API?

What exactly you mean by implement? Are you trying to
extend the current behavior or only trying find out
the usage?
Thanks,
  George






___ALL-NEW
Yahoo! Messenger - all new features - even more fun!
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search PharseQuery

2004-09-14 Thread Honey George
 --- "Natarajan.T" <[EMAIL PROTECTED]>
wrote: 
> Hi All,
> 
>  
> 
> How do I implement PharseQuery API? 

What exactly you mean by implement? Are you trying to
extend the current behavior or only trying find out
the usage?
Thanks,
  George






___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search PharseQuery

2004-09-14 Thread Natarajan.T
Hi Serigu,


String queryString = "\"waht is java\"";
Query q = QueryParser.parse(queryString, "field", new
StandardAnalyzer());
System.out.println(q.toString());

This is enough for starting  consult Lucene API for more information



Are you tested the above query? This search keyword is not a single
keyword.


Regards,
Natarajan.








-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 14, 2004 4:34 PM
To: Lucene Users List
Subject: Re: Search PharseQuery

String queryString = "\"waht is java\"";
Query q = QueryParser.parse(queryString, "field", new
StandardAnalyzer());
System.out.println(q.toString());

This is enough for starting  consult Lucene API for more information

   Sergiu


Natarajan.T wrote:

>Hi,
>
>Thanks for your mail, that link says only theoretically but I need some
>sample
>
>
>Regards,
>Natarajan.
>
>
>-Original Message-
>From: Cocula Remi [mailto:[EMAIL PROTECTED] 
>Sent: Tuesday, September 14, 2004 2:58 PM
>To: Lucene Users List
>Subject: RE: Search PharseQuery
>
>Use QueryParser. 
>please take a look at
>http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html
>It's pretty clear.
>
>
>-Message d'origine-
>De : Natarajan.T [mailto:[EMAIL PROTECTED]
>Envoyé : mardi 14 septembre 2004 11:26
>À : 'Lucene Users List'
>Objet : Search PharseQuery
>
>
>Hi All,
>
> 
>
>How do I implement PharseQuery API? Pls send me some sample code.( How
>can I handle "java is platform" as single word?
>
>)
>
>  
>
>Regards,
>
>Natarajan.
>
> 
>
> 
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>  
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search PharseQuery

2004-09-14 Thread sergiu gordea
String queryString = "\"waht is java\"";
Query q = QueryParser.parse(queryString, "field", new StandardAnalyzer());
System.out.println(q.toString());
This is enough for starting  consult Lucene API for more information
  Sergiu
Natarajan.T wrote:
Hi,
Thanks for your mail, that link says only theoretically but I need some
sample
Regards,
Natarajan.
-Original Message-
From: Cocula Remi [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 14, 2004 2:58 PM
To: Lucene Users List
Subject: RE: Search PharseQuery

Use QueryParser. 
please take a look at
http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html
It's pretty clear.

-Message d'origine-
De : Natarajan.T [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 11:26
À : 'Lucene Users List'
Objet : Search PharseQuery
Hi All,

How do I implement PharseQuery API? Pls send me some sample code.( How
can I handle "java is platform" as single word?
)
 

Regards,
Natarajan.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Search PharseQuery

2004-09-14 Thread Natarajan.T
Hi,

Thanks for your mail, that link says only theoretically but I need some
sample


Regards,
Natarajan.


-Original Message-
From: Cocula Remi [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 14, 2004 2:58 PM
To: Lucene Users List
Subject: RE: Search PharseQuery

Use QueryParser. 
please take a look at
http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html
It's pretty clear.


-Message d'origine-
De : Natarajan.T [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 11:26
À : 'Lucene Users List'
Objet : Search PharseQuery


Hi All,

 

How do I implement PharseQuery API? Pls send me some sample code.( How
can I handle "java is platform" as single word?

)

  

Regards,

Natarajan.

 

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search PharseQuery

2004-09-14 Thread Cocula Remi
Use QueryParser. 
please take a look at 
http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html
It's pretty clear.


-Message d'origine-
De : Natarajan.T [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 11:26
À : 'Lucene Users List'
Objet : Search PharseQuery


Hi All,

 

How do I implement PharseQuery API? Pls send me some sample code.( How
can I handle "java is platform" as single word?

)

  

Regards,

Natarajan.

 

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search PharseQuery

2004-09-14 Thread Natarajan.T
Hi All,

 

How do I implement PharseQuery API? Pls send me some sample code.( How
can I handle "java is platform" as single word?

)

  

Regards,

Natarajan.

 

 



Re: OutOfMemory example

2004-09-14 Thread Daniel Naber
On Tuesday 14 September 2004 08:32, JiÅÃ Kuhn wrote:

> The error is thrown in exactly the same point as before. This morning I
> downloaded Lucene from CVS, now the jar is lucene-1.5-rc1-dev.jar, JVM
> is 1.4.2_05-b04, both Linux and Windows.

Now I can reproduce the problem. I first tried running the code inside 
Eclipse, but the Exception doesn't occur there. It does occur on the 
command line.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]