LUCENE & eCommerce

2005-04-01 Thread Karthik N S



 
Hi 
Guys
Apologies.
Has Any body 
out on the Form ,have implemented  Lucene API in  eCommerce ( Search 
based Shopping)
Something 
similar to   http://www.bizrate.com/  .
 
Please Help 
me ???
 
 
WITH WARM REGARDS HAVE A NICE DAY [ 
N.S.KARTHIK] 


RE: using different analyzer for searching

2005-04-01 Thread Karthik N S
Hi

Try First Try Using the AnalysisDemo.java code from
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html?page=last#thre
ad from java.net for the Contents u seems to experiment with and verify
which analyzer to use.

Probably this will give  u some Idea on Analyzers.



with regaards
Karthik



-Original Message-
From: pashupathinath [mailto:[EMAIL PROTECTED]
Sent: Friday, April 01, 2005 9:19 AM
To: java-user@lucene.apache.org
Subject: Re: using different analyzer for searching


hi erik,
   i'm creating a blogger application, where the users
can create blogs, upload pictures and post comments
etc etc.
   i'm storing all the information using mysql
database. i'm indexing the database contents and
searching on this index.i'm using lucene to implement
this feature.
   i give the user options to search based on
BlogTitle, Blogdesc,blogcategory. my main purpose of
search is ..whenever a user enters any query related
to blogtitle or blogdesc or blogcategory, it should
return all the matching documents for that search
string.
   the real problem i'm facing is ..whenever the user
enters some part of the mainstring, the search returns
 zero because i was using a whitespaceanalyser, which
needs the complete string. i should look into using
wildcardquery which i think will solve my problem to
some extent.
   i should do even more analysis as suggested by you
before i should come to a decision of which analyser i
should be using to solve this. what about writing a
custom analyzer to solve this ??? how can i go abt the
logic of implementing this in a custom analyzer..
where this returns all the documents that has even a
part of  the search string.
   any insight into this would be very helpful
especially in terms of performance wise.

thanks,
pashupathinath.k

--- Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
> On Mar 31, 2005, at 11:44 AM, pashupathinath wrote:
>
> >   is it possible to index using a predefined
> analyzer
> > and search using a custom analyzer ??
>
> Yes, its perfectly fine to do so with the caveat
> that you end up
> searching for the terms exactly as they were
> indexed.
>
> I end up doing this in most applications, actually,
> primarily because
> untokenized fields need to use the KeywordAnalyzer
> during searching.
>
> >   i'm searching using the built in whitespace
> > analyser. the problem is when i'm searching for a
> part
> > of a string the search results are zero.
> >   i'm using white space analyzer. for example if
> the
> > statement is "my name is abc123" the search for
> abc or
> > 123 doesnt return any hits.
> >   anyinsight into this ??
>
> The exact terms indexed using WhitespaceAnalyzer are
> like this (using
> the Lucene in Action AnalyzerDemo - "ant
> AnalyzerDemo"):
>
>  [input] String to analyze: [This string will be
> analyzed.]
> my name is abc123
>   [echo] Running lia.analysis.AnalyzerDemo...
>   [java] Analyzing "my name is abc123"
>   [java]   WhitespaceAnalyzer:
>   [java] [my] [name] [is] [abc123]
>
>   [java]   SimpleAnalyzer:
>   [java] [my] [name] [is] [abc]
>
>   [java]   StopAnalyzer:
>   [java] [my] [name] [abc]
>
>   [java]   StandardAnalyzer:
>   [java] [my] [name] [abc123]
>
> So you indexed "abc123" and searches must search for
> that term
> *exactly*.  You can search for "abc*" as a
> PrefixQuery or WildcardQuery
> and find "abc123".  "*123" will also find it though
> QueryParser does
> not support leading wildcard characters (but the API
> does).  Wildcard
> queries are not ideally what you want as it tends to
> be much slower for
> large indexes.
>
> You may need to do specialized analysis.  Perhaps
> you could share you
> real needs with the list and we could offer
> recommendations.  It is
> possible to index "abc123", "abc", and "123" all
> within the same
> position in the index if you do some clever analysis
> and that meshes
> with what you're after.
>
>   Erik
>
>
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
>
>

Send instant messages to your online friends http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



indexing performance of little documents

2005-04-01 Thread Fabien Le Floc'h
Hello,
 
I want to index a 1GB file that contains a list of lines of
approximately 100 characters each, so that i can later get lines
containing some particular text. The natural way of doing it with lucene
would be to create 1 lucene Document per line. It works well except it
is too slow for my needs, even after tweaking all possible parameters of
IndexWriter and using cvs version of lucene. 
 
I can get 10x the indexing performance by indexing the file as 1 lucene
Document. Lucene builds a good index with all the terms and I am able to
get the number of terms matching a query but not the absolute position
in the original file (I only get the token relative position). A minor
quirk with this approach is that i need to split the document in order
to avoid outofmemory exception when the document is too big. It would be
probably possible for me to customize lucene for my needs (create a more
flexible Term class), that's just a hack. But I was wondering why there
should be such a performance difference.
 
I see that for each document plenty of work is done, but that seems
necessary, and then there is even more work while merging segments.
Things could probably be faster if documents were first aggregated and
then work done on them. But I think this would imply huge changes in
Lucene. Any advice for indexing millions of tiny docs?
 
 
 
Regards,
 
Fabien.


Re[2]: Analyzer don't work with wildcard queries, snowball analyzer.

2005-04-01 Thread Sven Duzont
Hello Erik,

Since wilcard queries are not analyzed, how can we deal with accents ?
For instance (in french) a query like "ingé*" will not match documents 
containing
"ingénieur" but the query "inge*" will.

Thanks

---
 sven

Le jeudi 31 mars 2005 à 17:51:25, vous écriviez :

EH> Wildcard terms simply are not analyzed.  How could it be possible to do
EH> this?  What if I search for "a*" - how could you stem that?

EH> Erik

EH> On Mar 31, 2005, at 9:51 AM, Ernesto De Santis wrote:

>> Hi
>>
>> I get an unexpected behavior when use wildcards in my queries.
>> I use a EnglishAnalyzer developed with SnowballAnalyzer. version 
>> 1.1_dev from Lucene in Action lib.
>>
>> Analysis case:
>> When use wildcards in the middle of one word, the word in not analyzed.
>> Examples:
>>
>>QueryParser qp = new QueryParser("body", analyzer);
>>Query q = qp.parse("ex?mple");
>>String strq = q.toString();
>>assertEquals("body:ex?mpl", strq);
>> //FAIL strq == body:ex?mple
>>
>>qp = new QueryParser("body", analyzer);
>>q = qp.parse("ex*ple");
>>strq = q.toString();
>>assertEquals("body:ex*pl", strq);
>> //FAIL strq == body:ex*ple
>>
>> With this behavior, the search does not find any document.
>>
>> Bye
>> Ernesto.
>>
>> -- 
>> Ernesto De Santis - Colaborativa.net
>> Córdoba 1147 Piso 6 Oficinas 3 y 4
>> (S2000AWO) Rosario, SF, Argentina.
>>
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]


EH> -
EH> To unsubscribe, e-mail: [EMAIL PROTECTED]
EH> For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: LUCENE & eCommerce

2005-04-01 Thread William W
Hi Karthik,
The e-commerce website www.shoptime.com from Brasil.
William.
From: "Karthik N S" <[EMAIL PROTECTED]>
Reply-To: java-user@lucene.apache.org
To: "LUCENE" 
Subject: LUCENE & eCommerce
Date: Fri, 1 Apr 2005 14:11:57 +0530

Hi Guys
Apologies.
Has Any body out on the Form ,have implemented  Lucene API in  eCommerce (
Search based Shopping)
Something similar to   http://www.bizrate.com/  .

Please Help me ???


WITH WARM REGARDS
HAVE A NICE DAY
[ N.S.KARTHIK]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing performance of little documents

2005-04-01 Thread Karl Øie
This might sound a bit lame but it has worked for me. I have had the 
same problem where the amount of small lucene documents slows down the 
building of large indexes.

Search is pretty fast, and read only, so for my case i just created 
three indexes and saved every three lucene documents into one of each 
index. then upon a search i merge the results from the three smaller 
indexes. Only thing to consider is to store all parts of a source 
document into the same index so that booleans still work. I have even 
threaded out the searching so search on the three indexes are performed 
in parallel.

By the way; Stop word filters can also do wonders for a index full of 
text too...

Mvh Karl Øie
On 1. apr. 2005, at 11.43, Fabien Le Floc'h wrote:
Hello,
I want to index a 1GB file that contains a list of lines of
approximately 100 characters each, so that i can later get lines
containing some particular text. The natural way of doing it with 
lucene
would be to create 1 lucene Document per line. It works well except it
is too slow for my needs, even after tweaking all possible parameters 
of
IndexWriter and using cvs version of lucene.

I can get 10x the indexing performance by indexing the file as 1 lucene
Document. Lucene builds a good index with all the terms and I am able 
to
get the number of terms matching a query but not the absolute position
in the original file (I only get the token relative position). A minor
quirk with this approach is that i need to split the document in order
to avoid outofmemory exception when the document is too big. It would 
be
probably possible for me to customize lucene for my needs (create a 
more
flexible Term class), that's just a hack. But I was wondering why there
should be such a performance difference.

I see that for each document plenty of work is done, but that seems
necessary, and then there is even more work while merging segments.
Things could probably be faster if documents were first aggregated and
then work done on them. But I think this would imply huge changes in
Lucene. Any advice for indexing millions of tiny docs?

Regards,
Fabien.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Re[2]: Analyzer don't work with wildcard queries, snowball analyzer.

2005-04-01 Thread Erik Hatcher
On Apr 1, 2005, at 8:09 AM, Sven Duzont wrote:
Since wilcard queries are not analyzed, how can we deal with accents ?
For instance (in french) a query like "ingé*" will not match documents 
containing
"ingénieur" but the query "inge*" will.
I presume your analyzer normalized accented characters?  Which analyzer 
is that?

You will need to employ some form of character normalization on 
wildcard queries too.

Erik

Thanks
---
 sven
Le jeudi 31 mars 2005 à 17:51:25, vous écriviez :
EH> Wildcard terms simply are not analyzed.  How could it be possible 
to do
EH> this?  What if I search for "a*" - how could you stem that?

EH>  Erik
EH> On Mar 31, 2005, at 9:51 AM, Ernesto De Santis wrote:
Hi
I get an unexpected behavior when use wildcards in my queries.
I use a EnglishAnalyzer developed with SnowballAnalyzer. version
1.1_dev from Lucene in Action lib.
Analysis case:
When use wildcards in the middle of one word, the word in not 
analyzed.
Examples:

   QueryParser qp = new QueryParser("body", analyzer);
   Query q = qp.parse("ex?mple");
   String strq = q.toString();
   assertEquals("body:ex?mpl", strq);
//FAIL strq == body:ex?mple
   qp = new QueryParser("body", analyzer);
   q = qp.parse("ex*ple");
   strq = q.toString();
   assertEquals("body:ex*pl", strq);
//FAIL strq == body:ex*ple
With this behavior, the search does not find any document.
Bye
Ernesto.
--
Ernesto De Santis - Colaborativa.net
Córdoba 1147 Piso 6 Oficinas 3 y 4
(S2000AWO) Rosario, SF, Argentina.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

EH> 
-
EH> To unsubscribe, e-mail: [EMAIL PROTECTED]
EH> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: using different analyzer for searching

2005-04-01 Thread Erik Hatcher
On Mar 31, 2005, at 10:49 PM, pashupathinath wrote:
   i should do even more analysis as suggested by you
before i should come to a decision of which analyser i
should be using to solve this. what about writing a
custom analyzer to solve this ??? how can i go abt the
logic of implementing this in a custom analyzer..
where this returns all the documents that has even a
part of  the search string.
   any insight into this would be very helpful
especially in terms of performance wise.
This is an involved topic, and one that is covered in great detail in 
the analysis chapter of Lucene in Action (shameless plug, yes, I 
know!).

I recommend you analyze the types of queries that need to be made and 
what type of user interface you will present for this - then determine 
what makes the most sense analysis-wise.  WhitespaceAnalyzer is not 
going to be good enough, as I suspect you'll want case-insensitive 
searches at least.

Erik

thanks,
pashupathinath.k
--- Erik Hatcher <[EMAIL PROTECTED]> wrote:
On Mar 31, 2005, at 11:44 AM, pashupathinath wrote:
  is it possible to index using a predefined
analyzer
and search using a custom analyzer ??
Yes, its perfectly fine to do so with the caveat
that you end up
searching for the terms exactly as they were
indexed.
I end up doing this in most applications, actually,
primarily because
untokenized fields need to use the KeywordAnalyzer
during searching.
  i'm searching using the built in whitespace
analyser. the problem is when i'm searching for a
part
of a string the search results are zero.
  i'm using white space analyzer. for example if
the
statement is "my name is abc123" the search for
abc or
123 doesnt return any hits.
  anyinsight into this ??
The exact terms indexed using WhitespaceAnalyzer are
like this (using
the Lucene in Action AnalyzerDemo - "ant
AnalyzerDemo"):
 [input] String to analyze: [This string will be
analyzed.]
my name is abc123
  [echo] Running lia.analysis.AnalyzerDemo...
  [java] Analyzing "my name is abc123"
  [java]   WhitespaceAnalyzer:
  [java] [my] [name] [is] [abc123]
  [java]   SimpleAnalyzer:
  [java] [my] [name] [is] [abc]
  [java]   StopAnalyzer:
  [java] [my] [name] [abc]
  [java]   StandardAnalyzer:
  [java] [my] [name] [abc123]
So you indexed "abc123" and searches must search for
that term
*exactly*.  You can search for "abc*" as a
PrefixQuery or WildcardQuery
and find "abc123".  "*123" will also find it though
QueryParser does
not support leading wildcard characters (but the API
does).  Wildcard
queries are not ideally what you want as it tends to
be much slower for
large indexes.
You may need to do specialized analysis.  Perhaps
you could share you
real needs with the list and we could offer
recommendations.  It is
possible to index "abc123", "abc", and "123" all
within the same
position in the index if you do some clever analysis
and that meshes
with what you're after.
Erik

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]

Send instant messages to your online friends 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re[4]: Analyzer don't work with wildcard queries, snowball analyzer.

2005-04-01 Thread Sven Duzont
EH> I presume your analyzer normalized accented characters?  Which analyzer
EH> is that?

Yes, i'm using a custom analyser for indexing / searching, ti consists
in :
- FrenchStopFilter
- IsoLatinFilter (this is the one that will replace accented
characters)
- LowerCaseFilter
- ApostropheFilter (in order to handle terms like with apostrophes,
for instance "l'expérience" will be decompozed into two tokens : "l" 
"expérience"

EH> You will need to employ some form of character normalization on 
EH> wildcard queries too.

thanks, it works succeffuly, code snippet following

---
 sven

/*--- CODE */

private static Query CreateCustomQuery(Query query)
{
  if(query instanceof BooleanQuery)  {
final BooleanClause[] bClauses = ((BooleanQuery) query).getClauses();

// The first clause is required
if(bClauses[0].prohibited != true)
  bClauses[0].required = true;
  
// Will parse each clause to remove accents if needed
Term term;
for (int i = 0; i < bClauses.length; i++){
  if(bClauses[i].query instanceof WildcardQuery)  {
term = ((WildcardQuery)bClauses[i].query).getTerm();
bClauses[i].query = new WildcardQuery(new Term(term.field(), 
ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(;
  }
  if(bClauses[i].query instanceof PrefixQuery)  {
term = ((PrefixQuery)bClauses[i].query).getPrefix();
bClauses[i].query = new PrefixQuery(new Term(term.field(), 
ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(;
  // toLowerCase because the text is lowercased during indexation
  }
}
  }
  else if(query instanceof WildcardQuery)  {
final Term term = ((WildcardQuery)query).getTerm();
query = new WildcardQuery(new Term(term.field(), 
ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(;
  }
  else if(query instanceof PrefixQuery)  {
final Term term = ((PrefixQuery)query).getPrefix();
query = new PrefixQuery(new Term(term.field(), 
ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(;
  }
  return query;
}

/*--- END OF CODE */

EH> Erik




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: FilteredQuery and Boolean AND

2005-04-01 Thread Kipping, Peter
Any ideas on this?  I have purchased your book, Lucene in Action, which
is quite good.  To make things easier, consider the example on p212.  In
item 4, when you combine the queries, what happens you combine them in
and AND fashion?  The book only has OR, which works.  Although it may
work since the book only has one filtered query, but what if you made
them both filtered queries and ANDed them?

Thanks,
Peter

-Original Message-
From: Kipping, Peter [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 25, 2005 10:34 AM
To: java-user@lucene.apache.org
Subject: FilteredQuery and Boolean AND

I have the following query structure:

BooleanQuery q2 = new BooleanQuery();
TermQuery tq = new TermQuery(new Term("all_entries", "y"));
FilteredQuery fq = new FilteredQuery(tq, ft);
FilteredQuery fq2 = new FilteredQuery(tq, ft2);
q2.add(fq, false, false);
q2.add(fq2, false, false);

The two filters are searches over numeric ranges.  I'm using filters so
I don't get the TooManyBooleanClauses Exception.  And my TermQuery tq is
just a field that has 'y' in every document so I can filter over the
entire index.  The last two lines I am creating a boolean OR, and
everything works fine.  I get back 30 documents which is correct.

However when I change the last two lines to create an AND:

q2.add(fq, true, false);
q2.add(fq2, true, false);

I still get back 30 documents, which is not correct.  It should be 0.
What's going on with FilteredQuery?

Thanks,
Peter


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Deeply nested boolean query performance

2005-04-01 Thread Erik Hatcher
I will soon create some tests for this scenario, but wanted to run this 
by the list as well

What performance differences would be seen between a query like this:
a AND b AND c AND d
and this one:
((a AND b) AND c) AND d
In other words, will building a query with nested boolean queries be 
substantially slower than a single boolean query with many clauses?  Or 
might it be the other way around?

Thanks,
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Re[4]: Analyzer don't work with wildcard queries, snowball analyzer.

2005-04-01 Thread Erik Hatcher
On Apr 1, 2005, at 11:07 AM, Sven Duzont wrote:
EH> I presume your analyzer normalized accented characters?  Which 
analyzer
EH> is that?

Yes, i'm using a custom analyser for indexing / searching, ti consists
in :
- FrenchStopFilter
- IsoLatinFilter (this is the one that will replace accented
characters)
Could you share that filter with the community?
EH> You will need to employ some form of character normalization on
EH> wildcard queries too.
thanks, it works succeffuly, code snippet following
---
 sven
/*--- CODE */
private static Query CreateCustomQuery(Query query)
{
  if(query instanceof BooleanQuery)  {
final BooleanClause[] bClauses = ((BooleanQuery) 
query).getClauses();

// The first clause is required
if(bClauses[0].prohibited != true)
  bClauses[0].required = true;
Why do you flip the required flag like this?
// Will parse each clause to remove accents if needed
Term term;
for (int i = 0; i < bClauses.length; i++){
  if(bClauses[i].query instanceof WildcardQuery)  {
term = ((WildcardQuery)bClauses[i].query).getTerm();
bClauses[i].query = new WildcardQuery(new Term(term.field(),

ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(;
  }
What about handling BooleanQuery's nested within a BooleanQuery?  
You'll need some recursion.

Erik

  if(bClauses[i].query instanceof PrefixQuery)  {
term = ((PrefixQuery)bClauses[i].query).getPrefix();
bClauses[i].query = new PrefixQuery(new Term(term.field(),

ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(;
  // toLowerCase because the text is lowercased during indexation
  }
}
  }
  else if(query instanceof WildcardQuery)  {
final Term term = ((WildcardQuery)query).getTerm();
query = new WildcardQuery(new Term(term.field(),

ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(;
  }
  else if(query instanceof PrefixQuery)  {
final Term term = ((PrefixQuery)query).getPrefix();
query = new PrefixQuery(new Term(term.field(),

ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(;
  }
  return query;
}

/*--- END OF CODE */
EH>  Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Plural Stemming

2005-04-01 Thread Miles Barr
Are there any Lucene extensions that can do simple stemming, i.e. just
for plurals? Or is the only stemming package available Snowball?



Cheers
-- 
Miles Barr <[EMAIL PROTECTED]>
Runtime Collective Ltd.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Plural Stemming

2005-04-01 Thread Andrzej Bialecki
Miles Barr wrote:
Are there any Lucene extensions that can do simple stemming, i.e. just
for plurals? Or is the only stemming package available Snowball?
For which language? Stemming is always language-specific...
If for English, then there is also a built-in PorterStemmer. If you know 
what you do, you could disable some of the stemming rules to get such 
"under-stemming".

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Plural Stemming

2005-04-01 Thread Miles Barr
On Fri, 2005-04-01 at 19:24 +0200, Andrzej Bialecki wrote:
> Miles Barr wrote:
> > Are there any Lucene extensions that can do simple stemming, i.e. just
> > for plurals? Or is the only stemming package available Snowball?
> 
> For which language? Stemming is always language-specific...
> 
> If for English, then there is also a built-in PorterStemmer. If you know 
> what you do, you could disable some of the stemming rules to get such 
> "under-stemming".

Sorry I should have said, at the moment I'm only going to be handling
English, but potentially other languages in the future.

I'll take a look at the PorterStemmer.


Thanks
-- 
Miles Barr <[EMAIL PROTECTED]>
Runtime Collective Ltd.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Deeply nested boolean query performance

2005-04-01 Thread Paul Elschot
On Friday 01 April 2005 18:14, Erik Hatcher wrote:
> I will soon create some tests for this scenario, but wanted to run this 
> by the list as well

Great, see below.

> What performance differences would be seen between a query like this:
> 
>   a AND b AND c AND d

This will use a single ConjunctionScorer, and it is the fastest form.
 
> and this one:
> 
>   ((a AND b) AND c) AND d

> In other words, will building a query with nested boolean queries be 
> substantially slower than a single boolean query with many clauses?  Or 
> might it be the other way around?

This will use a ConjunctionScorer for (a AND b), assuming a and
b are terms. For the other AND operators a BooleanScorer will be
used in 1.4.3. The development version will use a ConjunctionScorer
at each AND operator.

The main difference between a ConjunctionScorer and a BooleanScorer
is the use of skipTo(), ie. the forwarding information in the term docs
index, that allows to 'fast forward' to a given document.
This 'fast forward' is useful for AND queries, and ConjunctionScorer does it,
BooleanScorer simply uses next() instead. The next() method iterates
over all documents in a term docs index.

In other words, the nested form should be significantly slower than
the flat form in 1.4.3, and just a bit slower in the development version.

Another skipTo advantage comes from this form:
(a OR b) and c
In 1.4.3, this uses a BooleanScorer for both operators, making this
as much work as:
(a OR b) OR c.
In the development version, the OR operator gets a DisjunctionScorer,
and the AND operator a ConjunctionScorer, both allowing the use
of skipTo(), even on the a and b terms.

In this context (a OR b) can also be for example a fuzzy query or a prefix 
query.

The development version also uses skipTo() on b in the following situations:
+a b
a -b

So, when you measure, please use both 1.4.3 and the development version
to see the differences. And, off course, the larger your index, the better.
As the code is still a bit young, you might be in for some surprises, too.
skipTo() has the biggest advantages when the index data is not
available in any cache.

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FilteredQuery and Boolean AND

2005-04-01 Thread Erik Hatcher
Peter,
Could you provide a straight-forward test case that indexes a few 
documents into a RAMDirectory and demonstrates the problem you're 
having with AND'd FilteredQuery's?

Give me something concrete and simple and I'll dig into it further.
Erik
On Apr 1, 2005, at 11:13 AM, Kipping, Peter wrote:
Any ideas on this?  I have purchased your book, Lucene in Action, which
is quite good.  To make things easier, consider the example on p212.  
In
item 4, when you combine the queries, what happens you combine them in
and AND fashion?  The book only has OR, which works.  Although it may
work since the book only has one filtered query, but what if you made
them both filtered queries and ANDed them?

Thanks,
Peter
-Original Message-
From: Kipping, Peter [mailto:[EMAIL PROTECTED]
Sent: Friday, March 25, 2005 10:34 AM
To: java-user@lucene.apache.org
Subject: FilteredQuery and Boolean AND
I have the following query structure:
BooleanQuery q2 = new BooleanQuery();
TermQuery tq = new TermQuery(new Term("all_entries", "y"));
FilteredQuery fq = new FilteredQuery(tq, ft);
FilteredQuery fq2 = new FilteredQuery(tq, ft2);
q2.add(fq, false, false);
q2.add(fq2, false, false);
The two filters are searches over numeric ranges.  I'm using filters so
I don't get the TooManyBooleanClauses Exception.  And my TermQuery tq 
is
just a field that has 'y' in every document so I can filter over the
entire index.  The last two lines I am creating a boolean OR, and
everything works fine.  I get back 30 documents which is correct.

However when I change the last two lines to create an AND:
q2.add(fq, true, false);
q2.add(fq2, true, false);
I still get back 30 documents, which is not correct.  It should be 0.
What's going on with FilteredQuery?
Thanks,
Peter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Deeply nested boolean query performance

2005-04-01 Thread Erik Hatcher
Paul,
Thanks for your very thorough response.  It is very helpful.
For all my projects, I'm using the latest Subversion codebase and 
staying current with any changes there, so that is very good news.

Erik
On Apr 1, 2005, at 1:10 PM, Paul Elschot wrote:
On Friday 01 April 2005 18:14, Erik Hatcher wrote:
I will soon create some tests for this scenario, but wanted to run 
this
by the list as well
Great, see below.
What performance differences would be seen between a query like this:
	a AND b AND c AND d
This will use a single ConjunctionScorer, and it is the fastest form.
and this one:
	((a AND b) AND c) AND d

In other words, will building a query with nested boolean queries be
substantially slower than a single boolean query with many clauses?  
Or
might it be the other way around?
This will use a ConjunctionScorer for (a AND b), assuming a and
b are terms. For the other AND operators a BooleanScorer will be
used in 1.4.3. The development version will use a ConjunctionScorer
at each AND operator.
The main difference between a ConjunctionScorer and a BooleanScorer
is the use of skipTo(), ie. the forwarding information in the term docs
index, that allows to 'fast forward' to a given document.
This 'fast forward' is useful for AND queries, and ConjunctionScorer 
does it,
BooleanScorer simply uses next() instead. The next() method iterates
over all documents in a term docs index.

In other words, the nested form should be significantly slower than
the flat form in 1.4.3, and just a bit slower in the development 
version.

Another skipTo advantage comes from this form:
(a OR b) and c
In 1.4.3, this uses a BooleanScorer for both operators, making this
as much work as:
(a OR b) OR c.
In the development version, the OR operator gets a DisjunctionScorer,
and the AND operator a ConjunctionScorer, both allowing the use
of skipTo(), even on the a and b terms.
In this context (a OR b) can also be for example a fuzzy query or a 
prefix
query.

The development version also uses skipTo() on b in the following 
situations:
+a b
a -b

So, when you measure, please use both 1.4.3 and the development version
to see the differences. And, off course, the larger your index, the 
better.
As the code is still a bit young, you might be in for some surprises, 
too.
skipTo() has the biggest advantages when the index data is not
available in any cache.

Regards,
Paul Elschot.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: proximity search in lucene

2005-04-01 Thread Sujatha Das

Hi,
Does Lucene support "SpanNear" or phrase queries where the clauses or terms 
are not of the same field?
If not, could someone let me know which is the way to support proximity 
searches with terms belonging to different fields.

Thanks much,
Sujatha Das

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Performance Question

2005-04-01 Thread Omar Didi
I have 5 indexes, each one is 6GB...I need 512MB of Heap size in order to open 
the index and have all type of queries. My question is, is it better to just 
have on large Index 30GB? will increasing the Heap size increase performance? 
can I store an instance of MultiSearcher(OR just Searcher in case on big index 
is better) in the application variable since I have 3 servlets that opens the 
index? would implementing a listner to open the index would be useful knowing 
that the index changes onece a mounth?
any suggestions would be very helpful, 
Thanks,
omar

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, April 01, 2005 1:29 PM
To: java-user@lucene.apache.org
Subject: Re: FilteredQuery and Boolean AND


Peter,

Could you provide a straight-forward test case that indexes a few 
documents into a RAMDirectory and demonstrates the problem you're 
having with AND'd FilteredQuery's?

Give me something concrete and simple and I'll dig into it further.

Erik

On Apr 1, 2005, at 11:13 AM, Kipping, Peter wrote:

> Any ideas on this?  I have purchased your book, Lucene in Action, which
> is quite good.  To make things easier, consider the example on p212.  
> In
> item 4, when you combine the queries, what happens you combine them in
> and AND fashion?  The book only has OR, which works.  Although it may
> work since the book only has one filtered query, but what if you made
> them both filtered queries and ANDed them?
>
> Thanks,
> Peter
>
> -Original Message-
> From: Kipping, Peter [mailto:[EMAIL PROTECTED]
> Sent: Friday, March 25, 2005 10:34 AM
> To: java-user@lucene.apache.org
> Subject: FilteredQuery and Boolean AND
>
> I have the following query structure:
>
> BooleanQuery q2 = new BooleanQuery();
> TermQuery tq = new TermQuery(new Term("all_entries", "y"));
> FilteredQuery fq = new FilteredQuery(tq, ft);
> FilteredQuery fq2 = new FilteredQuery(tq, ft2);
> q2.add(fq, false, false);
> q2.add(fq2, false, false);
>
> The two filters are searches over numeric ranges.  I'm using filters so
> I don't get the TooManyBooleanClauses Exception.  And my TermQuery tq 
> is
> just a field that has 'y' in every document so I can filter over the
> entire index.  The last two lines I am creating a boolean OR, and
> everything works fine.  I get back 30 documents which is correct.
>
> However when I change the last two lines to create an AND:
>
> q2.add(fq, true, false);
> q2.add(fq2, true, false);
>
> I still get back 30 documents, which is not correct.  It should be 0.
> What's going on with FilteredQuery?
>
> Thanks,
> Peter
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: proximity search in lucene

2005-04-01 Thread Erik Hatcher
On Apr 1, 2005, at 2:29 PM, Sujatha Das wrote:

Hi,
Does Lucene support "SpanNear" or phrase queries where the clauses or 
terms are not of the same field?
If not, could someone let me know which is the way to support 
proximity searches with terms belonging to different fields.
No, it does not support cross-field proximity.  I'm not even remotely 
understanding what that would even mean.  Could you provide an example 
of what you're after?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Plural Stemming

2005-04-01 Thread Chris Hostetter

: > > Are there any Lucene extensions that can do simple stemming, i.e. just
: > > for plurals? Or is the only stemming package available Snowball?

LIA has a case study of jGuru which uses a very specific, home grown
utility method called "stripEnglishPlural" ... since it's in the case
study chapter, i'm not sure if it's included in the books source code, but
is included verbatim in the book...

   http://lucenebook.com/search?query=stripEnglishPlural



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Plural Stemming

2005-04-01 Thread Erik Hatcher
On Apr 1, 2005, at 7:03 PM, Chris Hostetter wrote:
: > > Are there any Lucene extensions that can do simple stemming, 
i.e. just
: > > for plurals? Or is the only stemming package available Snowball?

LIA has a case study of jGuru which uses a very specific, home grown
utility method called "stripEnglishPlural" ... since it's in the case
study chapter, i'm not sure if it's included in the books source code, 
but
is included verbatim in the book...

   http://lucenebook.com/search?query=stripEnglishPlural
Thanks for the reminder, Chris.  I'm sure jGuru wouldn't mind us 
posting it, so I've pasted it below.  It is not included in the LIA 
source code - only the code Otis and I wrote ourselves is included 
there and we didn't get the source code from any of the case studies 
(other than Bob Carpenter's LingPipe stuff).

Erik
/** A useful, but not particularly efficient plural stripper */
public static String stripEnglishPlural(String word) {
// too small?
if ( word.length()
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: FilteredQuery and Boolean AND

2005-04-01 Thread Chris Hostetter

Peter's problem intrigued me, so I wrote my own test case using two simple
Filters that filter out all but the first (or last) doc.  I seem to be
getting the same results he is, which is certianly.  see attached test case.

while this definitely seems like a bug, it also seems like a fairly
inefficinent way of approaching hte problem in general, instead of:
  BooleanQuery containing:
a) FilteredQuery wrapping:
Query for "all" -- filtered by -- RangeFilter #1
b) FilteredQuery wrapping:
Query for "all" -- filtered by -- RangeFilter #2

...it seems like it would make more sense to use...

  FilterQuery wrapping:
Query for all -- filtered by -- ChainedFilter containing:
  a) RangeFilter #1
  b) RangeFilter #2




: Date: Fri, 1 Apr 2005 13:29:04 -0500
: From: Erik Hatcher <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Re: FilteredQuery and Boolean AND
:
: Peter,
:
: Could you provide a straight-forward test case that indexes a few
: documents into a RAMDirectory and demonstrates the problem you're
: having with AND'd FilteredQuery's?
:
: Give me something concrete and simple and I'll dig into it further.
:
:   Erik
:
: On Apr 1, 2005, at 11:13 AM, Kipping, Peter wrote:
:
: > Any ideas on this?  I have purchased your book, Lucene in Action, which
: > is quite good.  To make things easier, consider the example on p212.
: > In
: > item 4, when you combine the queries, what happens you combine them in
: > and AND fashion?  The book only has OR, which works.  Although it may
: > work since the book only has one filtered query, but what if you made
: > them both filtered queries and ANDed them?
: >
: > Thanks,
: > Peter
: >
: > -Original Message-
: > From: Kipping, Peter [mailto:[EMAIL PROTECTED]
: > Sent: Friday, March 25, 2005 10:34 AM
: > To: java-user@lucene.apache.org
: > Subject: FilteredQuery and Boolean AND
: >
: > I have the following query structure:
: >
: > BooleanQuery q2 = new BooleanQuery();
: > TermQuery tq = new TermQuery(new Term("all_entries", "y"));
: > FilteredQuery fq = new FilteredQuery(tq, ft);
: > FilteredQuery fq2 = new FilteredQuery(tq, ft2);
: > q2.add(fq, false, false);
: > q2.add(fq2, false, false);
: >
: > The two filters are searches over numeric ranges.  I'm using filters so
: > I don't get the TooManyBooleanClauses Exception.  And my TermQuery tq
: > is
: > just a field that has 'y' in every document so I can filter over the
: > entire index.  The last two lines I am creating a boolean OR, and
: > everything works fine.  I get back 30 documents which is correct.
: >
: > However when I change the last two lines to create an AND:
: >
: > q2.add(fq, true, false);
: > q2.add(fq2, true, false);
: >
: > I still get back 30 documents, which is not correct.  It should be 0.
: > What's going on with FilteredQuery?
: >
: > Thanks,
: > Peter
: >
: >
: > -
: > To unsubscribe, e-mail: [EMAIL PROTECTED]
: > For additional commands, e-mail: [EMAIL PROTECTED]
: >
: >
: >
: >
: > -
: > To unsubscribe, e-mail: [EMAIL PROTECTED]
: > For additional commands, e-mail: [EMAIL PROTECTED]
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss

import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.DateField;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Filter;
import org.apache.lucene.search.FilteredQuery;

import java.io.IOException;
import java.util.Random;
import java.util.BitSet;

import junit.framework.TestCase;

public class TestKippingPeterBug extends TestCase {

public static final boolean F = false;
public static final boolean T = true;

public static String[] data = new String [] {
"a b   m n",
"a x   q r",
"a b   s t",
"a x   e f",
"a"
};


RAMDirectory index = new RAMDirectory();
IndexReader r;
IndexSearcher s;
Query ALL = new TermQuery(new Term("data","a"));


public TestKippingPeterBug(String name) {
	super(name);
}
public TestKippingPeterBug() {
super();
}

public void setUp() throws Exception {

/* build an index */
IndexWriter writer = new IndexWriter(index,