LUCENE & eCommerce
Hi Guys Apologies. Has Any body out on the Form ,have implemented Lucene API in eCommerce ( Search based Shopping) Something similar to http://www.bizrate.com/ . Please Help me ??? WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK]
RE: using different analyzer for searching
Hi Try First Try Using the AnalysisDemo.java code from http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html?page=last#thre ad from java.net for the Contents u seems to experiment with and verify which analyzer to use. Probably this will give u some Idea on Analyzers. with regaards Karthik -Original Message- From: pashupathinath [mailto:[EMAIL PROTECTED] Sent: Friday, April 01, 2005 9:19 AM To: java-user@lucene.apache.org Subject: Re: using different analyzer for searching hi erik, i'm creating a blogger application, where the users can create blogs, upload pictures and post comments etc etc. i'm storing all the information using mysql database. i'm indexing the database contents and searching on this index.i'm using lucene to implement this feature. i give the user options to search based on BlogTitle, Blogdesc,blogcategory. my main purpose of search is ..whenever a user enters any query related to blogtitle or blogdesc or blogcategory, it should return all the matching documents for that search string. the real problem i'm facing is ..whenever the user enters some part of the mainstring, the search returns zero because i was using a whitespaceanalyser, which needs the complete string. i should look into using wildcardquery which i think will solve my problem to some extent. i should do even more analysis as suggested by you before i should come to a decision of which analyser i should be using to solve this. what about writing a custom analyzer to solve this ??? how can i go abt the logic of implementing this in a custom analyzer.. where this returns all the documents that has even a part of the search string. any insight into this would be very helpful especially in terms of performance wise. thanks, pashupathinath.k --- Erik Hatcher <[EMAIL PROTECTED]> wrote: > > On Mar 31, 2005, at 11:44 AM, pashupathinath wrote: > > > is it possible to index using a predefined > analyzer > > and search using a custom analyzer ?? > > Yes, its perfectly fine to do so with the caveat > that you end up > searching for the terms exactly as they were > indexed. > > I end up doing this in most applications, actually, > primarily because > untokenized fields need to use the KeywordAnalyzer > during searching. > > > i'm searching using the built in whitespace > > analyser. the problem is when i'm searching for a > part > > of a string the search results are zero. > > i'm using white space analyzer. for example if > the > > statement is "my name is abc123" the search for > abc or > > 123 doesnt return any hits. > > anyinsight into this ?? > > The exact terms indexed using WhitespaceAnalyzer are > like this (using > the Lucene in Action AnalyzerDemo - "ant > AnalyzerDemo"): > > [input] String to analyze: [This string will be > analyzed.] > my name is abc123 > [echo] Running lia.analysis.AnalyzerDemo... > [java] Analyzing "my name is abc123" > [java] WhitespaceAnalyzer: > [java] [my] [name] [is] [abc123] > > [java] SimpleAnalyzer: > [java] [my] [name] [is] [abc] > > [java] StopAnalyzer: > [java] [my] [name] [abc] > > [java] StandardAnalyzer: > [java] [my] [name] [abc123] > > So you indexed "abc123" and searches must search for > that term > *exactly*. You can search for "abc*" as a > PrefixQuery or WildcardQuery > and find "abc123". "*123" will also find it though > QueryParser does > not support leading wildcard characters (but the API > does). Wildcard > queries are not ideally what you want as it tends to > be much slower for > large indexes. > > You may need to do specialized analysis. Perhaps > you could share you > real needs with the list and we could offer > recommendations. It is > possible to index "abc123", "abc", and "123" all > within the same > position in the index if you do some clever analysis > and that meshes > with what you're after. > > Erik > > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > Send instant messages to your online friends http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
indexing performance of little documents
Hello, I want to index a 1GB file that contains a list of lines of approximately 100 characters each, so that i can later get lines containing some particular text. The natural way of doing it with lucene would be to create 1 lucene Document per line. It works well except it is too slow for my needs, even after tweaking all possible parameters of IndexWriter and using cvs version of lucene. I can get 10x the indexing performance by indexing the file as 1 lucene Document. Lucene builds a good index with all the terms and I am able to get the number of terms matching a query but not the absolute position in the original file (I only get the token relative position). A minor quirk with this approach is that i need to split the document in order to avoid outofmemory exception when the document is too big. It would be probably possible for me to customize lucene for my needs (create a more flexible Term class), that's just a hack. But I was wondering why there should be such a performance difference. I see that for each document plenty of work is done, but that seems necessary, and then there is even more work while merging segments. Things could probably be faster if documents were first aggregated and then work done on them. But I think this would imply huge changes in Lucene. Any advice for indexing millions of tiny docs? Regards, Fabien.
Re[2]: Analyzer don't work with wildcard queries, snowball analyzer.
Hello Erik, Since wilcard queries are not analyzed, how can we deal with accents ? For instance (in french) a query like "ingé*" will not match documents containing "ingénieur" but the query "inge*" will. Thanks --- sven Le jeudi 31 mars 2005 à 17:51:25, vous écriviez : EH> Wildcard terms simply are not analyzed. How could it be possible to do EH> this? What if I search for "a*" - how could you stem that? EH> Erik EH> On Mar 31, 2005, at 9:51 AM, Ernesto De Santis wrote: >> Hi >> >> I get an unexpected behavior when use wildcards in my queries. >> I use a EnglishAnalyzer developed with SnowballAnalyzer. version >> 1.1_dev from Lucene in Action lib. >> >> Analysis case: >> When use wildcards in the middle of one word, the word in not analyzed. >> Examples: >> >>QueryParser qp = new QueryParser("body", analyzer); >>Query q = qp.parse("ex?mple"); >>String strq = q.toString(); >>assertEquals("body:ex?mpl", strq); >> //FAIL strq == body:ex?mple >> >>qp = new QueryParser("body", analyzer); >>q = qp.parse("ex*ple"); >>strq = q.toString(); >>assertEquals("body:ex*pl", strq); >> //FAIL strq == body:ex*ple >> >> With this behavior, the search does not find any document. >> >> Bye >> Ernesto. >> >> -- >> Ernesto De Santis - Colaborativa.net >> Córdoba 1147 Piso 6 Oficinas 3 y 4 >> (S2000AWO) Rosario, SF, Argentina. >> >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] EH> - EH> To unsubscribe, e-mail: [EMAIL PROTECTED] EH> For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: LUCENE & eCommerce
Hi Karthik, The e-commerce website www.shoptime.com from Brasil. William. From: "Karthik N S" <[EMAIL PROTECTED]> Reply-To: java-user@lucene.apache.org To: "LUCENE" Subject: LUCENE & eCommerce Date: Fri, 1 Apr 2005 14:11:57 +0530 Hi Guys Apologies. Has Any body out on the Form ,have implemented Lucene API in eCommerce ( Search based Shopping) Something similar to http://www.bizrate.com/ . Please Help me ??? WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing performance of little documents
This might sound a bit lame but it has worked for me. I have had the same problem where the amount of small lucene documents slows down the building of large indexes. Search is pretty fast, and read only, so for my case i just created three indexes and saved every three lucene documents into one of each index. then upon a search i merge the results from the three smaller indexes. Only thing to consider is to store all parts of a source document into the same index so that booleans still work. I have even threaded out the searching so search on the three indexes are performed in parallel. By the way; Stop word filters can also do wonders for a index full of text too... Mvh Karl Øie On 1. apr. 2005, at 11.43, Fabien Le Floc'h wrote: Hello, I want to index a 1GB file that contains a list of lines of approximately 100 characters each, so that i can later get lines containing some particular text. The natural way of doing it with lucene would be to create 1 lucene Document per line. It works well except it is too slow for my needs, even after tweaking all possible parameters of IndexWriter and using cvs version of lucene. I can get 10x the indexing performance by indexing the file as 1 lucene Document. Lucene builds a good index with all the terms and I am able to get the number of terms matching a query but not the absolute position in the original file (I only get the token relative position). A minor quirk with this approach is that i need to split the document in order to avoid outofmemory exception when the document is too big. It would be probably possible for me to customize lucene for my needs (create a more flexible Term class), that's just a hack. But I was wondering why there should be such a performance difference. I see that for each document plenty of work is done, but that seems necessary, and then there is even more work while merging segments. Things could probably be faster if documents were first aggregated and then work done on them. But I think this would imply huge changes in Lucene. Any advice for indexing millions of tiny docs? Regards, Fabien. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re[2]: Analyzer don't work with wildcard queries, snowball analyzer.
On Apr 1, 2005, at 8:09 AM, Sven Duzont wrote: Since wilcard queries are not analyzed, how can we deal with accents ? For instance (in french) a query like "ingé*" will not match documents containing "ingénieur" but the query "inge*" will. I presume your analyzer normalized accented characters? Which analyzer is that? You will need to employ some form of character normalization on wildcard queries too. Erik Thanks --- sven Le jeudi 31 mars 2005 à 17:51:25, vous écriviez : EH> Wildcard terms simply are not analyzed. How could it be possible to do EH> this? What if I search for "a*" - how could you stem that? EH> Erik EH> On Mar 31, 2005, at 9:51 AM, Ernesto De Santis wrote: Hi I get an unexpected behavior when use wildcards in my queries. I use a EnglishAnalyzer developed with SnowballAnalyzer. version 1.1_dev from Lucene in Action lib. Analysis case: When use wildcards in the middle of one word, the word in not analyzed. Examples: QueryParser qp = new QueryParser("body", analyzer); Query q = qp.parse("ex?mple"); String strq = q.toString(); assertEquals("body:ex?mpl", strq); //FAIL strq == body:ex?mple qp = new QueryParser("body", analyzer); q = qp.parse("ex*ple"); strq = q.toString(); assertEquals("body:ex*pl", strq); //FAIL strq == body:ex*ple With this behavior, the search does not find any document. Bye Ernesto. -- Ernesto De Santis - Colaborativa.net Córdoba 1147 Piso 6 Oficinas 3 y 4 (S2000AWO) Rosario, SF, Argentina. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] EH> - EH> To unsubscribe, e-mail: [EMAIL PROTECTED] EH> For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: using different analyzer for searching
On Mar 31, 2005, at 10:49 PM, pashupathinath wrote: i should do even more analysis as suggested by you before i should come to a decision of which analyser i should be using to solve this. what about writing a custom analyzer to solve this ??? how can i go abt the logic of implementing this in a custom analyzer.. where this returns all the documents that has even a part of the search string. any insight into this would be very helpful especially in terms of performance wise. This is an involved topic, and one that is covered in great detail in the analysis chapter of Lucene in Action (shameless plug, yes, I know!). I recommend you analyze the types of queries that need to be made and what type of user interface you will present for this - then determine what makes the most sense analysis-wise. WhitespaceAnalyzer is not going to be good enough, as I suspect you'll want case-insensitive searches at least. Erik thanks, pashupathinath.k --- Erik Hatcher <[EMAIL PROTECTED]> wrote: On Mar 31, 2005, at 11:44 AM, pashupathinath wrote: is it possible to index using a predefined analyzer and search using a custom analyzer ?? Yes, its perfectly fine to do so with the caveat that you end up searching for the terms exactly as they were indexed. I end up doing this in most applications, actually, primarily because untokenized fields need to use the KeywordAnalyzer during searching. i'm searching using the built in whitespace analyser. the problem is when i'm searching for a part of a string the search results are zero. i'm using white space analyzer. for example if the statement is "my name is abc123" the search for abc or 123 doesnt return any hits. anyinsight into this ?? The exact terms indexed using WhitespaceAnalyzer are like this (using the Lucene in Action AnalyzerDemo - "ant AnalyzerDemo"): [input] String to analyze: [This string will be analyzed.] my name is abc123 [echo] Running lia.analysis.AnalyzerDemo... [java] Analyzing "my name is abc123" [java] WhitespaceAnalyzer: [java] [my] [name] [is] [abc123] [java] SimpleAnalyzer: [java] [my] [name] [is] [abc] [java] StopAnalyzer: [java] [my] [name] [abc] [java] StandardAnalyzer: [java] [my] [name] [abc123] So you indexed "abc123" and searches must search for that term *exactly*. You can search for "abc*" as a PrefixQuery or WildcardQuery and find "abc123". "*123" will also find it though QueryParser does not support leading wildcard characters (but the API does). Wildcard queries are not ideally what you want as it tends to be much slower for large indexes. You may need to do specialized analysis. Perhaps you could share you real needs with the list and we could offer recommendations. It is possible to index "abc123", "abc", and "123" all within the same position in the index if you do some clever analysis and that meshes with what you're after. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Send instant messages to your online friends http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re[4]: Analyzer don't work with wildcard queries, snowball analyzer.
EH> I presume your analyzer normalized accented characters? Which analyzer EH> is that? Yes, i'm using a custom analyser for indexing / searching, ti consists in : - FrenchStopFilter - IsoLatinFilter (this is the one that will replace accented characters) - LowerCaseFilter - ApostropheFilter (in order to handle terms like with apostrophes, for instance "l'expérience" will be decompozed into two tokens : "l" "expérience" EH> You will need to employ some form of character normalization on EH> wildcard queries too. thanks, it works succeffuly, code snippet following --- sven /*--- CODE */ private static Query CreateCustomQuery(Query query) { if(query instanceof BooleanQuery) { final BooleanClause[] bClauses = ((BooleanQuery) query).getClauses(); // The first clause is required if(bClauses[0].prohibited != true) bClauses[0].required = true; // Will parse each clause to remove accents if needed Term term; for (int i = 0; i < bClauses.length; i++){ if(bClauses[i].query instanceof WildcardQuery) { term = ((WildcardQuery)bClauses[i].query).getTerm(); bClauses[i].query = new WildcardQuery(new Term(term.field(), ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(; } if(bClauses[i].query instanceof PrefixQuery) { term = ((PrefixQuery)bClauses[i].query).getPrefix(); bClauses[i].query = new PrefixQuery(new Term(term.field(), ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(; // toLowerCase because the text is lowercased during indexation } } } else if(query instanceof WildcardQuery) { final Term term = ((WildcardQuery)query).getTerm(); query = new WildcardQuery(new Term(term.field(), ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(; } else if(query instanceof PrefixQuery) { final Term term = ((PrefixQuery)query).getPrefix(); query = new PrefixQuery(new Term(term.field(), ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(; } return query; } /*--- END OF CODE */ EH> Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: FilteredQuery and Boolean AND
Any ideas on this? I have purchased your book, Lucene in Action, which is quite good. To make things easier, consider the example on p212. In item 4, when you combine the queries, what happens you combine them in and AND fashion? The book only has OR, which works. Although it may work since the book only has one filtered query, but what if you made them both filtered queries and ANDed them? Thanks, Peter -Original Message- From: Kipping, Peter [mailto:[EMAIL PROTECTED] Sent: Friday, March 25, 2005 10:34 AM To: java-user@lucene.apache.org Subject: FilteredQuery and Boolean AND I have the following query structure: BooleanQuery q2 = new BooleanQuery(); TermQuery tq = new TermQuery(new Term("all_entries", "y")); FilteredQuery fq = new FilteredQuery(tq, ft); FilteredQuery fq2 = new FilteredQuery(tq, ft2); q2.add(fq, false, false); q2.add(fq2, false, false); The two filters are searches over numeric ranges. I'm using filters so I don't get the TooManyBooleanClauses Exception. And my TermQuery tq is just a field that has 'y' in every document so I can filter over the entire index. The last two lines I am creating a boolean OR, and everything works fine. I get back 30 documents which is correct. However when I change the last two lines to create an AND: q2.add(fq, true, false); q2.add(fq2, true, false); I still get back 30 documents, which is not correct. It should be 0. What's going on with FilteredQuery? Thanks, Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Deeply nested boolean query performance
I will soon create some tests for this scenario, but wanted to run this by the list as well What performance differences would be seen between a query like this: a AND b AND c AND d and this one: ((a AND b) AND c) AND d In other words, will building a query with nested boolean queries be substantially slower than a single boolean query with many clauses? Or might it be the other way around? Thanks, Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re[4]: Analyzer don't work with wildcard queries, snowball analyzer.
On Apr 1, 2005, at 11:07 AM, Sven Duzont wrote: EH> I presume your analyzer normalized accented characters? Which analyzer EH> is that? Yes, i'm using a custom analyser for indexing / searching, ti consists in : - FrenchStopFilter - IsoLatinFilter (this is the one that will replace accented characters) Could you share that filter with the community? EH> You will need to employ some form of character normalization on EH> wildcard queries too. thanks, it works succeffuly, code snippet following --- sven /*--- CODE */ private static Query CreateCustomQuery(Query query) { if(query instanceof BooleanQuery) { final BooleanClause[] bClauses = ((BooleanQuery) query).getClauses(); // The first clause is required if(bClauses[0].prohibited != true) bClauses[0].required = true; Why do you flip the required flag like this? // Will parse each clause to remove accents if needed Term term; for (int i = 0; i < bClauses.length; i++){ if(bClauses[i].query instanceof WildcardQuery) { term = ((WildcardQuery)bClauses[i].query).getTerm(); bClauses[i].query = new WildcardQuery(new Term(term.field(), ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(; } What about handling BooleanQuery's nested within a BooleanQuery? You'll need some recursion. Erik if(bClauses[i].query instanceof PrefixQuery) { term = ((PrefixQuery)bClauses[i].query).getPrefix(); bClauses[i].query = new PrefixQuery(new Term(term.field(), ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(; // toLowerCase because the text is lowercased during indexation } } } else if(query instanceof WildcardQuery) { final Term term = ((WildcardQuery)query).getTerm(); query = new WildcardQuery(new Term(term.field(), ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(; } else if(query instanceof PrefixQuery) { final Term term = ((PrefixQuery)query).getPrefix(); query = new PrefixQuery(new Term(term.field(), ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase(; } return query; } /*--- END OF CODE */ EH> Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Plural Stemming
Are there any Lucene extensions that can do simple stemming, i.e. just for plurals? Or is the only stemming package available Snowball? Cheers -- Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Plural Stemming
Miles Barr wrote: Are there any Lucene extensions that can do simple stemming, i.e. just for plurals? Or is the only stemming package available Snowball? For which language? Stemming is always language-specific... If for English, then there is also a built-in PorterStemmer. If you know what you do, you could disable some of the stemming rules to get such "under-stemming". -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Plural Stemming
On Fri, 2005-04-01 at 19:24 +0200, Andrzej Bialecki wrote: > Miles Barr wrote: > > Are there any Lucene extensions that can do simple stemming, i.e. just > > for plurals? Or is the only stemming package available Snowball? > > For which language? Stemming is always language-specific... > > If for English, then there is also a built-in PorterStemmer. If you know > what you do, you could disable some of the stemming rules to get such > "under-stemming". Sorry I should have said, at the moment I'm only going to be handling English, but potentially other languages in the future. I'll take a look at the PorterStemmer. Thanks -- Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Deeply nested boolean query performance
On Friday 01 April 2005 18:14, Erik Hatcher wrote: > I will soon create some tests for this scenario, but wanted to run this > by the list as well Great, see below. > What performance differences would be seen between a query like this: > > a AND b AND c AND d This will use a single ConjunctionScorer, and it is the fastest form. > and this one: > > ((a AND b) AND c) AND d > In other words, will building a query with nested boolean queries be > substantially slower than a single boolean query with many clauses? Or > might it be the other way around? This will use a ConjunctionScorer for (a AND b), assuming a and b are terms. For the other AND operators a BooleanScorer will be used in 1.4.3. The development version will use a ConjunctionScorer at each AND operator. The main difference between a ConjunctionScorer and a BooleanScorer is the use of skipTo(), ie. the forwarding information in the term docs index, that allows to 'fast forward' to a given document. This 'fast forward' is useful for AND queries, and ConjunctionScorer does it, BooleanScorer simply uses next() instead. The next() method iterates over all documents in a term docs index. In other words, the nested form should be significantly slower than the flat form in 1.4.3, and just a bit slower in the development version. Another skipTo advantage comes from this form: (a OR b) and c In 1.4.3, this uses a BooleanScorer for both operators, making this as much work as: (a OR b) OR c. In the development version, the OR operator gets a DisjunctionScorer, and the AND operator a ConjunctionScorer, both allowing the use of skipTo(), even on the a and b terms. In this context (a OR b) can also be for example a fuzzy query or a prefix query. The development version also uses skipTo() on b in the following situations: +a b a -b So, when you measure, please use both 1.4.3 and the development version to see the differences. And, off course, the larger your index, the better. As the code is still a bit young, you might be in for some surprises, too. skipTo() has the biggest advantages when the index data is not available in any cache. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FilteredQuery and Boolean AND
Peter, Could you provide a straight-forward test case that indexes a few documents into a RAMDirectory and demonstrates the problem you're having with AND'd FilteredQuery's? Give me something concrete and simple and I'll dig into it further. Erik On Apr 1, 2005, at 11:13 AM, Kipping, Peter wrote: Any ideas on this? I have purchased your book, Lucene in Action, which is quite good. To make things easier, consider the example on p212. In item 4, when you combine the queries, what happens you combine them in and AND fashion? The book only has OR, which works. Although it may work since the book only has one filtered query, but what if you made them both filtered queries and ANDed them? Thanks, Peter -Original Message- From: Kipping, Peter [mailto:[EMAIL PROTECTED] Sent: Friday, March 25, 2005 10:34 AM To: java-user@lucene.apache.org Subject: FilteredQuery and Boolean AND I have the following query structure: BooleanQuery q2 = new BooleanQuery(); TermQuery tq = new TermQuery(new Term("all_entries", "y")); FilteredQuery fq = new FilteredQuery(tq, ft); FilteredQuery fq2 = new FilteredQuery(tq, ft2); q2.add(fq, false, false); q2.add(fq2, false, false); The two filters are searches over numeric ranges. I'm using filters so I don't get the TooManyBooleanClauses Exception. And my TermQuery tq is just a field that has 'y' in every document so I can filter over the entire index. The last two lines I am creating a boolean OR, and everything works fine. I get back 30 documents which is correct. However when I change the last two lines to create an AND: q2.add(fq, true, false); q2.add(fq2, true, false); I still get back 30 documents, which is not correct. It should be 0. What's going on with FilteredQuery? Thanks, Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Deeply nested boolean query performance
Paul, Thanks for your very thorough response. It is very helpful. For all my projects, I'm using the latest Subversion codebase and staying current with any changes there, so that is very good news. Erik On Apr 1, 2005, at 1:10 PM, Paul Elschot wrote: On Friday 01 April 2005 18:14, Erik Hatcher wrote: I will soon create some tests for this scenario, but wanted to run this by the list as well Great, see below. What performance differences would be seen between a query like this: a AND b AND c AND d This will use a single ConjunctionScorer, and it is the fastest form. and this one: ((a AND b) AND c) AND d In other words, will building a query with nested boolean queries be substantially slower than a single boolean query with many clauses? Or might it be the other way around? This will use a ConjunctionScorer for (a AND b), assuming a and b are terms. For the other AND operators a BooleanScorer will be used in 1.4.3. The development version will use a ConjunctionScorer at each AND operator. The main difference between a ConjunctionScorer and a BooleanScorer is the use of skipTo(), ie. the forwarding information in the term docs index, that allows to 'fast forward' to a given document. This 'fast forward' is useful for AND queries, and ConjunctionScorer does it, BooleanScorer simply uses next() instead. The next() method iterates over all documents in a term docs index. In other words, the nested form should be significantly slower than the flat form in 1.4.3, and just a bit slower in the development version. Another skipTo advantage comes from this form: (a OR b) and c In 1.4.3, this uses a BooleanScorer for both operators, making this as much work as: (a OR b) OR c. In the development version, the OR operator gets a DisjunctionScorer, and the AND operator a ConjunctionScorer, both allowing the use of skipTo(), even on the a and b terms. In this context (a OR b) can also be for example a fuzzy query or a prefix query. The development version also uses skipTo() on b in the following situations: +a b a -b So, when you measure, please use both 1.4.3 and the development version to see the differences. And, off course, the larger your index, the better. As the code is still a bit young, you might be in for some surprises, too. skipTo() has the biggest advantages when the index data is not available in any cache. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: proximity search in lucene
Hi, Does Lucene support "SpanNear" or phrase queries where the clauses or terms are not of the same field? If not, could someone let me know which is the way to support proximity searches with terms belonging to different fields. Thanks much, Sujatha Das - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Performance Question
I have 5 indexes, each one is 6GB...I need 512MB of Heap size in order to open the index and have all type of queries. My question is, is it better to just have on large Index 30GB? will increasing the Heap size increase performance? can I store an instance of MultiSearcher(OR just Searcher in case on big index is better) in the application variable since I have 3 servlets that opens the index? would implementing a listner to open the index would be useful knowing that the index changes onece a mounth? any suggestions would be very helpful, Thanks, omar -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, April 01, 2005 1:29 PM To: java-user@lucene.apache.org Subject: Re: FilteredQuery and Boolean AND Peter, Could you provide a straight-forward test case that indexes a few documents into a RAMDirectory and demonstrates the problem you're having with AND'd FilteredQuery's? Give me something concrete and simple and I'll dig into it further. Erik On Apr 1, 2005, at 11:13 AM, Kipping, Peter wrote: > Any ideas on this? I have purchased your book, Lucene in Action, which > is quite good. To make things easier, consider the example on p212. > In > item 4, when you combine the queries, what happens you combine them in > and AND fashion? The book only has OR, which works. Although it may > work since the book only has one filtered query, but what if you made > them both filtered queries and ANDed them? > > Thanks, > Peter > > -Original Message- > From: Kipping, Peter [mailto:[EMAIL PROTECTED] > Sent: Friday, March 25, 2005 10:34 AM > To: java-user@lucene.apache.org > Subject: FilteredQuery and Boolean AND > > I have the following query structure: > > BooleanQuery q2 = new BooleanQuery(); > TermQuery tq = new TermQuery(new Term("all_entries", "y")); > FilteredQuery fq = new FilteredQuery(tq, ft); > FilteredQuery fq2 = new FilteredQuery(tq, ft2); > q2.add(fq, false, false); > q2.add(fq2, false, false); > > The two filters are searches over numeric ranges. I'm using filters so > I don't get the TooManyBooleanClauses Exception. And my TermQuery tq > is > just a field that has 'y' in every document so I can filter over the > entire index. The last two lines I am creating a boolean OR, and > everything works fine. I get back 30 documents which is correct. > > However when I change the last two lines to create an AND: > > q2.add(fq, true, false); > q2.add(fq2, true, false); > > I still get back 30 documents, which is not correct. It should be 0. > What's going on with FilteredQuery? > > Thanks, > Peter > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: proximity search in lucene
On Apr 1, 2005, at 2:29 PM, Sujatha Das wrote: Hi, Does Lucene support "SpanNear" or phrase queries where the clauses or terms are not of the same field? If not, could someone let me know which is the way to support proximity searches with terms belonging to different fields. No, it does not support cross-field proximity. I'm not even remotely understanding what that would even mean. Could you provide an example of what you're after? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Plural Stemming
: > > Are there any Lucene extensions that can do simple stemming, i.e. just : > > for plurals? Or is the only stemming package available Snowball? LIA has a case study of jGuru which uses a very specific, home grown utility method called "stripEnglishPlural" ... since it's in the case study chapter, i'm not sure if it's included in the books source code, but is included verbatim in the book... http://lucenebook.com/search?query=stripEnglishPlural -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Plural Stemming
On Apr 1, 2005, at 7:03 PM, Chris Hostetter wrote: : > > Are there any Lucene extensions that can do simple stemming, i.e. just : > > for plurals? Or is the only stemming package available Snowball? LIA has a case study of jGuru which uses a very specific, home grown utility method called "stripEnglishPlural" ... since it's in the case study chapter, i'm not sure if it's included in the books source code, but is included verbatim in the book... http://lucenebook.com/search?query=stripEnglishPlural Thanks for the reminder, Chris. I'm sure jGuru wouldn't mind us posting it, so I've pasted it below. It is not included in the LIA source code - only the code Otis and I wrote ourselves is included there and we didn't get the source code from any of the case studies (other than Bob Carpenter's LingPipe stuff). Erik /** A useful, but not particularly efficient plural stripper */ public static String stripEnglishPlural(String word) { // too small? if ( word.length() - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FilteredQuery and Boolean AND
Peter's problem intrigued me, so I wrote my own test case using two simple Filters that filter out all but the first (or last) doc. I seem to be getting the same results he is, which is certianly. see attached test case. while this definitely seems like a bug, it also seems like a fairly inefficinent way of approaching hte problem in general, instead of: BooleanQuery containing: a) FilteredQuery wrapping: Query for "all" -- filtered by -- RangeFilter #1 b) FilteredQuery wrapping: Query for "all" -- filtered by -- RangeFilter #2 ...it seems like it would make more sense to use... FilterQuery wrapping: Query for all -- filtered by -- ChainedFilter containing: a) RangeFilter #1 b) RangeFilter #2 : Date: Fri, 1 Apr 2005 13:29:04 -0500 : From: Erik Hatcher <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: Re: FilteredQuery and Boolean AND : : Peter, : : Could you provide a straight-forward test case that indexes a few : documents into a RAMDirectory and demonstrates the problem you're : having with AND'd FilteredQuery's? : : Give me something concrete and simple and I'll dig into it further. : : Erik : : On Apr 1, 2005, at 11:13 AM, Kipping, Peter wrote: : : > Any ideas on this? I have purchased your book, Lucene in Action, which : > is quite good. To make things easier, consider the example on p212. : > In : > item 4, when you combine the queries, what happens you combine them in : > and AND fashion? The book only has OR, which works. Although it may : > work since the book only has one filtered query, but what if you made : > them both filtered queries and ANDed them? : > : > Thanks, : > Peter : > : > -Original Message- : > From: Kipping, Peter [mailto:[EMAIL PROTECTED] : > Sent: Friday, March 25, 2005 10:34 AM : > To: java-user@lucene.apache.org : > Subject: FilteredQuery and Boolean AND : > : > I have the following query structure: : > : > BooleanQuery q2 = new BooleanQuery(); : > TermQuery tq = new TermQuery(new Term("all_entries", "y")); : > FilteredQuery fq = new FilteredQuery(tq, ft); : > FilteredQuery fq2 = new FilteredQuery(tq, ft2); : > q2.add(fq, false, false); : > q2.add(fq2, false, false); : > : > The two filters are searches over numeric ranges. I'm using filters so : > I don't get the TooManyBooleanClauses Exception. And my TermQuery tq : > is : > just a field that has 'y' in every document so I can filter over the : > entire index. The last two lines I am creating a boolean OR, and : > everything works fine. I get back 30 documents which is correct. : > : > However when I change the last two lines to create an AND: : > : > q2.add(fq, true, false); : > q2.add(fq2, true, false); : > : > I still get back 30 documents, which is not correct. It should be 0. : > What's going on with FilteredQuery? : > : > Thanks, : > Peter : > : > : > - : > To unsubscribe, e-mail: [EMAIL PROTECTED] : > For additional commands, e-mail: [EMAIL PROTECTED] : > : > : > : > : > - : > To unsubscribe, e-mail: [EMAIL PROTECTED] : > For additional commands, e-mail: [EMAIL PROTECTED] : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss import org.apache.lucene.index.Term; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.search.Query; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.DateField; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.Filter; import org.apache.lucene.search.FilteredQuery; import java.io.IOException; import java.util.Random; import java.util.BitSet; import junit.framework.TestCase; public class TestKippingPeterBug extends TestCase { public static final boolean F = false; public static final boolean T = true; public static String[] data = new String [] { "a b m n", "a x q r", "a b s t", "a x e f", "a" }; RAMDirectory index = new RAMDirectory(); IndexReader r; IndexSearcher s; Query ALL = new TermQuery(new Term("data","a")); public TestKippingPeterBug(String name) { super(name); } public TestKippingPeterBug() { super(); } public void setUp() throws Exception { /* build an index */ IndexWriter writer = new IndexWriter(index,