Re: Nested Queries
Hi Kapil, I am not sure exactly what you asking, could you give an example of the correct response? Also, are you truly using numbers or are they just substitutes for text? And are they part of a bigger problem requiring Lucene? If it is just numbers, maybe a DB might be the better way to go, since you would have SET operations that may make this easier. Not saying Lucene can't do what you want, just thinking there are other ways -Grant On Dec 26, 2006, at 4:47 AM, Kapil Chhabra wrote: Just to mention, I have tokenized FIELD2 on , and indexed it. FIELD2:3 should return 1,2 FIELD2:(FIELD2:3) should return something like the output of: *FIELD2: 1 OR FIELD2: 2 * Regards, kapilChhabra* * Kapil Chhabra wrote: Hi, Please see the following data-structure ++--+ | FIELD1 | FIELD2 | ++--+ | 1 | 2,3,4,6, | | 2 | 3,1,5,7, | | 3 | 1,2, | | 4 | 1,8,10, | | 5 | 2,9, | | 6 | 1, | | 7 | 2,9, | | 8 | 4,9, | | 9 | 5,7,8, | | 10 | 4, | ++--+ My requirement is to find all values in FIELD1 where FIELD2 contains all values of FIELD1 where FIELD2 contains 3 Which means something like FIELD2:(FIELD2:3) Is it possible to achieve this in a single query? If yes, then how should I go about it? Thanks in anticipation, kapilChhabra -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
toomanyclauses exception
Hi All, I'm getting a 'TooManyClauses' Exception and I'm not sure how to fix this. Here's a sample query that I'm using: +(+freeform_text:exhibit* +(+freeform_text:dispaly +freeform_text:event*) +(+freeform_text:sale* +freeform_text:sells +freeform_text:develop*) +(+freeform_text:trade +freeform_text:show +freeform_text:trade +freeform_text:shows)) +degree_type:5 +position_desired:ftp +city:washington~0.5 +state:dc +ncountry:usa +last_modified:[2005-12-26 TO 2006-12-26] Here's the exception I'm getting: org.apache.lucene.search.BooleanQuery$TooManyClauses at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:160) at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:151) at org.apache.lucene.search.PrefixQuery.rewrite(PrefixQuery.java:52) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:137) at org.apache.lucene.search.Query.weight(Query.java:93) at org.apache.lucene.search.Hits.init(Hits.java:41) at org.apache.lucene.search.Searcher.search(Searcher.java:44) at org.apache.lucene.search.Searcher.search(Searcher.java:36) at net.mainsequence.pcr.lucene.LuceneHandler.multiSearch(LuceneHandler.java:382) at net.mainsequence.pcr.lucene.LuceneServlet.searchIndex(LuceneServlet.java:169) at net.mainsequence.pcr.lucene.LuceneServlet.processRequest(LuceneServlet.java:83) at net.mainsequence.pcr.lucene.LuceneServlet.doPost(LuceneServlet.java:72) at javax.servlet.http.HttpServlet.service(HttpServlet.java:709) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Unknown Source) Is there anyway to increase the amount of clauses lucene can take? This kind of large query is not uncommon so any help would be greatly appreciated. Chris Salem 440.946.5214 x5458 [EMAIL PROTECTED] (The following links were included with this email:) mailto:[EMAIL PROTECTED] (The following links were included with this email:) mailto:[EMAIL PROTECTED]
Re: toomanyclauses exception
Chris, On Wednesday 27 December 2006 15:42, Chris Salem wrote: Hi All, I'm getting a 'TooManyClauses' Exception and I'm not sure how to fix this. Here's a sample query that I'm using: +(+freeform_text:exhibit* +(+freeform_text:dispaly +freeform_text:event*) +(+freeform_text:sale* +freeform_text:sells +freeform_text:develop*) +(+freeform_text:trade +freeform_text:show +freeform_text:trade +freeform_text:shows)) +degree_type:5 +position_desired:ftp +city:washington~0.5 +state:dc +ncountry:usa +last_modified:[2005-12-26 TO 2006-12-26] Here's the exception I'm getting: org.apache.lucene.search.BooleanQuery$TooManyClauses at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:160) at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:151) at org.apache.lucene.search.PrefixQuery.rewrite(PrefixQuery.java:52) One of the prefix queries is causing this, possibly event* or sale*. Since they seem to be specific enough, increasing the maximum number of boolean clauses that can be added to a boolean query appears to be the good way to fix this, see BooleanQuery.setMaxClauseCount(). Regards, Paul Elschot at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:137) at org.apache.lucene.search.Query.weight(Query.java:93) at org.apache.lucene.search.Hits.init(Hits.java:41) at org.apache.lucene.search.Searcher.search(Searcher.java:44) at org.apache.lucene.search.Searcher.search(Searcher.java:36) at net.mainsequence.pcr.lucene.LuceneHandler.multiSearch(LuceneHandler.java:382) at net.mainsequence.pcr.lucene.LuceneServlet.searchIndex(LuceneServlet.java:169) at net.mainsequence.pcr.lucene.LuceneServlet.processRequest(LuceneServlet.java:83) at net.mainsequence.pcr.lucene.LuceneServlet.doPost(LuceneServlet.java:72) at javax.servlet.http.HttpServlet.service(HttpServlet.java:709) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Unknown Source) Is there anyway to increase the amount of clauses lucene can take? This kind of large query is not uncommon so any help would be greatly appreciated. Chris Salem 440.946.5214 x5458 [EMAIL PROTECTED] (The following links were included with this email:) mailto:[EMAIL PROTECTED] (The following links were included with this email:) mailto:[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: toomanyclauses exception
Also, see the thread on this list titled I just don't get wildcards at all to see an extensive discussion of this issue, as well as wildcards in general. You might also search the archive for wildcards. The short form is that any wildcard (including prefix queries) expands under the covers to create a clause for each possible entry in the index for that field. For instance, say a field had the following values: abcd abck abt Searching for ab* would expand to searching for ab, abck and abt under the covers. When the number of possibilities gets above the default value of 1024, you see a TooManyClauses exception. Expanding the number of clauses *may* fix you right up, but on any reasonably sized index, you can come up with a query that'll exceed whatever number you set. Or you'll get to an unacceptable performance/memory footprint. Imagine your query with things like a* Think seriously about how you're going to deal with this. There are several options: 1 use filters for all your wildcard clauses and create your own BooleanQuery. Be aware that using filters affects scoring. 2 Assume that any query that throws a TooManyClauses exception (after you've set a suitable max as Paul suggested) is too broad to be useful and respond to the user with some polite phrase asking them to refine the query. 3 Look over the SrndQuery classes. I don't fully understand these, but they certainly behave much differently in this area. Note that SrndQuery limits wildcards to having at least three non-wildcard characters. 4 Ask whether stemming is a complete or partial solution. Ditto for Soundex. There's a good chance these won't apply, but they may. 5 Insert the solution to your specific problem here This is a sticky wicket that will probably consume more time than you think to handle. It's easy for your product manager to claim that Of course, we must support arbitrary wildcards, but I'd urge you to seriously ask what value *arbitrary* wildcards bring to the product. When you start getting thousands of responses to a query, is it actually valuable to return them to the user? Or do you give her just as much value (and deliver product sooner) by telling her up front that she's getting too many responses to be useful? With this last strategy, you just catch the TooManyClauses exception and respond with refine your query. Best Erick On 12/27/06, Paul Elschot [EMAIL PROTECTED] wrote: Chris, On Wednesday 27 December 2006 15:42, Chris Salem wrote: Hi All, I'm getting a 'TooManyClauses' Exception and I'm not sure how to fix this. Here's a sample query that I'm using: +(+freeform_text:exhibit* +(+freeform_text:dispaly +freeform_text:event*) +(+freeform_text:sale* +freeform_text:sells +freeform_text:develop*) +(+freeform_text:trade +freeform_text:show +freeform_text:trade +freeform_text:shows)) +degree_type:5 +position_desired:ftp +city:washington~0.5 +state:dc +ncountry:usa +last_modified:[2005-12-26 TO 2006-12-26] Here's the exception I'm getting: org.apache.lucene.search.BooleanQuery$TooManyClauses at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:160) at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:151) at org.apache.lucene.search.PrefixQuery.rewrite(PrefixQuery.java:52) One of the prefix queries is causing this, possibly event* or sale*. Since they seem to be specific enough, increasing the maximum number of boolean clauses that can be added to a boolean query appears to be the good way to fix this, see BooleanQuery.setMaxClauseCount(). Regards, Paul Elschot at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java :137) at org.apache.lucene.search.Query.weight(Query.java:93) at org.apache.lucene.search.Hits.init(Hits.java:41) at org.apache.lucene.search.Searcher.search(Searcher.java:44) at org.apache.lucene.search.Searcher.search(Searcher.java:36) at net.mainsequence.pcr.lucene.LuceneHandler.multiSearch(LuceneHandler.java :382) at net.mainsequence.pcr.lucene.LuceneServlet.searchIndex(LuceneServlet.java :169) at net.mainsequence.pcr.lucene.LuceneServlet.processRequest( LuceneServlet.java:83) at net.mainsequence.pcr.lucene.LuceneServlet.doPost(LuceneServlet.java :72) at javax.servlet.http.HttpServlet.service(HttpServlet.java:709) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter( ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke( StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke( StandardContextValve.java:178) at
Clustering Lucene with 40 Servers
I'm currently investigating the best ways of clustering Lucene. I've heard of both Solr, Terracotta but do not know how well they scale. Their examples talk of a 4 node cluster. This is way too small for my needs. I have 30x JVMs each handling 3 requests/sec and each having their own Lucene index. The index changes are propagated to the cluster members using JGroups messages. This solution has more than reached its limit as JGroups has become unstable and a source of many JVMs crashes. Based on current traffic trends I anticipate needing to upgrade to 40x + JVMS very soon. Can anybody suggest a way to effectivily cluster / replicate document changes ? P.S.: JMS is not a possible solution as this was the prior JGroups solution. We had too many memory problems/queue full etc crashing the servers thereafter. -- View this message in context: http://www.nabble.com/Clustering-Lucene-with-40-Servers-tf2886546.html#a8064135 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: toomanyclauses exception
On Wednesday 27 December 2006 16:53, Erick Erickson wrote: ... 3 Look over the SrndQuery classes. I don't fully understand these, but they certainly behave much differently in this area. Note that SrndQuery limits wildcards to having at least three non-wildcard characters. In Lucene, the limit on the number of clauses is applied per (sub)query that expands to a BooleanQuery. In contrib/surround the limit on the maximum number of clauses is applied for a full query including all subqueries. The reason for the limitation in both cases is that each TermScorer needs some buffer space, and without the limit Query.rewrite() will run out of memory occasionaly. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Modelling Relational Lucene Index
Hi Erick, Thank you for the detailed response. First I would like to mention that my application has an index with company id name indexed for article for the following reasons: 1. A search interface where we search across articles and companies. 2. Paging - I need to page the results after loading the hits due to which I don't want to separate out the text search and article-company matching logic. I want to load the articles using one single Lucene query. I am using MySQL database to store the relations. But since I need to search across companies keywords in article, I am also storing the company name and id in the index. The option 3 looks good to me. But I am concerned about degrading the performance of the existing system if I make the search into a 2 step process. However I will try to evaluate your suggestions in detail. Thank you again, Harini Erick Erickson wrote: First, it probably would have been a good thing to start a new thread on this topic, since it's only vaguely related to disk space G... That said, sure. Note that there's no requirement in lucene that all documents in an index have the same fields. Also, there's no reason you can't use two separate indexes. Finally, you have to think about how many times you are going to add update a given article when choosing your approach. Here are several possibilities. 1 Add a field (tokenized) to each article in your index that contains IDs of the companies you want to associate with that article. The downside here is that you need to delete and re-add the document every time you want to add a company to that article. 2 Create a separate index that contains that relationship. 3 have two kinds of documents in your index, one that indexes articles and one that relates those to companies. Something like this: Articles are indexed with text and artid fields. (NOTE: artid is NOT the Lucene document ID, those change) Relations are indexed with id and company id fields. id and artid are your relationship. You *don't* want to name the field the same for both kinds of documents since they would be indexed together. Now, given a search over some text, you get back a bunch of article IDs. You then search on the id field of the relations documents to extract company id fields. You may be able to do some interesting things with termdocs/termenums to make this efficient, but don't go there unless you need to. At this point, though, I've got to ask if you have access to a database in your application. If you do, why not store the relations there? Lucene is a text-search engine, not a relational database. This kind of relation may be perfectly valid to implement in Lucene, but you want to be careful if you find yourself trying to do any more RDBMS-like things. Best Erick On 12/26/06, Harini Raghavan [EMAIL PROTECTED] wrote: Hi, I have another related problem. I am adding news articles for a company to the lucene index. As of now if the articles are mapped to more than one company, they are added so many times in the index. As the no. of companies mapped to each article increases, this will not be a scalable implementation as documents will be duplicated in the index. Is there a way to model the lucene index in a relational way such that the articles can be stored in an index and article-company mapping can be modelled separately? Thanks, Harini Harini Raghavan Software Engineer Office : +91-40-23556255 [EMAIL PROTECTED] we think, you sell www.InsideView.com InsideView - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Clustering Lucene with 40 Servers
Sorry if this seems naïve (I am new to Lucene), but why not keep one copy of the Lucene index on a NAS and have it shared by all servers? -Original Message- From: Biggy [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 27, 2006 7:57 AM To: java-user@lucene.apache.org Subject: Clustering Lucene with 40 Servers I'm currently investigating the best ways of clustering Lucene. I've heard of both Solr, Terracotta but do not know how well they scale. Their examples talk of a 4 node cluster. This is way too small for my needs. I have 30x JVMs each handling 3 requests/sec and each having their own Lucene index. The index changes are propagated to the cluster members using JGroups messages. This solution has more than reached its limit as JGroups has become unstable and a source of many JVMs crashes. Based on current traffic trends I anticipate needing to upgrade to 40x + JVMS very soon. Can anybody suggest a way to effectivily cluster / replicate document changes ? P.S.: JMS is not a possible solution as this was the prior JGroups solution. We had too many memory problems/queue full etc crashing the servers thereafter. -- View this message in context: http://www.nabble.com/Clustering-Lucene-with-40-Servers-tf2886546.html#a8064135 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Clustering Lucene with 40 Servers
Well try having say 30 servers try to write in the index at the same time and 10 others to read. You'll get enough locks to make a grown man cry. :) Scott Sellman wrote: Sorry if this seems naïve (I am new to Lucene), but why not keep one copy of the Lucene index on a NAS and have it shared by all servers? -Original Message- From: Biggy [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 27, 2006 7:57 AM To: java-user@lucene.apache.org Subject: Clustering Lucene with 40 Servers I'm currently investigating the best ways of clustering Lucene. I've heard of both Solr, Terracotta but do not know how well they scale. Their examples talk of a 4 node cluster. This is way too small for my needs. I have 30x JVMs each handling 3 requests/sec and each having their own Lucene index. The index changes are propagated to the cluster members using JGroups messages. This solution has more than reached its limit as JGroups has become unstable and a source of many JVMs crashes. Based on current traffic trends I anticipate needing to upgrade to 40x + JVMS very soon. Can anybody suggest a way to effectivily cluster / replicate document changes ? P.S.: JMS is not a possible solution as this was the prior JGroups solution. We had too many memory problems/queue full etc crashing the servers thereafter. -- View this message in context: http://www.nabble.com/Clustering-Lucene-with-40-Servers-tf2886546.html#a8064135 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- View this message in context: http://www.nabble.com/Clustering-Lucene-with-40-Servers-tf2886546.html#a8065033 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]