Re: Nested Queries

2006-12-27 Thread Grant Ingersoll

Hi Kapil,

I am not sure exactly what you asking, could you give an example of  
the correct response?  Also, are you truly using numbers or are they  
just substitutes for text?  And are they part of a bigger problem  
requiring Lucene? If it is just numbers, maybe a DB might be the  
better way to go, since you would have SET operations that may make  
this easier.  Not saying Lucene can't do what you want, just thinking  
there are other ways


-Grant

On Dec 26, 2006, at 4:47 AM, Kapil Chhabra wrote:


Just to mention, I have tokenized FIELD2 on , and indexed it.

FIELD2:3 should return 1,2
FIELD2:(FIELD2:3) should return something like the output of:

*FIELD2: 1 OR FIELD2: 2

*
Regards,
kapilChhabra*
*

Kapil Chhabra wrote:

Hi,

Please see the following data-structure
++--+
| FIELD1 | FIELD2   |
++--+
| 1  | 2,3,4,6, |
| 2  | 3,1,5,7, |
| 3  | 1,2, |
| 4  | 1,8,10,  |
| 5  | 2,9, |
| 6  | 1,   |
| 7  | 2,9, |
| 8  | 4,9, |
| 9  | 5,7,8,   |
| 10 | 4,   |
++--+

My requirement is to find all values in FIELD1 where FIELD2  
contains all values of FIELD1 where FIELD2 contains 3

Which means something like
FIELD2:(FIELD2:3)

Is it possible to achieve this in a single query? If yes, then how  
should I go about it?




Thanks in anticipation,
kapilChhabra





--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



toomanyclauses exception

2006-12-27 Thread Chris Salem
Hi All,

I'm getting a 'TooManyClauses' Exception and I'm not sure how to fix this.  
Here's a sample query that I'm using:

+(+freeform_text:exhibit* +(+freeform_text:dispaly +freeform_text:event*) 
+(+freeform_text:sale* +freeform_text:sells +freeform_text:develop*) 
+(+freeform_text:trade +freeform_text:show +freeform_text:trade 
+freeform_text:shows)) +degree_type:5 +position_desired:ftp 
+city:washington~0.5 +state:dc +ncountry:usa +last_modified:[2005-12-26 TO 
2006-12-26]

Here's the exception I'm getting:

org.apache.lucene.search.BooleanQuery$TooManyClauses
 at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:160)
 at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:151)
 at org.apache.lucene.search.PrefixQuery.rewrite(PrefixQuery.java:52)
 at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372)
 at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372)
 at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372)
 at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:137)
 at org.apache.lucene.search.Query.weight(Query.java:93)
 at org.apache.lucene.search.Hits.init(Hits.java:41)
 at org.apache.lucene.search.Searcher.search(Searcher.java:44)
 at org.apache.lucene.search.Searcher.search(Searcher.java:36)
 at 
net.mainsequence.pcr.lucene.LuceneHandler.multiSearch(LuceneHandler.java:382)
 at 
net.mainsequence.pcr.lucene.LuceneServlet.searchIndex(LuceneServlet.java:169)
 at 
net.mainsequence.pcr.lucene.LuceneServlet.processRequest(LuceneServlet.java:83)
 at net.mainsequence.pcr.lucene.LuceneServlet.doPost(LuceneServlet.java:72)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
 at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
 at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
 at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
 at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
 at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
 at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
 at 
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664)
 at 
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
 at 
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
 at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Unknown Source)

Is there anyway to increase the amount of clauses lucene can take?  This kind 
of large query is not uncommon so any help would be greatly appreciated.


Chris Salem
440.946.5214 x5458
[EMAIL PROTECTED] 

(The following links were included with this email:)
mailto:[EMAIL PROTECTED]



(The following links were included with this email:)
mailto:[EMAIL PROTECTED]




Re: toomanyclauses exception

2006-12-27 Thread Paul Elschot
Chris,

On Wednesday 27 December 2006 15:42, Chris Salem wrote:
 Hi All,
 
 I'm getting a 'TooManyClauses' Exception and I'm not sure how to fix this.  
Here's a sample query that I'm using:
 
 +(+freeform_text:exhibit* +(+freeform_text:dispaly +freeform_text:event*) 
+(+freeform_text:sale* +freeform_text:sells +freeform_text:develop*) 
+(+freeform_text:trade +freeform_text:show +freeform_text:trade 
+freeform_text:shows)) +degree_type:5 +position_desired:ftp 
+city:washington~0.5 +state:dc +ncountry:usa +last_modified:[2005-12-26 TO 
2006-12-26]
 
 Here's the exception I'm getting:
 
 org.apache.lucene.search.BooleanQuery$TooManyClauses
  at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:160)
  at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:151)
  at org.apache.lucene.search.PrefixQuery.rewrite(PrefixQuery.java:52)

One of the prefix queries is causing this, possibly event* or sale*.
Since they seem to be specific enough, increasing the maximum number
of boolean clauses that can be added to a boolean query appears to be
the good way to fix this, see BooleanQuery.setMaxClauseCount().

Regards,
Paul Elschot

  at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372)
  at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372)
  at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372)
  at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:137)
  at org.apache.lucene.search.Query.weight(Query.java:93)
  at org.apache.lucene.search.Hits.init(Hits.java:41)
  at org.apache.lucene.search.Searcher.search(Searcher.java:44)
  at org.apache.lucene.search.Searcher.search(Searcher.java:36)
  at 
net.mainsequence.pcr.lucene.LuceneHandler.multiSearch(LuceneHandler.java:382)
  at 
net.mainsequence.pcr.lucene.LuceneServlet.searchIndex(LuceneServlet.java:169)
  at 
net.mainsequence.pcr.lucene.LuceneServlet.processRequest(LuceneServlet.java:83)
  at net.mainsequence.pcr.lucene.LuceneServlet.doPost(LuceneServlet.java:72)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
  at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
  at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
  at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
  at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
  at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
  at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
  at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
  at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
  at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
  at 
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664)
  at 
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
  at 
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
  at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
  at java.lang.Thread.run(Unknown Source)
 
 Is there anyway to increase the amount of clauses lucene can take?  This 
kind of large query is not uncommon so any help would be greatly appreciated.
 
 
 Chris Salem
 440.946.5214 x5458
 [EMAIL PROTECTED] 
 
 (The following links were included with this email:)
 mailto:[EMAIL PROTECTED]
 
 
 
 (The following links were included with this email:)
 mailto:[EMAIL PROTECTED]
 
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: toomanyclauses exception

2006-12-27 Thread Erick Erickson

Also, see the thread on this list titled I just don't get wildcards at all
to see an extensive discussion of this issue, as well as wildcards in
general. You might also search the archive for wildcards. The short form is
that any wildcard (including prefix queries) expands under the covers to
create a clause for each possible entry in the index for that field. For
instance, say a field had the following values:

abcd
abck
abt

Searching for ab* would expand to searching for ab, abck and abt under the
covers. When the number of possibilities gets above the default value of
1024, you see a TooManyClauses exception. Expanding the number of clauses
*may* fix you right up, but on any reasonably sized index, you can come up
with a query that'll exceed whatever number you set. Or you'll get to an
unacceptable performance/memory footprint. Imagine your query with things
like a*

Think seriously about how you're going to deal with this. There are several
options:
1 use filters for all your wildcard clauses and create your own
BooleanQuery. Be aware that using filters affects scoring.
2 Assume that any query that throws a TooManyClauses exception (after
you've set a suitable max as Paul suggested) is too broad to be useful and
respond to the user with some polite phrase asking them to refine the query.
3 Look over the SrndQuery classes. I don't fully understand these, but they
certainly behave much differently in this area. Note that SrndQuery limits
wildcards to having at least three non-wildcard characters.
4 Ask whether stemming is a complete or partial solution. Ditto for
Soundex. There's a good chance these won't apply, but they may.
5 Insert the solution to your specific problem here

This is a sticky wicket that will probably consume more time than you think
to handle. It's easy for your product manager to claim that Of course, we
must support arbitrary wildcards, but I'd urge you to seriously ask what
value *arbitrary* wildcards bring to the product. When you start getting
thousands of responses to a query, is it actually valuable to return them to
the user? Or do you give her just as much value (and deliver product sooner)
by telling her up front that she's getting too many responses to be useful?
With this last strategy, you just catch the TooManyClauses exception and
respond with refine your query.

Best
Erick


On 12/27/06, Paul Elschot [EMAIL PROTECTED] wrote:


Chris,

On Wednesday 27 December 2006 15:42, Chris Salem wrote:
 Hi All,

 I'm getting a 'TooManyClauses' Exception and I'm not sure how to fix
this.
Here's a sample query that I'm using:

 +(+freeform_text:exhibit* +(+freeform_text:dispaly
+freeform_text:event*)
+(+freeform_text:sale* +freeform_text:sells +freeform_text:develop*)
+(+freeform_text:trade +freeform_text:show +freeform_text:trade
+freeform_text:shows)) +degree_type:5 +position_desired:ftp
+city:washington~0.5 +state:dc +ncountry:usa +last_modified:[2005-12-26 TO
2006-12-26]

 Here's the exception I'm getting:

 org.apache.lucene.search.BooleanQuery$TooManyClauses
  at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:160)
  at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:151)
  at org.apache.lucene.search.PrefixQuery.rewrite(PrefixQuery.java:52)

One of the prefix queries is causing this, possibly event* or sale*.
Since they seem to be specific enough, increasing the maximum number
of boolean clauses that can be added to a boolean query appears to be
the good way to fix this, see BooleanQuery.setMaxClauseCount().

Regards,
Paul Elschot

  at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372)
  at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372)
  at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:372)
  at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java
:137)
  at org.apache.lucene.search.Query.weight(Query.java:93)
  at org.apache.lucene.search.Hits.init(Hits.java:41)
  at org.apache.lucene.search.Searcher.search(Searcher.java:44)
  at org.apache.lucene.search.Searcher.search(Searcher.java:36)
  at
net.mainsequence.pcr.lucene.LuceneHandler.multiSearch(LuceneHandler.java
:382)
  at
net.mainsequence.pcr.lucene.LuceneServlet.searchIndex(LuceneServlet.java
:169)
  at
net.mainsequence.pcr.lucene.LuceneServlet.processRequest(
LuceneServlet.java:83)
  at net.mainsequence.pcr.lucene.LuceneServlet.doPost(LuceneServlet.java
:72)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
  at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
ApplicationFilterChain.java:252)
  at
org.apache.catalina.core.ApplicationFilterChain.doFilter(
ApplicationFilterChain.java:173)
  at
org.apache.catalina.core.StandardWrapperValve.invoke(
StandardWrapperValve.java:213)
  at
org.apache.catalina.core.StandardContextValve.invoke(
StandardContextValve.java:178)
  at

Clustering Lucene with 40 Servers

2006-12-27 Thread Biggy

I'm currently investigating the best ways of clustering Lucene.
I've heard of both Solr, Terracotta but do not know how well they scale.
Their examples talk of a 4 node cluster. This is way too small for my needs.

I have 30x JVMs each handling 3 requests/sec and each having their own
Lucene index. The index changes are propagated to the cluster members using
JGroups messages. This solution has more than reached its limit as JGroups
has become unstable and a source of many JVMs crashes. Based on current
traffic trends I anticipate needing to upgrade to 40x + JVMS very soon.

Can anybody suggest a way to effectivily cluster / replicate document
changes ?

P.S.: 
JMS is not a possible solution as this was the prior JGroups solution. We
had too many memory problems/queue full etc crashing the servers thereafter.

-- 
View this message in context: 
http://www.nabble.com/Clustering-Lucene-with-40-Servers-tf2886546.html#a8064135
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: toomanyclauses exception

2006-12-27 Thread Paul Elschot
On Wednesday 27 December 2006 16:53, Erick Erickson wrote:
...
 3 Look over the SrndQuery classes. I don't fully understand these, but they
 certainly behave much differently in this area. Note that SrndQuery limits
 wildcards to having at least three non-wildcard characters.

In Lucene, the limit on the number of clauses is applied per (sub)query
that expands to a BooleanQuery.
In contrib/surround the limit on the maximum number of clauses
is applied for a full query including all subqueries.

The reason for the limitation in both cases is that each TermScorer
needs some buffer space, and without the limit Query.rewrite()
will run out of memory occasionaly.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Modelling Relational Lucene Index

2006-12-27 Thread Harini Raghavan

Hi Erick,

Thank you for the detailed response.

First I would like to mention that my application has an index with 
company id  name indexed for article for the following reasons:

1. A search interface where we search across articles and companies.
2. Paging - I need to page the results after loading the hits due to 
which I don't want to separate out the text search and article-company 
matching logic. I want to load the articles using one single Lucene query.


I am using MySQL database to store the relations. But since I need to 
search across companies  keywords in article, I am also storing the 
company name and id in the index. The option 3 looks good to me. But I 
am concerned about degrading the performance of the existing system if I 
make the search into a 2 step process.


However I will try to evaluate your suggestions in detail.

Thank you again,
Harini

Erick Erickson wrote:


First, it probably would have been a good thing to start a new thread on
this topic, since it's only vaguely related to disk space G...

That said, sure. Note that there's no requirement in lucene that all
documents in an index have the same fields. Also, there's no reason you
can't use two separate indexes. Finally, you have to think about how many
times you are going to add update a given article when choosing your
approach. Here are several possibilities.

1 Add a field (tokenized) to each article in your index that contains 
IDs
of the companies you want to associate with that article. The downside 
here

is that you need to delete and re-add the document every time you want to
add a company to that article.

2 Create a separate index that contains that relationship.

3 have two kinds of documents in your index, one that indexes 
articles and

one that relates those to companies. Something like this:

Articles are indexed with text and artid fields. (NOTE: artid is 
NOT the

Lucene document ID, those change)
Relations are indexed with id and company id fields.

id and artid are your relationship. You *don't* want to name the field 
the

same for both kinds of documents since they would be indexed together.

Now, given a search over some text, you get back a bunch of article 
IDs. You
then search on the id field of the relations documents to extract 
company id

fields.

You may be able to do some interesting things with termdocs/termenums to
make this efficient, but don't go there unless you need to.

At this point, though, I've got to ask if you have access to a 
database in
your application. If you do, why not store the relations there? Lucene 
is a
text-search engine, not a relational database. This kind of relation 
may be

perfectly valid to implement in Lucene, but you want to be careful if you
find yourself trying to do any more RDBMS-like things.

Best
Erick

On 12/26/06, Harini Raghavan [EMAIL PROTECTED] wrote:



Hi,

I have another related problem. I am adding news articles for a company
to the lucene index. As of now if the articles are mapped to more than
one company, they are added so many times in the index. As the no. of
companies mapped to each article increases, this will not be a scalable
implementation as documents will be duplicated in the index. Is there a
way to model the lucene index in a relational way such that the articles
can be stored in an index and article-company mapping can be modelled
separately?

Thanks,
Harini



Harini Raghavan
Software Engineer
Office : +91-40-23556255
[EMAIL PROTECTED]
we think, you sell
www.InsideView.com
InsideView 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Clustering Lucene with 40 Servers

2006-12-27 Thread Scott Sellman
Sorry if this seems naïve (I am new to Lucene), but why not keep one copy of 
the Lucene index on a NAS and have it shared by all servers?  

-Original Message-
From: Biggy [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, December 27, 2006 7:57 AM
To: java-user@lucene.apache.org
Subject: Clustering Lucene with 40 Servers


I'm currently investigating the best ways of clustering Lucene.
I've heard of both Solr, Terracotta but do not know how well they scale.
Their examples talk of a 4 node cluster. This is way too small for my needs.

I have 30x JVMs each handling 3 requests/sec and each having their own
Lucene index. The index changes are propagated to the cluster members using
JGroups messages. This solution has more than reached its limit as JGroups
has become unstable and a source of many JVMs crashes. Based on current
traffic trends I anticipate needing to upgrade to 40x + JVMS very soon.

Can anybody suggest a way to effectivily cluster / replicate document
changes ?

P.S.: 
JMS is not a possible solution as this was the prior JGroups solution. We
had too many memory problems/queue full etc crashing the servers thereafter.

-- 
View this message in context: 
http://www.nabble.com/Clustering-Lucene-with-40-Servers-tf2886546.html#a8064135
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Clustering Lucene with 40 Servers

2006-12-27 Thread Biggy

Well try having say 30 servers try to write in the index at the same time and
10 others 
to read. You'll get enough locks to make a grown man cry. :)


Scott Sellman wrote:
 
 Sorry if this seems naïve (I am new to Lucene), but why not keep one copy
 of the Lucene index on a NAS and have it shared by all servers?  
 
 -Original Message-
 From: Biggy [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, December 27, 2006 7:57 AM
 To: java-user@lucene.apache.org
 Subject: Clustering Lucene with 40 Servers
 
 
 I'm currently investigating the best ways of clustering Lucene.
 I've heard of both Solr, Terracotta but do not know how well they scale.
 Their examples talk of a 4 node cluster. This is way too small for my
 needs.
 
 I have 30x JVMs each handling 3 requests/sec and each having their own
 Lucene index. The index changes are propagated to the cluster members
 using
 JGroups messages. This solution has more than reached its limit as JGroups
 has become unstable and a source of many JVMs crashes. Based on current
 traffic trends I anticipate needing to upgrade to 40x + JVMS very soon.
 
 Can anybody suggest a way to effectivily cluster / replicate document
 changes ?
 
 P.S.: 
 JMS is not a possible solution as this was the prior JGroups solution. We
 had too many memory problems/queue full etc crashing the servers
 thereafter.
 
 -- 
 View this message in context:
 http://www.nabble.com/Clustering-Lucene-with-40-Servers-tf2886546.html#a8064135
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Clustering-Lucene-with-40-Servers-tf2886546.html#a8065033
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]