Hello,

I'm trying to implement subdirectory searches, for queries like: 

  "testo site:example.com/directory"

that would search all documents within /directory.

I made some small changes to the query-site plugin:

In SiteIndexingFilter.java, I add the path as an indexed-only field:

  doc.add(new Field("path", url.getPath(), false, true, false))

And in SiteQueryFilter.java, my thought was to use a Lucene PrefixQuery
to also match the nested directories,

  "example.com/directory/1/2"
  "example.com/directory/news"
  etc.

<snippet>

  PrefixQuery pathClause = new PrefixQuery(new Term("path", url.getPath()));
  pathClause.setBoost(0.0f);
  output.add(pathClause, c.isRequired(), c.isProhibited());

</snippet>

This seems to work well for small numbers of documents, but for large crawls it starts
to fail with this backtrace:

StandardWrapperValve[jsp]: Servlet.service() for servlet jsp threw exception
org.apache.lucene.search.BooleanQuery$TooManyClauses
        at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:79)
        at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:71)
        at org.apache.lucene.search.PrefixQuery.rewrite(PrefixQuery.java:50)
        at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:243)
        at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:162)
        at org.apache.lucene.search.Query.weight(Query.java:84)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85)
        at 
net.nutch.searcher.LuceneQueryOptimizer.optimize(LuceneQueryOptimizer.java:79)
        at net.nutch.searcher.IndexSearcher.search(IndexSearcher.java:70)
        at net.nutch.searcher.NutchBean.search(NutchBean.java:149)
        at org.apache.jsp.search_jsp._jspService(search_jsp.java:163)
        at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:810)
        at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:298)
        at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
        at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:810)
        at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:237)
        at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:157)
        at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
        at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
        at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
        at 
org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
        at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
        at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
        at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
        at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
        at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
        at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
        at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
        at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
        at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at 
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
        at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
        at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
        at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:793)
        at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:702)
        at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:571)
        at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:644)
        at java.lang.Thread.run(Thread.java:534)

It seems that the way Lucene is handling the PrefixQuery is to add matching paths to 
the BooleanQuery, and that's the source of the BooleanQuery.TooManyClauses exception.

Have I gone about this incorrectly, or does anyone have any thoughts on how to search 
within a specific directory?

Thanks,

-Robert


-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to