Robert,
Could this not be done with a url search.
ie.
"testo url:"example.com/directory"
Seems to me as though it would work.
It would also find urls like this though
http://otherdomain.net/example/com/directory
Just a thought
Andy
Robert Hopson wrote:
Hello,
I'm trying to implement subdirectory searches, for queries like:
"testo site:example.com/directory"
that would search all documents within /directory.
I made some small changes to the query-site plugin:
In SiteIndexingFilter.java, I add the path as an indexed-only field:
doc.add(new Field("path", url.getPath(), false, true, false))
And in SiteQueryFilter.java, my thought was to use a Lucene PrefixQuery to also match the nested directories,
"example.com/directory/1/2" "example.com/directory/news" etc.
<snippet>
PrefixQuery pathClause = new PrefixQuery(new Term("path", url.getPath())); pathClause.setBoost(0.0f); output.add(pathClause, c.isRequired(), c.isProhibited());
</snippet>
This seems to work well for small numbers of documents, but for large crawls it starts to fail with this backtrace:
StandardWrapperValve[jsp]: Servlet.service() for servlet jsp threw exception org.apache.lucene.search.BooleanQuery$TooManyClauses at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:79) at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:71) at org.apache.lucene.search.PrefixQuery.rewrite(PrefixQuery.java:50) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:243) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:162) at org.apache.lucene.search.Query.weight(Query.java:84) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85) at net.nutch.searcher.LuceneQueryOptimizer.optimize(LuceneQueryOptimizer.java:79) at net.nutch.searcher.IndexSearcher.search(IndexSearcher.java:70) at net.nutch.searcher.NutchBean.search(NutchBean.java:149) at org.apache.jsp.search_jsp._jspService(search_jsp.java:163) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) at javax.servlet.http.HttpServlet.service(HttpServlet.java:810) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:298) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) at javax.servlet.http.HttpServlet.service(HttpServlet.java:810) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:237) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:157) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:793) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:702) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:571) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:644) at java.lang.Thread.run(Thread.java:534)
It seems that the way Lucene is handling the PrefixQuery is to add matching paths to the BooleanQuery, and that's the source of the BooleanQuery.TooManyClauses exception.
Have I gone about this incorrectly, or does anyone have any thoughts on how to search within a specific directory?
Thanks,
-Robert
-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
------------------------------------------------------- This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 Project Admins to receive an Apple iPod Mini FREE for your judgement on who ports your project to Linux PPC the best. Sponsored by IBM. Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
