RE: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Matthias W. Mon, 03 Nov 2008 05:51:38 -0800


Patrick Markiewicz wrote:
> 
> I'm not sure what you're using for searching, but wherever you
> reference an analyzer in Lucene, you need to change that from
> StandardAnalyzer to
> AnalyzerFactory.get(NutchConfiguration.create().get("en")) (which may
> require importing nutch-specific classes).
> 
I changed:
Analyzer analyzer = new StandardAnalyzer();


to:
Configuration nutchConfig = NutchConfiguration.create();
AnalyzerFactory an = new AnalyzerFactory(nutchConfig);
NutchAnalyzer analyzer = an.get(nutchConfig.get("en"));

now I get following error message from tomcat:
org.apache.jasper.JasperException

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:372)
        org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
        org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        java.lang.reflect.Method.invoke(Method.java:585)
        org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
        java.security.AccessController.doPrivileged(Native Method)
        javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
        org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:272)

org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161)

root cause

java.lang.NullPointerException
        java.io.Reader.<init>(Reader.java:61)
        java.io.BufferedReader.<init>(BufferedReader.java:76)
        java.io.BufferedReader.<init>(BufferedReader.java:91)
        org.apache.nutch.analysis.CommonGrams.init(CommonGrams.java:152)
        org.apache.nutch.analysis.CommonGrams.<init>(CommonGrams.java:52)

org.apache.nutch.analysis.NutchDocumentAnalyzer$ContentAnalyzer.<init>(NutchDocumentAnalyzer.java:64)

org.apache.nutch.analysis.NutchDocumentAnalyzer.<init>(NutchDocumentAnalyzer.java:55)
        
org.apache.nutch.analysis.AnalyzerFactory.<init>(AnalyzerFactory.java:49)
        org.apache.jsp.results_jsp._jspService(results_jsp.java:167)
        org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
        org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
        org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        java.lang.reflect.Method.invoke(Method.java:585)
        org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
        java.security.AccessController.doPrivileged(Native Method)
        javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
        org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:272)

org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161)


Full Sourcecode of results.jsp:
<%@ page import="org.apache.hadoop.conf.*"
  import="org.apache.nutch.util.NutchConfiguration"
  import="org.apache.nutch.analysis.*"
  import = "  javax.servlet.*, javax.servlet.http.*, java.io.*,
org.apache.lucene.document.*, org.apache.lucene.index.*,
org.apache.lucene.search.*, org.apache.lucene.queryParser.*,
org.apache.lucene.demo.*, org.apache.lucene.demo.html.Entities,
java.net.URLEncoder"
  
%>

<%
/*
        Author: Andrew C. Oliver, SuperLink Software, Inc.
([EMAIL PROTECTED])

        This jsp page is deliberatly written in the horrible java directly
embedded 
        in the page style for an easy and concise demonstration of Lucene.
        Due note...if you write pages that look like this...sooner or later
        you'll have a maintenance nightmare.  If you use jsps...use taglibs
        and beans!  That being said, this should be acceptable for a small
        page demonstrating how one uses Lucene in a web app. 

        This is also deliberately overcommented. ;-)

*/
%>
<%!
public String escapeHTML(String s) {
  s = s.replaceAll("&", "&amp;");
  s = s.replaceAll("<", "&lt;");
  s = s.replaceAll(">", "&gt;");
  s = s.replaceAll("\"", "&quot;");
  s = s.replaceAll("'", "&apos;");
  return s;
}
%>
<[EMAIL PROTECTED] file="header.jsp"%>
<%
        boolean error = false;                  //used to control flow for
error messages
        String indexName = indexLocation;       //local copy of the
configuration variable
        IndexSearcher searcher = null;          //the searcher used to
open/search the index
        Query query = null;                     //the Query created by the
QueryParser
        Hits hits = null;                       //the search results
        int startindex = 0;                     //the first index displayed
on this page
        int maxpage    = 50;                    //the maximum items
displayed on this page
        String queryString = null;              //the query entered in the
previous page
        String startVal    = null;              //string version of
startindex
        String maxresults  = null;              //string version of maxpage
        int thispage = 0;                       //used for the for/next
either maxpage or
                                                //hits.length() - startindex
- whichever is
                                                //less

        try {
          searcher = new IndexSearcher(indexName);      //create an
indexSearcher for our page
                                                        //NOTE: this
operation is slow for large
                                                        //indices (much
slower than the search itself)
                                                        //so you might want
to keep an IndexSearcher 
                                                        //open
                                                        
        } catch (Exception e) {                         //any error that
happens is probably due
                                                        //to a permission
problem or non-existant
                                                        //or otherwise
corrupt index
%>
                <p>ERROR opening the Index - contact sysadmin!</p>
                <p>Error message: <%=escapeHTML(e.getMessage())%></p>   
<%                error = true;                                  //don't do
anything up to the footer
        }
%>
<%
       if (error == false) {                                           //did
we open the index?
                queryString = request.getParameter("query");           //get
the search criteria
                startVal    = request.getParameter("startat");         //get
the start index
                maxresults  = request.getParameter("maxresults");      //get
max results per page
                try {
                        maxpage    = Integer.parseInt(maxresults);   
//parse the max results first
                        startindex = Integer.parseInt(startVal);      //then
the start index  
                } catch (Exception e) { } //we don't care if something
happens we'll just start at 0
                                          //or end at 50

                

                if (queryString == null)
                        throw new ServletException("no query "+       //if
you don't have a query then
                                                   "specified");      //you
probably played on the 
                                                                     
//query string so you get the 
                Configuration nutchConfig = NutchConfiguration.create();        
                        
//treatment
                                AnalyzerFactory an = new 
AnalyzerFactory(nutchConfig);
                NutchAnalyzer analyzer = an.get(nutchConfig.get("en"));  
//construct our usual analyzer
                try {
                        QueryParser qp = new QueryParser("contents",
analyzer);
                        query = qp.parse(queryString); //parse the 
                } catch (ParseException e) {                         
//query and construct the Query
                                                                     
//object
                                                                      //if
it's just "operator error"
                                                                      //send
them a nice error HTML
                                                                      
%>
                        <p>Error while parsing query:
<%=escapeHTML(e.getMessage())%></p>
<%
                        error = true;                                
//don't bother with the rest of
                                                                      //the
page
                }
        }
%>
<%
        if (error == false && searcher != null) {                     // if
we've had no errors
                                                                      //
searcher != null was to handle
                                                                      // a
weird compilation bug 
                thispage = maxpage;                                   //
default last element to maxpage
                hits = searcher.search(query);                        // run
the query 
                if (hits.length() == 0) {                             // if
we got no results tell the user
%>
                                        <p> I'm sorry I couldn't find what you 
were looking for. </p>
<%
                                        error = true;                           
             // don't bother
with the rest of the
                                                                     // page
                }
        }

        if (error == false && searcher != null) {                   
%>
                <table>
                <tr>
                        <td>Document</td>
                        <td>Summary</td>
                </tr>
<%
                if ((startindex + maxpage) > hits.length()) {
                        thispage = hits.length() - startindex;      // set
the max index to maxpage or last
                }                                                   //
actual search result whichever is less

                for (int i = startindex; i < (thispage + startindex); i++) { 
// for each element
%>
                <tr>
<%
                        Document doc = hits.doc(i);                    //get
the next document 
                        String doctitle = doc.get("title");            //get
its title
                        String url = doc.get("path");                  //get
its path field
                        if (url != null && url.startsWith("../webapps/")) {
// strip off ../webapps prefix if present
                                url = url.substring(10);
                        }
                        if ((doctitle == null) || doctitle.equals("")) //use
the path if it has no title
                                doctitle = url;
                                                                      
//then output!
%>
                        <td> "<%=url% "><%=doctitle%> </td>
                        <td><%=doc.get("summary")%></td>
                </tr>
<%
                }
%>
<%                if ( (startindex + maxpage) < hits.length()) {   //if
there are more results...display 
                                                                   //the
more link

                        String moreurl="results.jsp?query=" + 
                                       URLEncoder.encode(queryString) + 
//construct the "more" link
                                       "&amp;maxresults=" + maxpage + 
                                       "&amp;startat=" + (startindex +
maxpage);
%>
                <tr>
                        <td></td><td> "<%=moreurl% ">More Results>> </td>
                </tr>
<%
                }
%>
                </table>

<%       }                                            //then include our
footer.
         if (searcher != null)
                searcher.close();
%>
<[EMAIL PROTECTED] file="footer.jsp"%> 


What can I do now?
-- 
View this message in context: 
http://www.nabble.com/Using-Nutch-for-crawling-and-Lucene-for-searching-%28Wildcard-Fuzzy%29-tp19990219p20303116.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Using Nutch for crawling and Lucene for searching (Wildcard/Fuzzy)

Reply via email to