[jira] Commented: (NUTCH-613) Empty Summaries and Cached Pages

Dennis Kubes (JIRA) Mon, 18 Feb 2008 22:59:33 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12570117#action_12570117
 ]


Dennis Kubes commented on NUTCH-613:
------------------------------------

It seems to me that this code inside of the basic indexing filter is wrong and 
is what is causing the problem:

    // url is both stored and indexed, so it's both searchable and returned
    doc.add(new Field("url",
                      reprUrlString == null ? urlString : reprUrlString,
                      Field.Store.YES, Field.Index.TOKENIZED));
    
    if (reprUrlString != null) {
      // also store original url as both stored and indexes
      doc.add(new Field("orig", urlString,
                        Field.Store.YES, Field.Index.TOKENIZED));
    }

Ok some background.  Fetcher goes to get page A called sourceA and gets 
redirected to targetA.  Both sourceA and targetA are stored in segments and 
crawldb.  But sourceA doesn't have parseText, parseData, or Content, only crawl 
fetch.  TargetA has everything.  TargetA in its metadata has a reprURL possibly 
pointing to itself, possibly to a different version of itself due to 
normalization, but more likely pointing to its source, in this case sourceA.  

Now we come to indexer.  Here we add the reprURL, sourceA as the url and the 
targetA as the orig field.  Then when getting summary (before patch) it got the 
url field, sourceA, which had no parse objects and hence no summaries and no 
content so null pointer trying to get cached page.  IMO, url should point to 
targetA and orig should point to sourceA.  Essentially flipped from what it is 
here.  

> Empty Summaries and Cached Pages
> --------------------------------
>
>                 Key: NUTCH-613
>                 URL: https://issues.apache.org/jira/browse/NUTCH-613
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, searcher, web gui
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 0.9.0, 1.0.0
>
>         Attachments: NUTCH-613-1-20080219.patch
>
>
> There is a bug where some search results do not have summaries and viewing 
> their cached pages causes a NullPointer.  This bug is due to redirects 
> getting stored under the new url and the getURL method of FetchedSegments 
> getting the wrong (old) url which is stored in crawldb but has no content or 
> parse objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-613) Empty Summaries and Cached Pages

Reply via email to