[ https://issues.apache.org/jira/browse/NUTCH-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12570117#action_12570117 ]
Dennis Kubes commented on NUTCH-613: ------------------------------------ It seems to me that this code inside of the basic indexing filter is wrong and is what is causing the problem: // url is both stored and indexed, so it's both searchable and returned doc.add(new Field("url", reprUrlString == null ? urlString : reprUrlString, Field.Store.YES, Field.Index.TOKENIZED)); if (reprUrlString != null) { // also store original url as both stored and indexes doc.add(new Field("orig", urlString, Field.Store.YES, Field.Index.TOKENIZED)); } Ok some background. Fetcher goes to get page A called sourceA and gets redirected to targetA. Both sourceA and targetA are stored in segments and crawldb. But sourceA doesn't have parseText, parseData, or Content, only crawl fetch. TargetA has everything. TargetA in its metadata has a reprURL possibly pointing to itself, possibly to a different version of itself due to normalization, but more likely pointing to its source, in this case sourceA. Now we come to indexer. Here we add the reprURL, sourceA as the url and the targetA as the orig field. Then when getting summary (before patch) it got the url field, sourceA, which had no parse objects and hence no summaries and no content so null pointer trying to get cached page. IMO, url should point to targetA and orig should point to sourceA. Essentially flipped from what it is here. > Empty Summaries and Cached Pages > -------------------------------- > > Key: NUTCH-613 > URL: https://issues.apache.org/jira/browse/NUTCH-613 > Project: Nutch > Issue Type: Bug > Components: fetcher, searcher, web gui > Affects Versions: 0.9.0 > Environment: All > Reporter: Dennis Kubes > Assignee: Dennis Kubes > Fix For: 0.9.0, 1.0.0 > > Attachments: NUTCH-613-1-20080219.patch > > > There is a bug where some search results do not have summaries and viewing > their cached pages causes a NullPointer. This bug is due to redirects > getting stored under the new url and the getURL method of FetchedSegments > getting the wrong (old) url which is stored in crawldb but has no content or > parse objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.