[
https://issues.apache.org/jira/browse/NUTCH-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16132993#comment-16132993
]
hussein Al_Ahmad edited comment on NUTCH-1690 at 8/19/17 3:15 PM:
------------------------------------------------------------------
you should check if status == CrawlStatus.STATUS_DUPLICATED in the indexingJob
and skip it if so , otherwise the duplicated page is going to be indexed in the
next cycle if you'r using -all for batchId and the url isn't generated in that
cycle.
was (Author: opethema):
if you are using -all for batchId you should remove UPDATEDB_MARK also (if it
exists), otherwise the duplicated urls are going to be indexed again if they
aren't generated in the next cycle
> IndexClean: mark url as unindexed after clean to not delete again
> -----------------------------------------------------------------
>
> Key: NUTCH-1690
> URL: https://issues.apache.org/jira/browse/NUTCH-1690
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Reporter: Tien Nguyen Manh
> Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1690.patch
>
>
> We should marked a deleted page to not delete it again and again. That can
> simply done by remove Index marker when we delete.
> I also change to delete duplicated url in solrclean.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)