Hello Markus.

Before running the commands I dumped the crawldb and checked again that 
document status is 5 (db_redir_perm), then I ran both commands with the same 
result, but the 301 document/s still exists in Solr


1.      sudo bin/nutch clean crawl/crawldb/

2.      sudo bin/nutch solrclean crawl/crawldb/


No exchange was configured. The documents will be routed to all index writers.
SolrIndexer: deleting 1000/1000 documents
SolrIndexer: deleting 1000/2000 documents
SolrIndexer: deleting 1000/3000 documents
SolrIndexer: deleting 1000/4000 documents
SolrIndexer: deleting 270/4270 documents

Did I miss anything here?

Regards,
Hany

From: Markus Jelsma <[email protected]>
Sent: Tuesday, March 9, 2021 11:19 AM
To: [email protected]
Subject: EXTERNAL: Re: Re: 301 perm redirect pages are still in Solr

Hello Hany,

Sure, check these commands:

 solrclean         remove HTTP 301 and 404 documents from solr - DEPRECATED
use the clean command instead
 clean             remove HTTP 301 and 404 documents and duplicates from
indexing backends configured via plugins

Regards,
Markus

Op di 9 mrt. 2021 om 08:49 schreef Hany NASR 
<[email protected]<mailto:[email protected]>.invalid>:

> Hello Markus,
>
> I added the property in nutch-site.xml with no luck.
>
> The documents still exist in Solr; any advice?
>
> Regards,
> Hany
>
> From: Markus Jelsma 
> <[email protected]<mailto:[email protected]>>
> Sent: Monday, March 8, 2021 3:40 PM
> To: [email protected]<mailto:[email protected]>
> Subject: EXTERNAL: Re: 301 perm redirect pages are still in Solr
>
> Hello Hany,
>
> You need to tell the indexer to delete those record. This will help:
>
>   <!-- delete gone and redirects -->
>  <property>
>    <name>indexer.delete</name>
>    <value>true</value>
>  </property>
>
> Regards,
> Markus
>
> Op ma 8 mrt. 2021 om 15:31 schreef Hany NASR 
> <[email protected]<mailto:[email protected]><mailto:
> [email protected]<mailto:[email protected]>>.invalid>:
>
> > Hi All,
> >
> > I'm using Nutch 1.15, and figure out that permeant redirect pages (301)
> > are still indexed and not removed in Solr.
> >
> > When I exported the crawlDB I found the page Status: 5 (db_redir_perm).
> >
> > How can I keep Solr index up to date and make Nutch clean these pages
> > automatically?
> >
> > Regards,
> > Hany
> >
> > -----------------------------------------
> > SAVE PAPER - THINK BEFORE YOU PRINT!
> >
> > This E-mail is confidential.
> >
> > It may also be legally privileged. If you are not the addressee you may
> > not copy,
> > forward, disclose or use any part of it. If you have received this
> message
> > in error,
> > please delete it and all copies from your system and notify the sender
> > immediately by
> > return E-mail.
> >
> > Internet communications cannot be guaranteed to be timely secure, error
> or
> > virus-free.
> > The sender does not accept liability for any errors or omissions.
> >
>
> ******************************************************************
> This message originated from the Internet.  Its originator may or
> may not be who they claim to be and the information contained in
> the message and any attachments may or may not be accurate.
> ******************************************************************
>
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
>
> This E-mail is confidential.
>
> It may also be legally privileged. If you are not the addressee you may
> not copy,
> forward, disclose or use any part of it. If you have received this message
> in error,
> please delete it and all copies from your system and notify the sender
> immediately by
> return E-mail.
>
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.
>

-----------------------------------------
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential. 

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.

Reply via email to