Re: Can Nucth detect modified and deleted URLs?

Claudio Martella Wed, 26 Jan 2011 09:13:29 -0800

Today I had a look at the code and wrote this class. It works here on my
test cluster.

It scans the crawldb for entries carrying the STATUS_DB_GONE and it
issues a delete to solr for those entries.

Is that what you guys have in mind? Should i file a JIRA?

On 1/24/11 10:26 AM, Markus Jelsma wrote:
> Each item in the CrawlDB carries a status field. Reading the CrawlDB will 
> return this information as well, the same goes for a complete dump with which 
> you could create the appropriate delete statements for your Solr instance.
>
> 51    /** Page no longer exists. */
> 52    public static final byte STATUS_DB_GONE = 0x03; 
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
>
>> Where is that information stored? it could be then easily used to issue
>> deletes on solr.
>>
>> On 1/23/11 10:32 PM, Markus Jelsma wrote:
>>> Nutch can detect 404's by recrawling existing URL's. The mutation,
>>> however, is not pushed to Solr at the moment.
>>>
>>>> As far as I know, Nutch can only discover new URLs to crawl and send the
>>>> parsed content to Solr. But what about maintaining the index? Say that
>>>> you have a daily Nutch script that fetches/parses the web and updates
>>>> the Solr index. After one month, several web pages have been modified
>>>> and some have also been deleted. In other words, the Solr index is out
>>>> of sync.
>>>>
>>>> Is it possible to detect such changes in order to send update/delete
>>>> commands to Solr?
>>>>
>>>> It looks like the Aperture crawler has a workaround for this since the
>>>> crawler handler have methods such as objectChanged(...):
>>>> http://sourceforge.net/apps/trac/aperture/wiki/Crawlers
>>>>
>>>> Erlend

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.marte...@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to priv...@tis.bz.it in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.nutch.indexer.solr;

import java.io.IOException;
import java.net.MalformedURLException;
import java.text.SimpleDateFormat;
import java.util.Iterator;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.ByteWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.lib.NullOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.CrawlDb;
import org.apache.nutch.util.NutchConfiguration;
import org.apache.nutch.util.NutchJob;
import org.apache.nutch.util.TimingUtil;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.request.UpdateRequest;

/**
 * The class scans CrawlDB looking for entries with status DB_GONE (404) and 
sends delete requests to Solr
 * for those documents.
 * 
 * 
 * @author Claudio Martella
 *
 */

public class Solr404Deleter implements Tool {
        public static final Log LOG = LogFactory.getLog(Solr404Deleter.class);
        private Configuration conf;
        
        @Override
        public Configuration getConf() {
                return conf;
        }

        @Override
        public void setConf(Configuration conf) {
                this.conf = conf;
        }

        public static class DBFilter implements Mapper<Text, CrawlDatum, 
ByteWritable, Text> {
                private ByteWritable OUT = new 
ByteWritable(CrawlDatum.STATUS_DB_GONE);
                
                @Override
                public void configure(JobConf arg0) { }

                @Override
                public void close() throws IOException { }

                @Override
                public void map(Text key, CrawlDatum value,
                                OutputCollector<ByteWritable, Text> output, 
Reporter reporter)
                                throws IOException {
                        
                        if (value.getStatus() == CrawlDatum.STATUS_DB_GONE) {
                                output.collect(OUT, key);
                        }
                }
        }

        public static class SolrDeleter implements Reducer<ByteWritable, Text, 
Text, ByteWritable> {
                private static final int NUM_MAX_DELETE_REQUEST = 1000;
                private int numDeletes = 0;
                private int totalDeleted = 0;
                private SolrServer solr;
                private UpdateRequest updateRequest = new UpdateRequest();
                
                @Override
                public void configure(JobConf job) {
                        try {
                                solr = new 
CommonsHttpSolrServer(job.get(SolrConstants.SERVER_URL));
                        } catch (MalformedURLException e) {
                                throw new RuntimeException(e);
                        }
                }

                @Override
                public void close() throws IOException {
                        try {
                                if (numDeletes > 0) {
                                        LOG.info("Solr404Deleter: deleting " + 
numDeletes + " documents");
                                        updateRequest.process(solr);
                                        totalDeleted += numDeletes;
                                }
                                
                                LOG.info("Solr404Deleter: deleted a total of " 
+ totalDeleted + " documents");
                        } catch (SolrServerException e) {
                                throw new IOException(e);
                        }
                }

                @Override
                public void reduce(ByteWritable key, Iterator<Text> values,
                                OutputCollector<Text, ByteWritable> output, 
Reporter reporter)
                throws IOException {
                        while (values.hasNext()) {
                                Text document = values.next();
                                updateRequest.deleteById(document.toString());
                                numDeletes++;
                                if (numDeletes >= NUM_MAX_DELETE_REQUEST) {
                                        try {
                                                LOG.info("Solr404Deleter: 
deleting " + numDeletes + " documents");
                                                updateRequest.process(solr);
                                        } catch (SolrServerException e) {
                                                throw new IOException(e);
                                        }
                                        updateRequest = new UpdateRequest();
                                        totalDeleted += numDeletes;
                                        numDeletes = 0;
                                }
                        }
                }
        }

        public void delete(String crawldb, String solrUrl) throws IOException {
                SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd 
HH:mm:ss");
                long start = System.currentTimeMillis();
                LOG.info("Solr404Deleter: starting at " + sdf.format(start));
                LOG.info("Solr404Deleter: scanning " + crawldb);
                LOG.info("Solr404Deleter: Solr url: " + solrUrl);

                JobConf job = new NutchJob(getConf());

                FileInputFormat.addInputPath(job, new Path(crawldb, 
CrawlDb.CURRENT_NAME));
                job.set(SolrConstants.SERVER_URL, solrUrl);
                job.setInputFormat(SequenceFileInputFormat.class);
                job.setOutputFormat(NullOutputFormat.class);
                job.setMapOutputKeyClass(ByteWritable.class);
                job.setMapOutputValueClass(Text.class);
                job.setMapperClass(DBFilter.class);
                job.setReducerClass(SolrDeleter.class);

                JobClient.runJob(job);

                long end = System.currentTimeMillis();
                LOG.info("Solr404Deleter: finished at " + sdf.format(end) + ", 
elapsed: " + TimingUtil.elapsedTime(start, end));
        }

        public int run(String[] args) throws IOException {
                if (args.length != 2) {
                        System.err.println("Usage: Solr404Deleter <crawldb> 
<solrurl>");
                        return 1;
                }

                delete(args[0], args[1]);

                return 0;
        }

        public static void main(String[] args) throws Exception {
                int result = ToolRunner.run(NutchConfiguration.create(),
                                new Solr404Deleter(), args);
                System.exit(result);
        }
}

Re: Can Nucth detect modified and deleted URLs?

Reply via email to