Hi Sebastian

Thanks for your reply.

I was using readdb to see what was happening. It looked like this

Two pages indexed:

  https://www2.test.le.ac.uk/sh23 Version: 7
  Status: 2 (db_fetched)
  --
  https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
  Status: 2 (db_fetched)

Deleted https://www2.test.le.ac.uk/sh23/sleepy-zebra-page

After update, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is marked as db_gone, as expected:

  https://www2.test.le.ac.uk/sh23 Version: 7
  Status: 2 (db_fetched)
  --
  https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
  Status: 3 (db_gone)

(After invert links, there was no change)

After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no longer present

  https://www2.test.le.ac.uk/sh23 Version: 7
  Status: 2 (db_fetched)

Neither index nor clean clear 404s from Solr.


I'm just using the commands as given in bin/crawl from Nutch 1.9:

  $bin/nutch dedup $CRAWL_PATH/crawldb

"$bin/nutch" index -D solr.server.url=$SOLRURL "$CRAWL_PATH"/crawldb -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT


When I added an extra clean before dedup, Solr got the instruction to remove the deleted document.

There's nothing much in nutch-site.xml. It's mostly limits to make testing easier, static field added, metadata processing removed, db.update.purge.404 enabled.

<?xml version="1.0"?>
<configuration>
 <property>
  <name>http.agent.name</name>
  <value>nutch-solr-integration</value>
 </property>
 <property>
  <name>generate.max.per.host</name>
  <value>100</value>
 </property>
 <property>
  <name>plugin.includes</name>
  
<value>protocol-httpclient|urlfilter-regex|index-(basic|more|static)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|static)</value>
 </property>
 <property>
   <name>index.static</name>
   <value>_indexname:sitecore_web_index,_created_by_nutch:true</value>
   <description>
Used by plugin index-static to adds fields with static data at indexing time. You can specify a comma-separated list of fieldname:fieldcontent per Nutch job.
  Each fieldcontent can have multiple values separated by space, e.g.,
   field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...
   It can be useful when collections can't be created by URL patterns,
  like in subcollection, but on a job-basis.
  </description>
 </property>
 <property>
  <name>http.timeout</name>
  <value>5000</value>
  <description>The default network timeout, in milliseconds.</description>
 </property>
 <property>
  <name>fetcher.server.delay</name>
  <value>0.1</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server. Note that this might get
   overriden by a Crawl-Delay from a robots.txt and is used ONLY if
   fetcher.threads.per.queue is set to 1.
   </description>
 </property>
 <property>
  <name>db.fetch.interval.default</name>
  <value>60</value>
<description>The default number of seconds between re-fetches of a page (30 days).
  </description>
 </property>
 <property>
  <name>db.update.purge.404</name>
  <value>true</value>
  <description>If true, updatedb will add purge records with status DB_GONE
  from the CrawlDB.
  </description>
 </property>
</configuration>

Steven

On Sat, 4 Jul 2015, Sebastian Nagel wrote:

Hi Steven,

is the ordering of dedup and index wrong
No, that's correct: it would be not really efficient to first index duplicates
and then remove them afterwards.

If I understand right the db_gone pages have previously been indexed
(and were successfully fetched), right?

but "bin/nutch dedup" removes the records entirely
A dedup job should neither remove records entirely,
they are only set to status db_duplicate, nor should
it touch anything except db_fetched and db_notmodified.
If it does that's a bug.

Can you send the exact commands of "nutch dedup" and "nutch index"?
Have you checked the crawldb before and after using "bin/nutch readdb"
to get some hints what's special with these urls or documents?

Thanks,
Sebastian


On 07/03/2015 11:37 AM, Hayles, Steven wrote:
I'm using bin/crawl on Nutch 1.9 (with Solr 4.10.3)

What I see is that "bin/nutch update" sets db_gone status correctly, but "bin/nutch 
dedup" removes the records entirely before "bin/nutch index" can tell Sol to remove them from 
its index.

Is dedup doing more than it should, is the ordering of dedup and index wrong, 
or is there some configuration that I have wrong?

Thanks

Steven Hayles
Systems Analyst

IT Services, University of Leicester,
Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK

T: +44 (0)116 229 7950
E: s...@le.ac.uk<mailto:s...@le.ac.uk>

The Queen's Anniversary Prizes 1994, 2002 & 2013
THE Awards Winners 2007-2013

Elite without being elitist

Follow us on Twitter http://twitter.com/uniofleicester or
visit our Facebook page https://facebook.com/UniofLeicester




Reply via email to