Hi Sebastian
Thanks for your reply.
I was using readdb to see what was happening. It looked like this
Two pages indexed:
https://www2.test.le.ac.uk/sh23 Version: 7
Status: 2 (db_fetched)
--
https://www2.test.le.ac.uk/sh23/sleepy-zebra-page Version: 7
Status: 2 (db_fetched)
Deleted https://www2.test.le.ac.uk/sh23/sleepy-zebra-page
After update, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is marked
as db_gone, as expected:
https://www2.test.le.ac.uk/sh23 Version: 7
Status: 2 (db_fetched)
--
https://www2.test.le.ac.uk/sh23/sleepy-zebra-page Version: 7
Status: 3 (db_gone)
(After invert links, there was no change)
After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no
longer present
https://www2.test.le.ac.uk/sh23 Version: 7
Status: 2 (db_fetched)
Neither index nor clean clear 404s from Solr.
I'm just using the commands as given in bin/crawl from Nutch 1.9:
$bin/nutch dedup $CRAWL_PATH/crawldb
"$bin/nutch" index -D solr.server.url=$SOLRURL "$CRAWL_PATH"/crawldb
-linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
When I added an extra clean before dedup, Solr got the instruction to
remove the deleted document.
There's nothing much in nutch-site.xml. It's mostly limits to make testing
easier, static field added, metadata processing removed,
db.update.purge.404 enabled.
<?xml version="1.0"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch-solr-integration</value>
</property>
<property>
<name>generate.max.per.host</name>
<value>100</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|index-(basic|more|static)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|static)</value>
</property>
<property>
<name>index.static</name>
<value>_indexname:sitecore_web_index,_created_by_nutch:true</value>
<description>
Used by plugin index-static to adds fields with static data at
indexing
time.
You can specify a comma-separated list of fieldname:fieldcontent per
Nutch job.
Each fieldcontent can have multiple values separated by space, e.g.,
field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...
It can be useful when collections can't be created by URL patterns,
like in subcollection, but on a job-basis.
</description>
</property>
<property>
<name>http.timeout</name>
<value>5000</value>
<description>The default network timeout, in milliseconds.</description>
</property>
<property>
<name>fetcher.server.delay</name>
<value>0.1</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server. Note that this might get
overriden by a Crawl-Delay from a robots.txt and is used ONLY if
fetcher.threads.per.queue is set to 1.
</description>
</property>
<property>
<name>db.fetch.interval.default</name>
<value>60</value>
<description>The default number of seconds between re-fetches of a page
(30 days).
</description>
</property>
<property>
<name>db.update.purge.404</name>
<value>true</value>
<description>If true, updatedb will add purge records with status DB_GONE
from the CrawlDB.
</description>
</property>
</configuration>
Steven
On Sat, 4 Jul 2015, Sebastian Nagel wrote:
Hi Steven,
is the ordering of dedup and index wrong
No, that's correct: it would be not really efficient to first index duplicates
and then remove them afterwards.
If I understand right the db_gone pages have previously been indexed
(and were successfully fetched), right?
but "bin/nutch dedup" removes the records entirely
A dedup job should neither remove records entirely,
they are only set to status db_duplicate, nor should
it touch anything except db_fetched and db_notmodified.
If it does that's a bug.
Can you send the exact commands of "nutch dedup" and "nutch index"?
Have you checked the crawldb before and after using "bin/nutch readdb"
to get some hints what's special with these urls or documents?
Thanks,
Sebastian
On 07/03/2015 11:37 AM, Hayles, Steven wrote:
I'm using bin/crawl on Nutch 1.9 (with Solr 4.10.3)
What I see is that "bin/nutch update" sets db_gone status correctly, but "bin/nutch
dedup" removes the records entirely before "bin/nutch index" can tell Sol to remove them from
its index.
Is dedup doing more than it should, is the ordering of dedup and index wrong,
or is there some configuration that I have wrong?
Thanks
Steven Hayles
Systems Analyst
IT Services, University of Leicester,
Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK
T: +44 (0)116 229 7950
E: s...@le.ac.uk<mailto:s...@le.ac.uk>
The Queen's Anniversary Prizes 1994, 2002 & 2013
THE Awards Winners 2007-2013
Elite without being elitist
Follow us on Twitter http://twitter.com/uniofleicester or
visit our Facebook page https://facebook.com/UniofLeicester