Re: Gone content not reported to Solr

Steven Hayles Wed, 22 Jul 2015 04:15:54 -0700


Hi Sebastian


Thanks for the explanation.

If db.update.purge.404 is not set, would records with status DB_GONE stayforever, and Solr be repeatedly told to remove them?


Steven Hayles
Systems Analyst

IT Services, University of Leicester,
Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK

T: +44 (0)116 229 7950
E: s...@le.ac.uk

The Queen's Anniversary Prizes 1994, 2002 & 2013
THE Awards Winners 2007-2013

Elite without being elitist

Follow us on Twitter http://twitter.com/uniofleicester or
visit our Facebook page https://facebook.com/UniofLeicester


On Tue, 14 Jul 2015, Sebastian Nagel wrote:

Hi Steven,

thanks for reporting the issue.

I tried to reproduce the problem without success.
While looking back to the conversation I found that this property could be
the reason:

<property>
 <name>db.update.purge.404</name>
 <value>true</value>
 <description>If true, updatedb will add purge records with status DB_GONE
 from the CrawlDB.
 </description>
</property>

The dedup job shares some code with the update job, namely the
CrawlDbFilter as mapper
which will filter away all db_gone records if db.update.purge.404 is true.
That's not really wrong (the next update job would remove the gone pages
anyway) but
should be clearly documented.

Thanks again,
Sebastian

2015-07-07 10:30 GMT+02:00 Steven Hayles <s...@leicester.ac.uk>:


Created https://issues.apache.org/jira/browse/NUTCH-2060

In fact, "bin/crawl" uses "bin/nutch clean" rather than the -deleteGone
option on "bin/nutch index".

As a work around, I've added "bin/nutch clean" before "bin/nutch dedup"

Steven Hayles
Systems Analyst

IT Services, University of Leicester,
Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK

T: +44 (0)116 229 7950
E: s...@le.ac.uk

The Queen's Anniversary Prizes 1994, 2002 & 2013
THE Awards Winners 2007-2013

Elite without being elitist

Follow us on Twitter http://twitter.com/uniofleicester or
visit our Facebook page https://facebook.com/UniofLeicester


On Mon, 6 Jul 2015, Sebastian Nagel wrote:

 Hi Steven,


 After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no

longer present
That's a bug. It should be there, no question.  Could you, please, open a
Jira issue [1]

The index command needs the option
 -deleteGone
to send deletions to Solr. But if the db_gone pages disappeared that has
no
effect,
of course :)

Thanks,
Sebastian


2015-07-06 10:07 GMT+02:00 Steven Hayles <s...@leicester.ac.uk>:

Hi Sebastian

Thanks for your reply.

I was using readdb to see what was happening. It looked like this

Two pages indexed:

  https://www2.test.le.ac.uk/sh23 Version: 7
  Status: 2 (db_fetched)
  --
  https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
  Status: 2 (db_fetched)

Deleted https://www2.test.le.ac.uk/sh23/sleepy-zebra-page

After update, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is
marked
as db_gone, as expected:

  https://www2.test.le.ac.uk/sh23 Version: 7
  Status: 2 (db_fetched)
  --
  https://www2.test.le.ac.uk/sh23/sleepy-zebra-page       Version: 7
  Status: 3 (db_gone)

(After invert links, there was no change)

After dedup, https://www2.test.le.ac.uk/sh23/sleepy-zebra-page is no
longer present

  https://www2.test.le.ac.uk/sh23 Version: 7
  Status: 2 (db_fetched)

Neither index nor clean clear 404s from Solr.


I'm just using the commands as given in bin/crawl from Nutch 1.9:

  $bin/nutch dedup $CRAWL_PATH/crawldb

  "$bin/nutch" index -D solr.server.url=$SOLRURL "$CRAWL_PATH"/crawldb
-linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT


When I added an extra clean before dedup, Solr got the instruction to
remove the deleted document.

There's nothing much in nutch-site.xml. It's mostly limits to make
testing
easier, static field added, metadata processing removed,
db.update.purge.404 enabled.

<?xml version="1.0"?>
<configuration>
 <property>
  <name>http.agent.name</name>
  <value>nutch-solr-integration</value>
 </property>
 <property>
  <name>generate.max.per.host</name>
  <value>100</value>
 </property>
 <property>
  <name>plugin.includes</name>


<value>protocol-httpclient|urlfilter-regex|index-(basic|more|static)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|static)</value>
 </property>
 <property>
   <name>index.static</name>
   <value>_indexname:sitecore_web_index,_created_by_nutch:true</value>
   <description>
    Used by plugin index-static to adds fields with static data at
indexing time.
   You can specify a comma-separated list of fieldname:fieldcontent per
Nutch job.
  Each fieldcontent can have multiple values separated by space, e.g.,
   field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ...
   It can be useful when collections can't be created by URL patterns,
  like in subcollection, but on a job-basis.
  </description>
 </property>
 <property>
  <name>http.timeout</name>
  <value>5000</value>
  <description>The default network timeout, in
milliseconds.</description>
 </property>
 <property>
  <name>fetcher.server.delay</name>
  <value>0.1</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server. Note that this might get
   overriden by a Crawl-Delay from a robots.txt and is used ONLY if
   fetcher.threads.per.queue is set to 1.
   </description>
 </property>
 <property>
  <name>db.fetch.interval.default</name>
  <value>60</value>
  <description>The default number of seconds between re-fetches of a page
(30 days).
  </description>
 </property>
 <property>
  <name>db.update.purge.404</name>
  <value>true</value>
  <description>If true, updatedb will add purge records with status
DB_GONE
  from the CrawlDB.
  </description>
 </property>
</configuration>

Steven


On Sat, 4 Jul 2015, Sebastian Nagel wrote:

 Hi Steven,


 is the ordering of dedup and index wrong


 No, that's correct: it would be not really efficient to first index

duplicates
and then remove them afterwards.

If I understand right the db_gone pages have previously been indexed
(and were successfully fetched), right?

 but "bin/nutch dedup" removes the records entirely


 A dedup job should neither remove records entirely,

they are only set to status db_duplicate, nor should
it touch anything except db_fetched and db_notmodified.
If it does that's a bug.

Can you send the exact commands of "nutch dedup" and "nutch index"?
Have you checked the crawldb before and after using "bin/nutch readdb"
to get some hints what's special with these urls or documents?

Thanks,
Sebastian


On 07/03/2015 11:37 AM, Hayles, Steven wrote:

 I'm using bin/crawl on Nutch 1.9 (with Solr 4.10.3)


What I see is that "bin/nutch update" sets db_gone status correctly,
but
"bin/nutch dedup" removes the records entirely before "bin/nutch
index" can
tell Sol to remove them from its index.

Is dedup doing more than it should, is the ordering of dedup and index
wrong, or is there some configuration that I have wrong?

Thanks

Steven Hayles
Systems Analyst

IT Services, University of Leicester,
Propsect House, 94 Regent Rd, Leicester, LE1 7DA, UK

T: +44 (0)116 229 7950
E: s...@le.ac.uk<mailto:s...@le.ac.uk>

The Queen's Anniversary Prizes 1994, 2002 & 2013
THE Awards Winners 2007-2013

Elite without being elitist

Follow us on Twitter http://twitter.com/uniofleicester or
visit our Facebook page https://facebook.com/UniofLeicester

Re: Gone content not reported to Solr

Reply via email to