Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

lewis john mcgibbney Sat, 16 Jul 2011 04:30:20 -0700

Hi Gabriele,

At first this seems like a plausable arguement, however my question concerns
what Nutch would do if we wished to change the Solr core which to index to?


If we removed this functionality from the crawldb there would be no way to
determine what Nutch was to fetch and what it wasn't.

On Sat, Jul 16, 2011 at 1:00 AM, Gabriele Kahlout
<[email protected]>wrote:

> Hello,
>
> I had this draft lurking for a while now, and before archiving for personal
> reference I wondered if it's accurate, and if you recommend posting it to
> the wiki.
>
> Nutch maintains a crawldb (and linkdb, for that matter) of the urls it
> crawled, the fetch status, and the date. This data is maintained beyond
> fetch so that pages may be re-crawled, after the a re-crawling period.
> At the same time Solr maintains an inverted index of all the fetched pages.
> It'd seem more efficient if nutch relied on the index instead of
> maintaining its own crawldb, to !store the same url twice.
> [BUT THAT'S JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME IN
> SOLR]
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).




-- 
*Lewis*

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

Reply via email to