Hi again Craig,

There is a deduplicator in Nutch but it won't prevent you from crawling
these URLs infinitely. One option would be to change the URLFilters /
Normalisers so that they deal with the repetition of two elements in the
path.

How do you run your crawl BTW? Do you use the crawl script?


On 9 July 2014 23:44, Craig Leinoff <[email protected]> wrote:

> For what it's worth, as a result of sending this message I have been able
> to advance a little bit in this area.
>
> It had seemed to me that new, relevant URLs were indeed being fetched and
> parsed, so it was confusing as to why Solr had such a small number of
> documents in its index. I speculated that Solr must be deduplicating
> regular results.
>
> After diving into the END of my Nutch process logs, and testing this
> "correct-looking" URLs, I see that I may have well been mistaken. The URLs
> frequently look like this:
>
>
> http://www.example.com/site/news/2014/section/subsection/another-section/another-section/file/Long-Document-Name.html
>
> http://www.example.com/site/news/2014/section/subsection/another-section/another-section/file/Long-Document-Name2.html
>
> http://www.example.com/site/news/2014/section/subsection/another-section/another-section/file/Long-Document-Name3.html
>
> What we should be noticing here is the repetition of
> "another-section/another-section". At some point Nutch hits an error page.
> The server is erroneously return an HTTP 200 code, instead of an HTTP 404
> "Not Found" code, and loading, basically, a broken page.
>
> This page has a number of general elements, such as a list of recent news
> articles, that exist on all pages of the site and contains an odious
> feature wherein the URLs of the news articles seem to just be appended onto
> the current page, thereby generating more URLs to crawl, all of which are
> identical.
>
> I know that Nutch is configured to filter out URL segments that repeat 3
> or more times, but in this case we're already nearing 500,000 URLs to crawl.
>
> I appreciate that the website in question is for all intents and purposes
> "broken", and I'll do my best, but I can't rely on them to fix it. Is there
> a better methodology for identifying erroneous URLs? Perhaps it can be
> de-duped in the Parsing phase, or maybe Nutch could see that all the CSS,
> JS, images, and stuff are 404'ing out and somehow "guess" that this is a
> bad page?
>
> Thanks!
> Craig
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to