Re: [ManifoldCF 0.5] The web crawler remains running after a network connection refused

Karl Wright Thu, 10 May 2012 00:27:49 -0700

"Waiting for Processing" means that the URL will be retried.  There
should be a "Scheduled" value also listed which is *when* the URL will
be retried, and a "Scheduled action" column that says "Process".  If
you see these things you only need to wait until the time specified
and the document will be recrawled.


Karl

On Wed, May 9, 2012 at 9:54 PM, 小林 茂樹(情報システム本部 / サービス企画部)
<shigeki.kobayas...@g.softbank.co.jp> wrote:
> Karl,
>
> Thanks for the reply.
>
>
>> For web crawling, no single URL failure will cause the job to
> abort;
>
> OK, so I understand if I want it stopped, I need to manually abort the job.
>
>
>> You can check on the status of an individual URL by using the Document
> Status report.
>
> The Document Status report says the seed URL is "Waiting for Proecssing",
> which makes sense because the connection is refused. The report does not
> show retry count.
>
> The MCF log outputs exception. Is this also expected behavior?:
> -----
>
> DEBUG 2012-05-10 10:10:48,215 (Worker thread '34') - WEB: Fetch exception
> for 'http://xxx.xxx.xxx/index.html'
>
> java.net.ConnectException: Connection refused
>
>     at java.net.PlainSocketImpl.socketConnect(Native Method)
>
>     at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
>
>     at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
>
>     at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
>
>     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>
>     at java.net.Socket.connect(Socket.java:529)
>
>     at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>
>     at java.lang.reflect.Method.invoke(Method.java:597)
>
>     at
> org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(Unknown
> Source)
>
>     at
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(Unknown
> Source)
>
>     at org.apache.commons.httpclient.HttpConnection.open(Unknown Source)
>
>     at
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(Unknown
> Source)
>
>     at
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Unknown
> Source)
>
>     at
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Unknown
> Source)
>
>     at org.apache.commons.httpclient.HttpClient.executeMethod(Unknown
> Source)
>
>     at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection$ExecuteMethodThread.run(ThrottledFetcher.java:1244)
>
>  WARN 2012-05-10 10:10:48,216 (Worker thread '34') - Pre-ingest service
> interruption reported for job 1335340623530 connection 'WEB': Timed out
> waiting for a connection for 'http://xxx.xxx.xxx/index.html': Connection
> refused
>
>
> -----
>
> Regards,
>
> Shigeki
>
>
> 2012/5/9 Karl Wright <daddy...@gmail.com>
>>
>> Hi,
>>
>> ManifoldCF's web connector is, in general, very cautious about not
>> offending the owners of sites.  If it concludes that the site has
>> blocked access to a URL, it may remove the URL from its queue for
>> politeness, which would prevent further crawling of that URL for the
>> duration of the current job.  Under most cases, however, if a URL is
>> temporarily unavailable, it will be requeued for crawling at a later
>> time.  The typical pattern is to attempt to recrawl the URL
>> periodically (e.g. every 5 minutes) for many hours before giving up on
>> it.  For web crawling, no single URL failure will cause the job to
>> abort; it will continue running until all the other URLs have been
>> processed or forever (if the job is continuous).
>>
>> You can check on the status of an individual URL by using the Document
>> Status report.  This report should tell you what ManifoldCF intends to
>> do with a specific document.  If you locate one such URL and try out
>> this report, what does it say?
>>
>> Karl
>>
>>
>> On Tue, May 8, 2012 at 10:04 PM, 小林 茂樹(情報システム本部 / サービス企画部)
>> <shigeki.kobayas...@g.softbank.co.jp> wrote:
>> >
>> > Hi guys.
>> >
>> >
>> >
>> > I need some advice for stopping the MCF web crawler from a running state
>> > when a network connection refused.
>> >
>> >
>> >
>> > I use MCF 0.5 with Solr 3.5. I was testing what would happen to the web
>> > crawler when shutting down the web site that is to be crawled. I checked
>> > the
>> > simple history and saw “Connection refused” with status code of “-1”,
>> > that
>> > looked fine. But as I was waiting, the job status never changed and
>> > remained
>> > running. The crawler never crawls in this situation, but when I opened
>> > the
>> > web site, the crawler never started crawling again either.
>> >
>> > At least, somehow, I want the crawler to stop from running when a
>> > network
>> > connection refused, but I don’t know how. Does anyone have any ideas?
>> >
>> >
>> >
>> >
>> >
>> >
>
>

Re: [ManifoldCF 0.5] The web crawler remains running after a network connection refused

Reply via email to