Hi Shashanka, All,
Thank you for your reply!
I'm using Nutch 1.19. I did the injection and segment generation using the
following commands:
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
When I run the fetch command, Nutch stops with errors about hung threads.
I've attached the fetch command output and the nutch-site.xml.
s1=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $s1
My questions are:
1) What do I need to do to get Nutch to continue working even if there are
hung threads?
2) Is there a way to avoid having these hanging threads in the first place?
Thank you
Sheham
On Fri, Apr 19, 2024 at 1:04 AM Shashanka Balakuntala <
[email protected]> wrote:
> Hi Shehamizat,
> Please feel free to drop questions on the email itself. One of us/community
> will be glad to help on the same.
>
> *Regards*
> Shashanka Balakuntala Srinivasa
>
>
>
> On Fri, 19 Apr 2024 at 7:15 AM, Sheham Izat <[email protected]> wrote:
>
> > Hi,
> >
> > I'm trying to get Nutch to work and I have issues, how can I post
> questions
> > on the group?
> >
> > Thank you,
> > Sheham
> >
>
[root@localhost apache-nutch-1.19]# bin/nutch fetch $s1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/opt/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/opt/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2024-04-07 22:46:27,222 INFO o.a.n.p.PluginManifestParser [main] Plugins:
looking in: /opt/apache-nutch-1.19/plugins
2024-04-07 22:46:27,353 INFO o.a.n.p.PluginRepository [main] Plugin
Auto-activation mode: [true]
2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main] Registered Plugins:
2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main] Regex URL
Filter (urlfilter-regex)
2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main] Html Parse
Plug-in (parse-html)
2024-04-07 22:46:27,354 INFO o.a.n.p.PluginRepository [main] HTTP Framework
(lib-http)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] the nutch core
extension points (nutch-extensionpoints)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] Basic Indexing
Filter (index-basic)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] Anchor Indexing
Filter (index-anchor)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] Tika Parser
Plug-in (parse-tika)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] Basic URL
Normalizer (urlnormalizer-basic)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] Regex URL
Filter Framework (lib-regex-filter)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] Regex URL
Normalizer (urlnormalizer-regex)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] URL Validator
(urlfilter-validator)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] CyberNeko HTML
Parser (lib-nekohtml)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] OPIC Scoring
Plug-in (scoring-opic)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] Pass-through
URL Normalizer (urlnormalizer-pass)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] Http Protocol
Plug-in (protocol-http)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] SolrIndexWriter
(indexer-solr)
2024-04-07 22:46:27,355 INFO o.a.n.p.PluginRepository [main] Registered
Extension-Points:
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Content
Parser)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch URL
Filter)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (HTML Parse
Filter)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Scoring)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch URL
Normalizer)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch
Publisher)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch
Exchange)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch
Protocol)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch URL
Ignore Exemption Filter)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Index
Writer)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch Segment
Merge Filter)
2024-04-07 22:46:27,356 INFO o.a.n.p.PluginRepository [main] (Nutch
Indexing Filter)
2024-04-07 22:46:27,367 INFO o.a.n.f.Fetcher [main] Fetcher: starting at
2024-04-07 22:46:27
2024-04-07 22:46:27,367 INFO o.a.n.f.Fetcher [main] Fetcher: segment:
crawl/segments/20240407224534
2024-04-07 22:46:28,109 INFO o.a.n.f.FetchItemQueues [LocalJobRunner Map Task
Executor #0] Using queue mode : byHost
2024-04-07 22:46:28,110 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] Fetcher: threads: 10
2024-04-07 22:46:28,130 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] Fetcher: time-out divisor: 2
2024-04-07 22:46:28,147 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder
finished: total 60 records
2024-04-07 22:46:28,147 INFO o.a.n.f.QueueFeeder [QueueFeeder] QueueFeeder
queuing status:
2024-04-07 22:46:28,147 INFO o.a.n.f.QueueFeeder [QueueFeeder] 60
SUCCESSFULLY_QUEUED
2024-04-07 22:46:28,147 INFO o.a.n.f.QueueFeeder [QueueFeeder] 0
ERROR_CREATE_FETCH_ITEM
2024-04-07 22:46:28,147 INFO o.a.n.f.QueueFeeder [QueueFeeder] 0
ABOVE_EXCEPTION_THRESHOLD
2024-04-07 22:46:28,147 INFO o.a.n.f.QueueFeeder [QueueFeeder] 0
HIT_BY_TIMELIMIT
2024-04-07 22:46:28,160 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map
Task Executor #0] Found 0 extensions at
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,177 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,178 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 60 fetching https://ladot.lacity.org/ (queue crawl delay=5000ms)
2024-04-07 22:46:28,188 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map
Task Executor #0] Found 0 extensions at
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,287 INFO o.a.n.p.h.Http [FetcherThread] http.proxy.host =
null
2024-04-07 22:46:28,287 INFO o.a.n.p.h.Http [FetcherThread] http.proxy.port =
8080
2024-04-07 22:46:28,287 INFO o.a.n.p.h.Http [FetcherThread]
http.proxy.exception.list = false
2024-04-07 22:46:28,287 INFO o.a.n.p.h.Http [FetcherThread] http.timeout = 10000
2024-04-07 22:46:28,287 INFO o.a.n.p.h.Http [FetcherThread] http.content.limit
= 1048576
2024-04-07 22:46:28,288 INFO o.a.n.p.h.Http [FetcherThread] http.agent =
Spirawndex Nutch Spider/Nutch-1.19
2024-04-07 22:46:28,288 INFO o.a.n.p.h.Http [FetcherThread]
http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
2024-04-07 22:46:28,288 INFO o.a.n.p.h.Http [FetcherThread] http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2024-04-07 22:46:28,289 INFO o.a.n.p.h.Http [FetcherThread]
http.enable.cookie.header = true
2024-04-07 22:46:28,291 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,293 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 61 fetching https://disneyland.disney.go.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:28,302 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map
Task Executor #0] Found 0 extensions at
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,303 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,304 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 62 fetching https://www.getapp.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:28,313 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map
Task Executor #0] Found 0 extensions at
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,314 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,315 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 63 fetching https://www.kayemfoodservice.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:28,325 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map
Task Executor #0] Found 0 extensions at
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,326 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,327 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 64 fetching https://theculturetrip.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:28,337 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map
Task Executor #0] Found 0 extensions at
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,339 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,340 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 65 fetching https://www.slideshare.net/ (queue crawl delay=5000ms)
2024-04-07 22:46:28,350 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map
Task Executor #0] Found 0 extensions at
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,352 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,353 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 66 fetching https://appexchange.salesforce.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:28,363 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map
Task Executor #0] Found 0 extensions at
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,364 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,366 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 67 fetching https://www.lewisginter.org/ (queue crawl
delay=5000ms)
2024-04-07 22:46:28,376 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map
Task Executor #0] Found 0 extensions at
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,377 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,378 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 68 fetching https://maps.google.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:28,388 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map
Task Executor #0] Found 0 extensions at
point:'org.apache.nutch.net.URLExemptionFilter'
2024-04-07 22:46:28,388 INFO o.a.n.f.FetcherThread [LocalJobRunner Map Task
Executor #0] FetcherThread 54 Using queue mode : byHost
2024-04-07 22:46:28,389 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] Fetcher: throughput threshold: -1
2024-04-07 22:46:28,390 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] Fetcher: throughput threshold retries: 5
2024-04-07 22:46:28,390 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 69 fetching https://www.ballseed.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:28,928 INFO o.a.n.n.u.r.RegexURLNormalizer [FetcherThread]
can't find rules for scope 'fetcher', using default
2024-04-07 22:46:28,952 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 62 fetch of https://www.getapp.com/ failed with: Http code=403,
url=https://www.getapp.com/
2024-04-07 22:46:28,952 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue:
www.getapp.com >> delayed next fetch by 5000 ms after 1 exceptions in queue
2024-04-07 22:46:28,953 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 62 fetching https://www.youtube.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:28,955 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 68 fetching https://www.thefreedictionary.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:29,066 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 64 fetching https://mathiasconradt.medium.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:29,171 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 65 fetching https://sourceforge.net/ (queue crawl delay=5000ms)
2024-04-07 22:46:29,269 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 66 fetching https://bitnami.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:29,281 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 67 fetching https://www.hyland.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:29,389 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 69 fetching https://github.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:29,393 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=43,
fetchQueues.getQueueCount=60
2024-04-07 22:46:29,621 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 62 fetching https://microstrat.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:29,632 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 63 fetching https://www.caryillinois.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:29,650 INFO o.a.n.n.u.r.RegexURLNormalizer [FetcherThread]
can't find rules for scope 'fetcher', using default
2024-04-07 22:46:29,652 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 60 fetching https://www.wilsonappliance.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:29,688 INFO o.a.n.p.h.a.HttpRobotRulesParser [FetcherThread]
Couldn't get robots.txt for https://bitnami.com/: java.net.SocketException:
Socket is closed
2024-04-07 22:46:29,708 ERROR o.a.n.p.h.Http [FetcherThread] Failed to get
protocol output
java.net.SocketException: Socket is closed
at
sun.security.ssl.SSLSocketImpl.getOutputStream(SSLSocketImpl.java:1129) ~[?:?]
at
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:163) ~[?:?]
at org.apache.nutch.protocol.http.Http.getResponse(Http.java:65) ~[?:?]
at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:393)
~[?:?]
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:381)
~[apache-nutch-1.19.jar:?]
2024-04-07 22:46:29,713 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 66 fetch of https://bitnami.com/ failed with:
java.net.SocketException: Socket is closed
2024-04-07 22:46:29,714 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue:
bitnami.com >> delayed next fetch by 5000 ms after 1 exceptions in queue
2024-04-07 22:46:29,714 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 66 fetching https://www.inwoodartworks.nyc/ (queue crawl
delay=5000ms)
2024-04-07 22:46:29,816 INFO o.a.n.f.FetcherThread [FetcherThread] Denied by
robots.txt: https://sourceforge.net/
2024-04-07 22:46:29,816 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 65 fetching https://www.carahsoft.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:29,990 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 69 fetching https://www.pirch.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:30,139 INFO o.a.n.n.u.r.RegexURLNormalizer [FetcherThread]
can't find rules for scope 'fetcher', using default
2024-04-07 22:46:30,140 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 67 fetching https://www.lowes.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:30,166 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 62 fetching https://www.benjaminmoore.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:30,394 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=35,
fetchQueues.getQueueCount=60
2024-04-07 22:46:30,533 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 66 fetching https://www.jamieoliver.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:30,540 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 69 fetching https://hub.docker.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:30,607 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 64 fetching https://onsclothing.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:30,608 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 68 fetching https://www.burpeehomegardens.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:30,772 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 69 fetching https://www.crateandbarrel.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:30,786 INFO o.a.n.n.u.r.RegexURLNormalizer [FetcherThread]
can't find rules for scope 'fetcher', using default
2024-04-07 22:46:30,787 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 62 fetching https://kinto-usa.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:30,951 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 63 fetching https://dictionary.cambridge.org/ (queue crawl
delay=5000ms)
2024-04-07 22:46:30,988 INFO o.a.n.f.FetcherThread [FetcherThread] Denied by
robots.txt: https://onsclothing.com/
2024-04-07 22:46:30,989 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 64 fetching https://www.stylemepretty.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:31,175 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 69 fetch of https://www.crateandbarrel.com/ failed with: Http
code=403, url=https://www.crateandbarrel.com/
2024-04-07 22:46:31,176 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue:
www.crateandbarrel.com >> delayed next fetch by 5000 ms after 1 exceptions in
queue
2024-04-07 22:46:31,177 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 69 fetching https://www.seattlespheres.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:31,278 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 66 fetching https://www.efcontractflooring.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:31,394 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=25,
fetchQueues.getQueueCount=60
2024-04-07 22:46:31,436 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 65 fetching https://www.adu.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:31,480 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 68 fetching https://en.wiktionary.org/ (queue crawl delay=5000ms)
2024-04-07 22:46:31,502 WARN o.a.n.p.h.Http [FetcherThread] Missing or invalid
HTTP status line
org.apache.nutch.protocol.http.api.HttpException: Bad status line, no HTTP
response code:
at
org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:571)
~[?:?]
at
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:275) ~[?:?]
at org.apache.nutch.protocol.http.Http.getResponse(Http.java:65) ~[?:?]
at
org.apache.nutch.protocol.http.api.HttpRobotRulesParser.getRobotRulesSet(HttpRobotRulesParser.java:133)
~[?:?]
at
org.apache.nutch.protocol.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:235)
~[apache-nutch-1.19.jar:?]
at
org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:766)
~[?:?]
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:319)
~[apache-nutch-1.19.jar:?]
Caused by: java.lang.NumberFormatException: For input string: ""
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
~[?:?]
at java.lang.Integer.parseInt(Integer.java:662) ~[?:?]
at java.lang.Integer.parseInt(Integer.java:770) ~[?:?]
at
org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:569)
~[?:?]
... 6 more
2024-04-07 22:46:31,504 WARN o.a.n.p.h.Http [FetcherThread] No HTTP header,
assuming HTTP/0.9 for https://dictionary.cambridge.org/robots.txt
2024-04-07 22:46:31,552 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 69 fetching https://en.wikipedia.org/ (queue crawl delay=5000ms)
2024-04-07 22:46:31,561 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 60 fetching https://pitchbook.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:31,596 WARN o.a.n.p.h.Http [FetcherThread] Missing or invalid
HTTP status line
org.apache.nutch.protocol.http.api.HttpException: Bad status line, no HTTP
response code:
at
org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:571)
~[?:?]
at
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:275) ~[?:?]
at org.apache.nutch.protocol.http.Http.getResponse(Http.java:65) ~[?:?]
at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:393)
~[?:?]
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:381)
~[apache-nutch-1.19.jar:?]
Caused by: java.lang.NumberFormatException: For input string: ""
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
~[?:?]
at java.lang.Integer.parseInt(Integer.java:662) ~[?:?]
at java.lang.Integer.parseInt(Integer.java:770) ~[?:?]
at
org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:569)
~[?:?]
... 4 more
2024-04-07 22:46:31,600 WARN o.a.n.p.h.Http [FetcherThread] No HTTP header,
assuming HTTP/0.9 for https://dictionary.cambridge.org/
2024-04-07 22:46:31,602 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 63 fetching https://www.gartner.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:31,748 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 68 fetching https://www.crunchbase.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:31,767 INFO o.a.n.n.u.r.RegexURLNormalizer [FetcherThread]
can't find rules for scope 'fetcher', using default
2024-04-07 22:46:31,767 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 69 fetching https://access.redhat.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:31,853 INFO o.a.n.f.FetcherThread [FetcherThread] Denied by
robots.txt: https://kinto-usa.com/
2024-04-07 22:46:31,853 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 62 fetching https://www.cityofsacramento.org/ (queue crawl
delay=5000ms)
2024-04-07 22:46:31,901 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 60 fetch of https://pitchbook.com/ failed with: Http code=403,
url=https://pitchbook.com/
2024-04-07 22:46:31,901 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue:
pitchbook.com >> delayed next fetch by 5000 ms after 1 exceptions in queue
2024-04-07 22:46:31,901 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 60 fetching https://twitter.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:32,029 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 63 fetch of https://www.gartner.com/ failed with: Http code=403,
url=https://www.gartner.com/
2024-04-07 22:46:32,029 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue:
www.gartner.com >> delayed next fetch by 5000 ms after 1 exceptions in queue
2024-04-07 22:46:32,030 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 63 fetching https://www.aggressiveappliances.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:32,067 WARN c.r.SimpleRobotRulesParser [FetcherThread] Problem
processing robots.txt for https://twitter.com/
2024-04-07 22:46:32,067 WARN c.r.SimpleRobotRulesParser [FetcherThread]
Unknown line in robots.txt file (size 1350): Noindex: /i/u
2024-04-07 22:46:32,067 INFO o.a.n.f.FetcherThread [FetcherThread] Denied by
robots.txt: https://twitter.com/
2024-04-07 22:46:32,067 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 60 fetching https://www.softwareadvice.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:32,270 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 69 fetching https://www.facebook.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:32,395 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=13,
fetchQueues.getQueueCount=60
2024-04-07 22:46:32,414 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 64 fetching https://www.g2.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:32,496 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 60 fetching https://accounts.google.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:32,620 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 62 fetching https://www.groveresortorlando.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:32,641 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 69 fetching https://visitabingdonvirginia.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:32,690 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 60 fetching https://alfresco-content-app.netlify.app/ (queue
crawl delay=5000ms)
2024-04-07 22:46:32,801 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 64 fetch of https://www.g2.com/ failed with: Http code=403,
url=https://www.g2.com/
2024-04-07 22:46:32,801 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue:
www.g2.com >> delayed next fetch by 5000 ms after 1 exceptions in queue
2024-04-07 22:46:32,801 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 64 fetching https://www.linkedin.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:32,942 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 62 fetching https://www.trustradius.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:32,956 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 68 fetching https://www.foodnetwork.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:32,986 INFO o.a.n.f.FetcherThread [FetcherThread] Denied by
robots.txt: https://www.linkedin.com/
2024-04-07 22:46:32,987 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 64 fetching https://www.instagram.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:33,145 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 60 fetching https://www.imdb.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:33,148 INFO o.a.n.f.FetcherThread [FetcherThread] Denied by
robots.txt: https://www.instagram.com/
2024-04-07 22:46:33,149 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 64 fetching https://lolldesigns.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:33,159 WARN c.r.SimpleRobotRulesParser [FetcherThread] Problem
processing robots.txt for https://www.trustradius.com/
2024-04-07 22:46:33,160 WARN c.r.SimpleRobotRulesParser [FetcherThread]
Unknown line in robots.txt file (size 1158): Noindex: /api/
2024-04-07 22:46:33,160 WARN c.r.SimpleRobotRulesParser [FetcherThread]
Unknown line in robots.txt file (size 1158): Noindex: /share/
2024-04-07 22:46:33,160 WARN c.r.SimpleRobotRulesParser [FetcherThread]
Unknown line in robots.txt file (size 1158): Noindex: /share
2024-04-07 22:46:33,160 WARN c.r.SimpleRobotRulesParser [FetcherThread]
Unknown line in robots.txt file (size 1158): Noindex: /search/
2024-04-07 22:46:33,271 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 62 fetch of https://www.trustradius.com/ failed with: Http
code=403, url=https://www.trustradius.com/
2024-04-07 22:46:33,271 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue:
www.trustradius.com >> delayed next fetch by 5000 ms after 1 exceptions in queue
2024-04-07 22:46:33,272 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 62 fetching https://api.onlyoffice.com/ (queue crawl delay=5000ms)
2024-04-07 22:46:33,348 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 69 fetching https://www.choosechicago.com/ (queue crawl
delay=5000ms)
2024-04-07 22:46:33,364 INFO o.a.n.f.FetcherThread [FetcherThread] Denied by
robots.txt: https://lolldesigns.com/
2024-04-07 22:46:33,365 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 64 has no more work available
2024-04-07 22:46:33,365 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 64 -finishing thread FetcherThread, activeThreads=9
2024-04-07 22:46:33,383 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 68 has no more work available
2024-04-07 22:46:33,383 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 68 -finishing thread FetcherThread, activeThreads=8
2024-04-07 22:46:33,396 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] -activeThreads=8, spinWaiting=0, fetchQueues.totalSize=0,
fetchQueues.getQueueCount=8
2024-04-07 22:46:33,401 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 66 has no more work available
2024-04-07 22:46:33,402 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 66 -finishing thread FetcherThread, activeThreads=7
2024-04-07 22:46:33,731 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 69 has no more work available
2024-04-07 22:46:33,731 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 69 -finishing thread FetcherThread, activeThreads=6
2024-04-07 22:46:33,997 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 63 has no more work available
2024-04-07 22:46:33,997 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 63 -finishing thread FetcherThread, activeThreads=5
2024-04-07 22:46:34,277 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 60 has no more work available
2024-04-07 22:46:34,277 INFO o.a.n.f.FetcherThread [FetcherThread]
FetcherThread 60 -finishing thread FetcherThread, activeThreads=4
2024-04-07 22:46:34,396 INFO o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] -activeThreads=4, spinWaiting=0, fetchQueues.totalSize=0,
fetchQueues.getQueueCount=4
2024-04-07 22:46:34,397 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] Aborting with 4 hung threads.
2024-04-07 22:46:34,397 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] Thread #1 hung while processing https://disneyland.disney.go.com/
2024-04-07 22:46:34,397 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] Thread #2 hung while processing https://api.onlyoffice.com/
2024-04-07 22:46:34,397 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] Thread #5 hung while processing https://www.adu.com/
2024-04-07 22:46:34,397 WARN o.a.n.f.Fetcher [LocalJobRunner Map Task Executor
#0] Thread #7 hung while processing https://www.lowes.com/
2024-04-07 22:46:35,032 INFO o.a.n.f.Fetcher [main] Fetcher: finished at
2024-04-07 22:46:35, elapsed: 00:00:07
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>Nutch Spider</value>
</property>
<configuration>
<property>
<name>mapreduce.task.timeout</name>
<value>1800</value>
</property>
</configuration>