Sebastian Nagel created NUTCH-3174:
--------------------------------------
Summary: protocol-okhttp: request may hang despite http.time.limit
is set
Key: NUTCH-3174
URL: https://issues.apache.org/jira/browse/NUTCH-3174
Project: Nutch
Issue Type: Bug
Components: plugin, protocol
Affects Versions: 1.22
Reporter: Sebastian Nagel
Fix For: 1.23
The OkHttp protocol offers a configuration property {{http.time.limit}} which
sets a max. duration for the entire request, including DNS request,
establishing the connection and reading the response. The max. duration can be
configured in addition to {{http.timeout}}, which sets a timeout for individual
read operations.
The {{http.timeout}} is passed to the [OkHttp
client|https://square.github.io/okhttp/5.x/okhttp/okhttp3/-ok-http-client/index.html]
as read, write and connection timeout.
The {{http.time.limit}} is checked in
[HttpResponse|https://github.com/apache/nutch/blob/31c44b2afbec0f556ed9ebd6f2eca2eb07e7f4cb/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttpResponse.java#L197],
while the response is requested in chunks of 8 kiB.
If the server sends the responses back in very small chunks and regular
intervals shorter than the {{http.timeout}} (default: 10 seconds), the check
for the {{http.time.limit}} is never executed. Consequently, the request may
hang forever (ParserChecker), resp. blocks and causes "hung threads" in the
Fetcher.
The issue is reproducible with one of
[Shadowserver|https://www.shadowserver.org/]'s sinkholes.
The response is sent byte per byte as chunked transfer encoding in about 5
second intervals:
{noformat}
$> curl --no-progress-meter --dump-header
cbhjhlccfkqdpknyu-org-curl.headers.log \
--trace cbhjhlccfkqdpknyu-org-curl.trace.log \
--output cbhjhlccfkqdpknyu-org-curl.txt \
--speed-limit 1 --speed-time 60 http://cbhjhlccfkqdpknyu.org/
curl: (28) Operation too slow. Less than 1 bytes/sec transferred the last 60
seconds
$> cat cbhjhlccfkqdpknyu-org-curl.headers.log
HTTP/1.1 200 OK
Server: nginx/1.21.6
Date: Thu, 07 May 2026 12:27:13 GMT
Content-Type: application/octet-stream
Transfer-Encoding: chunked
Connection: keep-alive
Keep-Alive: timeout=20
$> cat cbhjhlccfkqdpknyu-org-curl.txt
FYnpmZslCQt
$> cat cbhjhlccfkqdpknyu-org-curl.trace.log
* Host cbhjhlccfkqdpknyu.org:80 was resolved.
* IPv6: (none)
* IPv4: 216.218.185.162
* Trying 216.218.185.162:80...
* Established connection to cbhjhlccfkqdpknyu.org (216.218.185.162 port 80)
from 192.168.178.110 port 50782
* using HTTP/1.x
=> Send header, 85 bytes (0x55)
0000: 47 45 54 20 2f 20 48 54 54 50 2f 31 2e 31 0d 0a GET / HTTP/1.1..
0010: 48 6f 73 74 3a 20 63 62 68 6a 68 6c 63 63 66 6b Host: cbhjhlccfk
0020: 71 64 70 6b 6e 79 75 2e 6f 72 67 0d 0a 55 73 65 qdpknyu.org..Use
0030: 72 2d 41 67 65 6e 74 3a 20 63 75 72 6c 2f 38 2e r-Agent: curl/8.
0040: 31 38 2e 30 0d 0a 41 63 63 65 70 74 3a 20 2a 2f 18.0..Accept: */
0050: 2a 0d 0a 0d 0a *....
* Request completely sent off
<= Recv header, 17 bytes (0x11)
0000: 48 54 54 50 2f 31 2e 31 20 32 30 30 20 4f 4b 0d HTTP/1.1 200 OK.
0010: 0a .
<= Recv header, 22 bytes (0x16)
0000: 53 65 72 76 65 72 3a 20 6e 67 69 6e 78 2f 31 2e Server: nginx/1.
0010: 32 31 2e 36 0d 0a 21.6..
<= Recv header, 37 bytes (0x25)
0000: 44 61 74 65 3a 20 54 68 75 2c 20 30 37 20 4d 61 Date: Thu, 07 Ma
0010: 79 20 32 30 32 36 20 31 32 3a 32 37 3a 31 33 20 y 2026 12:27:13
0020: 47 4d 54 0d 0a GMT..
<= Recv header, 40 bytes (0x28)
0000: 43 6f 6e 74 65 6e 74 2d 54 79 70 65 3a 20 61 70 Content-Type: ap
0010: 70 6c 69 63 61 74 69 6f 6e 2f 6f 63 74 65 74 2d plication/octet-
0020: 73 74 72 65 61 6d 0d 0a stream..
<= Recv header, 28 bytes (0x1c)
0000: 54 72 61 6e 73 66 65 72 2d 45 6e 63 6f 64 69 6e Transfer-Encodin
0010: 67 3a 20 63 68 75 6e 6b 65 64 0d 0a g: chunked..
<= Recv header, 24 bytes (0x18)
0000: 43 6f 6e 6e 65 63 74 69 6f 6e 3a 20 6b 65 65 70 Connection: keep
0010: 2d 61 6c 69 76 65 0d 0a -alive..
<= Recv header, 24 bytes (0x18)
0000: 4b 65 65 70 2d 41 6c 69 76 65 3a 20 74 69 6d 65 Keep-Alive: time
0010: 6f 75 74 3d 32 30 0d 0a out=20..
<= Recv header, 2 bytes (0x2)
0000: 0d 0a ..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 46 0d 0a 1..F..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 59 0d 0a 1..Y..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 6e 0d 0a 1..n..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 70 0d 0a 1..p..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 6d 0d 0a 1..m..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 5a 0d 0a 1..Z..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 73 0d 0a 1..s..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 6c 0d 0a 1..l..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 43 0d 0a 1..C..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 51 0d 0a 1..Q..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 74 0d 0a 1..t..
* Operation too slow. Less than 1 bytes/sec transferred the last 60 seconds
* closing connection #0
{noformat}
Fetcher reports a hung thread (that helped to discover this issue):
{noformat}
2026-05-06 14:18:43,255 WARN [main] fetcher.Fetcher: Thread #22 hung while
processing http://cbhjhlccfkqdpknyu.org/
{noformat}
ParserChecker hangs forever while fetching the content:
{noformat}
$> $NUTCH_HOME/bin/nutch parsechecker \
-Dplugin.includes='protocol-okhttp|parse-(html|tika)' \
-Dhttp.timeout=10000 -Dhttp.time.limit=60 -dumpText
http://cbhjhlccfkqdpknyu.org/
...
2026-05-07 14:37:44,811 INFO o.a.n.p.ParserChecker [main] fetching:
http://cbhjhlccfkqdpknyu.org/
...
2026-05-07 14:37:45,101 INFO o.a.n.p.o.OkHttp [main] http.timeout = 10000 ms
2026-05-07 14:37:45,101 INFO o.a.n.p.o.OkHttp [main] http.time.limit = 60
seconds
...
{noformat}
For a solution: OkHttp provides a [timeout for complete
call|https://square.github.io/okhttp/5.x/okhttp/okhttp3/-ok-http-client/-builder/call-timeout.html].
It should be set to the value of {{http.time.limit}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)