Sebastian Nagel created NUTCH-3174:
--------------------------------------

             Summary: protocol-okhttp: request may hang despite http.time.limit 
is set
                 Key: NUTCH-3174
                 URL: https://issues.apache.org/jira/browse/NUTCH-3174
             Project: Nutch
          Issue Type: Bug
          Components: plugin, protocol
    Affects Versions: 1.22
            Reporter: Sebastian Nagel
             Fix For: 1.23


The OkHttp protocol offers a configuration property {{http.time.limit}} which 
sets a max. duration for the entire request, including DNS request, 
establishing the connection and reading the response. The max. duration can be 
configured in addition to {{http.timeout}}, which sets a timeout for individual 
read operations.

The {{http.timeout}} is passed to the [OkHttp 
client|https://square.github.io/okhttp/5.x/okhttp/okhttp3/-ok-http-client/index.html]
 as read, write and connection timeout.

The {{http.time.limit}} is checked in 
[HttpResponse|https://github.com/apache/nutch/blob/31c44b2afbec0f556ed9ebd6f2eca2eb07e7f4cb/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttpResponse.java#L197],
 while the response is requested in chunks of 8 kiB.

If the server sends the responses back in very small chunks and regular 
intervals shorter than the {{http.timeout}} (default: 10 seconds), the check 
for the {{http.time.limit}} is never executed. Consequently, the request may 
hang forever (ParserChecker), resp. blocks and causes "hung threads" in the 
Fetcher.


The issue is reproducible with one of 
[Shadowserver|https://www.shadowserver.org/]'s sinkholes.

The response is sent byte per byte as chunked transfer encoding in about 5 
second intervals:

{noformat}
$> curl --no-progress-meter --dump-header 
cbhjhlccfkqdpknyu-org-curl.headers.log \
     --trace cbhjhlccfkqdpknyu-org-curl.trace.log \
     --output cbhjhlccfkqdpknyu-org-curl.txt \
     --speed-limit 1 --speed-time 60 http://cbhjhlccfkqdpknyu.org/
curl: (28) Operation too slow. Less than 1 bytes/sec transferred the last 60 
seconds

$> cat cbhjhlccfkqdpknyu-org-curl.headers.log
HTTP/1.1 200 OK
Server: nginx/1.21.6
Date: Thu, 07 May 2026 12:27:13 GMT
Content-Type: application/octet-stream
Transfer-Encoding: chunked
Connection: keep-alive
Keep-Alive: timeout=20

$> cat cbhjhlccfkqdpknyu-org-curl.txt
FYnpmZslCQt

$> cat cbhjhlccfkqdpknyu-org-curl.trace.log
* Host cbhjhlccfkqdpknyu.org:80 was resolved.
* IPv6: (none)
* IPv4: 216.218.185.162
*   Trying 216.218.185.162:80...
* Established connection to cbhjhlccfkqdpknyu.org (216.218.185.162 port 80) 
from 192.168.178.110 port 50782 
* using HTTP/1.x
=> Send header, 85 bytes (0x55)
0000: 47 45 54 20 2f 20 48 54 54 50 2f 31 2e 31 0d 0a GET / HTTP/1.1..
0010: 48 6f 73 74 3a 20 63 62 68 6a 68 6c 63 63 66 6b Host: cbhjhlccfk
0020: 71 64 70 6b 6e 79 75 2e 6f 72 67 0d 0a 55 73 65 qdpknyu.org..Use
0030: 72 2d 41 67 65 6e 74 3a 20 63 75 72 6c 2f 38 2e r-Agent: curl/8.
0040: 31 38 2e 30 0d 0a 41 63 63 65 70 74 3a 20 2a 2f 18.0..Accept: */
0050: 2a 0d 0a 0d 0a                                  *....
* Request completely sent off
<= Recv header, 17 bytes (0x11)
0000: 48 54 54 50 2f 31 2e 31 20 32 30 30 20 4f 4b 0d HTTP/1.1 200 OK.
0010: 0a                                              .
<= Recv header, 22 bytes (0x16)
0000: 53 65 72 76 65 72 3a 20 6e 67 69 6e 78 2f 31 2e Server: nginx/1.
0010: 32 31 2e 36 0d 0a                               21.6..
<= Recv header, 37 bytes (0x25)
0000: 44 61 74 65 3a 20 54 68 75 2c 20 30 37 20 4d 61 Date: Thu, 07 Ma
0010: 79 20 32 30 32 36 20 31 32 3a 32 37 3a 31 33 20 y 2026 12:27:13 
0020: 47 4d 54 0d 0a                                  GMT..
<= Recv header, 40 bytes (0x28)
0000: 43 6f 6e 74 65 6e 74 2d 54 79 70 65 3a 20 61 70 Content-Type: ap
0010: 70 6c 69 63 61 74 69 6f 6e 2f 6f 63 74 65 74 2d plication/octet-
0020: 73 74 72 65 61 6d 0d 0a                         stream..
<= Recv header, 28 bytes (0x1c)
0000: 54 72 61 6e 73 66 65 72 2d 45 6e 63 6f 64 69 6e Transfer-Encodin
0010: 67 3a 20 63 68 75 6e 6b 65 64 0d 0a             g: chunked..
<= Recv header, 24 bytes (0x18)
0000: 43 6f 6e 6e 65 63 74 69 6f 6e 3a 20 6b 65 65 70 Connection: keep
0010: 2d 61 6c 69 76 65 0d 0a                         -alive..
<= Recv header, 24 bytes (0x18)
0000: 4b 65 65 70 2d 41 6c 69 76 65 3a 20 74 69 6d 65 Keep-Alive: time
0010: 6f 75 74 3d 32 30 0d 0a                         out=20..
<= Recv header, 2 bytes (0x2)
0000: 0d 0a                                           ..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 46 0d 0a                               1..F..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 59 0d 0a                               1..Y..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 6e 0d 0a                               1..n..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 70 0d 0a                               1..p..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 6d 0d 0a                               1..m..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 5a 0d 0a                               1..Z..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 73 0d 0a                               1..s..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 6c 0d 0a                               1..l..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 43 0d 0a                               1..C..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 51 0d 0a                               1..Q..
<= Recv data, 6 bytes (0x6)
0000: 31 0d 0a 74 0d 0a                               1..t..
* Operation too slow. Less than 1 bytes/sec transferred the last 60 seconds
* closing connection #0
{noformat}

Fetcher reports a hung thread (that helped to discover this issue):
{noformat}
2026-05-06 14:18:43,255 WARN [main] fetcher.Fetcher: Thread #22 hung while 
processing http://cbhjhlccfkqdpknyu.org/
{noformat}

ParserChecker hangs forever while fetching the content:
{noformat}
$> $NUTCH_HOME/bin/nutch parsechecker \
    -Dplugin.includes='protocol-okhttp|parse-(html|tika)' \
    -Dhttp.timeout=10000 -Dhttp.time.limit=60 -dumpText 
http://cbhjhlccfkqdpknyu.org/
...
2026-05-07 14:37:44,811 INFO o.a.n.p.ParserChecker [main] fetching: 
http://cbhjhlccfkqdpknyu.org/
...
2026-05-07 14:37:45,101 INFO o.a.n.p.o.OkHttp [main] http.timeout = 10000 ms
2026-05-07 14:37:45,101 INFO o.a.n.p.o.OkHttp [main] http.time.limit = 60 
seconds
...
{noformat}


For a solution: OkHttp provides a [timeout for complete 
call|https://square.github.io/okhttp/5.x/okhttp/okhttp3/-ok-http-client/-builder/call-timeout.html].
 It should be set to the value of {{http.time.limit}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to