[ 
https://issues.apache.org/jira/browse/NUTCH-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079156#comment-18079156
 ] 

Sebastian Nagel commented on NUTCH-3174:
----------------------------------------

Balancing read/write/connection timeouts and the call timeout can be also 
tested using 
[OkCurl|https://github.com/square/okhttp/blob/master/okcurl/README.md]:
{noformat}
java -jar okcurl-5.3.2-all.jar --verbose --read-timeout 10 --call-timeout 20 
http://cbhjhlccfkqdpknyu.org/
{noformat}
The jar can be downloaded from 
[maven.org|https://repo1.maven.org/maven2/com/squareup/okhttp3/okcurl/5.3.2/].

> protocol-okhttp: request may hang despite http.time.limit is set
> ----------------------------------------------------------------
>
>                 Key: NUTCH-3174
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3174
>             Project: Nutch
>          Issue Type: Bug
>          Components: plugin, protocol
>    Affects Versions: 1.22
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.23
>
>
> The OkHttp protocol offers a configuration property {{http.time.limit}} which 
> sets a max. duration for the entire request, including DNS request, 
> establishing the connection and reading the response. The max. duration can 
> be configured in addition to {{http.timeout}}, which sets a timeout for 
> individual read operations.
> The {{http.timeout}} is passed to the [OkHttp 
> client|https://square.github.io/okhttp/5.x/okhttp/okhttp3/-ok-http-client/index.html]
>  as read, write and connection timeout.
> The {{http.time.limit}} is checked in 
> [HttpResponse|https://github.com/apache/nutch/blob/31c44b2afbec0f556ed9ebd6f2eca2eb07e7f4cb/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttpResponse.java#L197],
>  while the response is requested in chunks of 8 kiB.
> If the server sends the responses back in very small chunks and regular 
> intervals shorter than the {{http.timeout}} (default: 10 seconds), the check 
> for the {{http.time.limit}} is never executed. Consequently, the request may 
> hang forever (ParserChecker), resp. blocks and causes "hung threads" in the 
> Fetcher.
> The issue is reproducible with one of 
> [Shadowserver|https://www.shadowserver.org/]'s sinkholes.
> The response is sent byte per byte as chunked transfer encoding in about 5 
> second intervals:
> {noformat}
> $> curl --no-progress-meter --dump-header 
> cbhjhlccfkqdpknyu-org-curl.headers.log \
>      --trace cbhjhlccfkqdpknyu-org-curl.trace.log \
>      --output cbhjhlccfkqdpknyu-org-curl.txt \
>      --speed-limit 1 --speed-time 60 http://cbhjhlccfkqdpknyu.org/
> curl: (28) Operation too slow. Less than 1 bytes/sec transferred the last 60 
> seconds
> $> cat cbhjhlccfkqdpknyu-org-curl.headers.log
> HTTP/1.1 200 OK
> Server: nginx/1.21.6
> Date: Thu, 07 May 2026 12:27:13 GMT
> Content-Type: application/octet-stream
> Transfer-Encoding: chunked
> Connection: keep-alive
> Keep-Alive: timeout=20
> $> cat cbhjhlccfkqdpknyu-org-curl.txt
> FYnpmZslCQt
> $> cat cbhjhlccfkqdpknyu-org-curl.trace.log
> * Host cbhjhlccfkqdpknyu.org:80 was resolved.
> * IPv6: (none)
> * IPv4: 216.218.185.162
> *   Trying 216.218.185.162:80...
> * Established connection to cbhjhlccfkqdpknyu.org (216.218.185.162 port 80) 
> from 192.168.178.110 port 50782 
> * using HTTP/1.x
> => Send header, 85 bytes (0x55)
> 0000: 47 45 54 20 2f 20 48 54 54 50 2f 31 2e 31 0d 0a GET / HTTP/1.1..
> 0010: 48 6f 73 74 3a 20 63 62 68 6a 68 6c 63 63 66 6b Host: cbhjhlccfk
> 0020: 71 64 70 6b 6e 79 75 2e 6f 72 67 0d 0a 55 73 65 qdpknyu.org..Use
> 0030: 72 2d 41 67 65 6e 74 3a 20 63 75 72 6c 2f 38 2e r-Agent: curl/8.
> 0040: 31 38 2e 30 0d 0a 41 63 63 65 70 74 3a 20 2a 2f 18.0..Accept: */
> 0050: 2a 0d 0a 0d 0a                                  *....
> * Request completely sent off
> <= Recv header, 17 bytes (0x11)
> 0000: 48 54 54 50 2f 31 2e 31 20 32 30 30 20 4f 4b 0d HTTP/1.1 200 OK.
> 0010: 0a                                              .
> <= Recv header, 22 bytes (0x16)
> 0000: 53 65 72 76 65 72 3a 20 6e 67 69 6e 78 2f 31 2e Server: nginx/1.
> 0010: 32 31 2e 36 0d 0a                               21.6..
> <= Recv header, 37 bytes (0x25)
> 0000: 44 61 74 65 3a 20 54 68 75 2c 20 30 37 20 4d 61 Date: Thu, 07 Ma
> 0010: 79 20 32 30 32 36 20 31 32 3a 32 37 3a 31 33 20 y 2026 12:27:13 
> 0020: 47 4d 54 0d 0a                                  GMT..
> <= Recv header, 40 bytes (0x28)
> 0000: 43 6f 6e 74 65 6e 74 2d 54 79 70 65 3a 20 61 70 Content-Type: ap
> 0010: 70 6c 69 63 61 74 69 6f 6e 2f 6f 63 74 65 74 2d plication/octet-
> 0020: 73 74 72 65 61 6d 0d 0a                         stream..
> <= Recv header, 28 bytes (0x1c)
> 0000: 54 72 61 6e 73 66 65 72 2d 45 6e 63 6f 64 69 6e Transfer-Encodin
> 0010: 67 3a 20 63 68 75 6e 6b 65 64 0d 0a             g: chunked..
> <= Recv header, 24 bytes (0x18)
> 0000: 43 6f 6e 6e 65 63 74 69 6f 6e 3a 20 6b 65 65 70 Connection: keep
> 0010: 2d 61 6c 69 76 65 0d 0a                         -alive..
> <= Recv header, 24 bytes (0x18)
> 0000: 4b 65 65 70 2d 41 6c 69 76 65 3a 20 74 69 6d 65 Keep-Alive: time
> 0010: 6f 75 74 3d 32 30 0d 0a                         out=20..
> <= Recv header, 2 bytes (0x2)
> 0000: 0d 0a                                           ..
> <= Recv data, 6 bytes (0x6)
> 0000: 31 0d 0a 46 0d 0a                               1..F..
> <= Recv data, 6 bytes (0x6)
> 0000: 31 0d 0a 59 0d 0a                               1..Y..
> <= Recv data, 6 bytes (0x6)
> 0000: 31 0d 0a 6e 0d 0a                               1..n..
> <= Recv data, 6 bytes (0x6)
> 0000: 31 0d 0a 70 0d 0a                               1..p..
> <= Recv data, 6 bytes (0x6)
> 0000: 31 0d 0a 6d 0d 0a                               1..m..
> <= Recv data, 6 bytes (0x6)
> 0000: 31 0d 0a 5a 0d 0a                               1..Z..
> <= Recv data, 6 bytes (0x6)
> 0000: 31 0d 0a 73 0d 0a                               1..s..
> <= Recv data, 6 bytes (0x6)
> 0000: 31 0d 0a 6c 0d 0a                               1..l..
> <= Recv data, 6 bytes (0x6)
> 0000: 31 0d 0a 43 0d 0a                               1..C..
> <= Recv data, 6 bytes (0x6)
> 0000: 31 0d 0a 51 0d 0a                               1..Q..
> <= Recv data, 6 bytes (0x6)
> 0000: 31 0d 0a 74 0d 0a                               1..t..
> * Operation too slow. Less than 1 bytes/sec transferred the last 60 seconds
> * closing connection #0
> {noformat}
> Fetcher reports a hung thread (that helped to discover this issue):
> {noformat}
> 2026-05-06 14:18:43,255 WARN [main] fetcher.Fetcher: Thread #22 hung while 
> processing http://cbhjhlccfkqdpknyu.org/
> {noformat}
> ParserChecker hangs forever while fetching the content:
> {noformat}
> $> $NUTCH_HOME/bin/nutch parsechecker \
>     -Dplugin.includes='protocol-okhttp|parse-(html|tika)' \
>     -Dhttp.timeout=10000 -Dhttp.time.limit=60 -dumpText 
> http://cbhjhlccfkqdpknyu.org/
> ...
> 2026-05-07 14:37:44,811 INFO o.a.n.p.ParserChecker [main] fetching: 
> http://cbhjhlccfkqdpknyu.org/
> ...
> 2026-05-07 14:37:45,101 INFO o.a.n.p.o.OkHttp [main] http.timeout = 10000 ms
> 2026-05-07 14:37:45,101 INFO o.a.n.p.o.OkHttp [main] http.time.limit = 60 
> seconds
> ...
> {noformat}
> For a solution: OkHttp provides a [timeout for complete 
> call|https://square.github.io/okhttp/5.x/okhttp/okhttp3/-ok-http-client/-builder/call-timeout.html].
>  It should be set to the value of {{http.time.limit}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to