[jira] [Created] (NUTCH-3059) Generator: selector job does not count reduce output records

2024-06-05 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3059:
--

 Summary: Generator: selector job does not count reduce output 
records
 Key: NUTCH-3059
 URL: https://issues.apache.org/jira/browse/NUTCH-3059
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.21


The selector step (job) of the Generator does not count the reduce output 
records resp. shows the count "0":
{noformat}
2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: starting

2024-06-05 13:57:09,299 INFO o.a.n.c.Generator [main] Generator: selecting 
best-scoring urls due for fetch.
...
         Map-Reduce Framework
                Map input records=6
                Map output records=6
                ...
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=594
                Reduce input records=6
                Reduce output records=0
                Spilled Records=12
                ...
{noformat}
Not a big issue but should investigate why this happens. The other counters 
seem to work properly, also the partitioner job shows the reduce output 
records. The issue is observed in local and distributed mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] NUTCH-3058 Fetcher: counter for hung threads [nutch]

2024-06-05 Thread via GitHub


sebastian-nagel opened a new pull request, #820:
URL: https://github.com/apache/nutch/pull/820

   - count the number of hung threads in a fetcher job
   - log and count the number of fetch items still queued when the "hard" 
timeout is reached


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3058) Fetcher: counter for hung threads

2024-06-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852421#comment-17852421
 ] 

ASF GitHub Bot commented on NUTCH-3058:
---

sebastian-nagel opened a new pull request, #820:
URL: https://github.com/apache/nutch/pull/820

   - count the number of hung threads in a fetcher job
   - log and count the number of fetch items still queued when the "hard" 
timeout is reached




> Fetcher: counter for hung threads
> -
>
> Key: NUTCH-3058
> URL: https://issues.apache.org/jira/browse/NUTCH-3058
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> The Fetcher class defines a "hard" timeout defined as 50% of the MapReduce 
> task timeout, see {{mapreduce.task.timeout}} and 
> {{fetcher.threads.timeout.divisor}}. If there are fetcher threads running but 
> without any progress during the timeout period (in terms of newly started 
> fetch items), Fetcher is shut down to avoid that the task timeout is reached 
> and the fetcher job is failed. The "hung threads" are logged together with 
> the URL being fetched and (DEBUG level) the Java stack.
> In addition to logging, a job counter should indicate the number of hung 
> threads. This would allow to see on the job level whether there are issues 
> with hung threads. To trace the issues it's still required to look into the 
> Hadoop task logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3058) Fetcher: counter for hung threads

2024-06-05 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3058:
--

 Summary: Fetcher: counter for hung threads
 Key: NUTCH-3058
 URL: https://issues.apache.org/jira/browse/NUTCH-3058
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


The Fetcher class defines a "hard" timeout defined as 50% of the MapReduce task 
timeout, see {{mapreduce.task.timeout}} and 
{{fetcher.threads.timeout.divisor}}. If there are fetcher threads running but 
without any progress during the timeout period (in terms of newly started fetch 
items), Fetcher is shut down to avoid that the task timeout is reached and the 
fetcher job is failed. The "hung threads" are logged together with the URL 
being fetched and (DEBUG level) the Java stack.

In addition to logging, a job counter should indicate the number of hung 
threads. This would allow to see on the job level whether there are issues with 
hung threads. To trace the issues it's still required to look into the Hadoop 
task logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)