[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

Sebastian Nagel (JIRA) Mon, 18 May 2015 02:09:08 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547760#comment-14547760
 ]


Sebastian Nagel commented on NUTCH-2011:
----------------------------------------

Hi [~sujenshah], first a few questions to get a common understanding:
* "fetch round" means one fetch job, right?
* "greater depth fetch rounds" means long fetch lists? One fetch job has always 
a depth of 1 (unless "fetcher.follow.outlinks.depth" is used but this is not 
recommended because links are not unified and the same URL/document may be 
fetched multiple times).
* Can you explain the use case in the GUI/front-end the information in 
FetchNodeDb is used for?

Which URLs are going to be fetched in one round is known in advance: that's the 
generated fetch list. After each fetch job all data is stored in the segment 
the Fetcher is reading from and writing to. I don't see the need for an 
additional persistence layer. In case the server is continuing a crawl it has 
to achieve numbers from existing data structures (mainly CrawlDb).

To have real-time information what the fetcher tasks are doing a small queue 
(ring buffer) should be enough.The GUI can then show what's going on in each of 
the fetcher tasks.

For overall progress and status information the job counters would be the most 
natural way because they provide an aggregated view over all tasks of one job 
and it should be possible to achieve the counters from the job tracker. And 
it's a good idea to add parser-related counters (e.g., number of outlinks) to 
provide more metrics in the GUI.

> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>
>                 Key: NUTCH-2011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2011
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher, REST_api
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

Reply via email to