[ https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547760#comment-14547760 ]
Sebastian Nagel commented on NUTCH-2011: ---------------------------------------- Hi [~sujenshah], first a few questions to get a common understanding: * "fetch round" means one fetch job, right? * "greater depth fetch rounds" means long fetch lists? One fetch job has always a depth of 1 (unless "fetcher.follow.outlinks.depth" is used but this is not recommended because links are not unified and the same URL/document may be fetched multiple times). * Can you explain the use case in the GUI/front-end the information in FetchNodeDb is used for? Which URLs are going to be fetched in one round is known in advance: that's the generated fetch list. After each fetch job all data is stored in the segment the Fetcher is reading from and writing to. I don't see the need for an additional persistence layer. In case the server is continuing a crawl it has to achieve numbers from existing data structures (mainly CrawlDb). To have real-time information what the fetcher tasks are doing a small queue (ring buffer) should be enough.The GUI can then show what's going on in each of the fetcher tasks. For overall progress and status information the job counters would be the most natural way because they provide an aggregated view over all tasks of one job and it should be possible to achieve the counters from the job tracker. And it's a good idea to add parser-related counters (e.g., number of outlinks) to provide more metrics in the GUI. > Endpoint to support realtime JSON output from the fetcher > --------------------------------------------------------- > > Key: NUTCH-2011 > URL: https://issues.apache.org/jira/browse/NUTCH-2011 > Project: Nutch > Issue Type: Sub-task > Components: fetcher, REST_api > Reporter: Sujen Shah > Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > This fix will create an endpoint to query the Nutch REST service and get a > real-time JSON response of the current/past Fetched URLs. > This endpoint also includes pagination of the output to reduce data transfer > bw in large crawls. -- This message was sent by Atlassian JIRA (v6.3.4#6332)