[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

Sujen Shah (JIRA) Fri, 18 Sep 2015 15:57:22 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876639#comment-14876639
 ]


Sujen Shah commented on NUTCH-2011:
-----------------------------------

Hi [~ahmadia], 
There is an implementation of this in the org.apache.nutch.fetcher package and 
also a corresponding endpoint in the REST API. But the current issue with the 
implementation is that the entire data (ref class FetchNode) is stored in 
memory (ref class FetchNodeDb), which gets very large with large crawls. 

We could discuss a few options of how to implement this and come up with an 
efficient solution. Any suggestions ?  

> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>
>                 Key: NUTCH-2011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2011
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher, REST_api
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

Reply via email to