[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

Andrzej Bialecki (JIRA) Wed, 11 Aug 2010 02:51:47 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrzej Bialecki  updated NUTCH-880:
------------------------------------

    Description: 
This issue is for discussing a REST-style API for accessing Nutch.

Here's an initial idea:

* I propose to use org.restlet for handling requests and returning 
JSON/XML/whatever responses.
* hook up all regular tools so that they can be driven via this API. This would 
have to be an async API, since all Nutch operations take long time to execute. 
It follows then that we need to be able also to list running operations, 
retrieve their current status, and possibly 
abort/cancel/stop/suspend/resume/...? This also means that we would have to 
potentially create & manage many threads in a servlet - AFAIK this is frowned 
upon by J2EE purists...
* package this in a webapp (that includes all deps, essentially nutch.job 
content), with the restlet servlet as an entry point.

Open issues:

* how to implement the reading of crawl results via this API
* should we manage only crawls that use a single configuration per webapp, or 
should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops 
on them? this would be nice, because it would allow managing of several 
different crawls, with different configs, in a single webapp - but it 
complicates the implementation a lot.

  was:
This issue is for discussing a REST-style API for accessing Nutch.

Here's an initial idea:

* I propose to use org.restlet for handling JSON requests
* hook up all regular tools so that they can be driven via this API. This would 
have to be an async API, since all Nutch operations take long time to execute. 
It follows then that we need to be able also to list running operations, 
retrieve their current status, and possibly 
abort/cancel/stop/suspend/resume/...? This also means that we would have to 
potentially create & manage many threads in a servlet - AFAIK this is frowned 
upon by J2EE purists...
* package this in a webapp (that includes all deps, essentially nutch.job 
content), with the restlet servlet as an entry point.

Open issues:

* how to implement the reading of crawl results via this API
* should we manage only crawls that use a single configuration per webapp, or 
should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops 
on them? this would be nice, because it would allow managing of several 
different crawls, with different configs, in a single webapp - but it 
complicates the implementation a lot.


> REST API (and webapp) for Nutch
> -------------------------------
>
>                 Key: NUTCH-880
>                 URL: https://issues.apache.org/jira/browse/NUTCH-880
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 2.0
>            Reporter: Andrzej Bialecki 
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning 
> JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This 
> would have to be an async API, since all Nutch operations take long time to 
> execute. It follows then that we need to be able also to list running 
> operations, retrieve their current status, and possibly 
> abort/cancel/stop/suspend/resume/...? This also means that we would have to 
> potentially create & manage many threads in a servlet - AFAIK this is frowned 
> upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job 
> content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or 
> should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
> ops on them? this would be nice, because it would allow managing of several 
> different crawls, with different configs, in a single webapp - but it 
> complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

Reply via email to