[
https://issues.apache.org/jira/browse/NUTCH-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699115#comment-13699115
]
Tejas Patil commented on NUTCH-1599:
I agree with Julien: Nutch should be described as a web-crawler. Markus took it
to the next level by adding more technicality :) So Highly extensible and
scalable web crawler software it is !!
Obtain consensus on new description of Nutch
Key: NUTCH-1599
URL: https://issues.apache.org/jira/browse/NUTCH-1599
Project: Nutch
Issue Type: Improvement
Components: documentation
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Fix For: 2.3, 1.8
As we seem to be sustaining pushes and maintenance (touch wood) of two
branches, I think it is about time we agreed on a more accurate description
of what Nutch actually is.
We currently have (taken directly from our site)
{code:xml}
Apache Nutch is an open source web-search software project. Stemming from
Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a
crawler, a link-graph database and parsing support handled by Apache Tika for
HTML and and array other document formats.
Nutch can run on a single machine, but gains a lot of its strength from
running in a Hadoop cluster
The system can be enhanced (eg other document formats can be parsed) using a
highly flexible, easily extensible and thoroughly maintained plugin
infrastructure.
{code}
I suggest/propose something along the lines of
{code:xml}
Apache Nutch is an open source web-search software project. Stemming from
Apache Lucene, the community now develops and maintains two branches:
* 1.x; description of 1.x here
* 2.x; description of 2.x here
Both branches add web-specifics, such as a crawler, a link-graph database and
parsing support handled by Apache Tika for HTML and anarray other document
formats.
Nutch can run on a single machine, but gains a lot of its strength from
running in a Hadoop cluster
The system can be enhanced (eg other document formats can be parsed) using a
highly flexible, easily extensible and thoroughly maintained plugin
infrastructure.
{code}
Any thoughts?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira