[ https://issues.apache.org/jira/browse/NUTCH-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699115#comment-13699115 ]
Tejas Patil commented on NUTCH-1599: ------------------------------------ I agree with Julien: Nutch should be described as a web-crawler. Markus took it to the next level by adding more technicality :) So "Highly extensible and scalable web crawler software" it is !! > Obtain consensus on new description of Nutch > -------------------------------------------- > > Key: NUTCH-1599 > URL: https://issues.apache.org/jira/browse/NUTCH-1599 > Project: Nutch > Issue Type: Improvement > Components: documentation > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney > Fix For: 2.3, 1.8 > > > As we seem to be sustaining pushes and maintenance (touch wood) of two > branches, I think it is about time we agreed on a more accurate description > of what Nutch actually is. > We currently have (taken directly from our site) > {code:xml} > Apache Nutch is an open source web-search software project. Stemming from > Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a > crawler, a link-graph database and parsing support handled by Apache Tika for > HTML and and array other document formats. > Nutch can run on a single machine, but gains a lot of its strength from > running in a Hadoop cluster > The system can be enhanced (eg other document formats can be parsed) using a > highly flexible, easily extensible and thoroughly maintained plugin > infrastructure. > {code} > I suggest/propose something along the lines of > {code:xml} > Apache Nutch is an open source web-search software project. Stemming from > Apache Lucene, the community now develops and maintains two branches: > * 1.x; description of 1.x here > * 2.x; description of 2.x here > Both branches add web-specifics, such as a crawler, a link-graph database and > parsing support handled by Apache Tika for HTML and anarray other document > formats. > Nutch can run on a single machine, but gains a lot of its strength from > running in a Hadoop cluster > The system can be enhanced (eg other document formats can be parsed) using a > highly flexible, easily extensible and thoroughly maintained plugin > infrastructure. > {code} > Any thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira