[jira] [Commented] (NUTCH-1599) Obtain consensus on new description of Nutch

2013-07-03 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699115#comment-13699115
 ] 

Tejas Patil commented on NUTCH-1599:


I agree with Julien: Nutch should be described as a web-crawler. Markus took it 
to the next level by adding more technicality :) So Highly extensible and 
scalable web crawler software it is !!

 Obtain consensus on new description of Nutch
 

 Key: NUTCH-1599
 URL: https://issues.apache.org/jira/browse/NUTCH-1599
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.3, 1.8


 As we seem to be sustaining pushes and maintenance (touch wood) of two 
 branches, I think it is about time we agreed on a more accurate description 
 of what Nutch actually is.
 We currently have (taken directly from our site)
 {code:xml}
 Apache Nutch is an open source web-search software project. Stemming from 
 Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a 
 crawler, a link-graph database and parsing support handled by Apache Tika for 
 HTML and and array other document formats.
 Nutch can run on a single machine, but gains a lot of its strength from 
 running in a Hadoop cluster
 The system can be enhanced (eg other document formats can be parsed) using a 
 highly flexible, easily extensible and thoroughly maintained plugin 
 infrastructure.
 {code}
 I suggest/propose something along the lines of
 {code:xml}
 Apache Nutch is an open source web-search software project. Stemming from 
 Apache Lucene, the community now develops and maintains two branches:
 * 1.x; description of 1.x here
 * 2.x; description of 2.x here
 Both branches add web-specifics, such as a crawler, a link-graph database and 
 parsing support handled by Apache Tika for HTML and anarray other document 
 formats.
 Nutch can run on a single machine, but gains a lot of its strength from 
 running in a Hadoop cluster
 The system can be enhanced (eg other document formats can be parsed) using a 
 highly flexible, easily extensible and thoroughly maintained plugin 
 infrastructure.
 {code}
 Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1599) Obtain consensus on new description of Nutch

2013-07-03 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699362#comment-13699362
 ] 

Markus Jelsma commented on NUTCH-1599:
--

nice! thanks

 Obtain consensus on new description of Nutch
 

 Key: NUTCH-1599
 URL: https://issues.apache.org/jira/browse/NUTCH-1599
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.3, 1.8


 As we seem to be sustaining pushes and maintenance (touch wood) of two 
 branches, I think it is about time we agreed on a more accurate description 
 of what Nutch actually is.
 We currently have (taken directly from our site)
 {code:xml}
 Apache Nutch is an open source web-search software project. Stemming from 
 Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a 
 crawler, a link-graph database and parsing support handled by Apache Tika for 
 HTML and and array other document formats.
 Nutch can run on a single machine, but gains a lot of its strength from 
 running in a Hadoop cluster
 The system can be enhanced (eg other document formats can be parsed) using a 
 highly flexible, easily extensible and thoroughly maintained plugin 
 infrastructure.
 {code}
 I suggest/propose something along the lines of
 {code:xml}
 Apache Nutch is an open source web-search software project. Stemming from 
 Apache Lucene, the community now develops and maintains two branches:
 * 1.x; description of 1.x here
 * 2.x; description of 2.x here
 Both branches add web-specifics, such as a crawler, a link-graph database and 
 parsing support handled by Apache Tika for HTML and anarray other document 
 formats.
 Nutch can run on a single machine, but gains a lot of its strength from 
 running in a Hadoop cluster
 The system can be enhanced (eg other document formats can be parsed) using a 
 highly flexible, easily extensible and thoroughly maintained plugin 
 infrastructure.
 {code}
 Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira