[jira] [Commented] (NUTCH-1599) Obtain consensus on new description of Nutch

Tejas Patil (JIRA) Wed, 03 Jul 2013 09:05:46 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699115#comment-13699115
 ]


Tejas Patil commented on NUTCH-1599:
------------------------------------

I agree with Julien: Nutch should be described as a web-crawler. Markus took it 
to the next level by adding more technicality :) So "Highly extensible and 
scalable web crawler software" it is !!
                
> Obtain consensus on new description of Nutch
> --------------------------------------------
>
>                 Key: NUTCH-1599
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1599
>             Project: Nutch
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 2.3, 1.8
>
>
> As we seem to be sustaining pushes and maintenance (touch wood) of two 
> branches, I think it is about time we agreed on a more accurate description 
> of what Nutch actually is.
> We currently have (taken directly from our site)
> {code:xml}
> Apache Nutch is an open source web-search software project. Stemming from 
> Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a 
> crawler, a link-graph database and parsing support handled by Apache Tika for 
> HTML and and array other document formats.
> Nutch can run on a single machine, but gains a lot of its strength from 
> running in a Hadoop cluster
> The system can be enhanced (eg other document formats can be parsed) using a 
> highly flexible, easily extensible and thoroughly maintained plugin 
> infrastructure.
> {code}
> I suggest/propose something along the lines of
> {code:xml}
> Apache Nutch is an open source web-search software project. Stemming from 
> Apache Lucene, the community now develops and maintains two branches:
> * 1.x; description of 1.x here
> * 2.x; description of 2.x here
> Both branches add web-specifics, such as a crawler, a link-graph database and 
> parsing support handled by Apache Tika for HTML and anarray other document 
> formats.
> Nutch can run on a single machine, but gains a lot of its strength from 
> running in a Hadoop cluster
> The system can be enhanced (eg other document formats can be parsed) using a 
> highly flexible, easily extensible and thoroughly maintained plugin 
> infrastructure.
> {code}
> Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1599) Obtain consensus on new description of Nutch

Reply via email to