[ 
https://issues.apache.org/jira/browse/NUTCH-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Lopez updated NUTCH-1821:
------------------------------

    Description: 
Hi all,

Some of us are using Amazon EMR to deploy/run Nutch and from what I've been 
reading in the users mailing list there are 2 common issues people run into... 
first, EMR supports up to Hadoop 1.0.3 (which is kind of old) and second, from 
Nutch 1.8+ the Crawl class has been deprecated. 

The first issue poses a problem when we try to deploy recent Nutch versions. 
The most recent version that is supported by EMR is 1.6, the second issue is 
that EMR receives a jar and main class to do its job and from 1.8 the Crawl 
class has been removed.

After some tests we completed a branch (from Nutch 1.6 + Hadoop 1.0.3) that 
improves the old Crawl class so it scales, since 1.6 is an old version I wonder 
how can we contribute back to those that need to use ElasticMapreduce.

The things we did are:

a) Add num fetchers as a parameter to the Crawl class.
    For some reason the generator was always defaulting to one list( see: 
http://stackoverflow.com/questions/10264183/why-does-nutch-only-run-the-fetch-step-on-one-hadoop-node-when-the-cluster-has)
 creating just one fetch map task... with the new parameter we can adjust the 
map tasks to fit the cluster size.
b) Index documents on each Crawl cycle and not at the end.
     We had performance/memory issues when we tried to index all the documents 
when the whole crawl is done, we moved the index part into the main Crawl cycle.
c) We added an option to delete segments after their content is indexed into 
Solr. It saves HDF space since the EC2 instances we use don't have a lot of 
space.

So far these fixes have allowed us to scale out Nutch and be efficient with 
Amazon EMR clusters. If you guys think that there is some value on these 
changes we can  submit a patch file.

Luis.

  was:
Hi all,

Some of us are using Amazon EMR to deploy/run Nutch and from what I've been 
reading in the users mailing list there are 2 common issues people run into... 
first, EMR supports up to Hadoop 1.0.3 (which is kind of old) and second, from 
Nutch 1.8+ the Crawl class has been deprecated. 

The first issue poses a problem when we try to deploy recent Nutch versions. 
The most recent version that is supported by EMR is 1.6, the second issue is 
that EMR receives a jar and main class to do its job and from 1.8 the Crawl 
class has been removed.

After some tests we completed a branch (from Nutch 1.6 + Hadoop 1.0.3) that 
improves the old Crawl class so it scales, since 1.6 is an old version I wonder 
how can we contribute back to those that need to use ElasticMapreduce.

The things we did are:

a) Add num fetchers as a parameter to the Crawl class.
    For some reason the generator was always defaulting to one list( see: 
http://stackoverflow.com/questions/10264183/why-does-nutch-only-run-the-fetch-step-on-one-hadoop-node-when-the-cluster-has)
 creating just one fetch map task... with the new parameter we can adjust the 
map tasks to fit the cluster size.
b) Index documents on each Crawl cycle and not at the end.
     We had performance/memory issues when we tried to index all the documents 
when the whole crawl is done, we moved the index part into the main Crawl cycle.
c) We added an option to delete segments after their content is indexed into 
Solr. It saves HDF space since the EC2 instances we use don't have a lot of 
space.

So far these fixes have allowed us to scale out Nutch be efficient with Amazon 
EMR clusters. If you guys think that there is some value on these changes we 
can  submit a patch file.

Luis.


> Nutch Crawl class for EMR
> -------------------------
>
>                 Key: NUTCH-1821
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1821
>             Project: Nutch
>          Issue Type: Wish
>    Affects Versions: 1.6
>         Environment: Amazon EMR
>            Reporter: Luis Lopez
>              Labels: Amazon, Crawler, EMR, performance
>
> Hi all,
> Some of us are using Amazon EMR to deploy/run Nutch and from what I've been 
> reading in the users mailing list there are 2 common issues people run 
> into... first, EMR supports up to Hadoop 1.0.3 (which is kind of old) and 
> second, from Nutch 1.8+ the Crawl class has been deprecated. 
> The first issue poses a problem when we try to deploy recent Nutch versions. 
> The most recent version that is supported by EMR is 1.6, the second issue is 
> that EMR receives a jar and main class to do its job and from 1.8 the Crawl 
> class has been removed.
> After some tests we completed a branch (from Nutch 1.6 + Hadoop 1.0.3) that 
> improves the old Crawl class so it scales, since 1.6 is an old version I 
> wonder how can we contribute back to those that need to use ElasticMapreduce.
> The things we did are:
> a) Add num fetchers as a parameter to the Crawl class.
>     For some reason the generator was always defaulting to one list( see: 
> http://stackoverflow.com/questions/10264183/why-does-nutch-only-run-the-fetch-step-on-one-hadoop-node-when-the-cluster-has)
>  creating just one fetch map task... with the new parameter we can adjust the 
> map tasks to fit the cluster size.
> b) Index documents on each Crawl cycle and not at the end.
>      We had performance/memory issues when we tried to index all the 
> documents when the whole crawl is done, we moved the index part into the main 
> Crawl cycle.
> c) We added an option to delete segments after their content is indexed into 
> Solr. It saves HDF space since the EC2 instances we use don't have a lot of 
> space.
> So far these fixes have allowed us to scale out Nutch and be efficient with 
> Amazon EMR clusters. If you guys think that there is some value on these 
> changes we can  submit a patch file.
> Luis.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to