Re: [VOTE] Release Apache Nutch 2.3.1

Drulea, Sherban Mon, 05 Oct 2015 10:54:00 -0700

Hi Sebastian,

I tried multiple URLs in my seed.txt file. None of them result in the
nutch generator crawling any links.


Here’s my environment:
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
SOLR 4.6.0
Mongo version 3.0.2.
Nutch 2.3.1

―――――――――――――――

regex-urlfilter.txt:
―――――――――――――――
+.

―――――――――――――――
nutch-site.xml
―――――――――――――――
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>http.agent.name</name>
        <value>nutch Mongo Solr Crawler</value>
    </property>

    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.mongodb.store.MongoStore</value>
        <description>Default class for storing data</description>
    </property>
    
    <property>
        <name>plugin.includes</name>
        
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
alue>
        <description>Regular expression naming plugin directory names to
include. </description>
   </property>
    
</configuration>


―――――――――――――――
gora.properties:
―――――――――――――――
############################
# MongoDBStore properties  #
############################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=method_centers

―――――――――――――――
seed.txt
―――――――――――――――
http://punklawyer.com
http://mail-archives.apache.org/mod_mbox/nutch-user/
http://hbase.apache.org/index.html
http://wiki.apache.org/nutch/FrontPage
http://www.aintitcool.com/
―――――――――――――――

Here are the results of the crawl command " ./bin/crawl urls methods
http://127.0.0.1:8983/solr/ 2”
Injecting seed URLs
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls
-crawlId methods
InjectorJob: starting at 2015-10-01 18:27:23
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the
Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and
filtering: 5
Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02
Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2
Generating batchId
Generating a new fetchlist
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
-crawlId methods -batchId 1443749246-29495
GeneratorJob: starting at 2015-10-01 18:27:26
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs
Fetching : 
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D fetcher.timelimit.mins=180
1443749246-29495 -crawlId methods -threads 50
FetcherJob: starting at 2015-10-01 18:27:29
FetcherJob: batchId: 1443749246-29495
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1443760049865
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
...
-finishing thread FetcherThread49, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues
-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
...

-finishing thread FetcherThread48, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread49, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues
-activeThreads=0
FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12
Parsing : 
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods
ParserJob: starting at 2015-10-01 18:27:43
ParserJob: resuming:  false
ParserJob: forced reparse:  false
ParserJob: batchId: 1443749246-29495
ParserJob: success
ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02
CrawlDB update for methods

/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true 1443749246-29495 -crawlId methods
DbUpdaterJob: starting at 2015-10-01 18:27:46
DbUpdaterJob: batchId: 1443749246-29495
DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02
Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods
IndexingJob: starting
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication


IndexingJob: done.
SOLR dedup -> http://127.0.0.1:8983/solr/
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true http://127.0.0.1:8983/solr/
Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2
Generating batchId
Generating a new fetchlist
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
-crawlId methods -batchId 1443749274-17203
GeneratorJob: starting at 2015-10-01 18:27:55
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

There’s no errors but also no data. What else can I debug?

I see some warning in my hadoop.log but nothing glaring ….

2015-10-01 18:19:29,430 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2015-10-01 18:19:29,441 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2015-10-01 18:19:29,441 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2015-10-01 18:19:29,442 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
2015-10-01 18:19:30,326 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca
l1900181322_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2015-10-01 18:19:30,327 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca
l1900181322_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2015-10-01 18:19:30,405 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813
22_0001/job_local1900181322_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2015-10-01 18:19:30,406 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813
22_0001/job_local1900181322_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
….
2015-10-01 18:27:23,838 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2015-10-01 18:27:24,567 INFO  crawl.InjectorJob - InjectorJob: Using class
org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
2015-10-01 18:27:24,969 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca
l1182157052_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2015-10-01 18:27:24,971 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca
l1182157052_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2015-10-01 18:27:25,050 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570
52_0001/job_local1182157052_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2015-10-01 18:27:25,052 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570
52_0001/job_local1182157052_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.

2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.host = null
2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.port = 8080
2015-10-01 18:27:30,288 INFO  httpclient.Http - http.timeout = 10000
2015-10-01 18:27:30,288 INFO  httpclient.Http - http.content.limit = 65536
2015-10-01 18:27:30,288 INFO  httpclient.Http - http.agent = nutch Mongo
Solr Crawler/Nutch-2.4-SNAPSHOT
2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.host = null
2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.port = 8080
2015-10-01 18:27:30,292 INFO  httpclient.Http - http.timeout = 10000
2015-10-01 18:27:30,292 INFO  httpclient.Http - http.content.limit = 65536
2015-10-01 18:27:30,292 INFO  httpclient.Http - http.agent = nutch Mongo
Solr Crawler/Nutch-2.4-SNAPSHOT
2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

I’ve been trying this for 3 days with no luck. I want to use nutch but may
be forced to use other program.

My best guess is maybe something is borked with my plugin.includes:

<property>
        <name>plugin.includes</name>
        
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
alue>
        <description>Regular expression naming plugin directory names to
include. </description>
   </property>

Are these valid? Is there a more minimal set to try?

Cheers,
Sherban




On 10/4/15, 12:23 PM, "Sebastian Nagel" <wastl.na...@googlemail.com> wrote:

>Hi Sherban,
>
>> Right now it finds 0 URLs with no errors.
>
>Can you specify what's going wrong. It could
>be everything, even a configuration problem.
>What did you crawl? Using which storage back-end?
>
>Thanks,
>Sebastian
>
>
>On 10/02/2015 03:02 AM, Drulea, Sherban wrote:
>> Hi Lewis,
>> 
>> -1 until I verify nutch actually crawls. Right now it finds 0 URLs with
>>no
>> errors.
>> 
>> 2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at all.
>> 
>> Cheers,
>> Sherban
>> 
>> 
>> 
>> On 9/30/15, 5:35 PM, "Lewis John Mcgibbney" <lewis.mcgibb...@gmail.com>
>> wrote:
>> 
>>> Hi Folks,
>>> Is anyone else able to test and run the release candidate for 2.3.1?
>>> It would be great to get a release if we can get the VOTE's and the RC
>>>is
>>> suitable.
>>> Thanks in advance.
>>> Best
>>> Lewis
>>>
>>> On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney <
>>> lewis.mcgibb...@gmail.com> wrote:
>>>
>>>> Hi Folks,
>>>> It turns out the formatting for the original email below was terrible.
>>>> Sorry about that.
>>>> I've hopefully corrected formatting now. Please VOTE away!
>>>>
>>>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney <
>>>> lewis.mcgibb...@gmail.com> wrote:
>>>>
>>>>> Hi user@ & dev@,
>>>>>
>>>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>>>>>
>>>>> We addressed 32 issues in all which can been see at the release
>>>>>report
>>>>> http://s.apache.org/nutch_2.3.1
>>>>>
>>>>> The release candidate comprises the following components.
>>>>>
>>>>> * A staging repository [0] containing various Maven artifacts
>>>>> * A branch-2.3.1 of the 2.x code [1]
>>>>> * The tagged source upon which we are VOTE'ing [2]
>>>>> * Finally, the release artifacts [3] which i would encourage you to
>>>>> verify for signatures and test.
>>>>>
>>>>> You should use the following KEYS [4] file to verify the signatures
>>>>>of
>>>>> all release artifacts.
>>>>>
>>>>> Please VOTE as follows
>>>>>
>>>>> [ ] +1 Push the release, I am happy :)
>>>>> [ ] +/-0 I am not bothered either way
>>>>> [ ] -1 I am not happy with this release candidate (please state why)
>>>>>
>>>>> Firstly thank you to everyone that contributed to Nutch. Secondly,
>>>>> thank
>>>>> you to everyone that VOTE's. It is appreciated.
>>>>>
>>>>> Thanks
>>>>> Lewis
>>>>> (on behalf of Nutch PMC)
>>>>>
>>>>> p.s. Here's my +1
>>>>>
>>>>> [0]
>>>>> 
>>>>>https://repository.apache.org/content/repositories/orgapachenutch-1005
>>>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
>>>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>>>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1
>>>>> [4] http://www.apache.org/dist/nutch/KEYS
>>>>>
>>>>> --
>>>>> *Lewis*
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Lewis*
>>>>
>>>
>>>
>>>
>>> -- 
>>> *Lewis*
>> 
>> 
>> 
>>_________________________________________________________________________
>>_
>> 
>> This email message is for the sole use of the intended recipient(s) and
>> may contain confidential information. Any unauthorized review, use,
>> disclosure or distribution is prohibited. If you are not the intended
>> recipient, please contact the sender by reply email and destroy all
>>copies
>> of the original message.
>> 
>

Re: [VOTE] Release Apache Nutch 2.3.1

Reply via email to