Hi Sebastian, I tried multiple URLs in my seed.txt file. None of them result in the nutch generator crawling any links.
Here’s my environment: java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) SOLR 4.6.0 Mongo version 3.0.2. Nutch 2.3.1 ――――――――――――――― regex-urlfilter.txt: ――――――――――――――― +. ――――――――――――――― nutch-site.xml ――――――――――――――― <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>nutch Mongo Solr Crawler</value> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.mongodb.store.MongoStore</value> <description>Default class for storing data</description> </property> <property> <name>plugin.includes</name> <value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-( basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v alue> <description>Regular expression naming plugin directory names to include. </description> </property> </configuration> ――――――――――――――― gora.properties: ――――――――――――――― ############################ # MongoDBStore properties # ############################ gora.datastore.default=org.apache.gora.mongodb.store.MongoStore gora.mongodb.override_hadoop_configuration=false gora.mongodb.mapping.file=/gora-mongodb-mapping.xml gora.mongodb.servers=localhost:27017 gora.mongodb.db=method_centers ――――――――――――――― seed.txt ――――――――――――――― http://punklawyer.com http://mail-archives.apache.org/mod_mbox/nutch-user/ http://hbase.apache.org/index.html http://wiki.apache.org/nutch/FrontPage http://www.aintitcool.com/ ――――――――――――――― Here are the results of the crawl command " ./bin/crawl urls methods http://127.0.0.1:8983/solr/ 2” Injecting seed URLs /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls -crawlId methods InjectorJob: starting at 2015-10-01 18:27:23 InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the Gora storage class. InjectorJob: total number of urls rejected by filters: 0 InjectorJob: total number of urls injected after normalization and filtering: 5 Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02 Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2 Generating batchId Generating a new fetchlist /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId methods -batchId 1443749246-29495 GeneratorJob: starting at 2015-10-01 18:27:26 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: false GeneratorJob: normalizing: false GeneratorJob: topN: 50000 GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02 GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs Fetching : /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D fetcher.timelimit.mins=180 1443749246-29495 -crawlId methods -threads 50 FetcherJob: starting at 2015-10-01 18:27:29 FetcherJob: batchId: 1443749246-29495 FetcherJob: threads: 50 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : 1443760049865 Using queue mode : byHost Fetcher: threads: 50 QueueFeeder finished: total 0 records. Hit by time limit :0 -finishing thread FetcherThread0, activeThreads=0 ... -finishing thread FetcherThread49, activeThreads=0 Fetcher: throughput threshold: -1 Fetcher: throughput threshold sequence: 5 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues -activeThreads=0 Using queue mode : byHost Fetcher: threads: 50 QueueFeeder finished: total 0 records. Hit by time limit :0 -finishing thread FetcherThread0, activeThreads=0 ... -finishing thread FetcherThread48, activeThreads=0 Fetcher: throughput threshold: -1 Fetcher: throughput threshold sequence: 5 -finishing thread FetcherThread49, activeThreads=0 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues -activeThreads=0 FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12 Parsing : /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods ParserJob: starting at 2015-10-01 18:27:43 ParserJob: resuming: false ParserJob: forced reparse: false ParserJob: batchId: 1443749246-29495 ParserJob: success ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02 CrawlDB update for methods /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1443749246-29495 -crawlId methods DbUpdaterJob: starting at 2015-10-01 18:27:46 DbUpdaterJob: batchId: 1443749246-29495 DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02 Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/ /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods IndexingJob: starting Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : username for authentication solr.auth.password : password for authentication IndexingJob: done. SOLR dedup -> http://127.0.0.1:8983/solr/ /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true http://127.0.0.1:8983/solr/ Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2 Generating batchId Generating a new fetchlist /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId methods -batchId 1443749274-17203 GeneratorJob: starting at 2015-10-01 18:27:55 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: false GeneratorJob: normalizing: false GeneratorJob: topN: 50000 GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02 GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs Generate returned 1 (no new segments created) Escaping loop: no more URLs to fetch now There’s no errors but also no data. What else can I debug? I see some warning in my hadoop.log but nothing glaring …. 2015-10-01 18:19:29,430 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2015-10-01 18:19:29,441 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2015-10-01 18:19:29,441 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2015-10-01 18:19:29,442 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2015-10-01 18:19:30,326 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca l1900181322_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2015-10-01 18:19:30,327 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loca l1900181322_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2015-10-01 18:19:30,405 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813 22_0001/job_local1900181322_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2015-10-01 18:19:30,406 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local19001813 22_0001/job_local1900181322_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. …. 2015-10-01 18:27:23,838 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2015-10-01 18:27:24,567 INFO crawl.InjectorJob - InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the Gora storage class. 2015-10-01 18:27:24,969 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca l1182157052_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2015-10-01 18:27:24,971 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loca l1182157052_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2015-10-01 18:27:25,050 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570 52_0001/job_local1182157052_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2015-10-01 18:27:25,052 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local11821570 52_0001/job_local1182157052_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.host = null 2015-10-01 18:27:30,288 INFO httpclient.Http - http.proxy.port = 8080 2015-10-01 18:27:30,288 INFO httpclient.Http - http.timeout = 10000 2015-10-01 18:27:30,288 INFO httpclient.Http - http.content.limit = 65536 2015-10-01 18:27:30,288 INFO httpclient.Http - http.agent = nutch Mongo Solr Crawler/Nutch-2.4-SNAPSHOT 2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2015-10-01 18:27:30,288 INFO httpclient.Http - http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.host = null 2015-10-01 18:27:30,292 INFO httpclient.Http - http.proxy.port = 8080 2015-10-01 18:27:30,292 INFO httpclient.Http - http.timeout = 10000 2015-10-01 18:27:30,292 INFO httpclient.Http - http.content.limit = 65536 2015-10-01 18:27:30,292 INFO httpclient.Http - http.agent = nutch Mongo Solr Crawler/Nutch-2.4-SNAPSHOT 2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2015-10-01 18:27:30,292 INFO httpclient.Http - http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 I’ve been trying this for 3 days with no luck. I want to use nutch but may be forced to use other program. My best guess is maybe something is borked with my plugin.includes: <property> <name>plugin.includes</name> <value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-( basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v alue> <description>Regular expression naming plugin directory names to include. </description> </property> Are these valid? Is there a more minimal set to try? Cheers, Sherban On 10/4/15, 12:23 PM, "Sebastian Nagel" <wastl.na...@googlemail.com> wrote: >Hi Sherban, > >> Right now it finds 0 URLs with no errors. > >Can you specify what's going wrong. It could >be everything, even a configuration problem. >What did you crawl? Using which storage back-end? > >Thanks, >Sebastian > > >On 10/02/2015 03:02 AM, Drulea, Sherban wrote: >> Hi Lewis, >> >> -1 until I verify nutch actually crawls. Right now it finds 0 URLs with >>no >> errors. >> >> 2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at all. >> >> Cheers, >> Sherban >> >> >> >> On 9/30/15, 5:35 PM, "Lewis John Mcgibbney" <lewis.mcgibb...@gmail.com> >> wrote: >> >>> Hi Folks, >>> Is anyone else able to test and run the release candidate for 2.3.1? >>> It would be great to get a release if we can get the VOTE's and the RC >>>is >>> suitable. >>> Thanks in advance. >>> Best >>> Lewis >>> >>> On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney < >>> lewis.mcgibb...@gmail.com> wrote: >>> >>>> Hi Folks, >>>> It turns out the formatting for the original email below was terrible. >>>> Sorry about that. >>>> I've hopefully corrected formatting now. Please VOTE away! >>>> >>>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney < >>>> lewis.mcgibb...@gmail.com> wrote: >>>> >>>>> Hi user@ & dev@, >>>>> >>>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1. >>>>> >>>>> We addressed 32 issues in all which can been see at the release >>>>>report >>>>> http://s.apache.org/nutch_2.3.1 >>>>> >>>>> The release candidate comprises the following components. >>>>> >>>>> * A staging repository [0] containing various Maven artifacts >>>>> * A branch-2.3.1 of the 2.x code [1] >>>>> * The tagged source upon which we are VOTE'ing [2] >>>>> * Finally, the release artifacts [3] which i would encourage you to >>>>> verify for signatures and test. >>>>> >>>>> You should use the following KEYS [4] file to verify the signatures >>>>>of >>>>> all release artifacts. >>>>> >>>>> Please VOTE as follows >>>>> >>>>> [ ] +1 Push the release, I am happy :) >>>>> [ ] +/-0 I am not bothered either way >>>>> [ ] -1 I am not happy with this release candidate (please state why) >>>>> >>>>> Firstly thank you to everyone that contributed to Nutch. Secondly, >>>>> thank >>>>> you to everyone that VOTE's. It is appreciated. >>>>> >>>>> Thanks >>>>> Lewis >>>>> (on behalf of Nutch PMC) >>>>> >>>>> p.s. Here's my +1 >>>>> >>>>> [0] >>>>> >>>>>https://repository.apache.org/content/repositories/orgapachenutch-1005 >>>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1 >>>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1 >>>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1 >>>>> [4] http://www.apache.org/dist/nutch/KEYS >>>>> >>>>> -- >>>>> *Lewis* >>>>> >>>> >>>> >>>> >>>> -- >>>> *Lewis* >>>> >>> >>> >>> >>> -- >>> *Lewis* >> >> >> >>_________________________________________________________________________ >>_ >> >> This email message is for the sole use of the intended recipient(s) and >> may contain confidential information. Any unauthorized review, use, >> disclosure or distribution is prohibited. If you are not the intended >> recipient, please contact the sender by reply email and destroy all >>copies >> of the original message. >> >