Seems like the problem is with the generator. It doesn¹t generate any
links to crawl. Is there any way to debug why the generator doesn¹t work?



On 10/1/15, 6:39 PM, "Drulea, Sherban" <sdru...@rand.org> wrote:

>Hi All,
>
>Thanks for pointing me to the 2.3.1 release. It works without error but
>doesn¹t crawl. I¹m out of ideas why.
>
>Here¹s my environment:
>
>java version "1.8.0_60"
>
>Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
>
>Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
>
>SOLR 4.6.0
>Mongo version 3.0.2.
>Nutch 2.3.1
>
>My regex-urlfilter.txt:
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>+.
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>
>nutch-site.xml
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
><?xml version="1.0"?>
><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
><!-- Put site-specific property overrides in this file. -->
>
><configuration>
>
>    <property>
>        <name>http.agent.name</name>
>        <value>nutch Mongo Solr Crawler</value>
>    </property>
>
>    <property>
>        <name>storage.data.store.class</name>
>        <value>org.apache.gora.mongodb.store.MongoStore</value>
>        <description>Default class for storing data</description>
>    </property>
>
>    <property>
>        <name>plugin.includes</name>
>        
><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-
>(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr<
>/value>
>        <description>Regular expression naming plugin directory names to
>include. </description>
>   </property>
>
></configuration>
>
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>
>gora.properties:
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>############################
># MongoDBStore properties  #
>############################
>gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
>gora.mongodb.override_hadoop_configuration=false
>gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
>gora.mongodb.servers=localhost:27017
>gora.mongodb.db=method_centers
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>
>Seed.txt
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>http://punklawyer.com/
>http://mail-archives.apache.org/mod_mbox/nutch-user/
>http://hbase.apache.org/index.html
>http://wiki.apache.org/nutch/FrontPage
>http://www.aintitcool.com/
>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>
>Here are the results of the crawl command " ./bin/crawl urls methods
>http://127.0.0.1:8983/solr/ 2²
>
>Injecting seed URLs
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch inject urls
>-crawlId methods
>
>InjectorJob: starting at 2015-10-01 18:27:23
>
>InjectorJob: Injecting urlDir: urls
>
>InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the
>Gora storage class.
>
>InjectorJob: total number of urls rejected by filters: 0
>
>InjectorJob: total number of urls injected after normalization and
>filtering: 5
>
>Injector: finished at 2015-10-01 18:27:26, elapsed: 00:00:02
>
>Thu Oct 1 18:27:26 PDT 2015 : Iteration 1 of 2
>
>Generating batchId
>
>Generating a new fetchlist
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
>-crawlId methods -batchId 1443749246-29495
>
>GeneratorJob: starting at 2015-10-01 18:27:26
>
>GeneratorJob: Selecting best-scoring urls due for fetch.
>
>GeneratorJob: starting
>
>GeneratorJob: filtering: false
>
>GeneratorJob: normalizing: false
>
>GeneratorJob: topN: 50000
>
>GeneratorJob: finished at 2015-10-01 18:27:29, time elapsed: 00:00:02
>
>GeneratorJob: generated batch id: 1443749246-1282586680 containing 5 URLs
>
>Fetching :
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D fetcher.timelimit.mins=180
>1443749246-29495 -crawlId methods -threads 50
>
>FetcherJob: starting at 2015-10-01 18:27:29
>
>FetcherJob: batchId: 1443749246-29495
>
>FetcherJob: threads: 50
>
>FetcherJob: parsing: false
>
>FetcherJob: resuming: false
>
>FetcherJob : timelimit set for : 1443760049865
>
>Using queue mode : byHost
>
>Fetcher: threads: 50
>
>QueueFeeder finished: total 0 records. Hit by time limit :0
>
>-finishing thread FetcherThread0, activeThreads=0
>
>-finishing thread FetcherThread1, activeThreads=0
>
>-finishing thread FetcherThread2, activeThreads=0
>
>-finishing thread FetcherThread3, activeThreads=0
>
>-finishing thread FetcherThread4, activeThreads=0
>
>-finishing thread FetcherThread5, activeThreads=0
>
>-finishing thread FetcherThread6, activeThreads=0
>
>-finishing thread FetcherThread7, activeThreads=0
>
>-finishing thread FetcherThread8, activeThreads=0
>
>-finishing thread FetcherThread9, activeThreads=0
>
>-finishing thread FetcherThread10, activeThreads=0
>
>-finishing thread FetcherThread11, activeThreads=0
>
>-finishing thread FetcherThread12, activeThreads=0
>
>-finishing thread FetcherThread13, activeThreads=0
>
>-finishing thread FetcherThread14, activeThreads=0
>
>-finishing thread FetcherThread15, activeThreads=0
>
>-finishing thread FetcherThread16, activeThreads=0
>
>-finishing thread FetcherThread17, activeThreads=0
>
>-finishing thread FetcherThread18, activeThreads=0
>
>-finishing thread FetcherThread19, activeThreads=0
>
>-finishing thread FetcherThread20, activeThreads=0
>
>-finishing thread FetcherThread21, activeThreads=0
>
>-finishing thread FetcherThread22, activeThreads=0
>
>-finishing thread FetcherThread23, activeThreads=0
>
>-finishing thread FetcherThread25, activeThreads=0
>
>-finishing thread FetcherThread24, activeThreads=0
>
>-finishing thread FetcherThread26, activeThreads=0
>
>-finishing thread FetcherThread27, activeThreads=0
>
>-finishing thread FetcherThread28, activeThreads=0
>
>-finishing thread FetcherThread29, activeThreads=0
>
>-finishing thread FetcherThread30, activeThreads=0
>
>-finishing thread FetcherThread31, activeThreads=0
>
>-finishing thread FetcherThread32, activeThreads=0
>
>-finishing thread FetcherThread33, activeThreads=0
>
>-finishing thread FetcherThread34, activeThreads=0
>
>-finishing thread FetcherThread35, activeThreads=0
>
>-finishing thread FetcherThread36, activeThreads=0
>
>-finishing thread FetcherThread37, activeThreads=0
>
>-finishing thread FetcherThread38, activeThreads=0
>
>-finishing thread FetcherThread39, activeThreads=0
>
>-finishing thread FetcherThread40, activeThreads=0
>
>-finishing thread FetcherThread41, activeThreads=0
>
>-finishing thread FetcherThread42, activeThreads=0
>
>-finishing thread FetcherThread43, activeThreads=0
>
>-finishing thread FetcherThread44, activeThreads=0
>
>-finishing thread FetcherThread45, activeThreads=0
>
>-finishing thread FetcherThread46, activeThreads=0
>
>-finishing thread FetcherThread47, activeThreads=0
>
>-finishing thread FetcherThread48, activeThreads=0
>
>-finishing thread FetcherThread49, activeThreads=0
>
>Fetcher: throughput threshold: -1
>
>Fetcher: throughput threshold sequence: 5
>
>0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
>URLs in 0 queues
>
>-activeThreads=0
>
>Using queue mode : byHost
>
>Fetcher: threads: 50
>
>QueueFeeder finished: total 0 records. Hit by time limit :0
>
>-finishing thread FetcherThread0, activeThreads=0
>
>-finishing thread FetcherThread1, activeThreads=0
>
>-finishing thread FetcherThread2, activeThreads=0
>
>-finishing thread FetcherThread3, activeThreads=0
>
>-finishing thread FetcherThread4, activeThreads=0
>
>-finishing thread FetcherThread5, activeThreads=0
>
>-finishing thread FetcherThread6, activeThreads=0
>
>-finishing thread FetcherThread7, activeThreads=0
>
>-finishing thread FetcherThread8, activeThreads=0
>
>-finishing thread FetcherThread9, activeThreads=0
>
>-finishing thread FetcherThread10, activeThreads=0
>
>-finishing thread FetcherThread11, activeThreads=0
>
>-finishing thread FetcherThread12, activeThreads=0
>
>-finishing thread FetcherThread13, activeThreads=0
>
>-finishing thread FetcherThread14, activeThreads=0
>
>-finishing thread FetcherThread15, activeThreads=0
>
>-finishing thread FetcherThread16, activeThreads=0
>
>-finishing thread FetcherThread17, activeThreads=0
>
>-finishing thread FetcherThread18, activeThreads=0
>
>-finishing thread FetcherThread19, activeThreads=0
>
>-finishing thread FetcherThread20, activeThreads=0
>
>-finishing thread FetcherThread21, activeThreads=0
>
>-finishing thread FetcherThread22, activeThreads=0
>
>-finishing thread FetcherThread23, activeThreads=0
>
>-finishing thread FetcherThread24, activeThreads=0
>
>-finishing thread FetcherThread25, activeThreads=0
>
>-finishing thread FetcherThread26, activeThreads=0
>
>-finishing thread FetcherThread27, activeThreads=0
>
>-finishing thread FetcherThread28, activeThreads=0
>
>-finishing thread FetcherThread29, activeThreads=0
>
>-finishing thread FetcherThread30, activeThreads=0
>
>-finishing thread FetcherThread31, activeThreads=0
>
>-finishing thread FetcherThread32, activeThreads=0
>
>-finishing thread FetcherThread33, activeThreads=0
>
>-finishing thread FetcherThread34, activeThreads=0
>
>-finishing thread FetcherThread35, activeThreads=0
>
>-finishing thread FetcherThread36, activeThreads=0
>
>-finishing thread FetcherThread37, activeThreads=0
>
>-finishing thread FetcherThread38, activeThreads=0
>
>-finishing thread FetcherThread39, activeThreads=0
>
>-finishing thread FetcherThread40, activeThreads=0
>
>-finishing thread FetcherThread41, activeThreads=0
>
>-finishing thread FetcherThread42, activeThreads=0
>
>-finishing thread FetcherThread43, activeThreads=0
>
>-finishing thread FetcherThread44, activeThreads=0
>
>-finishing thread FetcherThread45, activeThreads=0
>
>-finishing thread FetcherThread46, activeThreads=0
>
>-finishing thread FetcherThread47, activeThreads=0
>
>-finishing thread FetcherThread48, activeThreads=0
>
>Fetcher: throughput threshold: -1
>
>Fetcher: throughput threshold sequence: 5
>
>-finishing thread FetcherThread49, activeThreads=0
>
>0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
>URLs in 0 queues
>
>-activeThreads=0
>
>FetcherJob: finished at 2015-10-01 18:27:42, time elapsed: 00:00:12
>
>Parsing :
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D
>mapred.skip.attempts.to.start.skipping=2 -D
>mapred.skip.map.max.skip.records=1 1443749246-29495 -crawlId methods
>
>ParserJob: starting at 2015-10-01 18:27:43
>
>ParserJob: resuming: false
>
>ParserJob: forced reparse: false
>
>ParserJob: batchId: 1443749246-29495
>
>ParserJob: success
>
>ParserJob: finished at 2015-10-01 18:27:45, time elapsed: 00:00:02
>
>CrawlDB update for methods
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true 1443749246-29495 -crawlId methods
>
>DbUpdaterJob: starting at 2015-10-01 18:27:46
>
>DbUpdaterJob: batchId: 1443749246-29495
>
>DbUpdaterJob: finished at 2015-10-01 18:27:48, time elapsed: 00:00:02
>
>Indexing methods on SOLR index -> http://127.0.0.1:8983/solr/
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D
>solr.server.url=http://127.0.0.1:8983/solr/ -all -crawlId methods
>
>IndexingJob: starting
>
>Active IndexWriters :
>
>SOLRIndexWriter
>
>solr.server.url : URL of the SOLR instance (mandatory)
>
>solr.commit.size : buffer size when sending to SOLR (default 1000)
>
>solr.mapping.file : name of the mapping file for fields (default
>solrindex-mapping.xml)
>
>solr.auth : use authentication (default false)
>
>solr.auth.username : username for authentication
>
>solr.auth.password : password for authentication
>
>
>
>IndexingJob: done.
>
>SOLR dedup -> http://127.0.0.1:8983/solr/
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true http://127.0.0.1:8983/solr/
>
>Thu Oct 1 18:27:54 PDT 2015 : Iteration 2 of 2
>
>Generating batchId
>
>Generating a new fetchlist
>
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch generate -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0
>-crawlId methods -batchId 1443749274-17203
>
>GeneratorJob: starting at 2015-10-01 18:27:55
>
>GeneratorJob: Selecting best-scoring urls due for fetch.
>
>GeneratorJob: starting
>
>GeneratorJob: filtering: false
>
>GeneratorJob: normalizing: false
>
>GeneratorJob: topN: 50000
>
>GeneratorJob: finished at 2015-10-01 18:27:57, time elapsed: 00:00:02
>
>GeneratorJob: generated batch id: 1443749275-2050785747 containing 0 URLs
>
>Generate returned 1 (no new segments created)
>
>Escaping loop: no more URLs to fetch now
>
>So no errors but also no data. What else can I debug?
>
>I see some warning in my hadoop.log but nothing alarming Š.
>
>2015-10-01 18:19:29,430 WARN  util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes
>where applicable
>
>2015-10-01 18:19:29,441 INFO  crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>
>2015-10-01 18:19:29,441 INFO  crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>
>2015-10-01 18:19:29,442 INFO  crawl.AbstractFetchSchedule -
>maxInterval=7776000
>
>2015-10-01 18:19:30,326 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loc
>al1900181322_0001/job.xml:an attempt to override final parameter:
>mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>
>2015-10-01 18:19:30,327 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1900181322/.staging/job_loc
>al1900181322_0001/job.xml:an attempt to override final parameter:
>mapreduce.job.end-notification.max.attempts;  Ignoring.
>
>2015-10-01 18:19:30,405 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181
>322_0001/job_local1900181322_0001.xml:an attempt to override final
>parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>
>2015-10-01 18:19:30,406 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1900181
>322_0001/job_local1900181322_0001.xml:an attempt to override final
>parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
>
>Š.
>
>
>2015-10-01 18:27:23,838 WARN  util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes
>where applicable
>
>2015-10-01 18:27:24,567 INFO  crawl.InjectorJob - InjectorJob: Using
>class org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
>
>2015-10-01 18:27:24,969 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loc
>al1182157052_0001/job.xml:an attempt to override final parameter:
>mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>
>2015-10-01 18:27:24,971 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1182157052/.staging/job_loc
>al1182157052_0001/job.xml:an attempt to override final parameter:
>mapreduce.job.end-notification.max.attempts;  Ignoring.
>
>2015-10-01 18:27:25,050 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157
>052_0001/job_local1182157052_0001.xml:an attempt to override final
>parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>
>2015-10-01 18:27:25,052 WARN  conf.Configuration -
>file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local1182157
>052_0001/job_local1182157052_0001.xml:an attempt to override final
>parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
>
>
>2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.host = null
>
>2015-10-01 18:27:30,288 INFO  httpclient.Http - http.proxy.port = 8080
>
>2015-10-01 18:27:30,288 INFO  httpclient.Http - http.timeout = 10000
>
>2015-10-01 18:27:30,288 INFO  httpclient.Http - http.content.limit = 65536
>
>2015-10-01 18:27:30,288 INFO  httpclient.Http - http.agent = nutch Mongo
>Solr Crawler/Nutch-2.4-SNAPSHOT
>
>2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept.language =
>en-us,en-gb,en;q=0.7,*;q=0.3
>
>2015-10-01 18:27:30,288 INFO  httpclient.Http - http.accept =
>text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>
>2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.host = null
>
>2015-10-01 18:27:30,292 INFO  httpclient.Http - http.proxy.port = 8080
>
>2015-10-01 18:27:30,292 INFO  httpclient.Http - http.timeout = 10000
>
>2015-10-01 18:27:30,292 INFO  httpclient.Http - http.content.limit = 65536
>
>2015-10-01 18:27:30,292 INFO  httpclient.Http - http.agent = nutch Mongo
>Solr Crawler/Nutch-2.4-SNAPSHOT
>
>2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept.language =
>en-us,en-gb,en;q=0.7,*;q=0.3
>
>2015-10-01 18:27:30,292 INFO  httpclient.Http - http.accept =
>text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>
>I¹ve been trying this for 3 days with no luck. I want to use nutch but
>may be forced to use other program.
>
>My best guess is maybe something is borked with my plugin.includes:
>
><property>
>        <name>plugin.includes</name>
>        
><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-
>(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr<
>/value>
>        <description>Regular expression naming plugin directory names to
>include. </description>
>   </property>
>
>Are these valid? Is there a more minimal set to try?
>
>Cheers,
>Sherban
>
>
>__________________________________________________________________________
>
>This email message is for the sole use of the intended recipient(s) and
>may contain confidential information. Any unauthorized review, use,
>disclosure or distribution is prohibited. If you are not the intended
>recipient, please contact the sender by reply email and destroy all copies
>of the original message.

Reply via email to