Hi Bayu, there is an open issue with file URLs, see https://issues.apache.org/jira/browse/NUTCH-1483
Hope the information helps, Sebastian On 06/05/2014 06:38 AM, Bayu Widyasanyata wrote: > Hi, > > I'm sure this is an "old" topic, but I still no luck crawling with it. > It's a little bit harder than crawling web / http protocol :( > > Following are some important files I configured: > > (1) urls/seed.txt > file://opt/searchengine/test/ > > which contains one file: > -rw-r--r-- 1 bayu bayu 3272 Jun 5 10:02 Testdocumentsaja.pdf > > (2) regex-urlfilter.txt: allowing file: protocol and accept path URL > -^(ftp|mailto): > +^file://opt/searchengine/test > > (3) nutch-site.xml : enabling protocol-file > <property> > <name>plugin.includes</name> > > <value>protocol-(http|file)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please enable > protocol-httpclient, but be aware of possible intermittent problems with > the > underlying commons-httpclient library. > </description> > </property> > > For the crawl nutch script using common steps (inject - generate - fetch - > parse - updatedb - solrindex - solrdedup). > From the hadoop.log below, nutch could fetch file protocol path, but it > never parse the file inside /opt/searchengine/test/. > > hadoop.log: > > 2014-06-05 10:33:33,274 INFO crawl.Injector - Injector: starting at > 2014-06-05 10:33:33 > 2014-06-05 10:33:33,276 INFO crawl.Injector - Injector: crawlDb: > /opt/searchengine/nutch/BWCrawl/crawldb > 2014-06-05 10:33:33,276 INFO crawl.Injector - Injector: urlDir: > /opt/searchengine/nutch/urls/seed.txt > 2014-06-05 10:33:33,277 INFO crawl.Injector - Injector: Converting > injected urls to crawl db entries. > 2014-06-05 10:33:33,714 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2014-06-05 10:33:33,807 WARN snappy.LoadSnappy - Snappy native library not > loaded > 2014-06-05 10:33:34,717 INFO regex.RegexURLNormalizer - can't find rules > for scope 'inject', using default > 2014-06-05 10:33:35,127 INFO crawl.Injector - Injector: total number of > urls rejected by filters: 0 > 2014-06-05 10:33:35,131 INFO crawl.Injector - Injector: total number of > urls injected after normalization and filtering: 1 > 2014-06-05 10:33:35,132 INFO crawl.Injector - Injector: Merging injected > urls into crawl db. > 2014-06-05 10:33:35,396 INFO crawl.Injector - Injector: overwrite: false > 2014-06-05 10:33:35,397 INFO crawl.Injector - Injector: update: false > 2014-06-05 10:33:36,357 INFO crawl.Injector - Injector: finished at > 2014-06-05 10:33:36, elapsed: 00:00:03 > 2014-06-05 10:33:37,857 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2014-06-05 10:33:37,863 INFO crawl.Generator - Generator: starting at > 2014-06-05 10:33:37 > 2014-06-05 10:33:37,863 INFO crawl.Generator - Generator: Selecting > best-scoring urls due for fetch. > 2014-06-05 10:33:37,864 INFO crawl.Generator - Generator: filtering: true > 2014-06-05 10:33:37,865 INFO crawl.Generator - Generator: normalizing: true > 2014-06-05 10:33:37,876 INFO crawl.Generator - Generator: jobtracker is > 'local', generating exactly one partition. > 2014-06-05 10:33:38,915 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > 2014-06-05 10:33:38,916 INFO crawl.AbstractFetchSchedule - > defaultInterval=129600 > 2014-06-05 10:33:38,917 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000 > 2014-06-05 10:33:38,929 INFO regex.RegexURLNormalizer - can't find rules > for scope 'partition', using default > 2014-06-05 10:33:39,006 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > 2014-06-05 10:33:39,007 INFO crawl.AbstractFetchSchedule - > defaultInterval=129600 > 2014-06-05 10:33:39,007 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000 > 2014-06-05 10:33:39,015 INFO regex.RegexURLNormalizer - can't find rules > for scope 'generate_host_count', using default > 2014-06-05 10:33:39,384 INFO crawl.Generator - Generator: Partitioning > selected urls for politeness. > 2014-06-05 10:33:40,386 INFO crawl.Generator - Generator: segment: > /opt/searchengine/nutch/BWCrawl/segments/20140605103340 > 2014-06-05 10:33:40,593 INFO regex.RegexURLNormalizer - can't find rules > for scope 'partition', using default > 2014-06-05 10:33:41,540 INFO crawl.Generator - Generator: finished at > 2014-06-05 10:33:41, elapsed: 00:00:03 > 2014-06-05 10:33:42,634 INFO fetcher.Fetcher - Fetcher: starting at > 2014-06-05 10:33:42https://issues.apache.org/jira/browse/NUTCH-1483 > 2014-06-05 10:33:42,635 INFO fetcher.Fetcher - Fetcher: segment: > /opt/searchengine/nutch/BWCrawl/segments/20140605103340 > 2014-06-05 10:33:43,056 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2014-06-05 10:33:43,719 INFO fetcher.Fetcher - Using queue mode : byHost > 2014-06-05 10:33:43,720 INFO fetcher.Fetcher - Fetcher: threads: 10 > 2014-06-05 10:33:43,720 INFO fetcher.Fetcher - Fetcher: time-out divisor: 4 > 2014-06-05 10:33:43,739 INFO fetcher.Fetcher - QueueFeeder finished: total > 1 records + hit by time limit :0 > 2014-06-05 10:33:44,102 INFO fetcher.Fetcher - Using queue mode : byHost > 2014-06-05 10:33:44,103 INFO fetcher.Fetcher - Using queue mode : byHost > 2014-06-05 10:33:44,104 INFO fetcher.Fetcher - fetching > file://opt/searchengine/test/ (queue crawl delay=5000ms) > 2014-06-05 10:33:44,106 INFO fetcher.Fetcher - Using queue mode : byHost > 2014-06-05 10:33:44,107 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2014-06-05 10:33:44,111 INFO fetcher.Fetcher - Using queue mode : byHost > 2014-06-05 10:33:44,111 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2014-06-05 10:33:44,118 INFO fetcher.Fetcher - Using queue mode : byHost > 2014-06-05 10:33:44,120 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2014-06-05 10:33:44,121 INFO fetcher.Fetcher - Using queue mode : byHost > 2014-06-05 10:33:44,122 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2014-06-05 10:33:44,122 INFO fetcher.Fetcher - Using queue mode : byHost > 2014-06-05 10:33:44,127 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2014-06-05 10:33:44,129 INFO fetcher.Fetcher - Using queue mode : byHost > 2014-06-05 10:33:44,130 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2014-06-05 10:33:44,131 INFO fetcher.Fetcher - Using queue mode : byHost > 2014-06-05 10:33:44,132 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2014-06-05 10:33:44,133 INFO fetcher.Fetcher - Using queue mode : byHost > 2014-06-05 10:33:44,146 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2014-06-05 10:33:44,149 INFO fetcher.Fetcher - Fetcher: throughput > threshold: -1 > 2014-06-05 10:33:44,149 INFO fetcher.Fetcher - Fetcher: throughput > threshold retries: 5 > 2014-06-05 10:33:44,150 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2014-06-05 10:33:44,423 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=0 > 2014-06-05 10:33:45,151 INFO fetcher.Fetcher - -activeThreads=0, > spinWaiting=0, fetchQueues.totalSize=0 > 2014-06-05 10:33:45,153 INFO fetcher.Fetcher - -activeThreads=0 > 2014-06-05 10:33:45,497 INFO fetcher.Fetcher - Fetcher: finished at > 2014-06-05 10:33:45, elapsed: 00:00:02 > 2014-06-05 10:33:46,660 INFO parse.ParseSegment - ParseSegment: starting > at 2014-06-05 10:33:46 > 2014-06-05 10:33:46,661 INFO parse.ParseSegment - ParseSegment: segment: > /opt/searchengine/nutch/BWCrawl/segments/20140605103340 > 2014-06-05 10:33:47,094 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2014-06-05 10:33:48,527 INFO parse.ParseSegment - ParseSegment: finished > at 2014-06-05 10:33:48, elapsed: 00:00:01 > 2014-06-05 10:33:49,949 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2014-06-05 10:33:49,995 INFO crawl.CrawlDb - CrawlDb update: starting at > 2014-06-05 10:33:49 > 2014-06-05 10:33:49,996 INFO crawl.CrawlDb - CrawlDb update: db: > /opt/searchengine/nutch/BWCrawl/crawldb > 2014-06-05 10:33:49,997 INFO crawl.CrawlDb - CrawlDb update: segments: > [/opt/searchengine/nutch/BWCrawl/segments/20140605103340] > 2014-06-05 10:33:50,002 INFO crawl.CrawlDb - CrawlDb update: additions > allowed: true > 2014-06-05 10:33:50,003 INFO crawl.CrawlDb - CrawlDb update: URL > normalizing: true > 2014-06-05 10:33:50,003 INFO crawl.CrawlDb - CrawlDb update: URL > filtering: true > 2014-06-05 10:33:50,003 INFO crawl.CrawlDb - CrawlDb update: 404 purging: > false > 2014-06-05 10:33:50,006 INFO crawl.CrawlDb - CrawlDb update: Merging > segment data into db. > 2014-06-05 10:33:51,150 INFO regex.RegexURLNormalizer - can't find rules > for scope 'crawldb', using default > 2014-06-05 10:33:51,242 INFO regex.RegexURLNormalizer - can't find rules > for scope 'crawldb', using default > 2014-06-05 10:33:51,399 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > 2014-06-05 10:33:51,399 INFO crawl.AbstractFetchSchedule - > defaultInterval=129600 > 2014-06-05 10:33:51,399 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000 > 2014-06-05 10:33:51,537 INFO crawl.CrawlDb - CrawlDb update: finished at > 2014-06-05 10:33:51, elapsed: 00:00:01 > 2014-06-05 10:33:53,008 INFO indexer.IndexingJob - Indexer: starting at > 2014-06-05 10:33:53 > 2014-06-05 10:33:53,024 INFO indexer.IndexingJob - Indexer: deleting gone > documents: false > 2014-06-05 10:33:53,025 INFO indexer.IndexingJob - Indexer: URL filtering: > false > 2014-06-05 10:33:53,027 INFO indexer.IndexingJob - Indexer: URL > normalizing: false > 2014-06-05 10:33:53,373 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.solr.SolrIndexWriter > 2014-06-05 10:33:53,385 INFO indexer.IndexingJob - Active IndexWriters : > SOLRIndexWriter > solr.server.url : URL of the SOLR instance (mandatory) > solr.commit.size : buffer size when sending to SOLR (default 1000) > solr.mapping.file : name of the mapping file for fields (default > solrindex-mapping.xml) > solr.auth : use authentication (default false) > solr.auth.username : use authentication (default false) > solr.auth : username for authentication > solr.auth.password : password for authentication > > > 2014-06-05 10:33:53,396 INFO indexer.IndexerMapReduce - IndexerMapReduce: > crawldb: /opt/searchengine/nutch/BWCrawl/crawldb > 2014-06-05 10:33:53,396 INFO indexer.IndexerMapReduce - IndexerMapReduces: > adding segment: file:/opt/searchengine/nutch/BWCrawl/segments/20140605103340 > 2014-06-05 10:33:53,464 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2014-06-05 10:33:54,214 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: off > 2014-06-05 10:33:54,532 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.solr.SolrIndexWriter > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: content > dest: content > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: title dest: > title > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: author dest: > author > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: host dest: > host > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: segment > dest: segment > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: boost dest: > boost > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: digest dest: > digest > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: url dest: id > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: url dest: url > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: content > dest: content > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: title dest: > title > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: author dest: > author > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: host dest: > host > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: segment > dest: segment > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: boost dest: > boost > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: digest dest: > digest > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: url dest: id > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: url dest: url > 2014-06-05 10:33:55,063 INFO indexer.IndexingJob - Indexer: finished at > 2014-06-05 10:33:55, elapsed: 00:00:02 > > Result of nutch readdb: > CrawlDb statistics start: BWCrawl/crawldb/ > Statistics for CrawlDb: BWCrawl/crawldb/ > TOTAL urls: 1 > retry 0: 1 > min score: 1.0 > avg score: 1.0 > max score: 1.0 > status 3 (db_gone): 1 > CrawlDb statistics: done > > Following are some of documents I've read: > > - http://wiki.apache.org/nutch/IntranetDocumentSearch > - http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F > - > > http://lucene.472066.n3.nabble.com/Crawling-the-local-file-system-with-Nutch-Document-td607747.html > > System: Ubuntu 14.04, nutch 1.8, Solr 4.8.0. > I really appreciate if someone could share some hints or any > "running-proof" references for this subject. > > Thank you.- >

