Hi Sebastian, Thank you for the info. I'll try the workaround as comments suggested.
On Fri, Jun 6, 2014 at 4:26 AM, Sebastian Nagel <[email protected]> wrote: > Hi Bayu, > > there is an open issue with file URLs, see > https://issues.apache.org/jira/browse/NUTCH-1483 > > Hope the information helps, > Sebastian > > > On 06/05/2014 06:38 AM, Bayu Widyasanyata wrote: > > Hi, > > > > I'm sure this is an "old" topic, but I still no luck crawling with it. > > It's a little bit harder than crawling web / http protocol :( > > > > Following are some important files I configured: > > > > (1) urls/seed.txt > > file://opt/searchengine/test/ > > > > which contains one file: > > -rw-r--r-- 1 bayu bayu 3272 Jun 5 10:02 Testdocumentsaja.pdf > > > > (2) regex-urlfilter.txt: allowing file: protocol and accept path URL > > -^(ftp|mailto): > > +^file://opt/searchengine/test > > > > (3) nutch-site.xml : enabling protocol-file > > <property> > > <name>plugin.includes</name> > > > > > <value>protocol-(http|file)|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > <description>Regular expression naming plugin directory names to > > include. Any plugin not matching this expression is excluded. > > In any case you need at least include the nutch-extensionpoints > plugin. By > > default Nutch includes crawling just HTML and plain text via HTTP, > > and basic indexing and search plugins. In order to use HTTPS please > enable > > protocol-httpclient, but be aware of possible intermittent problems > with > > the > > underlying commons-httpclient library. > > </description> > > </property> > > > > For the crawl nutch script using common steps (inject - generate - fetch > - > > parse - updatedb - solrindex - solrdedup). > > From the hadoop.log below, nutch could fetch file protocol path, but it > > never parse the file inside /opt/searchengine/test/. > > > > hadoop.log: > > > > 2014-06-05 10:33:33,274 INFO crawl.Injector - Injector: starting at > > 2014-06-05 10:33:33 > > 2014-06-05 10:33:33,276 INFO crawl.Injector - Injector: crawlDb: > > /opt/searchengine/nutch/BWCrawl/crawldb > > 2014-06-05 10:33:33,276 INFO crawl.Injector - Injector: urlDir: > > /opt/searchengine/nutch/urls/seed.txt > > 2014-06-05 10:33:33,277 INFO crawl.Injector - Injector: Converting > > injected urls to crawl db entries. > > 2014-06-05 10:33:33,714 WARN util.NativeCodeLoader - Unable to load > > native-hadoop library for your platform... using builtin-java classes > where > > applicable > > 2014-06-05 10:33:33,807 WARN snappy.LoadSnappy - Snappy native library > not > > loaded > > 2014-06-05 10:33:34,717 INFO regex.RegexURLNormalizer - can't find rules > > for scope 'inject', using default > > 2014-06-05 10:33:35,127 INFO crawl.Injector - Injector: total number of > > urls rejected by filters: 0 > > 2014-06-05 10:33:35,131 INFO crawl.Injector - Injector: total number of > > urls injected after normalization and filtering: 1 > > 2014-06-05 10:33:35,132 INFO crawl.Injector - Injector: Merging injected > > urls into crawl db. > > 2014-06-05 10:33:35,396 INFO crawl.Injector - Injector: overwrite: false > > 2014-06-05 10:33:35,397 INFO crawl.Injector - Injector: update: false > > 2014-06-05 10:33:36,357 INFO crawl.Injector - Injector: finished at > > 2014-06-05 10:33:36, elapsed: 00:00:03 > > 2014-06-05 10:33:37,857 WARN util.NativeCodeLoader - Unable to load > > native-hadoop library for your platform... using builtin-java classes > where > > applicable > > 2014-06-05 10:33:37,863 INFO crawl.Generator - Generator: starting at > > 2014-06-05 10:33:37 > > 2014-06-05 10:33:37,863 INFO crawl.Generator - Generator: Selecting > > best-scoring urls due for fetch. > > 2014-06-05 10:33:37,864 INFO crawl.Generator - Generator: filtering: > true > > 2014-06-05 10:33:37,865 INFO crawl.Generator - Generator: normalizing: > true > > 2014-06-05 10:33:37,876 INFO crawl.Generator - Generator: jobtracker is > > 'local', generating exactly one partition. > > 2014-06-05 10:33:38,915 INFO crawl.FetchScheduleFactory - Using > > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > > 2014-06-05 10:33:38,916 INFO crawl.AbstractFetchSchedule - > > defaultInterval=129600 > > 2014-06-05 10:33:38,917 INFO crawl.AbstractFetchSchedule - > > maxInterval=7776000 > > 2014-06-05 10:33:38,929 INFO regex.RegexURLNormalizer - can't find rules > > for scope 'partition', using default > > 2014-06-05 10:33:39,006 INFO crawl.FetchScheduleFactory - Using > > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > > 2014-06-05 10:33:39,007 INFO crawl.AbstractFetchSchedule - > > defaultInterval=129600 > > 2014-06-05 10:33:39,007 INFO crawl.AbstractFetchSchedule - > > maxInterval=7776000 > > 2014-06-05 10:33:39,015 INFO regex.RegexURLNormalizer - can't find rules > > for scope 'generate_host_count', using default > > 2014-06-05 10:33:39,384 INFO crawl.Generator - Generator: Partitioning > > selected urls for politeness. > > 2014-06-05 10:33:40,386 INFO crawl.Generator - Generator: segment: > > /opt/searchengine/nutch/BWCrawl/segments/20140605103340 > > 2014-06-05 10:33:40,593 INFO regex.RegexURLNormalizer - can't find rules > > for scope 'partition', using default > > 2014-06-05 10:33:41,540 INFO crawl.Generator - Generator: finished at > > 2014-06-05 10:33:41, elapsed: 00:00:03 > > 2014-06-05 10:33:42,634 INFO fetcher.Fetcher - Fetcher: starting at > > 2014-06-05 10:33:42https://issues.apache.org/jira/browse/NUTCH-1483 > > 2014-06-05 10:33:42,635 INFO fetcher.Fetcher - Fetcher: segment: > > /opt/searchengine/nutch/BWCrawl/segments/20140605103340 > > 2014-06-05 10:33:43,056 WARN util.NativeCodeLoader - Unable to load > > native-hadoop library for your platform... using builtin-java classes > where > > applicable > > 2014-06-05 10:33:43,719 INFO fetcher.Fetcher - Using queue mode : byHost > > 2014-06-05 10:33:43,720 INFO fetcher.Fetcher - Fetcher: threads: 10 > > 2014-06-05 10:33:43,720 INFO fetcher.Fetcher - Fetcher: time-out > divisor: 4 > > 2014-06-05 10:33:43,739 INFO fetcher.Fetcher - QueueFeeder finished: > total > > 1 records + hit by time limit :0 > > 2014-06-05 10:33:44,102 INFO fetcher.Fetcher - Using queue mode : byHost > > 2014-06-05 10:33:44,103 INFO fetcher.Fetcher - Using queue mode : byHost > > 2014-06-05 10:33:44,104 INFO fetcher.Fetcher - fetching > > file://opt/searchengine/test/ (queue crawl delay=5000ms) > > 2014-06-05 10:33:44,106 INFO fetcher.Fetcher - Using queue mode : byHost > > 2014-06-05 10:33:44,107 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=1 > > 2014-06-05 10:33:44,111 INFO fetcher.Fetcher - Using queue mode : byHost > > 2014-06-05 10:33:44,111 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=1 > > 2014-06-05 10:33:44,118 INFO fetcher.Fetcher - Using queue mode : byHost > > 2014-06-05 10:33:44,120 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=1 > > 2014-06-05 10:33:44,121 INFO fetcher.Fetcher - Using queue mode : byHost > > 2014-06-05 10:33:44,122 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=1 > > 2014-06-05 10:33:44,122 INFO fetcher.Fetcher - Using queue mode : byHost > > 2014-06-05 10:33:44,127 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=1 > > 2014-06-05 10:33:44,129 INFO fetcher.Fetcher - Using queue mode : byHost > > 2014-06-05 10:33:44,130 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=1 > > 2014-06-05 10:33:44,131 INFO fetcher.Fetcher - Using queue mode : byHost > > 2014-06-05 10:33:44,132 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=1 > > 2014-06-05 10:33:44,133 INFO fetcher.Fetcher - Using queue mode : byHost > > 2014-06-05 10:33:44,146 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=1 > > 2014-06-05 10:33:44,149 INFO fetcher.Fetcher - Fetcher: throughput > > threshold: -1 > > 2014-06-05 10:33:44,149 INFO fetcher.Fetcher - Fetcher: throughput > > threshold retries: 5 > > 2014-06-05 10:33:44,150 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=1 > > 2014-06-05 10:33:44,423 INFO fetcher.Fetcher - -finishing thread > > FetcherThread, activeThreads=0 > > 2014-06-05 10:33:45,151 INFO fetcher.Fetcher - -activeThreads=0, > > spinWaiting=0, fetchQueues.totalSize=0 > > 2014-06-05 10:33:45,153 INFO fetcher.Fetcher - -activeThreads=0 > > 2014-06-05 10:33:45,497 INFO fetcher.Fetcher - Fetcher: finished at > > 2014-06-05 10:33:45, elapsed: 00:00:02 > > 2014-06-05 10:33:46,660 INFO parse.ParseSegment - ParseSegment: starting > > at 2014-06-05 10:33:46 > > 2014-06-05 10:33:46,661 INFO parse.ParseSegment - ParseSegment: segment: > > /opt/searchengine/nutch/BWCrawl/segments/20140605103340 > > 2014-06-05 10:33:47,094 WARN util.NativeCodeLoader - Unable to load > > native-hadoop library for your platform... using builtin-java classes > where > > applicable > > 2014-06-05 10:33:48,527 INFO parse.ParseSegment - ParseSegment: finished > > at 2014-06-05 10:33:48, elapsed: 00:00:01 > > 2014-06-05 10:33:49,949 WARN util.NativeCodeLoader - Unable to load > > native-hadoop library for your platform... using builtin-java classes > where > > applicable > > 2014-06-05 10:33:49,995 INFO crawl.CrawlDb - CrawlDb update: starting at > > 2014-06-05 10:33:49 > > 2014-06-05 10:33:49,996 INFO crawl.CrawlDb - CrawlDb update: db: > > /opt/searchengine/nutch/BWCrawl/crawldb > > 2014-06-05 10:33:49,997 INFO crawl.CrawlDb - CrawlDb update: segments: > > [/opt/searchengine/nutch/BWCrawl/segments/20140605103340] > > 2014-06-05 10:33:50,002 INFO crawl.CrawlDb - CrawlDb update: additions > > allowed: true > > 2014-06-05 10:33:50,003 INFO crawl.CrawlDb - CrawlDb update: URL > > normalizing: true > > 2014-06-05 10:33:50,003 INFO crawl.CrawlDb - CrawlDb update: URL > > filtering: true > > 2014-06-05 10:33:50,003 INFO crawl.CrawlDb - CrawlDb update: 404 > purging: > > false > > 2014-06-05 10:33:50,006 INFO crawl.CrawlDb - CrawlDb update: Merging > > segment data into db. > > 2014-06-05 10:33:51,150 INFO regex.RegexURLNormalizer - can't find rules > > for scope 'crawldb', using default > > 2014-06-05 10:33:51,242 INFO regex.RegexURLNormalizer - can't find rules > > for scope 'crawldb', using default > > 2014-06-05 10:33:51,399 INFO crawl.FetchScheduleFactory - Using > > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > > 2014-06-05 10:33:51,399 INFO crawl.AbstractFetchSchedule - > > defaultInterval=129600 > > 2014-06-05 10:33:51,399 INFO crawl.AbstractFetchSchedule - > > maxInterval=7776000 > > 2014-06-05 10:33:51,537 INFO crawl.CrawlDb - CrawlDb update: finished at > > 2014-06-05 10:33:51, elapsed: 00:00:01 > > 2014-06-05 10:33:53,008 INFO indexer.IndexingJob - Indexer: starting at > > 2014-06-05 10:33:53 > > 2014-06-05 10:33:53,024 INFO indexer.IndexingJob - Indexer: deleting > gone > > documents: false > > 2014-06-05 10:33:53,025 INFO indexer.IndexingJob - Indexer: URL > filtering: > > false > > 2014-06-05 10:33:53,027 INFO indexer.IndexingJob - Indexer: URL > > normalizing: false > > 2014-06-05 10:33:53,373 INFO indexer.IndexWriters - Adding > > org.apache.nutch.indexwriter.solr.SolrIndexWriter > > 2014-06-05 10:33:53,385 INFO indexer.IndexingJob - Active IndexWriters : > > SOLRIndexWriter > > solr.server.url : URL of the SOLR instance (mandatory) > > solr.commit.size : buffer size when sending to SOLR (default > 1000) > > solr.mapping.file : name of the mapping file for fields (default > > solrindex-mapping.xml) > > solr.auth : use authentication (default false) > > solr.auth.username : use authentication (default false) > > solr.auth : username for authentication > > solr.auth.password : password for authentication > > > > > > 2014-06-05 10:33:53,396 INFO indexer.IndexerMapReduce - > IndexerMapReduce: > > crawldb: /opt/searchengine/nutch/BWCrawl/crawldb > > 2014-06-05 10:33:53,396 INFO indexer.IndexerMapReduce - > IndexerMapReduces: > > adding segment: > file:/opt/searchengine/nutch/BWCrawl/segments/20140605103340 > > 2014-06-05 10:33:53,464 WARN util.NativeCodeLoader - Unable to load > > native-hadoop library for your platform... using builtin-java classes > where > > applicable > > 2014-06-05 10:33:54,214 INFO anchor.AnchorIndexingFilter - Anchor > > deduplication is: off > > 2014-06-05 10:33:54,532 INFO indexer.IndexWriters - Adding > > org.apache.nutch.indexwriter.solr.SolrIndexWriter > > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: content > > dest: content > > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: title > dest: > > title > > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: author > dest: > > author > > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: host dest: > > host > > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: segment > > dest: segment > > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: boost > dest: > > boost > > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: digest > dest: > > digest > > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: tstamp > dest: > > tstamp > > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: url dest: > id > > 2014-06-05 10:33:54,589 INFO solr.SolrMappingReader - source: url dest: > url > > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: content > > dest: content > > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: title > dest: > > title > > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: author > dest: > > author > > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: host dest: > > host > > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: segment > > dest: segment > > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: boost > dest: > > boost > > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: digest > dest: > > digest > > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: tstamp > dest: > > tstamp > > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: url dest: > id > > 2014-06-05 10:33:54,941 INFO solr.SolrMappingReader - source: url dest: > url > > 2014-06-05 10:33:55,063 INFO indexer.IndexingJob - Indexer: finished at > > 2014-06-05 10:33:55, elapsed: 00:00:02 > > > > Result of nutch readdb: > > CrawlDb statistics start: BWCrawl/crawldb/ > > Statistics for CrawlDb: BWCrawl/crawldb/ > > TOTAL urls: 1 > > retry 0: 1 > > min score: 1.0 > > avg score: 1.0 > > max score: 1.0 > > status 3 (db_gone): 1 > > CrawlDb statistics: done > > > > Following are some of documents I've read: > > > > - http://wiki.apache.org/nutch/IntranetDocumentSearch > > - > http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F > > - > > > http://lucene.472066.n3.nabble.com/Crawling-the-local-file-system-with-Nutch-Document-td607747.html > > > > System: Ubuntu 14.04, nutch 1.8, Solr 4.8.0. > > I really appreciate if someone could share some hints or any > > "running-proof" references for this subject. > > > > Thank you.- > > > > -- wassalam, [bayu]

