BTW, I try the below with several nutch and solr versions and I had errors but now I am using nutch 1.7 ans solr 5.52 on ubuntu and I am trying to crawl a subfolder and anything under that subfolder. The subfolder contains yearly subfolders for every year since 2005(12 year subfolder) and each year subfolder has a month subfolder (12 month subfolder) and each month subfolder has at least 30 days subfolders. I know that I have more than 3,960 index.phtml and some other regular .html, .phtml and PDF files
Ok so I start the crawl and I follow the step by step instruction: http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website bin/nutch inject crawl/crawldb urls After crawling at least 7 times: bin/nutch generate crawl/crawldb crawl/segments -topN 10000000 -Depth 100000 s7=`ls -d crawl/segments/2* | tail -1` bin/nutch fetch $s7 bin/nutch parse $s7 bin/nutch updatedb crawl/crawldb $s7 Followed by: bin/nutch invertlinks crawl/linkdb -dir crawl/segments bin/nutch solrindex http://localhost:9191/solr/clips crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20161004205432/ -filter -normalize But it only finds 289 records(docs) when I look at the solr page. it seems that it only sees the clips/2016, clips/2015 and clips/2011 ----------------------------------------- I also try all in one command but it FAILS: bin/nutch crawl urls -solr http://localhost:9191/solr/clips -dir newcrawl -depth 3 -topN 3 Indexer: starting at 2016-10-14 18:53:55 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication Indexer: finished at 2016-10-14 18:53:57, elapsed: 00:00:01 SolrDeleteDuplicates: starting at 2016-10-14 18:53:57 SolrDeleteDuplicates: Solr url: http://localhost:9191/solr/clips *Exception in thread "main" java.io.IOException: Job failed!* at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) at org.apache.nutch.crawl.Crawl.run(Crawl.java:160) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) *How can I make it crawl the entire subfolder?* *and What does that error means?* Thanks, Néstor -- Né§t☼r *Authority gone to one's head is the greatest enemy of Truth*