I seem to have flushed this out. With the info you provided, I decided to just start the crawl over. Removed the everything under crawl and re-ran my scripts and besides what seems to be a dup entry in solr, which dedup does not seem to find, it is crawling as it should now only on 127.0.0.1... Thanks again...
See-ya Mitch On Thu, 2016-03-10 at 13:54 +0000, Mitch Baker wrote: > Thanks Markus, > > I'm very new to nutch. If you can point me in a direction on how to do > this it would be nice. In the mean time I'll start reading up on how I > might accomplish this. > > See-ya > Mitch > > On Thu, 2016-03-10 at 10:27 +0000, Markus Jelsma wrote: > > Hi - if you have db.ignore.external.links set to true, and only inject > > http://127.0.0.1:8080/cocoon into an empty CrawlDB, it should not try other > > domains. So i am not sure what is really wrong, it should just work. I > > assume you already have cpsc in your CrawlDB and don't filter it out. You > > should use -filter (once) on the CrawlDB to remove cpsc. > > > > > > Markus > > > > > > > > > > > > -----Original message----- > > > > From:Mitch Baker <mitch.ba...@iga.in.gov> > > > > Sent: Wednesday 9th March 2016 22:40 > > > > To: user@nutch.apache.org > > > > Subject: Only fetch 127.0.0.1:8080/* > > > > > > > > I have small setup to index some files on a local box. > > > > > > > > Solr 5 > > > > Nutch 1.11 > > > > > > > > I thought I had it configured to not try any URLs that are not local to > > > > the system but it still seems to look for them. > > > > > > > > fetching > > > http://www.cpsc.gov/Media/Documents/Regulations-Laws--Standards/Advisory-Opinions/Wheelchairs-145--/ > > > (queue crawl delay=2000ms) > > > > fetching http://www.cpsc.gov/PageFiles/121846/fuclearance.pdf (queue > > > crawl delay=2000ms) > > > > fetching > > > http://www.cpsc.gov/Business--Manufacturing/Business-Education/Business-Guidance/Phthalates-Information/ > > > (queue crawl delay=2000ms) > > > > -activeThreads=150, spinWaiting=148, fetchQueues.totalSize=2091, > > > fetchQueues.getQueueCount=1 > > > > fetching http://www.cpsc.gov/es/Research--Statistics/ (queue crawl > > > delay=2000ms) > > > > > > > > The regex-urlfilter.txt: > > > > > > > > # Each non-comment, non-blank line contains a regular expression > > > > # prefixed by '+' or '-'. The first matching pattern in the file > > > > # determines whether a URL is included or ignored. If no pattern > > > > # matches, the URL is ignored. > > > > > > > > # skip file: ftp: and mailto: urls > > > > -^(file|ftp|mailto): > > > > > > > > # skip image and other suffixes we can't yet parse > > > > # for a more extensive coverage use the urlfilter-suffix plugin > > > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > > > > > > > # skip URLs containing certain characters as probable queries, etc. > > > > #-[?*!@=] > > > > > > > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > > > loops > > > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > > > > > > > # skip specific PDF files in the volumes directory > > > > -.*00(FRONT|INTRO)\.PDF.* > > > > > > > > #skip > > > > #-^(http|https)://www\.*$ > > > > #-^(http|https)://blogs\.*$ > > > > #-^(http|https)://store\.*$ > > > > #-^(http|https)://.*\.google.com/.*$ > > > > #-^(http|https)://nist.gov/.*$ > > > > > > > > > > > > # accept anything else > > > > #+. > > > > +^http://127.0.0.1:8080/cocoon > > > > > > > > I have searched and tried several things, including nutch-site.xml: > > > > > > > > <configuration> > > > > <property> > > > > <name>http.agent.name</name> > > > > <value>nutch-solr-integration</value> > > > > </property> > > > > <property> > > > > <name>generate.max.per.host</name> > > > > <value>1000</value> > > > > </property> > > > > <property> > > > > <name>plugin.includes</name> > > > > <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > > > </property> > > > > <property> > > > > <name>db.ignore.external.links</name> > > > > <value>true</value> > > > > <description>If true, outlinks leading from a page to external hosts or > > > domain > > > > will be ignored. This is an effective way to limit the crawl to include > > > > only initially injected hosts, without creating complex URLFilters. > > > > See 'db.ignore.external.links.mode'. > > > > </description> > > > > </property> > > > > <property> > > > > <name>db.max.outlinks.per.page</name> > > > > <value>0</value> > > > > <description>The maximum number of outlinks that we'll process for a > > > page. > > > > If this value is nonnegative (>=0), at most db.max.outlinks.per.page > > > outlinks > > > > will be processed for a page; otherwise, all outlinks will be processed. > > > > </description> > > > > </property> > > > > <property> > > > > <name>fetcher.max.crawl.delay</name> > > > > <value>3</value> > > > > <description> > > > > If the Crawl-Delay in robots.txt is set to greater than this value (in > > > > seconds) then the fetcher will skip this page, generating an error > > > report. > > > > If set to -1 the fetcher will never skip such pages and will wait the > > > > amount of time retrieved from robots.txt Crawl-Delay, however long that > > > > might be. > > > > </description> > > > > </property> > > > > <property> > > > > <name>fetcher.queue.mode</name> > > > > <value>byHost</value> > > > > <description>Determines how to put URLs into queues. Default value is > > > 'byHost', > > > > also takes 'byDomain' or 'byIP'. > > > > </description> > > > > </property> > > > > > > > > <property> > > > > <name>fetcher.verbose</name> > > > > <value>false</value> > > > > <description>If true, fetcher will log more verbosely.</description> > > > > </property> > > > > > > > > > > > > I inherited this and not that well versed on nutch. Many hours of > > > searching, and trying what I have found but still no luck. Can't get it > > > to just search the local > > > > system http://127.0.0.1:8080/cocoon > > > > > > > > > > > > And help would be greatly appreciated. > > > > > > > > -- > > > > Mitch Baker <mitch.ba...@iga.in.gov> > > > > LSA > > > > > > > -- > Mitch Baker <mitch.ba...@iga.in.gov> > LSA -- Mitch Baker <mitch.ba...@iga.in.gov> LSA