Re: Only fetch 127.0.0.1:8080/*

Mitch Baker Thu, 10 Mar 2016 12:55:10 -0800

I seem to have flushed this out.  With the info you provided, I decided
to just start the crawl over.  Removed the everything under crawl and
re-ran my scripts and besides what seems to be a dup entry in solr,
which dedup does not seem to find, it is crawling as it should now only
on 127.0.0.1...  Thanks again...


See-ya
Mitch



On Thu, 2016-03-10 at 13:54 +0000, Mitch Baker wrote:
> Thanks Markus,
> 
> I'm very new to nutch.  If you can point me in a direction on how to do
> this it would be nice.  In the mean time I'll start reading up on how I
> might accomplish this.  
> 
> See-ya
> Mitch
> 
> On Thu, 2016-03-10 at 10:27 +0000, Markus Jelsma wrote:
> > Hi - if you have db.ignore.external.links set to true, and only inject 
> > http://127.0.0.1:8080/cocoon into an empty CrawlDB, it should not try other 
> > domains. So i am not sure what is really wrong, it should just work. I 
> > assume you already have cpsc in your CrawlDB and don't filter it out. You 
> > should use -filter (once) on the CrawlDB to remove cpsc.
> 
> > 
> 
> > Markus
> 
> > 
> 
> >  
> 
> > 
> 
> > -----Original message-----
> 
> > > From:Mitch Baker <mitch.ba...@iga.in.gov>
> 
> > > Sent: Wednesday 9th March 2016 22:40
> 
> > > To: user@nutch.apache.org
> 
> > > Subject: Only fetch 127.0.0.1:8080/*
> 
> > > 
> 
> > > I have small setup to index some files on a local box.  
> 
> > > 
> 
> > > Solr 5
> 
> > > Nutch 1.11
> 
> > > 
> 
> > > I thought I had it configured to not try any URLs that are not local to
> 
> > > the system but it still seems to look for them.
> 
> > > 
> 
> > > fetching 
> > > http://www.cpsc.gov/Media/Documents/Regulations-Laws--Standards/Advisory-Opinions/Wheelchairs-145--/
> > >  (queue crawl delay=2000ms)
> 
> > > fetching http://www.cpsc.gov/PageFiles/121846/fuclearance.pdf (queue 
> > > crawl delay=2000ms)
> 
> > > fetching 
> > > http://www.cpsc.gov/Business--Manufacturing/Business-Education/Business-Guidance/Phthalates-Information/
> > >  (queue crawl delay=2000ms)
> 
> > > -activeThreads=150, spinWaiting=148, fetchQueues.totalSize=2091, 
> > > fetchQueues.getQueueCount=1
> 
> > > fetching http://www.cpsc.gov/es/Research--Statistics/ (queue crawl 
> > > delay=2000ms)
> 
> > > 
> 
> > > The regex-urlfilter.txt:
> 
> > > 
> 
> > > # Each non-comment, non-blank line contains a regular expression
> 
> > > # prefixed by '+' or '-'.  The first matching pattern in the file
> 
> > > # determines whether a URL is included or ignored.  If no pattern
> 
> > > # matches, the URL is ignored.
> 
> > > 
> 
> > > # skip file: ftp: and mailto: urls
> 
> > > -^(file|ftp|mailto):
> 
> > > 
> 
> > > # skip image and other suffixes we can't yet parse
> 
> > > # for a more extensive coverage use the urlfilter-suffix plugin
> 
> > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> 
> > > 
> 
> > > # skip URLs containing certain characters as probable queries, etc.
> 
> > > #-[?*!@=]
> 
> > > 
> 
> > > # skip URLs with slash-delimited segment that repeats 3+ times, to break 
> > > loops
> 
> > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> 
> > > 
> 
> > > # skip specific PDF files in the volumes directory
> 
> > > -.*00(FRONT|INTRO)\.PDF.*
> 
> > > 
> 
> > > #skip
> 
> > > #-^(http|https)://www\.*$
> 
> > > #-^(http|https)://blogs\.*$
> 
> > > #-^(http|https)://store\.*$
> 
> > > #-^(http|https)://.*\.google.com/.*$
> 
> > > #-^(http|https)://nist.gov/.*$
> 
> > > 
> 
> > > 
> 
> > > # accept anything else
> 
> > > #+.
> 
> > > +^http://127.0.0.1:8080/cocoon
> 
> > > 
> 
> > > I have searched and tried several things, including nutch-site.xml:
> 
> > > 
> 
> > > <configuration>
> 
> > > <property>
> 
> > > <name>http.agent.name</name>
> 
> > > <value>nutch-solr-integration</value>
> 
> > > </property>
> 
> > > <property>
> 
> > > <name>generate.max.per.host</name>
> 
> > > <value>1000</value>
> 
> > > </property>
> 
> > > <property>
> 
> > > <name>plugin.includes</name>
> 
> > > <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> 
> > > </property>
> 
> > > <property>
> 
> > >   <name>db.ignore.external.links</name>
> 
> > >   <value>true</value>
> 
> > >   <description>If true, outlinks leading from a page to external hosts or 
> > > domain
> 
> > >   will be ignored. This is an effective way to limit the crawl to include
> 
> > >   only initially injected hosts, without creating complex URLFilters.
> 
> > >   See 'db.ignore.external.links.mode'.
> 
> > >   </description>
> 
> > > </property>
> 
> > > <property>
> 
> > >   <name>db.max.outlinks.per.page</name>
> 
> > >   <value>0</value>
> 
> > >   <description>The maximum number of outlinks that we'll process for a 
> > > page.
> 
> > >   If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
> > > outlinks
> 
> > >   will be processed for a page; otherwise, all outlinks will be processed.
> 
> > >   </description>
> 
> > > </property>
> 
> > > <property>
> 
> > >  <name>fetcher.max.crawl.delay</name>
> 
> > >  <value>3</value>
> 
> > >  <description>
> 
> > >  If the Crawl-Delay in robots.txt is set to greater than this value (in
> 
> > >  seconds) then the fetcher will skip this page, generating an error 
> > > report.
> 
> > >  If set to -1 the fetcher will never skip such pages and will wait the
> 
> > >  amount of time retrieved from robots.txt Crawl-Delay, however long that
> 
> > >  might be.
> 
> > >  </description>
> 
> > > </property>
> 
> > > <property>
> 
> > >   <name>fetcher.queue.mode</name>
> 
> > >   <value>byHost</value>
> 
> > >   <description>Determines how to put URLs into queues. Default value is 
> > > 'byHost',
> 
> > >   also takes 'byDomain' or 'byIP'.
> 
> > >   </description>
> 
> > > </property>
> 
> > > 
> 
> > > <property>
> 
> > >   <name>fetcher.verbose</name>
> 
> > >   <value>false</value>
> 
> > >   <description>If true, fetcher will log more verbosely.</description>
> 
> > > </property>
> 
> > > 
> 
> > > 
> 
> > > I inherited this and not that well versed on nutch.  Many hours of 
> > > searching, and trying what I have found but still no luck.  Can't get it 
> > > to just search  the local 
> 
> > > system http://127.0.0.1:8080/cocoon
> 
> > > 
> 
> > > 
> 
> > > And help would be greatly appreciated.
> 
> > > 
> 
> > > -- 
> 
> > > Mitch Baker <mitch.ba...@iga.in.gov>
> 
> > > LSA
> 
> > > 
> 
> 
> -- 
> Mitch Baker <mitch.ba...@iga.in.gov>
> LSA

-- 
Mitch Baker <mitch.ba...@iga.in.gov>
LSA

Re: Only fetch 127.0.0.1:8080/*

Reply via email to