RE: Only fetch 127.0.0.1:8080/*

Markus Jelsma Thu, 10 Mar 2016 13:41:05 -0800

Hi Mitch - the updatedb command with a -filter option would run your correct 
filter over the DB and remove non-127.0.0.1 entries. If you change URL filters 
it will work for new URL's, but not for stuff already in the CrawlDB. That is 
why you have to apply -filter to updatedb after changing filter configuration.


Regarding deduplication. Nutch' default is MD5 over the raw HTTP body, but 
because HTML changes almost every successive request, signatures never match. 
It is better to use a different signature, e.g. the text profile signature. Do 
note that Nutch only does exact signature matching, so no fancy 
minhashing/simhashing is possible as of yet. But it is possible in Solr though. 
We haven't had the time or the incentive to implement fuzzy signature matching 
in Nutch.

Markus

 
 
-----Original message-----
> From:Mitch Baker <mitch.ba...@iga.in.gov>
> Sent: Thursday 10th March 2016 21:54
> To: user@nutch.apache.org
> Subject: Re: Only fetch 127.0.0.1:8080/*
> 
> I seem to have flushed this out.  With the info you provided, I decided
> to just start the crawl over.  Removed the everything under crawl and
> re-ran my scripts and besides what seems to be a dup entry in solr,
> which dedup does not seem to find, it is crawling as it should now only
> on 127.0.0.1...  Thanks again...
> 
> See-ya
> Mitch
> 
> 
> 
> On Thu, 2016-03-10 at 13:54 +0000, Mitch Baker wrote:
> > Thanks Markus,
> > 
> > I'm very new to nutch.  If you can point me in a direction on how to do
> > this it would be nice.  In the mean time I'll start reading up on how I
> > might accomplish this.  
> > 
> > See-ya
> > Mitch
> > 
> > On Thu, 2016-03-10 at 10:27 +0000, Markus Jelsma wrote:
> > > Hi - if you have db.ignore.external.links set to true, and only inject 
> > > http://127.0.0.1:8080/cocoon into an empty CrawlDB, it should not try 
> > > other domains. So i am not sure what is really wrong, it should just 
> > > work. I assume you already have cpsc in your CrawlDB and don't filter it 
> > > out. You should use -filter (once) on the CrawlDB to remove cpsc.
> > 
> > > 
> > 
> > > Markus
> > 
> > > 
> > 
> > >  
> > 
> > > 
> > 
> > > -----Original message-----
> > 
> > > > From:Mitch Baker <mitch.ba...@iga.in.gov>
> > 
> > > > Sent: Wednesday 9th March 2016 22:40
> > 
> > > > To: user@nutch.apache.org
> > 
> > > > Subject: Only fetch 127.0.0.1:8080/*
> > 
> > > > 
> > 
> > > > I have small setup to index some files on a local box.  
> > 
> > > > 
> > 
> > > > Solr 5
> > 
> > > > Nutch 1.11
> > 
> > > > 
> > 
> > > > I thought I had it configured to not try any URLs that are not local to
> > 
> > > > the system but it still seems to look for them.
> > 
> > > > 
> > 
> > > > fetching 
> > > > http://www.cpsc.gov/Media/Documents/Regulations-Laws--Standards/Advisory-Opinions/Wheelchairs-145--/
> > > >  (queue crawl delay=2000ms)
> > 
> > > > fetching http://www.cpsc.gov/PageFiles/121846/fuclearance.pdf (queue 
> > > > crawl delay=2000ms)
> > 
> > > > fetching 
> > > > http://www.cpsc.gov/Business--Manufacturing/Business-Education/Business-Guidance/Phthalates-Information/
> > > >  (queue crawl delay=2000ms)
> > 
> > > > -activeThreads=150, spinWaiting=148, fetchQueues.totalSize=2091, 
> > > > fetchQueues.getQueueCount=1
> > 
> > > > fetching http://www.cpsc.gov/es/Research--Statistics/ (queue crawl 
> > > > delay=2000ms)
> > 
> > > > 
> > 
> > > > The regex-urlfilter.txt:
> > 
> > > > 
> > 
> > > > # Each non-comment, non-blank line contains a regular expression
> > 
> > > > # prefixed by '+' or '-'.  The first matching pattern in the file
> > 
> > > > # determines whether a URL is included or ignored.  If no pattern
> > 
> > > > # matches, the URL is ignored.
> > 
> > > > 
> > 
> > > > # skip file: ftp: and mailto: urls
> > 
> > > > -^(file|ftp|mailto):
> > 
> > > > 
> > 
> > > > # skip image and other suffixes we can't yet parse
> > 
> > > > # for a more extensive coverage use the urlfilter-suffix plugin
> > 
> > > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> > 
> > > > 
> > 
> > > > # skip URLs containing certain characters as probable queries, etc.
> > 
> > > > #-[?*!@=]
> > 
> > > > 
> > 
> > > > # skip URLs with slash-delimited segment that repeats 3+ times, to 
> > > > break loops
> > 
> > > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > 
> > > > 
> > 
> > > > # skip specific PDF files in the volumes directory
> > 
> > > > -.*00(FRONT|INTRO)\.PDF.*
> > 
> > > > 
> > 
> > > > #skip
> > 
> > > > #-^(http|https)://www\.*$
> > 
> > > > #-^(http|https)://blogs\.*$
> > 
> > > > #-^(http|https)://store\.*$
> > 
> > > > #-^(http|https)://.*\.google.com/.*$
> > 
> > > > #-^(http|https)://nist.gov/.*$
> > 
> > > > 
> > 
> > > > 
> > 
> > > > # accept anything else
> > 
> > > > #+.
> > 
> > > > +^http://127.0.0.1:8080/cocoon
> > 
> > > > 
> > 
> > > > I have searched and tried several things, including nutch-site.xml:
> > 
> > > > 
> > 
> > > > <configuration>
> > 
> > > > <property>
> > 
> > > > <name>http.agent.name</name>
> > 
> > > > <value>nutch-solr-integration</value>
> > 
> > > > </property>
> > 
> > > > <property>
> > 
> > > > <name>generate.max.per.host</name>
> > 
> > > > <value>1000</value>
> > 
> > > > </property>
> > 
> > > > <property>
> > 
> > > > <name>plugin.includes</name>
> > 
> > > > <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> > 
> > > > </property>
> > 
> > > > <property>
> > 
> > > >   <name>db.ignore.external.links</name>
> > 
> > > >   <value>true</value>
> > 
> > > >   <description>If true, outlinks leading from a page to external hosts 
> > > > or domain
> > 
> > > >   will be ignored. This is an effective way to limit the crawl to 
> > > > include
> > 
> > > >   only initially injected hosts, without creating complex URLFilters.
> > 
> > > >   See 'db.ignore.external.links.mode'.
> > 
> > > >   </description>
> > 
> > > > </property>
> > 
> > > > <property>
> > 
> > > >   <name>db.max.outlinks.per.page</name>
> > 
> > > >   <value>0</value>
> > 
> > > >   <description>The maximum number of outlinks that we'll process for a 
> > > > page.
> > 
> > > >   If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
> > > > outlinks
> > 
> > > >   will be processed for a page; otherwise, all outlinks will be 
> > > > processed.
> > 
> > > >   </description>
> > 
> > > > </property>
> > 
> > > > <property>
> > 
> > > >  <name>fetcher.max.crawl.delay</name>
> > 
> > > >  <value>3</value>
> > 
> > > >  <description>
> > 
> > > >  If the Crawl-Delay in robots.txt is set to greater than this value (in
> > 
> > > >  seconds) then the fetcher will skip this page, generating an error 
> > > > report.
> > 
> > > >  If set to -1 the fetcher will never skip such pages and will wait the
> > 
> > > >  amount of time retrieved from robots.txt Crawl-Delay, however long that
> > 
> > > >  might be.
> > 
> > > >  </description>
> > 
> > > > </property>
> > 
> > > > <property>
> > 
> > > >   <name>fetcher.queue.mode</name>
> > 
> > > >   <value>byHost</value>
> > 
> > > >   <description>Determines how to put URLs into queues. Default value is 
> > > > 'byHost',
> > 
> > > >   also takes 'byDomain' or 'byIP'.
> > 
> > > >   </description>
> > 
> > > > </property>
> > 
> > > > 
> > 
> > > > <property>
> > 
> > > >   <name>fetcher.verbose</name>
> > 
> > > >   <value>false</value>
> > 
> > > >   <description>If true, fetcher will log more verbosely.</description>
> > 
> > > > </property>
> > 
> > > > 
> > 
> > > > 
> > 
> > > > I inherited this and not that well versed on nutch.  Many hours of 
> > > > searching, and trying what I have found but still no luck.  Can't get 
> > > > it to just search  the local 
> > 
> > > > system http://127.0.0.1:8080/cocoon
> > 
> > > > 
> > 
> > > > 
> > 
> > > > And help would be greatly appreciated.
> > 
> > > > 
> > 
> > > > -- 
> > 
> > > > Mitch Baker <mitch.ba...@iga.in.gov>
> > 
> > > > LSA
> > 
> > > > 
> > 
> > 
> > -- 
> > Mitch Baker <mitch.ba...@iga.in.gov>
> > LSA
> 
> -- 
> Mitch Baker <mitch.ba...@iga.in.gov>
> LSA
>

RE: Only fetch 127.0.0.1:8080/*

Reply via email to