Re: nutch 2.x recrawl re-crawl

2013-01-13 Thread Bayu Widyasanyata
On Mon, Jan 14, 2013 at 6:45 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > > Markus, implemented an extension of the AdaptiveFetchSchedule [0] which > allows you to specify a configuration file [1] containing the mime-types > and thier inc and dec rates based upon your preference.

Re: Not all parsed docs is indexed & inconsistent parsed docs.

2013-01-13 Thread Lewis John Mcgibbney
Please see below On Sat, Jan 12, 2013 at 8:48 PM, Bayu Widyasanyata wrote: > > That's tomcat port for Solr. > Should we activate the proxy setting? > Is it already activated in nutch-site.xml? No I do not think it should be activated unless you have a proxy running. > > > > But the strange is t

Re: nutch 2.x recrawl re-crawl

2013-01-13 Thread Lewis John Mcgibbney
Hi J, On Sun, Jan 13, 2013 at 2:14 AM, J. Gobel wrote: > > At the moment I am testing if this works. > Please keep us updated then. > > This is not > desirable as this means that ALL urls will be fetched daily. Typically if URLs are dynamically changing, you would want to maintain a webdb of

nutch 2.x recrawl re-crawl

2013-01-13 Thread J. Gobel
hi there, I am trying to figure out what the best method is to recrawl certain sites. I am crawling news-sites and they update their frontpage quite often, so I need o crawl their frontpage/index.php etc. often and have Nutch fetch the new links + content. I cannot find an answer to my question i

Re: nutch javascript capabilities

2013-01-13 Thread Lewis John Mcgibbney
This should be correct yes. If you look at the plugin source you can see the patterns it uses to extract links. Also you can check what's iyour crawldb using the readdb command Hth Lewis On Saturday, January 12, 2013, Michael Gang wrote: > Hi, > > So if there is a javascript which actually submit

Re: How segments is created?

2013-01-13 Thread Bayu Widyasanyata
On Sun, Jan 13, 2013 at 5:50 PM, Markus Jelsma wrote: > No, you can plugin another FetchSchedule that supports adjusting the > interval based on whether a record is modified. See the > AdaptiveFetchSchedule for an example. > Hi, Thanks for pointing into that subject since I'm new in nutch & solr

Re: Size limit for fetched pages

2013-01-13 Thread k4200
Hi Feng and Lewis, Thanks for your replies! I tried a few different settings and finally found out that increasing "http.content.limit" fixed the problem. Kaz 2013/1/13 Lewis John Mcgibbney : > Hi Kaz, > > On Sat, Jan 12, 2013 at 1:09 AM, k4200 wrote: > >> >> Here are the questions: >> 1. How t

RE: How segments is created?

2013-01-13 Thread Markus Jelsma
-Original message- > From:Bayu Widyasanyata > Sent: Sun 13-Jan-2013 07:34 > To: user@nutch.apache.org > Subject: Re: How segments is created? > > On Sun, Jan 13, 2013 at 12:47 PM, Tejas Patil wrote: > > > > > Well, if you know that the front page is updated frequently, set > > "db.