Re: Few questions from a newbie

.: Abhishek :. Tue, 25 Jan 2011 19:02:46 -0800

Thanks a bunch Markus.

By the way, is there some book or material on Nutch which would help me
understanding it better? I  come from an application development background
and all the crawl n search stuff is *very* new to me :)



On Wed, Jan 26, 2011 at 9:48 AM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> These values come from the CrawlDB and have the following meaning.
>
> db_unfetched
> This is the number of URL's that are to be crawled when the next batch is
> started. This number is usually limited with the generate.max.per.host
> setting. So, if there are 5000 unfetched and generate.max.per.host is set
> to
> 1000, the next batch will fetch only 1000. Watch, the number of unfetched
> will
> usually not be 5000-1000 because new URL's have been discovered and added
> to
> the CrawlDB.
>
> db_fetched
> These URL's have been fetched. Their next fetch will be
> db.fetcher.interval.
> But, this is not always the case. There the adaprive schedule algorithm can
> tune this number depending on several settings. With these you can tune the
> interval when a page is modified or not modified.
>
> db_gone
> HTTP 404 Not Found
>
> db_redir-temp
> HTTP 307 Temporary Redirect
>
> db_redir_perm
> HTTP 301 Moved Permanently
>
> Code:
>
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup
>
> Configuration:
> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-
> default.xml?view=markup<http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/nutch-%0Adefault.xml?view=markup>
>
> > Thanks Chris, Charan and Alex.
> >
> > I am looking into the crawl statistics now. And I see fields like
> > db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what
> do
> > they mean?
> >
> > And, I also see the db_unfetched is way too high than the db_fetched.
> Does
> > it mean most of the pages did not crawl at all due to some issues?
> >
> > Thanks again for your time!
> >
> > On Tue, Jan 25, 2011 at 2:33 PM, charan kumar <charan.ku...@gmail.com
> >wrote:
> > > db.fetcher.interval : It means that URLS which were fetched in the last
> > > 30 days <default> will not be fetched. Or A URL is eligible for refetch
> > > only after 30 days of last crawl.
> > >
> > > On Mon, Jan 24, 2011 at 9:23 PM, <alx...@aim.com> wrote:
> > > > How to use solr to index nutch segments?
> > > > What is the meaning of db.fetcher.interval? Does this mean that if I
> > > > run the same crawl command before 30 days it will do nothing?
> > > >
> > > > Thanks.
> > > > Alex.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Charan K <charan.ku...@gmail.com>
> > > > To: user <user@nutch.apache.org>
> > > > Cc: user <user@nutch.apache.org>
> > > > Sent: Mon, Jan 24, 2011 8:24 pm
> > > > Subject: Re: Few questions from a newbie
> > > >
> > > >
> > > > Refer NutchBean.java for the their question. You can run than from
> > >
> > > command
> > >
> > > > line
> > > >
> > > > to test the index.
> > > >
> > > >  If you use SOLR indexing, it is going to be much simpler, they have
> a
> > >
> > > solr
> > >
> > > > java
> > > >
> > > > client..
> > > >
> > > >
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Jan 24, 2011, at 8:07 PM, Amna Waqar <amna.waqar...@gmail.com>
> wrote:
> > > > > 1,to crawl just 5 to 6 websites,u can use both cases but intranet
> > > > > crawl
> > > > >
> > > > > gives u more control and speed
> > > > >
> > > > > 2.After the first crawl,the recrawling the same sites time is 30
> days
> > >
> > > by
> > >
> > > > > default in db.fetcher.interval,you can change it according to ur
> own
> > > > >
> > > > > convenience.
> > > > >
> > > > > 3.I ve no idea about the third question
> > > > >
> > > > > cz  i m also a newbie
> > > > >
> > > > > Best of luck with nutch learning
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jan 24, 2011 at 9:04 PM, .: Abhishek :. <ab1s...@gmail.com
> >
> > > >
> > > > wrote:
> > > > >> Hi all,
> > > > >>
> > > > >>
> > > > >>
> > > > >> I am very new to Nutch and Lucene as well. I am having few
> questions
> > > >
> > > > about
> > > >
> > > > >> Nutch, I know they are very much basic but I could not get clear
> cut
> > > > >>
> > > > >> answers
> > > > >>
> > > > >> out of googling for this. The questions are,
> > > > >>
> > > > >>  - If I have to crawl just 5-6 web sites or URL's should I use
> > >
> > > intranet
> > >
> > > > >>  crawl or whole web crawl.
> > > > >>
> > > > >>  - How do I set recrawl's for these same web sites after the first
> > > >
> > > > crawl.
> > > >
> > > > >>  - If I have to start search the results via my own java code
> which
> > >
> > > jar
> > >
> > > > >>  files or api's or samples should I be looking into.
> > > > >>
> > > > >>  - Is there a book on Nutch?
> > > > >>
> > > > >> Thanks a bunch for your patience. I appreciate your time.
> > > > >>
> > > > >>
> > > > >>
> > > > >> ./Abishek
>

Re: Few questions from a newbie

Reply via email to