Re: Nutch and Solr search on the fly

Markus Jelsma Wed, 09 Feb 2011 06:19:02 -0800

Are you using the depth parameter with the crawl command or are you using the 
separate generate, fetch etc. commands?


What's $  nutch readdb <crawldb> -stats returning?

On Wednesday 09 February 2011 15:06:40 .: Abhishek :. wrote:
> Hi Markus,
> 
>  I am sorry for not being clear, I meant to say that...
> 
>  Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in
> turn contain links to a.html, b.html, c.html, d.html) is injected into the
> seed.txt, after the whole process I was expecting a bunch of other pages
> which crawled from this seed url. However, at the end of it all I see is
> the contents from only this page namely
> www.somehost.com/gifts/greetingcard.htmland I do not see any other
> pages(here a.html, b.html, c.html, d.html)
> crawled from this one.
> 
>  The crawling happens only for the URLs mentioned in the seed.txt and does
> not proceed further from there. So I am just bit confused. Why is it not
> crawling the linked pages(a.html, b.html, c.html and d.html). I get a
> feeling that I am missing something that the author of the blog(
> http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
> everyone would know.
> 
> Thanks,
> Abi
> 
> On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma 
<markus.jel...@openindex.io>wrote:
> > The parsed data is only sent to the Solr index of you tell a segment to
> > be indexed; solrindex <crawldb> <linkdb> <segment>
> > 
> > If you did this only once after injecting  and then the consequent
> > fetch,parse,update,index sequence then you, of course, only see those
> > URL's.
> > If you don't index a segment after it's being parsed, you need to do it
> > later
> > on.
> > 
> > On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
> > > Hi all,
> > > 
> > >  I am a newbie to nutch and solr. Well relatively much newer to Solr
> > >  than
> > > 
> > > Nutch :)
> > > 
> > >  I have been using nutch for past two weeks, and I wanted to know if I
> > 
> > can
> > 
> > > query or search on my nutch crawls on the fly(before it completes). I
> > > am asking this because the websites I am crawling are really huge and
> > > it
> > 
> > takes
> > 
> > > around 3-4 days for a crawl to complete. I want to analyze some quick
> > > results while the nutch crawler is still crawling the URLs. Some one
> > > suggested me that Solr would make it possible.
> > > 
> > >  I followed the steps in
> > > 
> > > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this.
> > > By this process, I see only the injected URLs are shown in the Solr
> > > search.
> > 
> > I
> > 
> > > know I did something really foolish and the crawl never happened, I
> > > feel
> > 
> > I
> > 
> > > am missing some information here. I think somewhere in the process
> > > there should be a crawling happening and I missed it out.
> > > 
> > >  Just wanted to see if some one could help me pointing this out and
> > >  where
> > 
> > I
> > 
> > > went wrong in the process. Forgive my foolishness and thanks for your
> > > patience.
> > > 
> > > Cheers,
> > > Abi
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Nutch and Solr search on the fly

Reply via email to