Re: nutch 1.3 solrindex empty content field

Markus Jelsma Mon, 19 Sep 2011 07:04:24 -0700


On Monday 19 September 2011 15:58:35 lewis john mcgibbney wrote:
> Yes, what Markus has pointed out is the problem I think Jann. This means
> you need to re-index you're data and change the stored and index value to
> true.
> 
> Markus', out of interest do you know the pro's/con's if we were to make
> this default in the Nutch schema? For example, with small indexes I
> wouldn't imagine there would be much difference, however non-trivial sized
> indexes I would imagine would be a different story...


The index size ~ *2.1
> 
> Any thoughts.
> 
> On Mon, Sep 19, 2011 at 2:54 PM, Markus Jelsma
> 
> <[email protected]>wrote:
> > Check line 79 of your Solr schema:
> > 
> > http://svn.apache.org/viewvc/nutch/branches/branch-1.3/conf/schema.xml?vi
> > ew=markup
> > 
> > Maybe we should configure the field to be stored in 1.4. I can imagine
> > this causes a lot of headaches for new users. Also highlighting will
> > never work with unstored fields.
> > 
> > On Monday 19 September 2011 11:02:17 Jann Forrer wrote:
> > > Hi
> > > 
> > > I tried to run nutch-1.3 together with solr  3.x according to
> > > http://wiki.apache.org/nutch/NutchTutorial.
> > > 
> > > That worked as described but if I try to search the index using the
> > > Solr admin
> > > interface i always get an empty result.
> > > 
> > > http://localhost:8983/solr/admin/schema.jsp
> > > 
> > > Using the Schema Browser I see entries in different fields (e.g. the
> > > url field) but the content field is emtpy. I
> > > was looking for similar problem on the mailing list but I didn't found
> > > a solution for this problem.
> > > 
> > > Here is what  I did:
> > > 
> > > 1.) ./bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> > > 2.) Dumping the segment (./bin/nutch readseg -dump
> > > crawl/segments/20110916124747 test). The script
> > > 
> > >       did also dump the content of the web pages. All seems to be ok
> > 
> > here.
> > 
> > > 3.) Copy the nutch schema.xml to the solr conf directory
> > > 4.) bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> > > crawl/linkdb crawl/segments/*
> > > 5.) And then trying to search using http://localhost:8983/solr/admin/.
> > > but didn't found any HTML-content.
> > > 
> > >       However if there was a pdf-File to crawl, this pdf-Content is
> > 
> > found.
> > 
> > > BTW. Using Nutch 1.1 and solr 1.4.1 all worked as expected.  I could
> > > use these version but I am upgrading
> > > from an older Nutch Version and it would be nice if I could use the
> > > newer version where nutch and solr
> > > are better integrated.
> > > 
> > > Any Ideas what might be wrong?
> > > 
> > > Jann
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: nutch 1.3 solrindex empty content field

Reply via email to