Hi Kiks,

What kind of changes have you made to your schema when transferring to Solr
instance?

You ask about the stored parsed text content, well the default Nutch schema
sets this by default to stored=false as it is not always required for all
content to be stored. Generally speaking terms that occur in title, meta,
etc fields will be more valuable for searching across, especially when
considering data stores. Hopefully you can change this behaviour by simple
making the changes described, however Solr does not like kindly changes to
schema therefore it will be necessary to reindex your data to your Solr
core.

On Wed, Aug 3, 2011 at 7:31 AM, Kiks <kikstern...@gmail.com> wrote:

> This question was posted on solr list and not answered because nutch
> related...
>
>
> The indexed contents of 100 sites were imported to solr from nutch using:
>
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
> crawl/segments/*
>
> now, a solr admin search for 'photography' includes these results:
>
>  <doc>
>    <float name="score">0.12570743</
> float>
>    <float name="boost">1.0440307</float>
>    <str name="digest">94d97f2806240d18d67cafe9c34f94e1</str>
>    <str name="id">http://www.galleryhopper.org/</str>
>    <str name="segment">...</str>
>    <str name="title">Gallery Hopper: Todd Walker's photography ephemera.
> Read, enjoy, share, discard.</str>
>    <date name="tstamp">...</date>
>    <str name="url">http://www.galleryhopper.org/</str>
>  </doc>
>
> but highlighting options are on the title field not page text.
>
> My question: Where is the stored parsetext content of the pages? What is
> the
> solr command to send it from nutch with url/id key? The information is
> contained in the crawl segments with solr id field matching nutch url.
>
> Thanks.
>



-- 
*Lewis*

Reply via email to