Ok,

My content.limit is 65536 (default) and I am storing the content. I would
assume this since this is in my nutch-default.xml and I did not override
those setting in my nutch-site.xml

I can manipulate my output in Drupal using print substr($snippet, 0, 800).
My Solr is setup to accept <copyField source="body" dest="teaser"
maxChars="800"/> as well is my schem for nutch.

So, I guess now I should run another nutch instance and see the results. If
I'm missing something obvious let me know. Thanks for your help. I really
appreciate your time. It's not going to waste. 

-----Original Message-----
From: Dogacan Guney [mailto:[email protected]] 
Sent: Wednesday, November 10, 2010 12:26 PM
To: [email protected]
Subject: Re: Nutch Body Length


On Nov 10, 2010, at 12:19 PM, Eric Martin wrote:

> I am using Solr 1.4.0 as my index, Nutch 1.2 as my crawler and Drupal 6.x
as
> my interface. My objective is to increase my teaser/description in my
search
> results.
> 
> 
> 
> My obstacles are:
> 
> 
> 
> 1.)    Does nutch pull the entire page when it crawls and store it? (If it
> does, then I can re-index crawled documents and get more description into
my
> search results. That would be easy!)
> 
> 2.)    Does nutch truncate the page? If so, I can't find out where so I
can
> modify it to get the character length I need.
> 
> 

You should look at http.content.length. If a document is longer than the
value
specified with that option, then nutch truncates the page. Also, make sure 
you store "content" if you want to access it later.

> 
> I guess my biggest question is, does nutch pull and keep the entire
crawled
> page? If so, I know to look to Solr configuration to get my desired search
> results.
> 
> Thanks
> 
> 
> 
> Eric
> 
> 
> 

Reply via email to