If you are planning to have the crawl name specified at the command line, one 
approach would be to modify your script to use sed to modify nutch-site.xml 
before the crawl steps begins.

-----Original Message-----
From: Katrina Riehl [mailto:[email protected]] 
Sent: Wednesday, April 8, 2015 10:00 AM
To: [email protected]
Subject: Re: Adding field to Nutch / Solr

My understanding is that I can use the index-state plugin if the information 
doesn't change... remains static.... but, there will be a different crawl_name 
every time I run a new crawl.  I'd like to take that new crawl name and add it 
into solr somehow.

Is my understanding correct?  Or is there a way to override that field on a 
per-crawl basis?

Thanks

On Wed, Apr 8, 2015 at 9:54 AM, Iain Lopata <[email protected]> wrote:

> Katrina,
>
> If I am understanding you correctly, you could do this with the 
> index-static plugin which is configured with the following property:
>
> <property>
>   <name>index.static</name>
>   <value> fieldname:fieldcontent </value>
>   <description>
>   A simple plugin called at indexing that adds fields with static data.
>   You can specify a list of fieldname:fieldcontent per nutch job.
>   It can be useful when collections can't be created by urlpatterns,
>   like in subcollection, but on a job-basis.
>   </description>
> </property>
>
> Use crawlname as your fieldname and use a different config directory 
> for each of your crawls with an appropriate value for fieldcontent set in 
> each.
>
> Iain
>
> -----Original Message-----
> From: Katrina Riehl [mailto:[email protected]]
> Sent: Wednesday, April 8, 2015 9:41 AM
> To: [email protected]
> Subject: Re: Adding field to Nutch / Solr
>
> Right, I can create multiple collections no problem... but, what I'd 
> really love is to put them into the same collection, just adding a 
> field like "crawl_name" to the index.
>
> Any way I can do that?
>
> Thanks!
>
>
> On Wed, Apr 8, 2015 at 9:15 AM, Iain Lopata <[email protected]> wrote:
>
> > Katrina,
> >
> > When you specify the solr instance as the third parameter to 
> > bin/crawl try  specifying the collection name in the path e.g.
> > http://localhost:8080/solr/collection1
> >
> > Iain
> >
> > -----Original Message-----
> > From: Katrina Riehl [mailto:[email protected]]
> > Sent: Wednesday, April 8, 2015 8:51 AM
> > To: [email protected]
> > Subject: Adding field to Nutch / Solr
> >
> > Hello,
> >
> > I am new to using Nutch.  I'm developing an application that crawls 
> > websites, and then indexes information about those websites into a 
> > Solr instance.  The problem is, it's putting all the crawled 
> > documents into the same Solr collection.
> >
> > Is there a way for me to add a field specifying which crawl the 
> > index came from?  Is there a command line option I can add when I 
> > start the
> crawl?
> >
> > Thank you so much for your help.
> >
> > --
> > Katrina Riehl
> > Continuum Analytics
> > [email protected]
> >
> >
>
>

Reply via email to