Re: site-specific crawling policies

2012-11-17 Thread Joe Zhang
What if I want to index different metatags for different site?

On Fri, Nov 16, 2012 at 11:03 AM, Markus Jelsma
wrote:

> you can override some URL Filter paths in nutch site or with command line
> options (tools) such as bin/nutch fetch -Durlfilter.regex.file=bla. You can
> also set NUTCH_HOME and keep everything separate if you're running it
> locally. On Hadoop you'll need separate job files.
>
> -Original message-
> > From:Joe Zhang 
> > Sent: Fri 16-Nov-2012 18:35
> > To: user@nutch.apache.org
> > Subject: Re: site-specific crawling policies
> >
> > That's easy to do. But what about the configuration files? The same
> > nutchs-site.xml, urlfiter files will be read.
> >
> > On Fri, Nov 16, 2012 at 3:28 AM, Sourajit Basak <
> sourajit.ba...@gmail.com>wrote:
> >
> > > Group related sites together and use separate crawldb, segment
> > > directories.
> > >
> > > On Fri, Nov 16, 2012 at 9:40 AM, Joe Zhang 
> wrote:
> > >
> > > > So how exactly do I set up different nutch instances then?
> > > >
> > > > On Thu, Nov 15, 2012 at 7:52 PM, Lewis John Mcgibbney <
> > > > lewis.mcgibb...@gmail.com> wrote:
> > > >
> > > > > Hi Joe,
> > > > >
> > > > > In all honesty, it might sound slightly optimistic, it may also
> depend
> > > > > upon the size and calibre of the different sites/domains but if you
> > > > > are attempting a depth first, domain specific crawl, then maybe
> > > > > separate Nutch instances will be your friend...
> > > > >
> > > > > Lewis
> > > > >
> > > > >
> > > > > On Thu, Nov 15, 2012 at 11:53 PM, Joe Zhang 
> > > > wrote:
> > > > > > well, these are all details. The bigger question is, how to
> seperate
> > > > the
> > > > > > crawling policy of site A from that of site B?
> > > > > >
> > > > > > On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak <
> > > > > sourajit.ba...@gmail.com>wrote:
> > > > > >
> > > > > >> You probably need to customize parse-metatags plugin.
> > > > > >>
> > > > > >> I think you go ahead and include all possible metatags. And take
> > > care
> > > > of
> > > > > >> missing metatags in solr.
> > > > > >>
> > > > > >> On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang <
> smartag...@gmail.com>
> > > > > wrote:
> > > > > >>
> > > > > >> > I understand conf/regex-urlfilter.txt; I can put domain names
> into
> > > > the
> > > > > >> URL
> > > > > >> > patterns.
> > > > > >> >
> > > > > >> > But what about meta tags? What if I want to parse out
> different
> > > meta
> > > > > tags
> > > > > >> > for different sites?
> > > > > >> >
> > > > > >> > On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak <
> > > > > >> sourajit.ba...@gmail.com
> > > > > >> > >wrote:
> > > > > >> >
> > > > > >> > > 1) For parsing & indexing customized meta tags enable &
> > > configure
> > > > > >> plugin
> > > > > >> > > "parse-metatags"
> > > > > >> > >
> > > > > >> > > 2) There are several filters of url, like regex based. For
> > > regex,
> > > > > the
> > > > > >> > > patterns are specified via conf/regex-urlfilter.txt
> > > > > >> > >
> > > > > >> > > On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil <
> > > > > tejas.patil...@gmail.com
> > > > > >> > > >wrote:
> > > > > >> > >
> > > > > >> > > > While defining url patterns, have the domain name in it so
> > > that
> > > > > you
> > > > > >> get
> > > > > >> > > > site/domain specific rules. I don't know about configuring
> > > meta
> > > > > tags.
> > > > > >> > > >
> > > > > >> > > > Thanks,
> > > > > >> > > > Tejas
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang <
> > > > smartag...@gmail.com
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > > > How to enforce site-specific crawling policies, i.e,
> > > different
> > > > > URL
> > > > > >> > > > > patterns, meta tags, etc. for different websites to be
> > > > crawled?
> > > > > I
> > > > > >> got
> > > > > >> > > the
> > > > > >> > > > > sense that multiple instances of nutch are needed? Is it
> > > > > correct?
> > > > > >> If
> > > > > >> > > yes,
> > > > > >> > > > > how?
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Lewis
> > > > >
> > > >
> > >
> >
>


RE: site-specific crawling policies

2012-11-16 Thread Markus Jelsma
you can override some URL Filter paths in nutch site or with command line 
options (tools) such as bin/nutch fetch -Durlfilter.regex.file=bla. You can 
also set NUTCH_HOME and keep everything separate if you're running it locally. 
On Hadoop you'll need separate job files.
 
-Original message-
> From:Joe Zhang 
> Sent: Fri 16-Nov-2012 18:35
> To: user@nutch.apache.org
> Subject: Re: site-specific crawling policies
> 
> That's easy to do. But what about the configuration files? The same
> nutchs-site.xml, urlfiter files will be read.
> 
> On Fri, Nov 16, 2012 at 3:28 AM, Sourajit Basak 
> wrote:
> 
> > Group related sites together and use separate crawldb, segment
> > directories.
> >
> > On Fri, Nov 16, 2012 at 9:40 AM, Joe Zhang  wrote:
> >
> > > So how exactly do I set up different nutch instances then?
> > >
> > > On Thu, Nov 15, 2012 at 7:52 PM, Lewis John Mcgibbney <
> > > lewis.mcgibb...@gmail.com> wrote:
> > >
> > > > Hi Joe,
> > > >
> > > > In all honesty, it might sound slightly optimistic, it may also depend
> > > > upon the size and calibre of the different sites/domains but if you
> > > > are attempting a depth first, domain specific crawl, then maybe
> > > > separate Nutch instances will be your friend...
> > > >
> > > > Lewis
> > > >
> > > >
> > > > On Thu, Nov 15, 2012 at 11:53 PM, Joe Zhang 
> > > wrote:
> > > > > well, these are all details. The bigger question is, how to seperate
> > > the
> > > > > crawling policy of site A from that of site B?
> > > > >
> > > > > On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak <
> > > > sourajit.ba...@gmail.com>wrote:
> > > > >
> > > > >> You probably need to customize parse-metatags plugin.
> > > > >>
> > > > >> I think you go ahead and include all possible metatags. And take
> > care
> > > of
> > > > >> missing metatags in solr.
> > > > >>
> > > > >> On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang 
> > > > wrote:
> > > > >>
> > > > >> > I understand conf/regex-urlfilter.txt; I can put domain names into
> > > the
> > > > >> URL
> > > > >> > patterns.
> > > > >> >
> > > > >> > But what about meta tags? What if I want to parse out different
> > meta
> > > > tags
> > > > >> > for different sites?
> > > > >> >
> > > > >> > On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak <
> > > > >> sourajit.ba...@gmail.com
> > > > >> > >wrote:
> > > > >> >
> > > > >> > > 1) For parsing & indexing customized meta tags enable &
> > configure
> > > > >> plugin
> > > > >> > > "parse-metatags"
> > > > >> > >
> > > > >> > > 2) There are several filters of url, like regex based. For
> > regex,
> > > > the
> > > > >> > > patterns are specified via conf/regex-urlfilter.txt
> > > > >> > >
> > > > >> > > On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil <
> > > > tejas.patil...@gmail.com
> > > > >> > > >wrote:
> > > > >> > >
> > > > >> > > > While defining url patterns, have the domain name in it so
> > that
> > > > you
> > > > >> get
> > > > >> > > > site/domain specific rules. I don't know about configuring
> > meta
> > > > tags.
> > > > >> > > >
> > > > >> > > > Thanks,
> > > > >> > > > Tejas
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang <
> > > smartag...@gmail.com
> > > > >
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > > How to enforce site-specific crawling policies, i.e,
> > different
> > > > URL
> > > > >> > > > > patterns, meta tags, etc. for different websites to be
> > > crawled?
> > > > I
> > > > >> got
> > > > >> > > the
> > > > >> > > > > sense that multiple instances of nutch are needed? Is it
> > > > correct?
> > > > >> If
> > > > >> > > yes,
> > > > >> > > > > how?
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Lewis
> > > >
> > >
> >
> 


Re: site-specific crawling policies

2012-11-16 Thread Joe Zhang
That's easy to do. But what about the configuration files? The same
nutchs-site.xml, urlfiter files will be read.

On Fri, Nov 16, 2012 at 3:28 AM, Sourajit Basak wrote:

> Group related sites together and use separate crawldb, segment
> directories.
>
> On Fri, Nov 16, 2012 at 9:40 AM, Joe Zhang  wrote:
>
> > So how exactly do I set up different nutch instances then?
> >
> > On Thu, Nov 15, 2012 at 7:52 PM, Lewis John Mcgibbney <
> > lewis.mcgibb...@gmail.com> wrote:
> >
> > > Hi Joe,
> > >
> > > In all honesty, it might sound slightly optimistic, it may also depend
> > > upon the size and calibre of the different sites/domains but if you
> > > are attempting a depth first, domain specific crawl, then maybe
> > > separate Nutch instances will be your friend...
> > >
> > > Lewis
> > >
> > >
> > > On Thu, Nov 15, 2012 at 11:53 PM, Joe Zhang 
> > wrote:
> > > > well, these are all details. The bigger question is, how to seperate
> > the
> > > > crawling policy of site A from that of site B?
> > > >
> > > > On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak <
> > > sourajit.ba...@gmail.com>wrote:
> > > >
> > > >> You probably need to customize parse-metatags plugin.
> > > >>
> > > >> I think you go ahead and include all possible metatags. And take
> care
> > of
> > > >> missing metatags in solr.
> > > >>
> > > >> On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang 
> > > wrote:
> > > >>
> > > >> > I understand conf/regex-urlfilter.txt; I can put domain names into
> > the
> > > >> URL
> > > >> > patterns.
> > > >> >
> > > >> > But what about meta tags? What if I want to parse out different
> meta
> > > tags
> > > >> > for different sites?
> > > >> >
> > > >> > On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak <
> > > >> sourajit.ba...@gmail.com
> > > >> > >wrote:
> > > >> >
> > > >> > > 1) For parsing & indexing customized meta tags enable &
> configure
> > > >> plugin
> > > >> > > "parse-metatags"
> > > >> > >
> > > >> > > 2) There are several filters of url, like regex based. For
> regex,
> > > the
> > > >> > > patterns are specified via conf/regex-urlfilter.txt
> > > >> > >
> > > >> > > On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil <
> > > tejas.patil...@gmail.com
> > > >> > > >wrote:
> > > >> > >
> > > >> > > > While defining url patterns, have the domain name in it so
> that
> > > you
> > > >> get
> > > >> > > > site/domain specific rules. I don't know about configuring
> meta
> > > tags.
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > > Tejas
> > > >> > > >
> > > >> > > >
> > > >> > > > On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang <
> > smartag...@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > > >
> > > >> > > > > How to enforce site-specific crawling policies, i.e,
> different
> > > URL
> > > >> > > > > patterns, meta tags, etc. for different websites to be
> > crawled?
> > > I
> > > >> got
> > > >> > > the
> > > >> > > > > sense that multiple instances of nutch are needed? Is it
> > > correct?
> > > >> If
> > > >> > > yes,
> > > >> > > > > how?
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> > >
> > >
> > > --
> > > Lewis
> > >
> >
>


Re: site-specific crawling policies

2012-11-16 Thread Sourajit Basak
Group related sites together and use separate crawldb, segment
directories.

On Fri, Nov 16, 2012 at 9:40 AM, Joe Zhang  wrote:

> So how exactly do I set up different nutch instances then?
>
> On Thu, Nov 15, 2012 at 7:52 PM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
>
> > Hi Joe,
> >
> > In all honesty, it might sound slightly optimistic, it may also depend
> > upon the size and calibre of the different sites/domains but if you
> > are attempting a depth first, domain specific crawl, then maybe
> > separate Nutch instances will be your friend...
> >
> > Lewis
> >
> >
> > On Thu, Nov 15, 2012 at 11:53 PM, Joe Zhang 
> wrote:
> > > well, these are all details. The bigger question is, how to seperate
> the
> > > crawling policy of site A from that of site B?
> > >
> > > On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak <
> > sourajit.ba...@gmail.com>wrote:
> > >
> > >> You probably need to customize parse-metatags plugin.
> > >>
> > >> I think you go ahead and include all possible metatags. And take care
> of
> > >> missing metatags in solr.
> > >>
> > >> On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang 
> > wrote:
> > >>
> > >> > I understand conf/regex-urlfilter.txt; I can put domain names into
> the
> > >> URL
> > >> > patterns.
> > >> >
> > >> > But what about meta tags? What if I want to parse out different meta
> > tags
> > >> > for different sites?
> > >> >
> > >> > On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak <
> > >> sourajit.ba...@gmail.com
> > >> > >wrote:
> > >> >
> > >> > > 1) For parsing & indexing customized meta tags enable & configure
> > >> plugin
> > >> > > "parse-metatags"
> > >> > >
> > >> > > 2) There are several filters of url, like regex based. For regex,
> > the
> > >> > > patterns are specified via conf/regex-urlfilter.txt
> > >> > >
> > >> > > On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil <
> > tejas.patil...@gmail.com
> > >> > > >wrote:
> > >> > >
> > >> > > > While defining url patterns, have the domain name in it so that
> > you
> > >> get
> > >> > > > site/domain specific rules. I don't know about configuring meta
> > tags.
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Tejas
> > >> > > >
> > >> > > >
> > >> > > > On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang <
> smartag...@gmail.com
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > > > How to enforce site-specific crawling policies, i.e, different
> > URL
> > >> > > > > patterns, meta tags, etc. for different websites to be
> crawled?
> > I
> > >> got
> > >> > > the
> > >> > > > > sense that multiple instances of nutch are needed? Is it
> > correct?
> > >> If
> > >> > > yes,
> > >> > > > > how?
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> >
> >
> > --
> > Lewis
> >
>


Re: site-specific crawling policies

2012-11-15 Thread Joe Zhang
So how exactly do I set up different nutch instances then?

On Thu, Nov 15, 2012 at 7:52 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Joe,
>
> In all honesty, it might sound slightly optimistic, it may also depend
> upon the size and calibre of the different sites/domains but if you
> are attempting a depth first, domain specific crawl, then maybe
> separate Nutch instances will be your friend...
>
> Lewis
>
>
> On Thu, Nov 15, 2012 at 11:53 PM, Joe Zhang  wrote:
> > well, these are all details. The bigger question is, how to seperate the
> > crawling policy of site A from that of site B?
> >
> > On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak <
> sourajit.ba...@gmail.com>wrote:
> >
> >> You probably need to customize parse-metatags plugin.
> >>
> >> I think you go ahead and include all possible metatags. And take care of
> >> missing metatags in solr.
> >>
> >> On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang 
> wrote:
> >>
> >> > I understand conf/regex-urlfilter.txt; I can put domain names into the
> >> URL
> >> > patterns.
> >> >
> >> > But what about meta tags? What if I want to parse out different meta
> tags
> >> > for different sites?
> >> >
> >> > On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak <
> >> sourajit.ba...@gmail.com
> >> > >wrote:
> >> >
> >> > > 1) For parsing & indexing customized meta tags enable & configure
> >> plugin
> >> > > "parse-metatags"
> >> > >
> >> > > 2) There are several filters of url, like regex based. For regex,
> the
> >> > > patterns are specified via conf/regex-urlfilter.txt
> >> > >
> >> > > On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil <
> tejas.patil...@gmail.com
> >> > > >wrote:
> >> > >
> >> > > > While defining url patterns, have the domain name in it so that
> you
> >> get
> >> > > > site/domain specific rules. I don't know about configuring meta
> tags.
> >> > > >
> >> > > > Thanks,
> >> > > > Tejas
> >> > > >
> >> > > >
> >> > > > On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang  >
> >> > > wrote:
> >> > > >
> >> > > > > How to enforce site-specific crawling policies, i.e, different
> URL
> >> > > > > patterns, meta tags, etc. for different websites to be crawled?
> I
> >> got
> >> > > the
> >> > > > > sense that multiple instances of nutch are needed? Is it
> correct?
> >> If
> >> > > yes,
> >> > > > > how?
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>
>
>
> --
> Lewis
>


Re: site-specific crawling policies

2012-11-15 Thread Lewis John Mcgibbney
Hi Joe,

In all honesty, it might sound slightly optimistic, it may also depend
upon the size and calibre of the different sites/domains but if you
are attempting a depth first, domain specific crawl, then maybe
separate Nutch instances will be your friend...

Lewis


On Thu, Nov 15, 2012 at 11:53 PM, Joe Zhang  wrote:
> well, these are all details. The bigger question is, how to seperate the
> crawling policy of site A from that of site B?
>
> On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak 
> wrote:
>
>> You probably need to customize parse-metatags plugin.
>>
>> I think you go ahead and include all possible metatags. And take care of
>> missing metatags in solr.
>>
>> On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang  wrote:
>>
>> > I understand conf/regex-urlfilter.txt; I can put domain names into the
>> URL
>> > patterns.
>> >
>> > But what about meta tags? What if I want to parse out different meta tags
>> > for different sites?
>> >
>> > On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak <
>> sourajit.ba...@gmail.com
>> > >wrote:
>> >
>> > > 1) For parsing & indexing customized meta tags enable & configure
>> plugin
>> > > "parse-metatags"
>> > >
>> > > 2) There are several filters of url, like regex based. For regex, the
>> > > patterns are specified via conf/regex-urlfilter.txt
>> > >
>> > > On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil > > > >wrote:
>> > >
>> > > > While defining url patterns, have the domain name in it so that you
>> get
>> > > > site/domain specific rules. I don't know about configuring meta tags.
>> > > >
>> > > > Thanks,
>> > > > Tejas
>> > > >
>> > > >
>> > > > On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang 
>> > > wrote:
>> > > >
>> > > > > How to enforce site-specific crawling policies, i.e, different URL
>> > > > > patterns, meta tags, etc. for different websites to be crawled? I
>> got
>> > > the
>> > > > > sense that multiple instances of nutch are needed? Is it correct?
>> If
>> > > yes,
>> > > > > how?
>> > > > >
>> > > >
>> > >
>> >
>>



-- 
Lewis


Re: site-specific crawling policies

2012-11-15 Thread Joe Zhang
well, these are all details. The bigger question is, how to seperate the
crawling policy of site A from that of site B?

On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak wrote:

> You probably need to customize parse-metatags plugin.
>
> I think you go ahead and include all possible metatags. And take care of
> missing metatags in solr.
>
> On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang  wrote:
>
> > I understand conf/regex-urlfilter.txt; I can put domain names into the
> URL
> > patterns.
> >
> > But what about meta tags? What if I want to parse out different meta tags
> > for different sites?
> >
> > On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak <
> sourajit.ba...@gmail.com
> > >wrote:
> >
> > > 1) For parsing & indexing customized meta tags enable & configure
> plugin
> > > "parse-metatags"
> > >
> > > 2) There are several filters of url, like regex based. For regex, the
> > > patterns are specified via conf/regex-urlfilter.txt
> > >
> > > On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil  > > >wrote:
> > >
> > > > While defining url patterns, have the domain name in it so that you
> get
> > > > site/domain specific rules. I don't know about configuring meta tags.
> > > >
> > > > Thanks,
> > > > Tejas
> > > >
> > > >
> > > > On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang 
> > > wrote:
> > > >
> > > > > How to enforce site-specific crawling policies, i.e, different URL
> > > > > patterns, meta tags, etc. for different websites to be crawled? I
> got
> > > the
> > > > > sense that multiple instances of nutch are needed? Is it correct?
> If
> > > yes,
> > > > > how?
> > > > >
> > > >
> > >
> >
>


Re: site-specific crawling policies

2012-11-15 Thread Sourajit Basak
You probably need to customize parse-metatags plugin.

I think you go ahead and include all possible metatags. And take care of
missing metatags in solr.

On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang  wrote:

> I understand conf/regex-urlfilter.txt; I can put domain names into the URL
> patterns.
>
> But what about meta tags? What if I want to parse out different meta tags
> for different sites?
>
> On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak  >wrote:
>
> > 1) For parsing & indexing customized meta tags enable & configure plugin
> > "parse-metatags"
> >
> > 2) There are several filters of url, like regex based. For regex, the
> > patterns are specified via conf/regex-urlfilter.txt
> >
> > On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil  > >wrote:
> >
> > > While defining url patterns, have the domain name in it so that you get
> > > site/domain specific rules. I don't know about configuring meta tags.
> > >
> > > Thanks,
> > > Tejas
> > >
> > >
> > > On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang 
> > wrote:
> > >
> > > > How to enforce site-specific crawling policies, i.e, different URL
> > > > patterns, meta tags, etc. for different websites to be crawled? I got
> > the
> > > > sense that multiple instances of nutch are needed? Is it correct? If
> > yes,
> > > > how?
> > > >
> > >
> >
>


Re: site-specific crawling policies

2012-11-14 Thread Joe Zhang
I understand conf/regex-urlfilter.txt; I can put domain names into the URL
patterns.

But what about meta tags? What if I want to parse out different meta tags
for different sites?

On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak wrote:

> 1) For parsing & indexing customized meta tags enable & configure plugin
> "parse-metatags"
>
> 2) There are several filters of url, like regex based. For regex, the
> patterns are specified via conf/regex-urlfilter.txt
>
> On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil  >wrote:
>
> > While defining url patterns, have the domain name in it so that you get
> > site/domain specific rules. I don't know about configuring meta tags.
> >
> > Thanks,
> > Tejas
> >
> >
> > On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang 
> wrote:
> >
> > > How to enforce site-specific crawling policies, i.e, different URL
> > > patterns, meta tags, etc. for different websites to be crawled? I got
> the
> > > sense that multiple instances of nutch are needed? Is it correct? If
> yes,
> > > how?
> > >
> >
>


Re: site-specific crawling policies

2012-11-14 Thread Sourajit Basak
1) For parsing & indexing customized meta tags enable & configure plugin
"parse-metatags"

2) There are several filters of url, like regex based. For regex, the
patterns are specified via conf/regex-urlfilter.txt

On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil wrote:

> While defining url patterns, have the domain name in it so that you get
> site/domain specific rules. I don't know about configuring meta tags.
>
> Thanks,
> Tejas
>
>
> On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang  wrote:
>
> > How to enforce site-specific crawling policies, i.e, different URL
> > patterns, meta tags, etc. for different websites to be crawled? I got the
> > sense that multiple instances of nutch are needed? Is it correct? If yes,
> > how?
> >
>


Re: site-specific crawling policies

2012-11-14 Thread Tejas Patil
While defining url patterns, have the domain name in it so that you get
site/domain specific rules. I don't know about configuring meta tags.

Thanks,
Tejas


On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang  wrote:

> How to enforce site-specific crawling policies, i.e, different URL
> patterns, meta tags, etc. for different websites to be crawled? I got the
> sense that multiple instances of nutch are needed? Is it correct? If yes,
> how?
>