Re: URL filtering: crawling time vs. indexing time

Joe Zhang Sun, 04 Nov 2012 17:39:47 -0800

Markus, I tried it. The command line works great. But it doesn't seem to
achieve the filtering effect even if I provide really tight patterns in the
regex file. Any idea why?


On Sun, Nov 4, 2012 at 4:38 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> http://hadoop.apache.org/docs/r1.0.3/commands_manual.html#Generic+Options
>
> hth
>
> On Sun, Nov 4, 2012 at 9:15 AM, Markus Jelsma
> <markus.jel...@openindex.io> wrote:
> > Just try it. With -D you can override Nutch and Hadoop configuration
> properties.
> >
> >
> >
> >
> >
> > -----Original message-----
> >> From:Joe Zhang <smartag...@gmail.com>
> >> Sent: Sun 04-Nov-2012 06:07
> >> To: user <user@nutch.apache.org>
> >> Subject: Re: URL filtering: crawling time vs. indexing time
> >>
> >> Markus, I don't see "-D" as a valid command parameter for solrindex.
> >>
> >> On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma
> >> <markus.jel...@openindex.io>wrote:
> >>
> >> > Ah, i understand now.
> >> >
> >> > The indexer tool can filter as well in 1.5.1 and if you enable the
> regex
> >> > filter and set a different regex configuration file when indexing vs.
> >> > crawling you should be good to go.
> >> >
> >> > You can override the default configuration file by setting
> >> > urlfilter.regex.file and point it to the regex file you want to use
> for
> >> > indexing. You can set it via nutch solrindex
> -Durlfilter.regex.file=/path
> >> > http://solrurl/ ...
> >> >
> >> > Cheers
> >> >
> >> > -----Original message-----
> >> > > From:Joe Zhang <smartag...@gmail.com>
> >> > > Sent: Fri 02-Nov-2012 17:55
> >> > > To: user@nutch.apache.org
> >> > > Subject: Re: URL filtering: crawling time vs. indexing time
> >> > >
> >> > > I'm not sure I get it. Again, my problem is a very generic one:
> >> > >
> >> > > - The patterns in regex-urlfitler.txt, howevery exotic they are,
> they
> >> > > control ***which URLs to visit***.
> >> > > - Generally speaking, the set of ULRs to be indexed into solr is
> only a
> >> > > ***subset*** of the above.
> >> > >
> >> > > We need a way to specify crawling filter (which is
> regex-urlfitler.txt)
> >> > vs.
> >> > > indexing filter, I think.
> >> > >
> >> > > On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux <r...@teorem.fr>
> wrote:
> >> > >
> >> > > > You have still several possibilities here :
> >> > > > 1) find a way to seed the crawl with the URLs containing the
> links to
> >> > the
> >> > > > leaf pages (sometimes it is possible with a simple loop)
> >> > > > 2) create regex for each step of the scenario going to the leaf
> page,
> >> > in
> >> > > > order to limit the crawl to necessary pages only. Use the $ sign
> at
> >> > the end
> >> > > > of your regexp to limit the match of regexp like http://
> ([a-z0-9]*\.)*
> >> > > > mysite.com.
> >> > > >
> >> > > >
> >> > > > Le 2 nov. 2012 à 17:22, Joe Zhang <smartag...@gmail.com> a écrit
> :
> >> > > >
> >> > > > > The problem is that,
> >> > > > >
> >> > > > > - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com
> ,
> >> > you'll
> >> > > > end
> >> > > > > up indexing all the pages on the way, not just the leaf pages.
> >> > > > > - if you write specific regex for
> >> > > > >
> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html,
> >> > and
> >> > > > you
> >> > > > > start crawling at mysite.com, you'll get zero results, as
> there is
> >> > no
> >> > > > match.
> >> > > > >
> >> > > > > On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma <
> >> > > > markus.jel...@openindex.io>wrote:
> >> > > > >
> >> > > > >> -----Original message-----
> >> > > > >>> From:Joe Zhang <smartag...@gmail.com>
> >> > > > >>> Sent: Fri 02-Nov-2012 10:04
> >> > > > >>> To: user@nutch.apache.org
> >> > > > >>> Subject: URL filtering: crawling time vs. indexing time
> >> > > > >>>
> >> > > > >>> I feel like this is a trivial question, but I just can't get
> my
> >> > ahead
> >> > > > >>> around it.
> >> > > > >>>
> >> > > > >>> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work
> fine at
> >> > the
> >> > > > >>> rudimentary level.
> >> > > > >>>
> >> > > > >>> If my understanding is correct, the regex-es in
> >> > > > >>> nutch/conf/regex-urlfilter.txt control  the crawling
> behavior, ie.,
> >> > > > which
> >> > > > >>> URLs to visit or not in the crawling process.
> >> > > > >>
> >> > > > >> Yes.
> >> > > > >>
> >> > > > >>>
> >> > > > >>> On the other hand, it doesn't seem artificial for us to only
> want
> >> > > > certain
> >> > > > >>> pages to be indexed. I was hoping to write some regular
> >> > expressions as
> >> > > > >> well
> >> > > > >>> in some config file, but I just can't find the right place. My
> >> > hunch
> >> > > > >> tells
> >> > > > >>> me that such things should not require into-the-box coding.
> Can
> >> > anybody
> >> > > > >>> help?
> >> > > > >>
> >> > > > >> What exactly do you want? Add your custom regular expressions?
> The
> >> > > > >> regex-urlfilter.txt is the place to write them to.
> >> > > > >>
> >> > > > >>>
> >> > > > >>> Again, the scenario is really rather generic. Let's say we
> want to
> >> > > > crawl
> >> > > > >>> http://www.mysite.com. We can use the regex-urlfilter.txt to
> skip
> >> > > > loops
> >> > > > >> and
> >> > > > >>> unncessary file types etc., but only expect to index pages
> with
> >> > URLs
> >> > > > >> like:
> >> > > > >>>
> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html
> >> > .
> >> > > > >>
> >> > > > >> To do this you must simply make sure your regular expressions
> can do
> >> > > > this.
> >> > > > >>
> >> > > > >>>
> >> > > > >>> Am I too naive to expect zero Java coding in this case?
> >> > > > >>
> >> > > > >> No, you can achieve almost all kinds of exotic filtering with
> just
> >> > the
> >> > > > URL
> >> > > > >> filters and the regular expressions.
> >> > > > >>
> >> > > > >> Cheers
> >> > > > >>>
> >> > > > >>
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
>
>
>
> --
> Lewis
>

Re: URL filtering: crawling time vs. indexing time

Reply via email to