Re: URL filtering: crawling time vs. indexing time

Joe Zhang Sat, 17 Nov 2012 14:33:14 -0800

Markus, could you advise? This seems the most promising approach, and I'm
quite confident that my url pattern files is correct.


On Sun, Nov 4, 2012 at 6:39 PM, Joe Zhang <smartag...@gmail.com> wrote:

> Markus, I tried it. The command line works great. But it doesn't seem to
> achieve the filtering effect even if I provide really tight patterns in the
> regex file. Any idea why?
>
>
> On Sun, Nov 4, 2012 at 4:38 AM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
>
>> http://hadoop.apache.org/docs/r1.0.3/commands_manual.html#Generic+Options
>>
>> hth
>>
>> On Sun, Nov 4, 2012 at 9:15 AM, Markus Jelsma
>> <markus.jel...@openindex.io> wrote:
>> > Just try it. With -D you can override Nutch and Hadoop configuration
>> properties.
>> >
>> >
>> >
>> >
>> >
>> > -----Original message-----
>> >> From:Joe Zhang <smartag...@gmail.com>
>> >> Sent: Sun 04-Nov-2012 06:07
>> >> To: user <user@nutch.apache.org>
>> >> Subject: Re: URL filtering: crawling time vs. indexing time
>> >>
>> >> Markus, I don't see "-D" as a valid command parameter for solrindex.
>> >>
>> >> On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma
>> >> <markus.jel...@openindex.io>wrote:
>> >>
>> >> > Ah, i understand now.
>> >> >
>> >> > The indexer tool can filter as well in 1.5.1 and if you enable the
>> regex
>> >> > filter and set a different regex configuration file when indexing vs.
>> >> > crawling you should be good to go.
>> >> >
>> >> > You can override the default configuration file by setting
>> >> > urlfilter.regex.file and point it to the regex file you want to use
>> for
>> >> > indexing. You can set it via nutch solrindex
>> -Durlfilter.regex.file=/path
>> >> > http://solrurl/ ...
>> >> >
>> >> > Cheers
>> >> >
>> >> > -----Original message-----
>> >> > > From:Joe Zhang <smartag...@gmail.com>
>> >> > > Sent: Fri 02-Nov-2012 17:55
>> >> > > To: user@nutch.apache.org
>> >> > > Subject: Re: URL filtering: crawling time vs. indexing time
>> >> > >
>> >> > > I'm not sure I get it. Again, my problem is a very generic one:
>> >> > >
>> >> > > - The patterns in regex-urlfitler.txt, howevery exotic they are,
>> they
>> >> > > control ***which URLs to visit***.
>> >> > > - Generally speaking, the set of ULRs to be indexed into solr is
>> only a
>> >> > > ***subset*** of the above.
>> >> > >
>> >> > > We need a way to specify crawling filter (which is
>> regex-urlfitler.txt)
>> >> > vs.
>> >> > > indexing filter, I think.
>> >> > >
>> >> > > On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux <r...@teorem.fr>
>> wrote:
>> >> > >
>> >> > > > You have still several possibilities here :
>> >> > > > 1) find a way to seed the crawl with the URLs containing the
>> links to
>> >> > the
>> >> > > > leaf pages (sometimes it is possible with a simple loop)
>> >> > > > 2) create regex for each step of the scenario going to the leaf
>> page,
>> >> > in
>> >> > > > order to limit the crawl to necessary pages only. Use the $ sign
>> at
>> >> > the end
>> >> > > > of your regexp to limit the match of regexp like http://
>> ([a-z0-9]*\.)*
>> >> > > > mysite.com.
>> >> > > >
>> >> > > >
>> >> > > > Le 2 nov. 2012 à 17:22, Joe Zhang <smartag...@gmail.com> a
>> écrit :
>> >> > > >
>> >> > > > > The problem is that,
>> >> > > > >
>> >> > > > > - if you write regex such as: +^http://([a-z0-9]*\.)*
>> mysite.com,
>> >> > you'll
>> >> > > > end
>> >> > > > > up indexing all the pages on the way, not just the leaf pages.
>> >> > > > > - if you write specific regex for
>> >> > > > >
>> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html,
>> >> > and
>> >> > > > you
>> >> > > > > start crawling at mysite.com, you'll get zero results, as
>> there is
>> >> > no
>> >> > > > match.
>> >> > > > >
>> >> > > > > On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma <
>> >> > > > markus.jel...@openindex.io>wrote:
>> >> > > > >
>> >> > > > >> -----Original message-----
>> >> > > > >>> From:Joe Zhang <smartag...@gmail.com>
>> >> > > > >>> Sent: Fri 02-Nov-2012 10:04
>> >> > > > >>> To: user@nutch.apache.org
>> >> > > > >>> Subject: URL filtering: crawling time vs. indexing time
>> >> > > > >>>
>> >> > > > >>> I feel like this is a trivial question, but I just can't get
>> my
>> >> > ahead
>> >> > > > >>> around it.
>> >> > > > >>>
>> >> > > > >>> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work
>> fine at
>> >> > the
>> >> > > > >>> rudimentary level.
>> >> > > > >>>
>> >> > > > >>> If my understanding is correct, the regex-es in
>> >> > > > >>> nutch/conf/regex-urlfilter.txt control  the crawling
>> behavior, ie.,
>> >> > > > which
>> >> > > > >>> URLs to visit or not in the crawling process.
>> >> > > > >>
>> >> > > > >> Yes.
>> >> > > > >>
>> >> > > > >>>
>> >> > > > >>> On the other hand, it doesn't seem artificial for us to only
>> want
>> >> > > > certain
>> >> > > > >>> pages to be indexed. I was hoping to write some regular
>> >> > expressions as
>> >> > > > >> well
>> >> > > > >>> in some config file, but I just can't find the right place.
>> My
>> >> > hunch
>> >> > > > >> tells
>> >> > > > >>> me that such things should not require into-the-box coding.
>> Can
>> >> > anybody
>> >> > > > >>> help?
>> >> > > > >>
>> >> > > > >> What exactly do you want? Add your custom regular
>> expressions? The
>> >> > > > >> regex-urlfilter.txt is the place to write them to.
>> >> > > > >>
>> >> > > > >>>
>> >> > > > >>> Again, the scenario is really rather generic. Let's say we
>> want to
>> >> > > > crawl
>> >> > > > >>> http://www.mysite.com. We can use the regex-urlfilter.txt
>> to skip
>> >> > > > loops
>> >> > > > >> and
>> >> > > > >>> unncessary file types etc., but only expect to index pages
>> with
>> >> > URLs
>> >> > > > >> like:
>> >> > > > >>>
>> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html
>> >> > .
>> >> > > > >>
>> >> > > > >> To do this you must simply make sure your regular expressions
>> can do
>> >> > > > this.
>> >> > > > >>
>> >> > > > >>>
>> >> > > > >>> Am I too naive to expect zero Java coding in this case?
>> >> > > > >>
>> >> > > > >> No, you can achieve almost all kinds of exotic filtering with
>> just
>> >> > the
>> >> > > > URL
>> >> > > > >> filters and the regular expressions.
>> >> > > > >>
>> >> > > > >> Cheers
>> >> > > > >>>
>> >> > > > >>
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>>
>>
>>
>> --
>> Lewis
>>
>
>

Re: URL filtering: crawling time vs. indexing time

Reply via email to