Markus, could you advise? This seems the most promising approach, and I'm quite confident that my url pattern files is correct.
On Sun, Nov 4, 2012 at 6:39 PM, Joe Zhang <smartag...@gmail.com> wrote: > Markus, I tried it. The command line works great. But it doesn't seem to > achieve the filtering effect even if I provide really tight patterns in the > regex file. Any idea why? > > > On Sun, Nov 4, 2012 at 4:38 AM, Lewis John Mcgibbney < > lewis.mcgibb...@gmail.com> wrote: > >> http://hadoop.apache.org/docs/r1.0.3/commands_manual.html#Generic+Options >> >> hth >> >> On Sun, Nov 4, 2012 at 9:15 AM, Markus Jelsma >> <markus.jel...@openindex.io> wrote: >> > Just try it. With -D you can override Nutch and Hadoop configuration >> properties. >> > >> > >> > >> > >> > >> > -----Original message----- >> >> From:Joe Zhang <smartag...@gmail.com> >> >> Sent: Sun 04-Nov-2012 06:07 >> >> To: user <user@nutch.apache.org> >> >> Subject: Re: URL filtering: crawling time vs. indexing time >> >> >> >> Markus, I don't see "-D" as a valid command parameter for solrindex. >> >> >> >> On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma >> >> <markus.jel...@openindex.io>wrote: >> >> >> >> > Ah, i understand now. >> >> > >> >> > The indexer tool can filter as well in 1.5.1 and if you enable the >> regex >> >> > filter and set a different regex configuration file when indexing vs. >> >> > crawling you should be good to go. >> >> > >> >> > You can override the default configuration file by setting >> >> > urlfilter.regex.file and point it to the regex file you want to use >> for >> >> > indexing. You can set it via nutch solrindex >> -Durlfilter.regex.file=/path >> >> > http://solrurl/ ... >> >> > >> >> > Cheers >> >> > >> >> > -----Original message----- >> >> > > From:Joe Zhang <smartag...@gmail.com> >> >> > > Sent: Fri 02-Nov-2012 17:55 >> >> > > To: user@nutch.apache.org >> >> > > Subject: Re: URL filtering: crawling time vs. indexing time >> >> > > >> >> > > I'm not sure I get it. Again, my problem is a very generic one: >> >> > > >> >> > > - The patterns in regex-urlfitler.txt, howevery exotic they are, >> they >> >> > > control ***which URLs to visit***. >> >> > > - Generally speaking, the set of ULRs to be indexed into solr is >> only a >> >> > > ***subset*** of the above. >> >> > > >> >> > > We need a way to specify crawling filter (which is >> regex-urlfitler.txt) >> >> > vs. >> >> > > indexing filter, I think. >> >> > > >> >> > > On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux <r...@teorem.fr> >> wrote: >> >> > > >> >> > > > You have still several possibilities here : >> >> > > > 1) find a way to seed the crawl with the URLs containing the >> links to >> >> > the >> >> > > > leaf pages (sometimes it is possible with a simple loop) >> >> > > > 2) create regex for each step of the scenario going to the leaf >> page, >> >> > in >> >> > > > order to limit the crawl to necessary pages only. Use the $ sign >> at >> >> > the end >> >> > > > of your regexp to limit the match of regexp like http:// >> ([a-z0-9]*\.)* >> >> > > > mysite.com. >> >> > > > >> >> > > > >> >> > > > Le 2 nov. 2012 à 17:22, Joe Zhang <smartag...@gmail.com> a >> écrit : >> >> > > > >> >> > > > > The problem is that, >> >> > > > > >> >> > > > > - if you write regex such as: +^http://([a-z0-9]*\.)* >> mysite.com, >> >> > you'll >> >> > > > end >> >> > > > > up indexing all the pages on the way, not just the leaf pages. >> >> > > > > - if you write specific regex for >> >> > > > > >> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, >> >> > and >> >> > > > you >> >> > > > > start crawling at mysite.com, you'll get zero results, as >> there is >> >> > no >> >> > > > match. >> >> > > > > >> >> > > > > On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma < >> >> > > > markus.jel...@openindex.io>wrote: >> >> > > > > >> >> > > > >> -----Original message----- >> >> > > > >>> From:Joe Zhang <smartag...@gmail.com> >> >> > > > >>> Sent: Fri 02-Nov-2012 10:04 >> >> > > > >>> To: user@nutch.apache.org >> >> > > > >>> Subject: URL filtering: crawling time vs. indexing time >> >> > > > >>> >> >> > > > >>> I feel like this is a trivial question, but I just can't get >> my >> >> > ahead >> >> > > > >>> around it. >> >> > > > >>> >> >> > > > >>> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work >> fine at >> >> > the >> >> > > > >>> rudimentary level. >> >> > > > >>> >> >> > > > >>> If my understanding is correct, the regex-es in >> >> > > > >>> nutch/conf/regex-urlfilter.txt control the crawling >> behavior, ie., >> >> > > > which >> >> > > > >>> URLs to visit or not in the crawling process. >> >> > > > >> >> >> > > > >> Yes. >> >> > > > >> >> >> > > > >>> >> >> > > > >>> On the other hand, it doesn't seem artificial for us to only >> want >> >> > > > certain >> >> > > > >>> pages to be indexed. I was hoping to write some regular >> >> > expressions as >> >> > > > >> well >> >> > > > >>> in some config file, but I just can't find the right place. >> My >> >> > hunch >> >> > > > >> tells >> >> > > > >>> me that such things should not require into-the-box coding. >> Can >> >> > anybody >> >> > > > >>> help? >> >> > > > >> >> >> > > > >> What exactly do you want? Add your custom regular >> expressions? The >> >> > > > >> regex-urlfilter.txt is the place to write them to. >> >> > > > >> >> >> > > > >>> >> >> > > > >>> Again, the scenario is really rather generic. Let's say we >> want to >> >> > > > crawl >> >> > > > >>> http://www.mysite.com. We can use the regex-urlfilter.txt >> to skip >> >> > > > loops >> >> > > > >> and >> >> > > > >>> unncessary file types etc., but only expect to index pages >> with >> >> > URLs >> >> > > > >> like: >> >> > > > >>> >> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html >> >> > . >> >> > > > >> >> >> > > > >> To do this you must simply make sure your regular expressions >> can do >> >> > > > this. >> >> > > > >> >> >> > > > >>> >> >> > > > >>> Am I too naive to expect zero Java coding in this case? >> >> > > > >> >> >> > > > >> No, you can achieve almost all kinds of exotic filtering with >> just >> >> > the >> >> > > > URL >> >> > > > >> filters and the regular expressions. >> >> > > > >> >> >> > > > >> Cheers >> >> > > > >>> >> >> > > > >> >> >> > > > >> >> > > > >> >> > > >> >> > >> >> >> >> >> >> -- >> Lewis >> > >