Just try it. With -D you can override Nutch and Hadoop configuration properties.
-----Original message----- > From:Joe Zhang <smartag...@gmail.com> > Sent: Sun 04-Nov-2012 06:07 > To: user <user@nutch.apache.org> > Subject: Re: URL filtering: crawling time vs. indexing time > > Markus, I don't see "-D" as a valid command parameter for solrindex. > > On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma > <markus.jel...@openindex.io>wrote: > > > Ah, i understand now. > > > > The indexer tool can filter as well in 1.5.1 and if you enable the regex > > filter and set a different regex configuration file when indexing vs. > > crawling you should be good to go. > > > > You can override the default configuration file by setting > > urlfilter.regex.file and point it to the regex file you want to use for > > indexing. You can set it via nutch solrindex -Durlfilter.regex.file=/path > > http://solrurl/ ... > > > > Cheers > > > > -----Original message----- > > > From:Joe Zhang <smartag...@gmail.com> > > > Sent: Fri 02-Nov-2012 17:55 > > > To: user@nutch.apache.org > > > Subject: Re: URL filtering: crawling time vs. indexing time > > > > > > I'm not sure I get it. Again, my problem is a very generic one: > > > > > > - The patterns in regex-urlfitler.txt, howevery exotic they are, they > > > control ***which URLs to visit***. > > > - Generally speaking, the set of ULRs to be indexed into solr is only a > > > ***subset*** of the above. > > > > > > We need a way to specify crawling filter (which is regex-urlfitler.txt) > > vs. > > > indexing filter, I think. > > > > > > On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux <r...@teorem.fr> wrote: > > > > > > > You have still several possibilities here : > > > > 1) find a way to seed the crawl with the URLs containing the links to > > the > > > > leaf pages (sometimes it is possible with a simple loop) > > > > 2) create regex for each step of the scenario going to the leaf page, > > in > > > > order to limit the crawl to necessary pages only. Use the $ sign at > > the end > > > > of your regexp to limit the match of regexp like http://([a-z0-9]*\.)* > > > > mysite.com. > > > > > > > > > > > > Le 2 nov. 2012 à 17:22, Joe Zhang <smartag...@gmail.com> a écrit : > > > > > > > > > The problem is that, > > > > > > > > > > - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, > > you'll > > > > end > > > > > up indexing all the pages on the way, not just the leaf pages. > > > > > - if you write specific regex for > > > > > http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, > > and > > > > you > > > > > start crawling at mysite.com, you'll get zero results, as there is > > no > > > > match. > > > > > > > > > > On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma < > > > > markus.jel...@openindex.io>wrote: > > > > > > > > > >> -----Original message----- > > > > >>> From:Joe Zhang <smartag...@gmail.com> > > > > >>> Sent: Fri 02-Nov-2012 10:04 > > > > >>> To: user@nutch.apache.org > > > > >>> Subject: URL filtering: crawling time vs. indexing time > > > > >>> > > > > >>> I feel like this is a trivial question, but I just can't get my > > ahead > > > > >>> around it. > > > > >>> > > > > >>> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at > > the > > > > >>> rudimentary level. > > > > >>> > > > > >>> If my understanding is correct, the regex-es in > > > > >>> nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., > > > > which > > > > >>> URLs to visit or not in the crawling process. > > > > >> > > > > >> Yes. > > > > >> > > > > >>> > > > > >>> On the other hand, it doesn't seem artificial for us to only want > > > > certain > > > > >>> pages to be indexed. I was hoping to write some regular > > expressions as > > > > >> well > > > > >>> in some config file, but I just can't find the right place. My > > hunch > > > > >> tells > > > > >>> me that such things should not require into-the-box coding. Can > > anybody > > > > >>> help? > > > > >> > > > > >> What exactly do you want? Add your custom regular expressions? The > > > > >> regex-urlfilter.txt is the place to write them to. > > > > >> > > > > >>> > > > > >>> Again, the scenario is really rather generic. Let's say we want to > > > > crawl > > > > >>> http://www.mysite.com. We can use the regex-urlfilter.txt to skip > > > > loops > > > > >> and > > > > >>> unncessary file types etc., but only expect to index pages with > > URLs > > > > >> like: > > > > >>> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html > > . > > > > >> > > > > >> To do this you must simply make sure your regular expressions can do > > > > this. > > > > >> > > > > >>> > > > > >>> Am I too naive to expect zero Java coding in this case? > > > > >> > > > > >> No, you can achieve almost all kinds of exotic filtering with just > > the > > > > URL > > > > >> filters and the regular expressions. > > > > >> > > > > >> Cheers > > > > >>> > > > > >> > > > > > > > > > > > > > >