Re: URL filtering: crawling time vs. indexing time
Markus, could you advise? This seems the most promising approach, and I'm quite confident that my url pattern files is correct. On Sun, Nov 4, 2012 at 6:39 PM, Joe Zhang smartag...@gmail.com wrote: Markus, I tried it. The command line works great. But it doesn't seem to achieve the filtering effect even if I provide really tight patterns in the regex file. Any idea why? On Sun, Nov 4, 2012 at 4:38 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: http://hadoop.apache.org/docs/r1.0.3/commands_manual.html#Generic+Options hth On Sun, Nov 4, 2012 at 9:15 AM, Markus Jelsma markus.jel...@openindex.io wrote: Just try it. With -D you can override Nutch and Hadoop configuration properties. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 04-Nov-2012 06:07 To: user user@nutch.apache.org Subject: Re: URL filtering: crawling time vs. indexing time Markus, I don't see -D as a valid command parameter for solrindex. On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma markus.jel...@openindex.iowrote: Ah, i understand now. The indexer tool can filter as well in 1.5.1 and if you enable the regex filter and set a different regex configuration file when indexing vs. crawling you should be good to go. You can override the default configuration file by setting urlfilter.regex.file and point it to the regex file you want to use for indexing. You can set it via nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ... Cheers -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 17:55 To: user@nutch.apache.org Subject: Re: URL filtering: crawling time vs. indexing time I'm not sure I get it. Again, my problem is a very generic one: - The patterns in regex-urlfitler.txt, howevery exotic they are, they control ***which URLs to visit***. - Generally speaking, the set of ULRs to be indexed into solr is only a ***subset*** of the above. We need a way to specify crawling filter (which is regex-urlfitler.txt) vs. indexing filter, I think. On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux r...@teorem.fr wrote: You have still several possibilities here : 1) find a way to seed the crawl with the URLs containing the links to the leaf pages (sometimes it is possible with a simple loop) 2) create regex for each step of the scenario going to the leaf page, in order to limit the crawl to necessary pages only. Use the $ sign at the end of your regexp to limit the match of regexp like http:// ([a-z0-9]*\.)* mysite.com. Le 2 nov. 2012 à 17:22, Joe Zhang smartag...@gmail.com a écrit : The problem is that, - if you write regex such as: +^http://([a-z0-9]*\.)* mysite.com, you'll end up indexing all the pages on the way, not just the leaf pages. - if you write specific regex for http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you start crawling at mysite.com, you'll get zero results, as there is no match. On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma markus.jel...@openindex.iowrote: -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 10:04 To: user@nutch.apache.org Subject: URL filtering: crawling time vs. indexing time I feel like this is a trivial question, but I just can't get my ahead around it. I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the rudimentary level. If my understanding is correct, the regex-es in nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., which URLs to visit or not in the crawling process. Yes. On the other hand, it doesn't seem artificial for us to only want certain pages to be indexed. I was hoping to write some regular expressions as well in some config file, but I just can't find the right place. My hunch tells me that such things should not require into-the-box coding. Can anybody help? What exactly do you want? Add your custom regular expressions? The regex-urlfilter.txt is the place to write them to. Again, the scenario is really rather generic. Let's say we want to crawl http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and unncessary file types etc., but only expect to index pages with URLs like: http://www.mysite.com/level1pattern/level2pattern/pagepattern.html . To do this you must simply make sure your regular expressions can do this. Am I too naive to expect zero Java coding in this case
RE: URL filtering: crawling time vs. indexing time
Just try it. With -D you can override Nutch and Hadoop configuration properties. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 04-Nov-2012 06:07 To: user user@nutch.apache.org Subject: Re: URL filtering: crawling time vs. indexing time Markus, I don't see -D as a valid command parameter for solrindex. On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma markus.jel...@openindex.iowrote: Ah, i understand now. The indexer tool can filter as well in 1.5.1 and if you enable the regex filter and set a different regex configuration file when indexing vs. crawling you should be good to go. You can override the default configuration file by setting urlfilter.regex.file and point it to the regex file you want to use for indexing. You can set it via nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ... Cheers -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 17:55 To: user@nutch.apache.org Subject: Re: URL filtering: crawling time vs. indexing time I'm not sure I get it. Again, my problem is a very generic one: - The patterns in regex-urlfitler.txt, howevery exotic they are, they control ***which URLs to visit***. - Generally speaking, the set of ULRs to be indexed into solr is only a ***subset*** of the above. We need a way to specify crawling filter (which is regex-urlfitler.txt) vs. indexing filter, I think. On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux r...@teorem.fr wrote: You have still several possibilities here : 1) find a way to seed the crawl with the URLs containing the links to the leaf pages (sometimes it is possible with a simple loop) 2) create regex for each step of the scenario going to the leaf page, in order to limit the crawl to necessary pages only. Use the $ sign at the end of your regexp to limit the match of regexp like http://([a-z0-9]*\.)* mysite.com. Le 2 nov. 2012 à 17:22, Joe Zhang smartag...@gmail.com a écrit : The problem is that, - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll end up indexing all the pages on the way, not just the leaf pages. - if you write specific regex for http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you start crawling at mysite.com, you'll get zero results, as there is no match. On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma markus.jel...@openindex.iowrote: -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 10:04 To: user@nutch.apache.org Subject: URL filtering: crawling time vs. indexing time I feel like this is a trivial question, but I just can't get my ahead around it. I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the rudimentary level. If my understanding is correct, the regex-es in nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., which URLs to visit or not in the crawling process. Yes. On the other hand, it doesn't seem artificial for us to only want certain pages to be indexed. I was hoping to write some regular expressions as well in some config file, but I just can't find the right place. My hunch tells me that such things should not require into-the-box coding. Can anybody help? What exactly do you want? Add your custom regular expressions? The regex-urlfilter.txt is the place to write them to. Again, the scenario is really rather generic. Let's say we want to crawl http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and unncessary file types etc., but only expect to index pages with URLs like: http://www.mysite.com/level1pattern/level2pattern/pagepattern.html . To do this you must simply make sure your regular expressions can do this. Am I too naive to expect zero Java coding in this case? No, you can achieve almost all kinds of exotic filtering with just the URL filters and the regular expressions. Cheers
Re: URL filtering: crawling time vs. indexing time
http://hadoop.apache.org/docs/r1.0.3/commands_manual.html#Generic+Options hth On Sun, Nov 4, 2012 at 9:15 AM, Markus Jelsma markus.jel...@openindex.io wrote: Just try it. With -D you can override Nutch and Hadoop configuration properties. -Original message- From:Joe Zhang smartag...@gmail.com Sent: Sun 04-Nov-2012 06:07 To: user user@nutch.apache.org Subject: Re: URL filtering: crawling time vs. indexing time Markus, I don't see -D as a valid command parameter for solrindex. On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma markus.jel...@openindex.iowrote: Ah, i understand now. The indexer tool can filter as well in 1.5.1 and if you enable the regex filter and set a different regex configuration file when indexing vs. crawling you should be good to go. You can override the default configuration file by setting urlfilter.regex.file and point it to the regex file you want to use for indexing. You can set it via nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ... Cheers -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 17:55 To: user@nutch.apache.org Subject: Re: URL filtering: crawling time vs. indexing time I'm not sure I get it. Again, my problem is a very generic one: - The patterns in regex-urlfitler.txt, howevery exotic they are, they control ***which URLs to visit***. - Generally speaking, the set of ULRs to be indexed into solr is only a ***subset*** of the above. We need a way to specify crawling filter (which is regex-urlfitler.txt) vs. indexing filter, I think. On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux r...@teorem.fr wrote: You have still several possibilities here : 1) find a way to seed the crawl with the URLs containing the links to the leaf pages (sometimes it is possible with a simple loop) 2) create regex for each step of the scenario going to the leaf page, in order to limit the crawl to necessary pages only. Use the $ sign at the end of your regexp to limit the match of regexp like http://([a-z0-9]*\.)* mysite.com. Le 2 nov. 2012 à 17:22, Joe Zhang smartag...@gmail.com a écrit : The problem is that, - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll end up indexing all the pages on the way, not just the leaf pages. - if you write specific regex for http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you start crawling at mysite.com, you'll get zero results, as there is no match. On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma markus.jel...@openindex.iowrote: -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 10:04 To: user@nutch.apache.org Subject: URL filtering: crawling time vs. indexing time I feel like this is a trivial question, but I just can't get my ahead around it. I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the rudimentary level. If my understanding is correct, the regex-es in nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., which URLs to visit or not in the crawling process. Yes. On the other hand, it doesn't seem artificial for us to only want certain pages to be indexed. I was hoping to write some regular expressions as well in some config file, but I just can't find the right place. My hunch tells me that such things should not require into-the-box coding. Can anybody help? What exactly do you want? Add your custom regular expressions? The regex-urlfilter.txt is the place to write them to. Again, the scenario is really rather generic. Let's say we want to crawl http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and unncessary file types etc., but only expect to index pages with URLs like: http://www.mysite.com/level1pattern/level2pattern/pagepattern.html . To do this you must simply make sure your regular expressions can do this. Am I too naive to expect zero Java coding in this case? No, you can achieve almost all kinds of exotic filtering with just the URL filters and the regular expressions. Cheers -- Lewis
RE: URL filtering: crawling time vs. indexing time
-Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 10:04 To: user@nutch.apache.org Subject: URL filtering: crawling time vs. indexing time I feel like this is a trivial question, but I just can't get my ahead around it. I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the rudimentary level. If my understanding is correct, the regex-es in nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., which URLs to visit or not in the crawling process. Yes. On the other hand, it doesn't seem artificial for us to only want certain pages to be indexed. I was hoping to write some regular expressions as well in some config file, but I just can't find the right place. My hunch tells me that such things should not require into-the-box coding. Can anybody help? What exactly do you want? Add your custom regular expressions? The regex-urlfilter.txt is the place to write them to. Again, the scenario is really rather generic. Let's say we want to crawl http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and unncessary file types etc., but only expect to index pages with URLs like: http://www.mysite.com/level1pattern/level2pattern/pagepattern.html. To do this you must simply make sure your regular expressions can do this. Am I too naive to expect zero Java coding in this case? No, you can achieve almost all kinds of exotic filtering with just the URL filters and the regular expressions. Cheers
Re: URL filtering: crawling time vs. indexing time
The problem is that, - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll end up indexing all the pages on the way, not just the leaf pages. - if you write specific regex for http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you start crawling at mysite.com, you'll get zero results, as there is no match. On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma markus.jel...@openindex.iowrote: -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 10:04 To: user@nutch.apache.org Subject: URL filtering: crawling time vs. indexing time I feel like this is a trivial question, but I just can't get my ahead around it. I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the rudimentary level. If my understanding is correct, the regex-es in nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., which URLs to visit or not in the crawling process. Yes. On the other hand, it doesn't seem artificial for us to only want certain pages to be indexed. I was hoping to write some regular expressions as well in some config file, but I just can't find the right place. My hunch tells me that such things should not require into-the-box coding. Can anybody help? What exactly do you want? Add your custom regular expressions? The regex-urlfilter.txt is the place to write them to. Again, the scenario is really rather generic. Let's say we want to crawl http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and unncessary file types etc., but only expect to index pages with URLs like: http://www.mysite.com/level1pattern/level2pattern/pagepattern.html. To do this you must simply make sure your regular expressions can do this. Am I too naive to expect zero Java coding in this case? No, you can achieve almost all kinds of exotic filtering with just the URL filters and the regular expressions. Cheers
Re: URL filtering: crawling time vs. indexing time
You have still several possibilities here : 1) find a way to seed the crawl with the URLs containing the links to the leaf pages (sometimes it is possible with a simple loop) 2) create regex for each step of the scenario going to the leaf page, in order to limit the crawl to necessary pages only. Use the $ sign at the end of your regexp to limit the match of regexp like http://([a-z0-9]*\.)*mysite.com. Le 2 nov. 2012 à 17:22, Joe Zhang smartag...@gmail.com a écrit : The problem is that, - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll end up indexing all the pages on the way, not just the leaf pages. - if you write specific regex for http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you start crawling at mysite.com, you'll get zero results, as there is no match. On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma markus.jel...@openindex.iowrote: -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 10:04 To: user@nutch.apache.org Subject: URL filtering: crawling time vs. indexing time I feel like this is a trivial question, but I just can't get my ahead around it. I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the rudimentary level. If my understanding is correct, the regex-es in nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., which URLs to visit or not in the crawling process. Yes. On the other hand, it doesn't seem artificial for us to only want certain pages to be indexed. I was hoping to write some regular expressions as well in some config file, but I just can't find the right place. My hunch tells me that such things should not require into-the-box coding. Can anybody help? What exactly do you want? Add your custom regular expressions? The regex-urlfilter.txt is the place to write them to. Again, the scenario is really rather generic. Let's say we want to crawl http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and unncessary file types etc., but only expect to index pages with URLs like: http://www.mysite.com/level1pattern/level2pattern/pagepattern.html. To do this you must simply make sure your regular expressions can do this. Am I too naive to expect zero Java coding in this case? No, you can achieve almost all kinds of exotic filtering with just the URL filters and the regular expressions. Cheers
RE: URL filtering: crawling time vs. indexing time
Ah, i understand now. The indexer tool can filter as well in 1.5.1 and if you enable the regex filter and set a different regex configuration file when indexing vs. crawling you should be good to go. You can override the default configuration file by setting urlfilter.regex.file and point it to the regex file you want to use for indexing. You can set it via nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ... Cheers -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 17:55 To: user@nutch.apache.org Subject: Re: URL filtering: crawling time vs. indexing time I'm not sure I get it. Again, my problem is a very generic one: - The patterns in regex-urlfitler.txt, howevery exotic they are, they control ***which URLs to visit***. - Generally speaking, the set of ULRs to be indexed into solr is only a ***subset*** of the above. We need a way to specify crawling filter (which is regex-urlfitler.txt) vs. indexing filter, I think. On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux r...@teorem.fr wrote: You have still several possibilities here : 1) find a way to seed the crawl with the URLs containing the links to the leaf pages (sometimes it is possible with a simple loop) 2) create regex for each step of the scenario going to the leaf page, in order to limit the crawl to necessary pages only. Use the $ sign at the end of your regexp to limit the match of regexp like http://([a-z0-9]*\.)* mysite.com. Le 2 nov. 2012 à 17:22, Joe Zhang smartag...@gmail.com a écrit : The problem is that, - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll end up indexing all the pages on the way, not just the leaf pages. - if you write specific regex for http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you start crawling at mysite.com, you'll get zero results, as there is no match. On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma markus.jel...@openindex.iowrote: -Original message- From:Joe Zhang smartag...@gmail.com Sent: Fri 02-Nov-2012 10:04 To: user@nutch.apache.org Subject: URL filtering: crawling time vs. indexing time I feel like this is a trivial question, but I just can't get my ahead around it. I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the rudimentary level. If my understanding is correct, the regex-es in nutch/conf/regex-urlfilter.txt control the crawling behavior, ie., which URLs to visit or not in the crawling process. Yes. On the other hand, it doesn't seem artificial for us to only want certain pages to be indexed. I was hoping to write some regular expressions as well in some config file, but I just can't find the right place. My hunch tells me that such things should not require into-the-box coding. Can anybody help? What exactly do you want? Add your custom regular expressions? The regex-urlfilter.txt is the place to write them to. Again, the scenario is really rather generic. Let's say we want to crawl http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and unncessary file types etc., but only expect to index pages with URLs like: http://www.mysite.com/level1pattern/level2pattern/pagepattern.html. To do this you must simply make sure your regular expressions can do this. Am I too naive to expect zero Java coding in this case? No, you can achieve almost all kinds of exotic filtering with just the URL filters and the regular expressions. Cheers