Re: URL filtering: crawling time vs. indexing time

2012-11-17 Thread Joe Zhang
Markus, could you advise? This seems the most promising approach, and I'm
quite confident that my url pattern files is correct.

On Sun, Nov 4, 2012 at 6:39 PM, Joe Zhang smartag...@gmail.com wrote:

 Markus, I tried it. The command line works great. But it doesn't seem to
 achieve the filtering effect even if I provide really tight patterns in the
 regex file. Any idea why?


 On Sun, Nov 4, 2012 at 4:38 AM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 http://hadoop.apache.org/docs/r1.0.3/commands_manual.html#Generic+Options

 hth

 On Sun, Nov 4, 2012 at 9:15 AM, Markus Jelsma
 markus.jel...@openindex.io wrote:
  Just try it. With -D you can override Nutch and Hadoop configuration
 properties.
 
 
 
 
 
  -Original message-
  From:Joe Zhang smartag...@gmail.com
  Sent: Sun 04-Nov-2012 06:07
  To: user user@nutch.apache.org
  Subject: Re: URL filtering: crawling time vs. indexing time
 
  Markus, I don't see -D as a valid command parameter for solrindex.
 
  On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma
  markus.jel...@openindex.iowrote:
 
   Ah, i understand now.
  
   The indexer tool can filter as well in 1.5.1 and if you enable the
 regex
   filter and set a different regex configuration file when indexing vs.
   crawling you should be good to go.
  
   You can override the default configuration file by setting
   urlfilter.regex.file and point it to the regex file you want to use
 for
   indexing. You can set it via nutch solrindex
 -Durlfilter.regex.file=/path
   http://solrurl/ ...
  
   Cheers
  
   -Original message-
From:Joe Zhang smartag...@gmail.com
Sent: Fri 02-Nov-2012 17:55
To: user@nutch.apache.org
Subject: Re: URL filtering: crawling time vs. indexing time
   
I'm not sure I get it. Again, my problem is a very generic one:
   
- The patterns in regex-urlfitler.txt, howevery exotic they are,
 they
control ***which URLs to visit***.
- Generally speaking, the set of ULRs to be indexed into solr is
 only a
***subset*** of the above.
   
We need a way to specify crawling filter (which is
 regex-urlfitler.txt)
   vs.
indexing filter, I think.
   
On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux r...@teorem.fr
 wrote:
   
 You have still several possibilities here :
 1) find a way to seed the crawl with the URLs containing the
 links to
   the
 leaf pages (sometimes it is possible with a simple loop)
 2) create regex for each step of the scenario going to the leaf
 page,
   in
 order to limit the crawl to necessary pages only. Use the $ sign
 at
   the end
 of your regexp to limit the match of regexp like http://
 ([a-z0-9]*\.)*
 mysite.com.


 Le 2 nov. 2012 à 17:22, Joe Zhang smartag...@gmail.com a
 écrit :

  The problem is that,
 
  - if you write regex such as: +^http://([a-z0-9]*\.)*
 mysite.com,
   you'll
 end
  up indexing all the pages on the way, not just the leaf pages.
  - if you write specific regex for
 
 http://www.mysite.com/level1pattern/level2pattern/pagepattern.html,
   and
 you
  start crawling at mysite.com, you'll get zero results, as
 there is
   no
 match.
 
  On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma 
 markus.jel...@openindex.iowrote:
 
  -Original message-
  From:Joe Zhang smartag...@gmail.com
  Sent: Fri 02-Nov-2012 10:04
  To: user@nutch.apache.org
  Subject: URL filtering: crawling time vs. indexing time
 
  I feel like this is a trivial question, but I just can't get
 my
   ahead
  around it.
 
  I'm using nutch 1.5.1 and solr 3.6.1 together. Things work
 fine at
   the
  rudimentary level.
 
  If my understanding is correct, the regex-es in
  nutch/conf/regex-urlfilter.txt control  the crawling
 behavior, ie.,
 which
  URLs to visit or not in the crawling process.
 
  Yes.
 
 
  On the other hand, it doesn't seem artificial for us to only
 want
 certain
  pages to be indexed. I was hoping to write some regular
   expressions as
  well
  in some config file, but I just can't find the right place.
 My
   hunch
  tells
  me that such things should not require into-the-box coding.
 Can
   anybody
  help?
 
  What exactly do you want? Add your custom regular
 expressions? The
  regex-urlfilter.txt is the place to write them to.
 
 
  Again, the scenario is really rather generic. Let's say we
 want to
 crawl
  http://www.mysite.com. We can use the regex-urlfilter.txt
 to skip
 loops
  and
  unncessary file types etc., but only expect to index pages
 with
   URLs
  like:
 
 http://www.mysite.com/level1pattern/level2pattern/pagepattern.html
   .
 
  To do this you must simply make sure your regular expressions
 can do
 this.
 
 
  Am I too naive to expect zero Java coding in this case

RE: URL filtering: crawling time vs. indexing time

2012-11-04 Thread Markus Jelsma
Just try it. With -D you can override Nutch and Hadoop configuration properties.



 
 
-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Sun 04-Nov-2012 06:07
 To: user user@nutch.apache.org
 Subject: Re: URL filtering: crawling time vs. indexing time
 
 Markus, I don't see -D as a valid command parameter for solrindex.
 
 On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:
 
  Ah, i understand now.
 
  The indexer tool can filter as well in 1.5.1 and if you enable the regex
  filter and set a different regex configuration file when indexing vs.
  crawling you should be good to go.
 
  You can override the default configuration file by setting
  urlfilter.regex.file and point it to the regex file you want to use for
  indexing. You can set it via nutch solrindex -Durlfilter.regex.file=/path
  http://solrurl/ ...
 
  Cheers
 
  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Fri 02-Nov-2012 17:55
   To: user@nutch.apache.org
   Subject: Re: URL filtering: crawling time vs. indexing time
  
   I'm not sure I get it. Again, my problem is a very generic one:
  
   - The patterns in regex-urlfitler.txt, howevery exotic they are, they
   control ***which URLs to visit***.
   - Generally speaking, the set of ULRs to be indexed into solr is only a
   ***subset*** of the above.
  
   We need a way to specify crawling filter (which is regex-urlfitler.txt)
  vs.
   indexing filter, I think.
  
   On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux r...@teorem.fr wrote:
  
You have still several possibilities here :
1) find a way to seed the crawl with the URLs containing the links to
  the
leaf pages (sometimes it is possible with a simple loop)
2) create regex for each step of the scenario going to the leaf page,
  in
order to limit the crawl to necessary pages only. Use the $ sign at
  the end
of your regexp to limit the match of regexp like http://([a-z0-9]*\.)*
mysite.com.
   
   
Le 2 nov. 2012 à 17:22, Joe Zhang smartag...@gmail.com a écrit :
   
 The problem is that,

 - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com,
  you'll
end
 up indexing all the pages on the way, not just the leaf pages.
 - if you write specific regex for
 http://www.mysite.com/level1pattern/level2pattern/pagepattern.html,
  and
you
 start crawling at mysite.com, you'll get zero results, as there is
  no
match.

 On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma 
markus.jel...@openindex.iowrote:

 -Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Fri 02-Nov-2012 10:04
 To: user@nutch.apache.org
 Subject: URL filtering: crawling time vs. indexing time

 I feel like this is a trivial question, but I just can't get my
  ahead
 around it.

 I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at
  the
 rudimentary level.

 If my understanding is correct, the regex-es in
 nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie.,
which
 URLs to visit or not in the crawling process.

 Yes.


 On the other hand, it doesn't seem artificial for us to only want
certain
 pages to be indexed. I was hoping to write some regular
  expressions as
 well
 in some config file, but I just can't find the right place. My
  hunch
 tells
 me that such things should not require into-the-box coding. Can
  anybody
 help?

 What exactly do you want? Add your custom regular expressions? The
 regex-urlfilter.txt is the place to write them to.


 Again, the scenario is really rather generic. Let's say we want to
crawl
 http://www.mysite.com. We can use the regex-urlfilter.txt to skip
loops
 and
 unncessary file types etc., but only expect to index pages with
  URLs
 like:
 http://www.mysite.com/level1pattern/level2pattern/pagepattern.html
  .

 To do this you must simply make sure your regular expressions can do
this.


 Am I too naive to expect zero Java coding in this case?

 No, you can achieve almost all kinds of exotic filtering with just
  the
URL
 filters and the regular expressions.

 Cheers


   
   
  
 
 


Re: URL filtering: crawling time vs. indexing time

2012-11-04 Thread Lewis John Mcgibbney
http://hadoop.apache.org/docs/r1.0.3/commands_manual.html#Generic+Options

hth

On Sun, Nov 4, 2012 at 9:15 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Just try it. With -D you can override Nutch and Hadoop configuration 
 properties.





 -Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Sun 04-Nov-2012 06:07
 To: user user@nutch.apache.org
 Subject: Re: URL filtering: crawling time vs. indexing time

 Markus, I don't see -D as a valid command parameter for solrindex.

 On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  Ah, i understand now.
 
  The indexer tool can filter as well in 1.5.1 and if you enable the regex
  filter and set a different regex configuration file when indexing vs.
  crawling you should be good to go.
 
  You can override the default configuration file by setting
  urlfilter.regex.file and point it to the regex file you want to use for
  indexing. You can set it via nutch solrindex -Durlfilter.regex.file=/path
  http://solrurl/ ...
 
  Cheers
 
  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Fri 02-Nov-2012 17:55
   To: user@nutch.apache.org
   Subject: Re: URL filtering: crawling time vs. indexing time
  
   I'm not sure I get it. Again, my problem is a very generic one:
  
   - The patterns in regex-urlfitler.txt, howevery exotic they are, they
   control ***which URLs to visit***.
   - Generally speaking, the set of ULRs to be indexed into solr is only a
   ***subset*** of the above.
  
   We need a way to specify crawling filter (which is regex-urlfitler.txt)
  vs.
   indexing filter, I think.
  
   On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux r...@teorem.fr wrote:
  
You have still several possibilities here :
1) find a way to seed the crawl with the URLs containing the links to
  the
leaf pages (sometimes it is possible with a simple loop)
2) create regex for each step of the scenario going to the leaf page,
  in
order to limit the crawl to necessary pages only. Use the $ sign at
  the end
of your regexp to limit the match of regexp like http://([a-z0-9]*\.)*
mysite.com.
   
   
Le 2 nov. 2012 à 17:22, Joe Zhang smartag...@gmail.com a écrit :
   
 The problem is that,

 - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com,
  you'll
end
 up indexing all the pages on the way, not just the leaf pages.
 - if you write specific regex for
 http://www.mysite.com/level1pattern/level2pattern/pagepattern.html,
  and
you
 start crawling at mysite.com, you'll get zero results, as there is
  no
match.

 On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma 
markus.jel...@openindex.iowrote:

 -Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Fri 02-Nov-2012 10:04
 To: user@nutch.apache.org
 Subject: URL filtering: crawling time vs. indexing time

 I feel like this is a trivial question, but I just can't get my
  ahead
 around it.

 I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at
  the
 rudimentary level.

 If my understanding is correct, the regex-es in
 nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie.,
which
 URLs to visit or not in the crawling process.

 Yes.


 On the other hand, it doesn't seem artificial for us to only want
certain
 pages to be indexed. I was hoping to write some regular
  expressions as
 well
 in some config file, but I just can't find the right place. My
  hunch
 tells
 me that such things should not require into-the-box coding. Can
  anybody
 help?

 What exactly do you want? Add your custom regular expressions? The
 regex-urlfilter.txt is the place to write them to.


 Again, the scenario is really rather generic. Let's say we want to
crawl
 http://www.mysite.com. We can use the regex-urlfilter.txt to skip
loops
 and
 unncessary file types etc., but only expect to index pages with
  URLs
 like:
 http://www.mysite.com/level1pattern/level2pattern/pagepattern.html
  .

 To do this you must simply make sure your regular expressions can do
this.


 Am I too naive to expect zero Java coding in this case?

 No, you can achieve almost all kinds of exotic filtering with just
  the
URL
 filters and the regular expressions.

 Cheers


   
   
  
 




-- 
Lewis


RE: URL filtering: crawling time vs. indexing time

2012-11-02 Thread Markus Jelsma
-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Fri 02-Nov-2012 10:04
 To: user@nutch.apache.org
 Subject: URL filtering: crawling time vs. indexing time
 
 I feel like this is a trivial question, but I just can't get my ahead
 around it.
 
 I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
 rudimentary level.
 
 If my understanding is correct, the regex-es in
 nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie., which
 URLs to visit or not in the crawling process.

Yes.

 
 On the other hand, it doesn't seem artificial for us to only want certain
 pages to be indexed. I was hoping to write some regular expressions as well
 in some config file, but I just can't find the right place. My hunch tells
 me that such things should not require into-the-box coding. Can anybody
 help?

What exactly do you want? Add your custom regular expressions? The 
regex-urlfilter.txt is the place to write them to.

 
 Again, the scenario is really rather generic. Let's say we want to crawl
 http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and
 unncessary file types etc., but only expect to index pages with URLs like:
 http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.

To do this you must simply make sure your regular expressions can do this.

 
 Am I too naive to expect zero Java coding in this case?

No, you can achieve almost all kinds of exotic filtering with just the URL 
filters and the regular expressions.

Cheers
 


Re: URL filtering: crawling time vs. indexing time

2012-11-02 Thread Joe Zhang
The problem is that,

- if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll end
up indexing all the pages on the way, not just the leaf pages.
- if you write specific regex for
http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you
start crawling at mysite.com, you'll get zero results, as there is no match.

On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma markus.jel...@openindex.iowrote:

 -Original message-
  From:Joe Zhang smartag...@gmail.com
  Sent: Fri 02-Nov-2012 10:04
  To: user@nutch.apache.org
  Subject: URL filtering: crawling time vs. indexing time
 
  I feel like this is a trivial question, but I just can't get my ahead
  around it.
 
  I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
  rudimentary level.
 
  If my understanding is correct, the regex-es in
  nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie., which
  URLs to visit or not in the crawling process.

 Yes.

 
  On the other hand, it doesn't seem artificial for us to only want certain
  pages to be indexed. I was hoping to write some regular expressions as
 well
  in some config file, but I just can't find the right place. My hunch
 tells
  me that such things should not require into-the-box coding. Can anybody
  help?

 What exactly do you want? Add your custom regular expressions? The
 regex-urlfilter.txt is the place to write them to.

 
  Again, the scenario is really rather generic. Let's say we want to crawl
  http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops
 and
  unncessary file types etc., but only expect to index pages with URLs
 like:
  http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.

 To do this you must simply make sure your regular expressions can do this.

 
  Am I too naive to expect zero Java coding in this case?

 No, you can achieve almost all kinds of exotic filtering with just the URL
 filters and the regular expressions.

 Cheers
 



Re: URL filtering: crawling time vs. indexing time

2012-11-02 Thread Rémy Amouroux
You have still several possibilities here :
1) find a way to seed the crawl with the URLs containing the links to the leaf 
pages (sometimes it is possible with a simple loop)
2) create regex for each step of the scenario going to the leaf page, in order 
to limit the crawl to necessary pages only. Use the $ sign at the end of your 
regexp to limit the match of regexp like http://([a-z0-9]*\.)*mysite.com.


Le 2 nov. 2012 à 17:22, Joe Zhang smartag...@gmail.com a écrit :

 The problem is that,
 
 - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll end
 up indexing all the pages on the way, not just the leaf pages.
 - if you write specific regex for
 http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you
 start crawling at mysite.com, you'll get zero results, as there is no match.
 
 On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma 
 markus.jel...@openindex.iowrote:
 
 -Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Fri 02-Nov-2012 10:04
 To: user@nutch.apache.org
 Subject: URL filtering: crawling time vs. indexing time
 
 I feel like this is a trivial question, but I just can't get my ahead
 around it.
 
 I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
 rudimentary level.
 
 If my understanding is correct, the regex-es in
 nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie., which
 URLs to visit or not in the crawling process.
 
 Yes.
 
 
 On the other hand, it doesn't seem artificial for us to only want certain
 pages to be indexed. I was hoping to write some regular expressions as
 well
 in some config file, but I just can't find the right place. My hunch
 tells
 me that such things should not require into-the-box coding. Can anybody
 help?
 
 What exactly do you want? Add your custom regular expressions? The
 regex-urlfilter.txt is the place to write them to.
 
 
 Again, the scenario is really rather generic. Let's say we want to crawl
 http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops
 and
 unncessary file types etc., but only expect to index pages with URLs
 like:
 http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.
 
 To do this you must simply make sure your regular expressions can do this.
 
 
 Am I too naive to expect zero Java coding in this case?
 
 No, you can achieve almost all kinds of exotic filtering with just the URL
 filters and the regular expressions.
 
 Cheers
 
 



RE: URL filtering: crawling time vs. indexing time

2012-11-02 Thread Markus Jelsma
Ah, i understand now.

The indexer tool can filter as well in 1.5.1 and if you enable the regex filter 
and set a different regex configuration file when indexing vs. crawling you 
should be good to go.

You can override the default configuration file by setting urlfilter.regex.file 
and point it to the regex file you want to use for indexing. You can set it via 
nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ...

Cheers
 
-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Fri 02-Nov-2012 17:55
 To: user@nutch.apache.org
 Subject: Re: URL filtering: crawling time vs. indexing time
 
 I'm not sure I get it. Again, my problem is a very generic one:
 
 - The patterns in regex-urlfitler.txt, howevery exotic they are, they
 control ***which URLs to visit***.
 - Generally speaking, the set of ULRs to be indexed into solr is only a
 ***subset*** of the above.
 
 We need a way to specify crawling filter (which is regex-urlfitler.txt) vs.
 indexing filter, I think.
 
 On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux r...@teorem.fr wrote:
 
  You have still several possibilities here :
  1) find a way to seed the crawl with the URLs containing the links to the
  leaf pages (sometimes it is possible with a simple loop)
  2) create regex for each step of the scenario going to the leaf page, in
  order to limit the crawl to necessary pages only. Use the $ sign at the end
  of your regexp to limit the match of regexp like http://([a-z0-9]*\.)*
  mysite.com.
 
 
  Le 2 nov. 2012 à 17:22, Joe Zhang smartag...@gmail.com a écrit :
 
   The problem is that,
  
   - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll
  end
   up indexing all the pages on the way, not just the leaf pages.
   - if you write specific regex for
   http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and
  you
   start crawling at mysite.com, you'll get zero results, as there is no
  match.
  
   On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma 
  markus.jel...@openindex.iowrote:
  
   -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Fri 02-Nov-2012 10:04
   To: user@nutch.apache.org
   Subject: URL filtering: crawling time vs. indexing time
  
   I feel like this is a trivial question, but I just can't get my ahead
   around it.
  
   I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
   rudimentary level.
  
   If my understanding is correct, the regex-es in
   nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie.,
  which
   URLs to visit or not in the crawling process.
  
   Yes.
  
  
   On the other hand, it doesn't seem artificial for us to only want
  certain
   pages to be indexed. I was hoping to write some regular expressions as
   well
   in some config file, but I just can't find the right place. My hunch
   tells
   me that such things should not require into-the-box coding. Can anybody
   help?
  
   What exactly do you want? Add your custom regular expressions? The
   regex-urlfilter.txt is the place to write them to.
  
  
   Again, the scenario is really rather generic. Let's say we want to
  crawl
   http://www.mysite.com. We can use the regex-urlfilter.txt to skip
  loops
   and
   unncessary file types etc., but only expect to index pages with URLs
   like:
   http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.
  
   To do this you must simply make sure your regular expressions can do
  this.
  
  
   Am I too naive to expect zero Java coding in this case?
  
   No, you can achieve almost all kinds of exotic filtering with just the
  URL
   filters and the regular expressions.
  
   Cheers