[jira] Updated: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-25 Thread Andrzej Bialecki (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-339?page=all ]

Andrzej Bialecki  updated NUTCH-339:


Attachment: patch4-fixed.txt

Sorry, the patch was incomplete - please try patch4-fixed.txt instead.

 Refactor nutch to allow fetcher improvements
 

 Key: NUTCH-339
 URL: http://issues.apache.org/jira/browse/NUTCH-339
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Sami Siren
 Assigned To: Andrzej Bialecki 
 Fix For: 0.9.0

 Attachments: patch.txt, patch2.txt, patch3.txt, patch4-fixed.txt, 
 patch4-trunk.txt


 As I (and Stefan?) see it there are two major areas the current fetcher could 
 be
 improved (as in speed)
 1. Politeness code and how it is implemented is the biggest
 problem of current fetcher(together with robots.txt handling).
 With a simple code changes like replacing it with a PriorityQueue
 based solution showed very promising results in increased IO.
 2. Changing fetcher to use non blocking io (this requires great amount
 of work as we need to implement the protocols from scratch again).
 I would like to start with working towards #1 by first refactoring
 the current code (plugins actually) in following way:
 1. Move robots.txt handling away from (lib-http)plugin.
 Even if this is related only to http, leaving it to lib-http
 does not allow other kinds of scheduling strategies to be implemented
 (it is hardcoded to fetch robots.txt from the same thread when requesting
 a page from a site from witch it hasn't tried to load robots.txt)
 2. Move code for politeness away from (lib-http)plugin
 It is really usable outside http and also the current design limits
 changing of the implementation (to queue based)
 Where to move these, well my suggestion is the nutch core, does anybody
 see problems with this?
 These code refactoring activities are to be done in a way that none
 of the current functionality is (at least deliberately) changed leaving
 current functionality as is thus leaving room and possibility to build
 the next generation fetcher(s) without destroying the old one at same time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




RE: [jira] Created: (NUTCH-408) Plugin development documentation

2006-11-25 Thread Armel T. Nene
I agree with you that documentation is vital not the just extending the
current version but also for any plugins and patches created. I have been
spending almost two weeks trying to adapt nutch to my project but I spend
more time in reading code and trying to understand what they do before I can
even start to fix problem. Come on guys, documentation is good coding
practice, we can't read your mind to know exactly what you were trying to
achieve by just looking at the implementation code.

This is just a good constructive criticism.

:) Armel

-Original Message-
From: nutch.newbie (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: 25 November 2006 03:45
To: nutch-dev@lucene.apache.org
Subject: [jira] Created: (NUTCH-408) Plugin development documentation

Plugin development documentation


 Key: NUTCH-408
 URL: http://issues.apache.org/jira/browse/NUTCH-408
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1
 Environment: Linux Fedora
Reporter: nutch.newbie


Documentation is rare! But very vital for extending current (0.9) nutch.
Current docs on the wiki for 0.7 plugin development was good but it doesn't
apply to 0.9 and new developers who are joining directly 0.9 find the 0.7
documentation not enough. A more practical plugin writing documentation for
0.9 is desired also exposing the plugin principals in practical terms i.e.
extension points and libs etc. furthermore it would be good to provide some
best practice example i.e. 

look for the lib you are planning to use if its already in lib folder and
maybe that version of the external lib is good for the plugin dev rather
then using another version things like that..

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira






Re: [jira] Created: (NUTCH-408) Plugin development documentation

2006-11-25 Thread Stefan Groschupf
did you erver browse this: http://wiki.media-style.com/display/ 
nutchDocu/Home

Nothing big, but it will give you some ideas, also about plugins.

On 25.11.2006, at 06:32, Armel T. Nene wrote:

I agree with you that documentation is vital not the just extending  
the
current version but also for any plugins and patches created. I  
have been
spending almost two weeks trying to adapt nutch to my project but I  
spend
more time in reading code and trying to understand what they do  
before I can

even start to fix problem. Come on guys, documentation is good coding
practice, we can't read your mind to know exactly what you were  
trying to

achieve by just looking at the implementation code.

This is just a good constructive criticism.

:) Armel

-Original Message-
From: nutch.newbie (JIRA) [mailto:[EMAIL PROTECTED]
Sent: 25 November 2006 03:45
To: nutch-dev@lucene.apache.org
Subject: [jira] Created: (NUTCH-408) Plugin development documentation

Plugin development documentation


 Key: NUTCH-408
 URL: http://issues.apache.org/jira/browse/NUTCH-408
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1
 Environment: Linux Fedora
Reporter: nutch.newbie


Documentation is rare! But very vital for extending current (0.9)  
nutch.
Current docs on the wiki for 0.7 plugin development was good but it  
doesn't
apply to 0.9 and new developers who are joining directly 0.9 find  
the 0.7
documentation not enough. A more practical plugin writing  
documentation for
0.9 is desired also exposing the plugin principals in practical  
terms i.e.
extension points and libs etc. furthermore it would be good to  
provide some

best practice example i.e.

look for the lib you are planning to use if its already in lib  
folder and
maybe that version of the external lib is good for the plugin dev  
rather

then using another version things like that..

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the  
administrators:

http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/ 
software/jira








~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com





[jira] Commented: (NUTCH-408) Plugin development documentation

2006-11-25 Thread nutch.newbie (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-408?page=comments#action_12452610 ] 

nutch.newbie commented on NUTCH-408:


Yes, I have gone through the media style documentation and it is a good start. 
and there are also some very good documentation in Nutch wiki. My thinking was 
to complete-compile existing documentation in a coherent way so that you get 
the whole picture. So I  would like to give it a shot at writing but I can not 
do this all by myself as I lack background info plus I haven't wrote any plugin 
myself so if any of you would like to help me I would like to do this job. Feel 
free to mail me directly if you are up for it. I will probably ask lot of 
stupid question but we will have some development documentation at least :=) 
what you say? anyone up for this?

Regards 

 Plugin development documentation
 

 Key: NUTCH-408
 URL: http://issues.apache.org/jira/browse/NUTCH-408
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1
 Environment: Linux Fedora
Reporter: nutch.newbie

 Documentation is rare! But very vital for extending current (0.9) nutch. 
 Current docs on the wiki for 0.7 plugin development was good but it doesn't 
 apply to 0.9 and new developers who are joining directly 0.9 find the 0.7 
 documentation not enough. A more practical plugin writing documentation for 
 0.9 is desired also exposing the plugin principals in practical terms i.e. 
 extension points and libs etc. furthermore it would be good to provide some 
 best practice example i.e. 
 look for the lib you are planning to use if its already in lib folder and 
 maybe that version of the external lib is good for the plugin dev rather then 
 using another version things like that..

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling

2006-11-25 Thread Doug Cook (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-409?page=all ]

Doug Cook updated NUTCH-409:


Attachment: shortcircuit.patch

 Add short circuit notion to filters to speedup mixed site/subsite crawling
 

 Key: NUTCH-409
 URL: http://issues.apache.org/jira/browse/NUTCH-409
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
Reporter: Doug Cook
Priority: Minor
 Attachments: shortcircuit.patch


 In the case where one is crawling a mixture of sites and sub-sites, the 
 prefix matcher can match the sites quite quickly, but either the regex or 
 automaton filters are considerably slower matching the sub-sites. In the 
 current model of AND-ing all the filters together, the pattern-matching 
 filter will be run on every site that matches the prefix matcher -- even if 
 that entire site is to be crawled and there are no sub-site patterns. If only 
 a small portion of the sites actually need sub-site pattern matching, this is 
 much slower than it should be.
 I propose (and attach) a simple modification allowing considerable speedup 
 for this usage pattern. I define the notion of a short circuit match that 
 means accept this URL and don't run any of the remaining filters in the 
 filter chain. 
 Though with this change, any filter plugin can in theory return a 
 short-circuit match, I have only implemented the short-circuit match for the 
 PrefixURLFilter. The configuration file format is backwards-compatible; 
 shortcircuit matches just have SHORTCIRCUIT: in front of them.
 One minor gotcha:
 * Because the shortcircuit matches will avoid running any later filters, all 
 of the site-independent filters need to be BEFORE the PrefixURLFilter in the 
 chain.
 I get my best performance using the following filter chain:
 1) The SuffixURLFilter  to throw away anything with unwanted extensions
 2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping 
 mailto:, bulletin-board pages, etc.)
 3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT 
 the sites needing subsite matching
 4) The AutomatonURLFilter to match those sites needing subsite pattern 
 matching.
 I have tens of thousands of sites and an order of magnitude fewer subsites, 
 so skipping step #4 90% of the time speeds things up considerably (my reduce 
 time on a round of crawling is down from some 26 hours to less than 10).
 There are only two drawbacks to the implementation, and I think they're 
 pretty minor:
 1) Because I pass a special token (_PASS_) in the place of the URL to 
 implement the short circuit, if for some reason someone wanted to crawl a URL 
 named _PASS_, there would be problems. I find this highly unlikely, since 
 that's an invalid URL.
 2) The correct behavior of steps #3 and #4 above depends upon coordination of 
 the config files between the prefix and automaton filters, making an 
 opportunity for user screwup. I thought about creating a new kind of filter 
 which essentially combined prefix  automaton's behaviors, took one config 
 file, and internally handled the short-circuiting. But I think the approach I 
 took is simpler, cleaner, more flexible, and avoids creating yet another kind 
 of filter. Coordinating the config files is pretty easy (I generate them 
 programmatically).
 As this is my first contribution to Nutch I'm sure that there are things I've 
 missed, whether in coding style or desired patch format. I welcome any 
 feedback, suggestions, etc.
 Doug

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: More fetcher speed increases

2006-11-25 Thread Doug Cook


Done. See http://issues.apache.org/jira/browse/NUTCH-409

This is my first Nutch contribution, so hopefully I've got it right ;-) Any
suggestions/questions/feedback welcome.

Hope this is useful to others.

D


scott green wrote:
 
 Hi Doug,
 
 Your idea about PrefixURLFilter and  AutomatonURLFilter combination
 sounds interesting. Could you please attach the patch to JIRA? Thanks
 
 - Scott
 
 On 11/17/06, Doug Cook [EMAIL PROTECTED] wrote:

 Hi, folks,

 I, too, was slowed down by reduce operations in fetch. Some benchmarking
 showed that in my case, the limiting operation was filtering (though a
 distant second was the time spent calculating Levenshtein distances,
 presumably part of the spellchecking that Sami just removed to speed
 things
 up, though I haven't looked at it yet).

 I've fixed the problem, and my reduce speed is better by about a factor
 of
 three. However, the fix is limited to certain usage patterns.

 In my case, I have tens of thousands of sites and subsites I'm crawling,
 and
 I'm using a combination of PrefixURLFilter + AutomatonURLFilter. I
 essentially use the prefix filter to limit to the set of sites, and then
 automaton to pattern-match within those sites. I only have subsite
 matches
 on  10% of the sites, however, so I was clearly wasting a lot of time
 running the automaton patterns that didn't need it. And automaton, though
 much faster than RegexURLFilter, is still dog-slow with that many
 patterns.

 A simple fix was to extend the current AND all the filters together
 model
 to have the notion of a short-circuit match, which allows a filter to
 say
 let this URL through and DON'T run the other filters by returning a
 special token to URLFilters. Now I have a version of PrefixURLFilter that
 can return both normal matches and short circuit matches, and only
 returns normal matches for those sites that need to run subsite
 patterns.
 It seems to work well, the overhead is negligible when not in use, and
 the
 speedup is massive for my usage pattern.

 I'd like to contribute it back, if people would find this useful (not
 that
 it's rocket science!).

 First, is there anyone out there besides me who would find this useful?

 Second, I've been thinking about the best way to handle PrefixURLFilter
 configuration. I can see a few options:

 1. Have two different config files, one for normal matches, and one for
 short-circuit matches.
 2. Have one config file, with a syntax to say make this pattern a
 short-circuit match, and make the default be a normal match, so it is
 backwards compatible with the current version.
 3. Make a new type of filter which internally combines Prefix and
 Automaton,
 takes one config file, and decides internally which patterns should
 generate
 automaton inputs vs normal or short circuit prefix matches.

 Approach #3 requires no changes to the URLFilter model, and makes it
 difficult to screw up by making config files which are inconsistent (e.g.
 forgetting to put in a prefix pattern for one of the automaton patterns).
 It
 is also the least flexible, requires the most code, and introduces yet
 another kind of filter.

 I tend to like the changed URLFilter model; it's more flexible, even if
 it
 requires a little more care in configuration (a simple Perl script, in my
 case, to generate the config files correctly and consistently). I'm
 leaning
 towards approach #2. I'm thinking something simple, syntax-wise, like
 putting SHORTCIRCUIT: before the patterns which should short-circuit. Any
 suggestions for a  better syntax? Or reasons why I should consider a
 different approach?

 Doug

 --
 View this message in context:
 http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7381430
 Sent from the Nutch - Dev mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7543634
Sent from the Nutch - Dev mailing list archive at Nabble.com.



[jira] Commented: (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling

2006-11-25 Thread Doug Cook (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-409?page=comments#action_12452617 ] 

Doug Cook commented on NUTCH-409:
-

I should also note that this approach is still not optimal (though it is faster 
for my usage pattern). I'm still running the site-independent regular 
expressions (ad removal, etc) on *every* URL; really, they should just be run 
on the URLs which belong to the set of sites I'm crawling. 

One could think of a slight extension to the change here, where each filter has 
a parameter: (A) run me on all URLs which have passed the prior filters or 
(B) run me on only the non-shortcircuit matches. This would allow us to put 
the RegexURLFilter *after* the PrefixURLFilter, and make it a type A 
(site-independent) filter, while the Automaton would be Type B. 
(site-dependent). Simple code-wise, but a little more complexity in 
configuration.

Or one could return to the notion of a super filter which takes one config 
file, internally combines these effects and automatically optimizes the 
filtering. A little more ambitious code-wise, but ultimately easier to use.

At any rate, the attached change is pretty simple, and at least helpful for me, 
if not perfect; thought I would share it.

D

 Add short circuit notion to filters to speedup mixed site/subsite crawling
 

 Key: NUTCH-409
 URL: http://issues.apache.org/jira/browse/NUTCH-409
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
Reporter: Doug Cook
Priority: Minor
 Attachments: shortcircuit.patch


 In the case where one is crawling a mixture of sites and sub-sites, the 
 prefix matcher can match the sites quite quickly, but either the regex or 
 automaton filters are considerably slower matching the sub-sites. In the 
 current model of AND-ing all the filters together, the pattern-matching 
 filter will be run on every site that matches the prefix matcher -- even if 
 that entire site is to be crawled and there are no sub-site patterns. If only 
 a small portion of the sites actually need sub-site pattern matching, this is 
 much slower than it should be.
 I propose (and attach) a simple modification allowing considerable speedup 
 for this usage pattern. I define the notion of a short circuit match that 
 means accept this URL and don't run any of the remaining filters in the 
 filter chain. 
 Though with this change, any filter plugin can in theory return a 
 short-circuit match, I have only implemented the short-circuit match for the 
 PrefixURLFilter. The configuration file format is backwards-compatible; 
 shortcircuit matches just have SHORTCIRCUIT: in front of them.
 One minor gotcha:
 * Because the shortcircuit matches will avoid running any later filters, all 
 of the site-independent filters need to be BEFORE the PrefixURLFilter in the 
 chain.
 I get my best performance using the following filter chain:
 1) The SuffixURLFilter  to throw away anything with unwanted extensions
 2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping 
 mailto:, bulletin-board pages, etc.)
 3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT 
 the sites needing subsite matching
 4) The AutomatonURLFilter to match those sites needing subsite pattern 
 matching.
 I have tens of thousands of sites and an order of magnitude fewer subsites, 
 so skipping step #4 90% of the time speeds things up considerably (my reduce 
 time on a round of crawling is down from some 26 hours to less than 10).
 There are only two drawbacks to the implementation, and I think they're 
 pretty minor:
 1) Because I pass a special token (_PASS_) in the place of the URL to 
 implement the short circuit, if for some reason someone wanted to crawl a URL 
 named _PASS_, there would be problems. I find this highly unlikely, since 
 that's an invalid URL.
 2) The correct behavior of steps #3 and #4 above depends upon coordination of 
 the config files between the prefix and automaton filters, making an 
 opportunity for user screwup. I thought about creating a new kind of filter 
 which essentially combined prefix  automaton's behaviors, took one config 
 file, and internally handled the short-circuiting. But I think the approach I 
 took is simpler, cleaner, more flexible, and avoids creating yet another kind 
 of filter. Coordinating the config files is pretty easy (I generate them 
 programmatically).
 As this is my first contribution to Nutch I'm sure that there are things I've 
 missed, whether in coding style or desired patch format. I welcome any 
 feedback, suggestions, etc.
 Doug

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more