Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Stefan Groschupf

Doug,

Instead I  would suggest go a step forward by add a  
(configurable)  timeout mechanism and skip bad records in reducing  
in general.
Processing such big data and losing all data because just of one  
bad  record is very sad.


That's a good suggestion.  Ideally we could use Thread.interrupt(),  
but that won't stop a thread in a tight loop.  The only other  
option is thread.stop(), which isn't generally safe.  The safest  
thing to do is to restart the task in such a way that the bad entry  
is skipped.


Sounds like a lot of overhead but I agree there is no other chance.




As far I know google's map reduce skip bad records also.


Yes, I the paper says that, when a job fails, they can restart it,  
skipping the bad entry.  I don't think they skip without restarting  
the task.


In Hadoop I think this could correspond to removing the task that  
failed and replacing it with two tasks: one whose input split  
includes entries before the bad entry, and one whose input split  
includes those after.


It would be very nice if there would be any chance of recycle the  
already processed records and just add a new task that process the  
records from badrecord +1 to the end of the split.


But determining which entry failed is hard.  Unless we report every  
single entry processed to the TaskTracker (which would be too  
expensive for many map function) then it is hard to know exactly  
where things were when the process dies.


Something pops up in my mind would be splitting the task until we  
found the one record that fails. Of course this is expansive sine we  
have to may to process many small tasks.





We could instead include the number of entries processed in each  
status message, and the maximum count of entries before another  
status will be sent.
This sounds interesting. We would require some more meta data in the  
reporter, but this is scheduled for hadoop 0.2. In this change I  
would love to see the ability custom meta data in the report  
( MapWritable?) also.
In combination with a public API that allows to access these task  
reports we can have kind of lockmanager as described in the big table  
talk.



  This way the task child can try to send, e.g., about one report  
per second to its parent TaskTracker, and adaptively determine how  
many entries between reports.  So, for the first report it can  
guess that it will process only 1 entry before the next report.   
Then it processes the first entry and can now estimate how many  
entries it can process in the next second, and reports this as the  
maximum number of entries before the next report.  Then it  
processes entries until either the reported max or one second is  
exceeded, and then makes its next status report. And so on.  If the  
child hangs, then one can identify the range of entries that it was  
in down to one second.  If each entry takes longer than one second  
to process then we'd know the exact entry.


Unfortunately, this would not work with the Nutch Fetcher, which  
processes entries in separate threads, not strictly ordered...


Well it would work for all map and reduce task. MapRunnable  
implementations can take care about bad records by itself since here  
we have fully access to the record reader.



Stefan 


Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Doug Cutting

Stefan Groschupf wrote:
Instead I  would suggest go a step forward by add a (configurable)  
timeout mechanism and skip bad records in reducing in general.
Processing such big data and losing all data because just of one bad  
record is very sad.


That's a good suggestion.  Ideally we could use Thread.interrupt(), but 
that won't stop a thread in a tight loop.  The only other option is 
thread.stop(), which isn't generally safe.  The safest thing to do is to 
restart the task in such a way that the bad entry is skipped.



As far I know google's map reduce skip bad records also.


Yes, I the paper says that, when a job fails, they can restart it, 
skipping the bad entry.  I don't think they skip without restarting the 
task.


In Hadoop I think this could correspond to removing the task that failed 
and replacing it with two tasks: one whose input split includes entries 
before the bad entry, and one whose input split includes those after. 
Or we could keep a list of bad entry indexes and send these along with 
the task.  I prefer splitting the task.


But determining which entry failed is hard.  Unless we report every 
single entry processed to the TaskTracker (which would be too expensive 
for many map function) then it is hard to know exactly where things were 
when the process dies.


We could instead include the number of entries processed in each status 
message, and the maximum count of entries before another status will be 
sent.  This way the task child can try to send, e.g., about one report 
per second to its parent TaskTracker, and adaptively determine how many 
entries between reports.  So, for the first report it can guess that it 
will process only 1 entry before the next report.  Then it processes the 
first entry and can now estimate how many entries it can process in the 
next second, and reports this as the maximum number of entries before 
the next report.  Then it processes entries until either the reported 
max or one second is exceeded, and then makes its next status report. 
And so on.  If the child hangs, then one can identify the range of 
entries that it was in down to one second.  If each entry takes longer 
than one second to process then we'd know the exact entry.


Unfortunately, this would not work with the Nutch Fetcher, which 
processes entries in separate threads, not strictly ordered...


Doug





Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Stefan Groschupf
Beside that, we may should add a kind of timeout to the url filter  
in  general.


I think this is overkill.  There is already a Hadoop task timeout.   
Is that not sufficient?


No! What happens is that the url filter hang and than the complete  
task is time outed instead of just skipping  this url.
After 4 retries the complete job is killed and all fetched data are  
lost, in my case any time 5 mio urls. :-(

This was the real reason of the described problem in hadoop-dev.

Instead I  would suggest go a step forward by add a (configurable)  
timeout mechanism and skip bad records in reducing in general.
Processing such big data and losing all data because just of one bad  
record is very sad.

As far I know google's map reduce skip bad records also.

Stefan




Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Doug Cutting

Stefan Groschupf wrote:
Beside that, we may should add a kind of timeout to the url filter in  
general.


I think this is overkill.  There is already a Hadoop task timeout.  Is 
that not sufficient?


Doug


Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Andrzej Bialecki

Jérôme Charron wrote:

3. Add new plugins that use dk.brics.automaton.RegExp, using different
default regex file names.  Then folks can, if they choose, configure
things to use these faster regex libraries, but only if they're willing
to write the simpler regexes that it supports.  If, over time, we find
that the most useful regexes are easily converted, then we could switch
the default to this.



+1
I will doing it this way.
Thanks Doug.

  


Yes, I prefer it this way too, then it's clear that it's different and 
should be treated differently.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Jérôme Charron
> If it were easy to implement all java regex features in
> dk.brics.automaton.RegExp, then they probably would have.  Alternately,
> if they'd implemented all java regex features, it probably wouldn't be
> so fast.  So I worry that attempts to translate are doomed.  Better to
> accept the differences: if you want the speed, you must use restricted
> regexes.

That's right. It is a deterministic API => more speed, but less
functionality.


> 3. Add new plugins that use dk.brics.automaton.RegExp, using different
> default regex file names.  Then folks can, if they choose, configure
> things to use these faster regex libraries, but only if they're willing
> to write the simpler regexes that it supports.  If, over time, we find
> that the most useful regexes are easily converted, then we could switch
> the default to this.

+1
I will doing it this way.
Thanks Doug.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Jérôme Charron
> Beside that, we may should add a kind of timeout to the url filter in
> general.
> Since it can happen that a user configure a regex for his nutch setup
> that run in the same problem as we had run right now.
> Something like below attached.
> Would you agree? I can create a serious patch and test it if we are
> interested to add this as a fail back into the sources.

+1 as a short term solution.
In the long term, I think we should try to reproduce it and analyze what
really happen.
(I will commit some minimal unit test in the next few days).

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Stefan Groschupf
Beside that, we may should add a kind of timeout to the url filter in  
general.
Since it can happen that a user configure a regex for his nutch setup  
that run in the same problem as we had run right now.

Something like below attached.
Would you agree? I can create a serious patch and test it if we are  
interested to add this as a fail back into the sources.

At least this would save nutch against wrong user configurations. :-)



Index: src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/ 
RegexURLFilter.java

===
--- src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/ 
RegexURLFilter.java	(revision 383682)
+++ src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/ 
RegexURLFilter.java	(working copy)

@@ -75,14 +75,20 @@
   public synchronized String filter(String url) {
 Iterator i=rules.iterator();
+MatcherThread mt;
 while(i.hasNext()) {
-  Rule r=(Rule) i.next();
-  Matcher matcher = r.pattern.matcher(url);
-
-  if (matcher.find()) {
-//System.out.println("Matched " + r.regex);
-return r.sign ? url : null;
-  }
+  mt = new MatcherThread();
+  mt.rule=(Rule) i.next();
+  mt.start();
+  try {
+synchronized (mt.monitor) {
+  if (!mt.done) {
+mt.monitor.wait(1000);
+  }
+}
+  } catch (InterruptedException e) {}
+  mt.stop();
+  return mt.found ? url : null;
 };

 return null;   // assume no go
@@ -87,6 +93,24 @@

 return null;   // assume no go
   }
+
+  class MatcherThread extends Thread {
+private Object monitor = new Object();
+private String url;
+private Rule rule;
+private boolean found = false;
+private boolean done = false;
+public void run() {
+  Matcher matcher = this.rule.pattern.matcher(url);
+  if (matcher.find()) {
+this.found = rule.sign;
+  }
+  synchronized (monitor) {
+this.monitor.notify();
+this.done = true;
+  }
+}
+  }
   //
   // Format of configuration file is


Am 16.03.2006 um 18:10 schrieb Jérôme Charron:

1. Keeps the well-known perl syntax for regexp (and then find a  
way to

"simulate" them with automaton "limited" syntax) ?

My vote would be for option 1. It's less work for everyone
(except for the person incorporating the new library :)



That's my prefered solution too.
The first challenge is to see how to translate the  regexp used in  
default

regexp-urlfilter
templates provided by Nutch.
For now, in the only thing I don't see how to translate from perl  
syntax to

dk.brics.automaton syntax is this regexp:
-.*(/.+?)/.*?\1/.*?\1/.*
In fact, automaton doesn't support capturing groups (Anders Moeller  
has

confirmed).
We cannot remove this regexp from urlfilter, but we cannot handle  
it with

automaton.
So, two solutions:
1. Keep java regexp ...
2. Switch to automaton and provide a java implementation of this  
regexp (it
is more a protection pattern than really a filter pattern, and it  
could

probably be hard-coded).

I'm waiting for your suggestions...

Regards

Jérôme

 --
http://motrech.free.fr/
http://www.frutch.org/


-
blog: http://www.find23.org
company: http://www.media-style.com




Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Doug Cutting

Jérôme Charron wrote:

So, two solutions:

1. Keep java regexp ...
2. Switch to automaton and provide a java implementation of this regexp (it
is more a protection pattern than really a filter pattern, and it could
probably be hard-coded).


If it were easy to implement all java regex features in 
dk.brics.automaton.RegExp, then they probably would have.  Alternately, 
if they'd implemented all java regex features, it probably wouldn't be 
so fast.  So I worry that attempts to translate are doomed.  Better to 
accept the differences: if you want the speed, you must use restricted 
regexes.


How about:

3. Add new plugins that use dk.brics.automaton.RegExp, using different 
default regex file names.  Then folks can, if they choose, configure 
things to use these faster regex libraries, but only if they're willing 
to write the simpler regexes that it supports.  If, over time, we find 
that the most useful regexes are easily converted, then we could switch 
the default to this.


Doug


Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Ken Krugler

 > >1. Keeps the well-known perl syntax for regexp (and then find a way to

 >"simulate" them with automaton "limited" syntax) ?
 My vote would be for option 1. It's less work for everyone

 > (except for the person incorporating the new library :)

That's my prefered solution too.
The first challenge is to see how to translate the  regexp used in default
regexp-urlfilter
templates provided by Nutch.
For now, in the only thing I don't see how to translate from perl syntax to
dk.brics.automaton syntax is this regexp:
-.*(/.+?)/.*?\1/.*?\1/.*
In fact, automaton doesn't support capturing groups (Anders Moeller has
confirmed).
We cannot remove this regexp from urlfilter, but we cannot handle it with
automaton.
So, two solutions:
1. Keep java regexp ...
2. Switch to automaton and provide a java implementation of this regexp (it
is more a protection pattern than really a filter pattern, and it could
probably be hard-coded).

I'm waiting for your suggestions...


I've pinged Terence Parr - ANTLR author. I heard that the new version 
(ANTLR 3) has a fast FSM inside it. If so, somebody could write an 
ANTLR grammar to convert the Nutch regex into another ANTLR grammar 
that, when processed by ANLTR, creates a URL parser/validator.


It's almost too easy... :)

Anyway, waiting to hear back from Ter.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


Re: Crawling Accuracy

2006-03-16 Thread Ken Krugler

I have posted this before in the "nutch user", but, since that time I
have made some aditional testing and I feel that this has more to do
with developers.
I have about 450 seed sites (in the quality and environmental areas) and
I used the crawl method (Nutch 0.7.1.x) till depth 4, and then used the
whole web method till depth 6 and some more sites (in this case not all)
till detpth 7.  I restrained the outlinks to 50, used the default crawl-
urfilter (+^http://([a-z0-9]*\.)*NAMEOFSITE/, one for every site)  and
got about 523,000 pages.  Doing some searches I noted that I only got
few results for some terms. For instance "nureg" a document used by the
Nuclear Regulatory Commission (NRC) yielded only a little more than 20
documents (there are more than 3,000 of them).  Than I tried
"site:www.nrc.gov http", and found only 82 pages.  This site has more
than 10,000 pages!  I tried site:www.epa.gov http and only got 2413
pages (also, this site has more than 10,000 pages). The results were
similar for other very large (and not dynamic sites).
Experimenting further I crawled, using the crawl method, depth 7, only
some sites, one per time.  For instance, http://www.nrc.gov/ with the
filter +^http://([a-z0-9]*\.)*nrc.gov/ (and -.), increasing the
"http.max.delays" to 10 and the "http.timeout" to 2 and the results
were very poor: looking for "http" resulted in only 58 results.
Searching for "nureg" I only found 13 results, but for "adobe" (that
should be blocked by the filter (but not by the "outlinks rule", I do
not know) I got 4.  Performing the same testing in other sites, like
www.epa.gov; www.iaea.org; www.iso.org, the results were very similar: a
very, very small percentage of the site pages indexed.
So I am posting those results that can constitute, in itself, an issue
that, may be, shall be dealt with.  May be this is not an problem if you
try to index the whole web, I dont know, but for niche sites, like mine,
it seems to be.


I think you're probably running into the limited # of domains problem 
that many vertical crawlers encounter.


The default Nutch settings are for a maximum of one fetcher thread 
per domain. This is the safe setting for polite crawling, unless you 
enjoy getting blacklisted :)


So if you have only a few domains (e.g. just one for your test case 
of just nrc.gov), you're going to get a lot of retry timeout errors 
as threads "block" because another thread is already fetching a page 
from the same domain.


Which means that your effective throughput per domain is going to be 
limited to the rate at which individual pages can be downloaded, 
including the delay that your Nutch configuration specifies between 
each request.


If you assume a page takes 1 second to download (counting connection 
setup time), plus there's a 5 second delay between requests, you're 
getting 10 pages/minute from any given domain. If you have 10M 
domains, no problem, but if you only have a limited number of 
domains, you run into inefficiencies in how Nutch handles fetcher 
threads that will severely constrain your crawl performance.


We're in the middle of a project to improve throughput in this kind 
of environment, but haven't yet finished.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Jérôme Charron
> >1. Keeps the well-known perl syntax for regexp (and then find a way to
> >"simulate" them with automaton "limited" syntax) ?
> My vote would be for option 1. It's less work for everyone
> (except for the person incorporating the new library :)


That's my prefered solution too.
The first challenge is to see how to translate the  regexp used in default
regexp-urlfilter
templates provided by Nutch.
For now, in the only thing I don't see how to translate from perl syntax to
dk.brics.automaton syntax is this regexp:
-.*(/.+?)/.*?\1/.*?\1/.*
In fact, automaton doesn't support capturing groups (Anders Moeller has
confirmed).
We cannot remove this regexp from urlfilter, but we cannot handle it with
automaton.
So, two solutions:
1. Keep java regexp ...
2. Switch to automaton and provide a java implementation of this regexp (it
is more a protection pattern than really a filter pattern, and it could
probably be hard-coded).

I'm waiting for your suggestions...

Regards

Jérôme

 --
http://motrech.free.fr/
http://www.frutch.org/


Crawling Accuracy

2006-03-16 Thread carmmello
I have posted this before in the "nutch user", but, since that time I
have made some aditional testing and I feel that this has more to do
with developers.
I have about 450 seed sites (in the quality and environmental areas) and
I used the crawl method (Nutch 0.7.1.x) till depth 4, and then used the
whole web method till depth 6 and some more sites (in this case not all)
till detpth 7.  I restrained the outlinks to 50, used the default crawl-
urfilter (+^http://([a-z0-9]*\.)*NAMEOFSITE/, one for every site)  and
got about 523,000 pages.  Doing some searches I noted that I only got
few results for some terms. For instance "nureg" a document used by the
Nuclear Regulatory Commission (NRC) yielded only a little more than 20
documents (there are more than 3,000 of them).  Than I tried
"site:www.nrc.gov http", and found only 82 pages.  This site has more
than 10,000 pages!  I tried site:www.epa.gov http and only got 2413
pages (also, this site has more than 10,000 pages). The results were
similar for other very large (and not dynamic sites).
Experimenting further I crawled, using the crawl method, depth 7, only
some sites, one per time.  For instance, http://www.nrc.gov/ with the 
filter +^http://([a-z0-9]*\.)*nrc.gov/ (and -.), increasing the
"http.max.delays" to 10 and the "http.timeout" to 2 and the results
were very poor: looking for "http" resulted in only 58 results.
Searching for "nureg" I only found 13 results, but for "adobe" (that
should be blocked by the filter (but not by the "outlinks rule", I do
not know) I got 4.  Performing the same testing in other sites, like
www.epa.gov; www.iaea.org; www.iso.org, the results were very similar: a
very, very small percentage of the site pages indexed.
So I am posting those results that can constitute, in itself, an issue
that, may be, shall be dealt with.  May be this is not an problem if you
try to index the whole web, I dont know, but for niche sites, like mine,
it seems to be. 
Tanks




[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-03-16 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12370686 ] 

Stefan Groschupf commented on NUTCH-233:


Sorry, I haven't such url since it happens until reducing a fetch. Reducing 
provides no logging and map data will be deleted if the job fails because a 
timeout. :(


> wrong regular expression hang reduce process for ever
> -
>
>  Key: NUTCH-233
>  URL: http://issues.apache.org/jira/browse/NUTCH-233
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Blocker
>  Fix For: 0.8-dev

>
> Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt 
> wasn't compatible with java.util.regex that is actually used in the regex url 
> filter. 
> May be it was missed to change it when the regular expression packages was 
> changed.
> The problem was that until reducing a fetch map output the reducer hangs 
> forever since the outputformat was applying the urlfilter a url that causes 
> the hang.
> 060315 230823 task_r_3n4zga at 
> java.lang.Character.codePointAt(Character.java:2335)
> 060315 230823 task_r_3n4zga at 
> java.util.regex.Pattern$Dot.match(Pattern.java:4092)
> 060315 230823 task_r_3n4zga at 
> java.util.regex.Pattern$Curly.match1(Pattern.java:
> I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the 
> fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
> However may people can review it and can suggest improvements, since the old 
> regex would match :
> "abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the 
> old regex would also match :
> "abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-03-16 Thread Jerome Charron (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12370685 ] 

Jerome Charron commented on NUTCH-233:
--

Stefan,

I have created a small unit test for urlfilter-regexp and I doesn't notice any 
incompatibility in java.util.regex with this regexp. Could you please provide 
the urls that cause problem so that I can add them to me unit tests.
Thanks

Jérôme

> wrong regular expression hang reduce process for ever
> -
>
>  Key: NUTCH-233
>  URL: http://issues.apache.org/jira/browse/NUTCH-233
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Blocker
>  Fix For: 0.8-dev

>
> Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt 
> wasn't compatible with java.util.regex that is actually used in the regex url 
> filter. 
> May be it was missed to change it when the regular expression packages was 
> changed.
> The problem was that until reducing a fetch map output the reducer hangs 
> forever since the outputformat was applying the urlfilter a url that causes 
> the hang.
> 060315 230823 task_r_3n4zga at 
> java.lang.Character.codePointAt(Character.java:2335)
> 060315 230823 task_r_3n4zga at 
> java.util.regex.Pattern$Dot.match(Pattern.java:4092)
> 060315 230823 task_r_3n4zga at 
> java.util.regex.Pattern$Curly.match1(Pattern.java:
> I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the 
> fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
> However may people can review it and can suggest improvements, since the old 
> regex would match :
> "abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the 
> old regex would also match :
> "abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira