Hi mike,
I tried removing one regex as described in
http://issues.apache.org/jira/browse/NUTCH-233
I am not 100% sure if this is what eliminated the error, since a lot of
things changed since then=> the seed list, updated nutch trunk and also
I am doing an internal crawl now on my seeds. It's worth a shot to try
and change the regex as described, or remove it completely if you don't
need that kind of thing.
Regards,
Vishal.
-----Original Message-----
From: Mike Smith [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 18, 2006 4:50 AM
To: [email protected]; [EMAIL PROTECTED]
Subject: Re: Reduce Error during fetch
Hi Vishal,
I am experiencing the same problem. It gets stuck in the reduce stage
and finally fails by timeout problem. Did removing or simplifying regex
solved the problem?
Thanks, Mike
On 9/11/06, Vishal Shah <[EMAIL PROTECTED]> wrote:
Hi Dennis,
Thanks for the reply. I can't avoid using the regex matching, I have
some patterns in the hostname that can't be matched using either prefix
or suffix filters. However, I will try it your way using simpler regexes
just to test your theory.
Regards,
-vishal.
-----Original Message-----
From: Dennis Kubes [mailto: <mailto:[EMAIL PROTECTED]>
[EMAIL PROTECTED]
Sent: Friday, September 08, 2006 11:30 PM
To: [email protected]
Subject: Re: Reduce Error during fetch
You may be running into problems with regex stalls on filtering. Try
removing the regex filter from the nutch-site.xml plugin.includes
property. I was having similar problems before switching to just use
prefix and suffix filters as below. I attached my prefix and suffix url
filter files that go in conf. I am only indexing http files so you may
need to modify these. Hope this helps.
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(suffix|prefix)|parse-(text|html|js)|inde
x-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints
plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
Dennis
Vishal Shah wrote:
> Hi,
>
> I've been trying to get the nutch fetcher to work since a couple of
> days, but it always hangs on one of the reduce processes, and the job
is
> aborted. I am using numFetchers=24 during generate, 24 map tasks and 6
> reduce tasks during fetch on a 3 machine cluster. The task that failed
> was tried atleast 3 times, before the job was aborted.
>
> I looked into the logs on one of the machines with the failed tasks,
> and I see these errors:
>
> 1) 2006-09-08 18:04:03,294 INFO mapred.TaskTracker -
> task_0003_r_000004_3: Task failed to report status for 608 seconds.
> Killin
> g.
>
> 2)
> java.lang.IllegalStateException
> at
>
org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(ServletHttpRespo
> nse.java:561)
> at
>
org.apache.jasper.runtime.JspWriterImpl.initOut (JspWriterImpl.java:122)
> at
>
org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWriterImpl.java:1
> 15)
> at
>
org.apache.jasper.runtime.PageContextImpl.release(PageContextImpl.java
:1
> 90)
> at
>
org.apache.jasper.runtime.JspFactoryImpl.internalReleasePageContext(JspF
> actoryImpl.java:115)
> at
>
org.apache.jasper.runtime.JspFactoryImpl.releasePageContext
(JspFactoryIm
> pl.java:75)
> at
>
org.apache.hadoop.mapred.getMapOutput_jsp._jspService(getMapOutput_jsp.j
> ava:100)
> at
> org.apache.jasper.runtime.HttpJspBase.service (HttpJspBase.java:94)
> at
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
> at
>
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH
> andler.java:475)
> at
>
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
> at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
> at
>
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon
> text.java:635)
> at org.mortbay.http.HttpContext.handle (HttpContext.java:1517)
> at org.mortbay.http.HttpServer.service(HttpServer.java:954)
> at
> org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
> at
> org.mortbay.http.HttpConnection.handleNext (HttpConnection.java:981)
> at
> org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
> at
>
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244
> )
> at
> org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
> at
> org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
>
> Any idea where the problem is, and how to rectify it?
>
> Regards,
>
> -vishal.
>
>
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general