Sami,
Thanks for resolving this serious issue.  I just updated my code from trunk
and plan to test fetch speed. But ,there is a runtime error related to
switching from UTF8 to Text. Since the error is from hadoop, how do I fix
it?

java.lang.ClassCastException: org.apache.hadoop.io.UTF8
   at org.apache.nutch.crawl.Generato r$Selector.map(Generator.java:108)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:213)
   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
:105)

Thanks,
AJ


On 11/13/06, Sami Siren (JIRA) <[EMAIL PROTECTED]> wrote:

     [ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]

Sami Siren resolved NUTCH-395.
------------------------------

    Fix Version/s: 0.9.0
       Resolution: Fixed

applied to trunk with some additional whitespace changes.

> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http://issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>             Fix For: 0.9.0
>
>         Attachments: nutch-0.8-performance.txt,
NUTCH-395-trunk-metadata-only-2.patch, NUTCH-395-trunk-metadata-only.patch
>
>
> There have been some discussion on nutch mailing lists about fetcher
being slow, this patch tried to address that. the patch is just a quich hack
and needs some cleaning up, it also currently applies to 0.8 branch and
not trunk and it has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking, new version does
not (a decorator is provided that can do it and it should perhaps be used
where http headers are handled but in most of the cases the functionality is
not required)
> Reading/writing various data structures - patch tries to do io more
efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance of changes with a
script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira





--
AJ Chen, PhD
http://web2express.org
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to