Re: Crawl and parse exceptions
Unfortunately, the logs have since been overwritten by nutch so I can't check them, but I am pretty sure those are actually the messages from the task tracker log on the remote machine. If I am remembering correctly, all that was shown on the master was a short exception saying the child failed or something like that. I wish I could be more help but as I said, when the jobtracker/tasktrackers were stopped and started, they overwrote the log. -Matt Zytaruk Doug Cutting wrote: Matt Zytaruk wrote: Exception in thread "main" java.io.IOException: Not a file: /user/nutch/segments/20060107130328/parse_data/part-0/data at org.apache.nutch.ipc.Client.call(Client.java:294) This is an error returned from an RPC call. There should be more details about this in a slave log, e.g., a better stack trace, some context, etc. What do you see there? We also got this for awhile (seems like the mapred/system dir is never being created for some reason): java.io.IOException: Cannot open filename /nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml at org.apache.nutch.ipc.Client.call(Client.java:294) Again, it would be interesting to see what happened on the other end of this RPC call. Please look in the remote log. Doug
Re: Crawl and parse exceptions
Just a followup, i figured out the 3rd exception below ( Exception in thread "main" java.io.IOException: No input directories specified in: NutchConf..) so no worries there. but the others are still issues. Matt Zytaruk wrote: I've been having a lot of trouble lately with the newest nutch src. Both my crawls and parses are failing (for our fetches we crawl and parse at the same time with just the default nutch config, just to get the outlinks and update the crawldb, but then later on, after the fetch we do another parse with custom parse filters). Here are the exceptions below. This exception happens sometimes when crawling (on the linkdb part of the crawl): Exception in thread "main" java.io.IOException: Not a file: /user/nutch/segments/20060107130328/parse_data/part-0/data at org.apache.nutch.ipc.Client.call(Client.java:294) at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) at $Proxy1.submitJob(Unknown Source) at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259) at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:131) We also got this for awhile (seems like the mapred/system dir is never being created for some reason): java.io.IOException: Cannot open filename /nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml at org.apache.nutch.ipc.Client.call(Client.java:294) at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) at $Proxy1.open(Unknown Source) at org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.openInfo(NDFSClient.java:256) at org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.(NDFSClient.java:242) at org.apache.nutch.ndfs.NDFSClient.open(NDFSClient.java:79) at org.apache.nutch.fs.NDFSFileSystem.openRaw(NDFSFileSystem.java:66) at org.apache.nutch.fs.NFSDataInputStream$Checker.(NFSDataInputStream.java:45) at org.apache.nutch.fs.NFSDataInputStream.(NFSDataInputStream.java:221) at org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160) at org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:149) at org.apache.nutch.fs.NDFSFileSystem.copyToLocalFile(NDFSFileSystem.java:221) at org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:346) at org.apache.nutch.mapred.TaskTracker$TaskInProgress.(TaskTracker.java:332) at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:232) at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:286) at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:651) Then, on parsing, we got this, within 10 second of the parse starting: 060109 093759 task_m_ltgpnj Error running child 060109 093759 task_m_ltgpnj java.lang.RuntimeException: java.io.EOFException 060109 093759 task_m_ltgpnj at org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:57) 060109 093759 task_m_ltgpnj at org.apache.nutch.protocol.Content.getContent(Content.java:124) 060109 093759 task_m_ltgpnj at org.apache.nutch.crawl.MD5Signature.calculate(MD5Signature.java:33) 060109 093759 task_m_ltgpnj at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:62) 060109 093759 task_m_ltgpnj at org.apache.nutch.mapred.MapRunner.run(MapRunner.java:52) 060109 093759 task_m_ltgpnj at org.apache.nutch.mapred.MapTask.run(MapTask.java:116) 060109 093759 task_m_ltgpnj at org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603) 060109 093759 task_m_ltgpnj Caused by: java.io.EOFException 060109 093759 task_m_ltgpnj at java.io.DataInputStream.readFully(DataInputStream.java:268) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.UTF8.readChars(UTF8.java:212) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.UTF8.readString(UTF8.java:204) 060109 093759 task_m_ltgpnj at org.apache.nutch.protocol.ContentProperties.readFields(ContentProperties.java:169) 060109 093759 task_m_ltgpnj at org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:81) 060109 093759 task_m_ltgpnj at org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:54) 060109 093759 task_m_ltgpnj ... 6 more 060109 093802 task_m_txrnu3 done; removing files. 060109 093802 Server connection on port 50050 from 127.0.0.2: exiting 060109 093805 task_m_ltgpnj done; removing files. 060109 093805 Lost connection to JobTracker [crawler-d-03.internal.wavefire.ca/127.0.0.2:8050]. ex=java.lang.NullPointerException Retrying... On a different segment we got this instead: Exception in thread "main" java.io.IOException:
Crawl and parse exceptions
ch.ipc.RPC$Invoker.invoke(RPC.java:127) at $Proxy0.submitJob(Unknown Source) at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259) at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:95) at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:113) (I think you usually get this error when you dont put the right filenames in arguments, but that is definately not the case here) These are all tasks on segments which worked fine before we changed src code (we had been working with the src from about the beginning of december previously). It's also not a permissions issue as it all worked fine previously. The only things that have changed are the updated code and the number of map/reduce tasks in the config (side note: what is the best number of tasks for each to use? we have a set of 2 machines that works together to crawl, and a set of 3 machines that work together to parse/index). Any help would be muchly appreciated as otherwise I am doomed. Thanks, ahead of time. -Matt Zytaruk
Re: Class Cast exception
Worked perfectly. Thanks -Matt Zytaruk Andrzej Bialecki wrote: Hi, I attached the patch. Please test. Index: ParseData.java === --- ParseData.java (revision 365563) +++ ParseData.java (working copy) @@ -31,7 +31,7 @@ public final class ParseData extends VersionedWritable { public static final String DIR_NAME = "parse_data"; - private final static byte VERSION = 2; + private final static byte VERSION = 3; private String title; private Outlink[] outlinks; @@ -96,10 +96,15 @@ Outlink.skip(in); } -int propertyCount = in.readInt(); // read metadata -metadata = new ContentProperties(); -for (int i = 0; i < propertyCount; i++) { - metadata.put(UTF8.readString(in), UTF8.readString(in)); +if (version < 3) { + int propertyCount = in.readInt(); // read metadata + metadata = new ContentProperties(); + for (int i = 0; i < propertyCount; i++) { +metadata.put(UTF8.readString(in), UTF8.readString(in)); + } +} else { + metadata = new ContentProperties(); + metadata.readFields(in); } } @@ -113,14 +118,7 @@ for (int i = 0; i < outlinks.length; i++) { outlinks[i].write(out); } - -out.writeInt(metadata.size());// write metadata -Iterator i = metadata.entrySet().iterator(); -while (i.hasNext()) { - Map.Entry e = (Map.Entry)i.next(); - UTF8.writeString(out, (String)e.getKey()); - UTF8.writeString(out, (String)e.getValue()); -} +metadata.write(out); } public static ParseData read(DataInput in) throws IOException {
Re: Class Cast exception
So will this throw an exception on older segments? or will it just not get the correct metadata? I have a lot of older segments I still need to use. Thanks for your help. -Matt Zytaruk Andrzej Bialecki wrote: Matt Zytaruk wrote: Here you go. java.lang.ClassCastException: java.util.ArrayList at org.apache.nutch.parse.ParseData.write(ParseData.java:122) at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51) at org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57) at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:168) at org.apache.nutch.mapred.MapTask$1.collect(MapTask.java:78) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:229) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:123) Congratulations! You are the first person to actually use (and suffer from) the multiple values in ContentProperties... ;-) It turns out that ParseData.write() uses its own method for writing out metadata, instead of using ContentProperties.write(). It works well if you only have single values (then they are stored as Strings), but if there are multiple values they are stored in ArrayLists, which ParseData accesses directly by the virtue of using metadata.entrySet().iterator(). The fix is easy: please replace the following lines in ParseData.write(): out.writeInt(metadata.size());// write metadata Iterator i = metadata.entrySet().iterator(); while (i.hasNext()) { Map.Entry e = (Map.Entry)i.next(); UTF8.writeString(out, (String)e.getKey()); UTF8.writeString(out, (String)e.getValue()); } with this: metadata.write(out); and the same for reading the metadata field; replace in ParseData.readField() this: int propertyCount = in.readInt(); // read metadata metadata = new ContentProperties(); for (int i = 0; i < propertyCount; i++) { metadata.put(UTF8.readString(in), UTF8.readString(in)); } with this: metadata = new ContentProperties(); metadata.readFields(in); Compile, deploy, test, report ... :-) Please note that this changes the on-disk segment format, so you won't be able to read the old segments with the new code. You may want to bump the ParseData.VERSION, and leave this code to handle older versions...
Re: Class Cast exception
Here you go. java.lang.ClassCastException: java.util.ArrayList at org.apache.nutch.parse.ParseData.write(ParseData.java:122) at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51) at org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57) at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:168) at org.apache.nutch.mapred.MapTask$1.collect(MapTask.java:78) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:229) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:123) Andrzej Bialecki wrote: Matt Zytaruk wrote: The newest src (as of this morning) of trunk is occaisionally giving ClassCastExceptions when doing a crawl, with parsing (and by occaisionally I mean this was the only page out of the small list I crawled that it happened on). This is with the nothing changed from the defaults and on a server running Suse linux. Here is a sample of the logging: 060106 111516 Parsing [http://easily.co.uk/] with [EMAIL PROTECTED] 060106 111516 Using Signature impl: org.apache.nutch.crawl.MD5Signature 060106 111516 fetch of http://easily.co.uk/ failed with: java.lang.ClassCastException: java.util.ArrayList -Matt Zytaruk Could you please add a call to printStackTrace() in that catch{} statement, so that we know where the exception is thrown?
Class Cast exception
The newest src (as of this morning) of trunk is occaisionally giving ClassCastExceptions when doing a crawl, with parsing (and by occaisionally I mean this was the only page out of the small list I crawled that it happened on). This is with the nothing changed from the defaults and on a server running Suse linux. Here is a sample of the logging: 060106 111516 Parsing [http://easily.co.uk/] with [EMAIL PROTECTED] 060106 111516 Using Signature impl: org.apache.nutch.crawl.MD5Signature 060106 111516 fetch of http://easily.co.uk/ failed with: java.lang.ClassCastException: java.util.ArrayList -Matt Zytaruk
Site Query Filter Bug?
Hello there, I've been trying to get domain specific searches using the Site Query Filter working, and it seems there is a bug somewhere as it just won't return any results. I changed the code in the BasicIndexingFilter to make the site field stored instead of just indexed, and suddenly searches using the Site query filter work fine. Are you supposed to be able to search on un-stored fields? If not, why is the Site Query Filter plugin enabled by default, if the indexer doesn't store the site field information? This is with the 0.8-dev version. Thanks for your help. -Matt Zytaruk
Fetcher Speed Issues
Hi there, I just started working on a search engine based on the nutch project, but we are finding that the fetcher is crawling extremely slow. I've seen posts talking about people maxing out their 5mb lines with the fetcher, but we can't seem to get anymore than about 20k/s or 1.5 pages/second, which isnt even a smidgen of our capacity, even with -threads set to 200 . This is using the mapred branch, in freebsd 4. Are there any settings we might be missing that would cause this slowdown? or are there certain network configurations that could be causing this? Also, is the -numFetchers option in 'nutch generate' broken in the mapred branch? it worked fine in 0.7, but doesn't seem to do anything in 0.8-dev. Thanks a lot for your help. Matt Zytaruk
Fetch Speed Issues
Hi there, I just started working on a search engine based on the nutch project, but we are finding that the fetcher is crawling extremely slow. I've seen posts talking about people maxing out their 5mb lines with the fetcher, but we can't seem to get anymore than about 20k/s or 1.5 pages/second, which isnt even a smidgen of our capacity, even with -threads set to 200 . This is using the mapred branch, by the way. Are there any settings we might be missing that would cause this slowdown? or are there certain network configurations that could be causing this? Also, is the -numFetchers option in 'nutch generate' broken in the mapred branch? it worked fine in 0.7, but doesn't seem to do anything in 0.8-dev. Thanks a lot for your help. Matt Zytaruk