Re: Crawl and parse exceptions

2006-01-11 Thread Matt Zytaruk
Unfortunately, the logs have since been overwritten by nutch so I can't 
check them, but I am pretty sure those are actually the messages from 
the task tracker log on the remote machine. If I am remembering 
correctly, all that was shown on the master was a short exception saying 
the child failed or something like that. I wish I could be more help but 
as I said, when the jobtracker/tasktrackers were stopped and started, 
they overwrote the log.


-Matt Zytaruk

Doug Cutting wrote:


Matt Zytaruk wrote:

Exception in thread "main" java.io.IOException: Not a file: 
/user/nutch/segments/20060107130328/parse_data/part-0/data

   at org.apache.nutch.ipc.Client.call(Client.java:294)



This is an error returned from an RPC call.  There should be more 
details about this in a slave log, e.g., a better stack trace, some 
context, etc.  What do you see there?


We also got this for awhile (seems like the mapred/system dir is 
never being created for some reason):
java.io.IOException: Cannot open filename 
/nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml

  at org.apache.nutch.ipc.Client.call(Client.java:294)



Again, it would be interesting to see what happened on the other end 
of this RPC call.  Please look in the remote log.


Doug






Re: Crawl and parse exceptions

2006-01-09 Thread Matt Zytaruk
Just a followup, i figured out the 3rd exception below ( Exception in 
thread "main" java.io.IOException: No input directories specified in: 
NutchConf..) so no worries there. but the others are still issues.


Matt Zytaruk wrote:

I've been having a lot of trouble lately with the newest nutch src. 
Both my crawls and parses are failing (for our fetches we crawl and 
parse at the same time with just the default nutch config, just to get 
the outlinks and update the crawldb, but then later on, after the 
fetch we do another parse with custom parse filters). Here are the 
exceptions below.


This exception happens sometimes when crawling (on the linkdb part of 
the crawl):


Exception in thread "main" java.io.IOException: Not a file: 
/user/nutch/segments/20060107130328/parse_data/part-0/data

   at org.apache.nutch.ipc.Client.call(Client.java:294)
   at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
   at $Proxy1.submitJob(Unknown Source)
   at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
   at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
   at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:131)

We also got this for awhile (seems like the mapred/system dir is never 
being created for some reason):
java.io.IOException: Cannot open filename 
/nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml

  at org.apache.nutch.ipc.Client.call(Client.java:294)
  at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
  at $Proxy1.open(Unknown Source)
  at 
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.openInfo(NDFSClient.java:256) 

  at 
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.(NDFSClient.java:242) 


  at org.apache.nutch.ndfs.NDFSClient.open(NDFSClient.java:79)
  at 
org.apache.nutch.fs.NDFSFileSystem.openRaw(NDFSFileSystem.java:66)
  at 
org.apache.nutch.fs.NFSDataInputStream$Checker.(NFSDataInputStream.java:45) 

  at 
org.apache.nutch.fs.NFSDataInputStream.(NFSDataInputStream.java:221) 

  at 
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160)
  at 
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:149)
  at 
org.apache.nutch.fs.NDFSFileSystem.copyToLocalFile(NDFSFileSystem.java:221) 

  at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:346) 

  at 
org.apache.nutch.mapred.TaskTracker$TaskInProgress.(TaskTracker.java:332) 

  at 
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:232)

  at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:286)
  at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:651)

Then, on parsing, we got this, within 10 second of the parse starting:

060109 093759 task_m_ltgpnj  Error running child
060109 093759 task_m_ltgpnj java.lang.RuntimeException: 
java.io.EOFException
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:57) 

060109 093759 task_m_ltgpnj at 
org.apache.nutch.protocol.Content.getContent(Content.java:124)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.crawl.MD5Signature.calculate(MD5Signature.java:33)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:62)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.mapred.MapRunner.run(MapRunner.java:52)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)

060109 093759 task_m_ltgpnj Caused by: java.io.EOFException
060109 093759 task_m_ltgpnj at 
java.io.DataInputStream.readFully(DataInputStream.java:268)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55) 

060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.UTF8.readChars(UTF8.java:212)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.UTF8.readString(UTF8.java:204)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.protocol.ContentProperties.readFields(ContentProperties.java:169) 

060109 093759 task_m_ltgpnj at 
org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:81)
060109 093759 task_m_ltgpnj at 
org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:54) 


060109 093759 task_m_ltgpnj ... 6 more
060109 093802 task_m_txrnu3 done; removing files.
060109 093802 Server connection on port 50050 from 127.0.0.2: exiting
060109 093805 task_m_ltgpnj done; removing files.
060109 093805 Lost connection to JobTracker 
[crawler-d-03.internal.wavefire.ca/127.0.0.2:8050]. 
ex=java.lang.NullPointerException  Retrying...


On a different segment we got this instead:
Exception in thread "main" java.io.IOException:

Crawl and parse exceptions

2006-01-09 Thread Matt Zytaruk
ch.ipc.RPC$Invoker.invoke(RPC.java:127)
   at $Proxy0.submitJob(Unknown Source)
   at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
   at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
   at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:95)
   at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:113)

(I think you usually get this error when you dont put the right 
filenames in arguments, but that is definately not the case here)



These are all tasks on segments which worked fine before we changed src 
code (we had been working with the src from about the beginning of 
december previously). It's also not a permissions issue as it all worked 
fine previously. The only things that have changed are the updated code 
and the number of map/reduce tasks in the config (side note: what is the 
best number of tasks for each to use? we have a set of 2 machines that 
works together to crawl, and a set of 3 machines that work together to 
parse/index).


Any help would be muchly appreciated as otherwise I am doomed. Thanks, 
ahead of time.


-Matt Zytaruk




Re: Class Cast exception

2006-01-06 Thread Matt Zytaruk

Worked perfectly. Thanks

-Matt Zytaruk

Andrzej Bialecki wrote:


Hi,

I attached the patch. Please test.



Index: ParseData.java
===
--- ParseData.java  (revision 365563)
+++ ParseData.java  (working copy)
@@ -31,7 +31,7 @@
public final class ParseData extends VersionedWritable {
  public static final String DIR_NAME = "parse_data";

-  private final static byte VERSION = 2;
+  private final static byte VERSION = 3;

  private String title;
  private Outlink[] outlinks;
@@ -96,10 +96,15 @@
  Outlink.skip(in);
}

-int propertyCount = in.readInt(); // read metadata

-metadata = new ContentProperties();
-for (int i = 0; i < propertyCount; i++) {
-  metadata.put(UTF8.readString(in), UTF8.readString(in));
+if (version < 3) {
+  int propertyCount = in.readInt(); // read metadata
+  metadata = new ContentProperties();
+  for (int i = 0; i < propertyCount; i++) {
+metadata.put(UTF8.readString(in), UTF8.readString(in));
+  }
+} else {
+  metadata = new ContentProperties();
+  metadata.readFields(in);
}

  }

@@ -113,14 +118,7 @@
for (int i = 0; i < outlinks.length; i++) {
  outlinks[i].write(out);
}
-
-out.writeInt(metadata.size());// write metadata
-Iterator i = metadata.entrySet().iterator();
-while (i.hasNext()) {
-  Map.Entry e = (Map.Entry)i.next();
-  UTF8.writeString(out, (String)e.getKey());
-  UTF8.writeString(out, (String)e.getValue());
-}
+metadata.write(out);
  }

  public static ParseData read(DataInput in) throws IOException {
 





Re: Class Cast exception

2006-01-06 Thread Matt Zytaruk
So will this throw an exception on older segments? or will it just not 
get the correct metadata? I have a lot of older segments I still need to 
use.

Thanks for your help.

-Matt Zytaruk

Andrzej Bialecki wrote:


Matt Zytaruk wrote:


Here you go.

java.lang.ClassCastException: java.util.ArrayList
   at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
   at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
   at 
org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57)
   at 
org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:168)

   at org.apache.nutch.mapred.MapTask$1.collect(MapTask.java:78)
   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:229)
   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:123)




Congratulations! You are the first person to actually use (and suffer 
from) the multiple values in ContentProperties... ;-)


It turns out that ParseData.write() uses its own method for writing 
out metadata, instead of using ContentProperties.write(). It works 
well if you only have single values (then they are stored as Strings), 
but if there are multiple values they are stored in ArrayLists, which 
ParseData accesses directly by the virtue of using 
metadata.entrySet().iterator().


The fix is easy: please replace the following lines in ParseData.write():

   out.writeInt(metadata.size());// write metadata
   Iterator i = metadata.entrySet().iterator();
   while (i.hasNext()) {
 Map.Entry e = (Map.Entry)i.next();
 UTF8.writeString(out, (String)e.getKey());
 UTF8.writeString(out, (String)e.getValue());
   }

with this:

   metadata.write(out);

and the same for reading the metadata field; replace in 
ParseData.readField() this:


   int propertyCount = in.readInt(); // read metadata
   metadata = new ContentProperties();
   for (int i = 0; i < propertyCount; i++) {
 metadata.put(UTF8.readString(in), UTF8.readString(in));
   }

with this:

   metadata = new ContentProperties();
   metadata.readFields(in);
  Compile, deploy, test, report ... :-) Please note that this changes 
the on-disk segment format, so you won't be able to read the old 
segments with the new code. You may want to bump the 
ParseData.VERSION, and leave this code to handle older versions...






Re: Class Cast exception

2006-01-06 Thread Matt Zytaruk

Here you go.

java.lang.ClassCastException: java.util.ArrayList
   at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
   at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
   at 
org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57)
   at 
org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:168)

   at org.apache.nutch.mapred.MapTask$1.collect(MapTask.java:78)
   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:229)
   at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:123)


Andrzej Bialecki wrote:


Matt Zytaruk wrote:

The newest src (as of this morning) of trunk is occaisionally giving 
ClassCastExceptions when doing a crawl, with parsing (and by 
occaisionally I mean this was the only page out of the small list I 
crawled that it happened on). This is with the nothing changed from 
the defaults and on a server running Suse linux. Here is a sample of 
the logging:


060106 111516 Parsing [http://easily.co.uk/] with 
[EMAIL PROTECTED]

060106 111516 Using Signature impl: org.apache.nutch.crawl.MD5Signature
060106 111516 fetch of http://easily.co.uk/ failed with: 
java.lang.ClassCastException: java.util.ArrayList


-Matt Zytaruk




Could you please add a call to printStackTrace() in that catch{} 
statement, so that we know where the exception is thrown?






Class Cast exception

2006-01-06 Thread Matt Zytaruk
The newest src (as of this morning) of trunk is occaisionally giving 
ClassCastExceptions when doing a crawl, with parsing (and by 
occaisionally I mean this was the only page out of the small list I 
crawled that it happened on). This is with the nothing changed from the 
defaults and on a server running Suse linux. Here is a sample of the 
logging:


060106 111516 Parsing [http://easily.co.uk/] with 
[EMAIL PROTECTED]

060106 111516 Using Signature impl: org.apache.nutch.crawl.MD5Signature
060106 111516 fetch of http://easily.co.uk/ failed with: 
java.lang.ClassCastException: java.util.ArrayList


-Matt Zytaruk



Site Query Filter Bug?

2005-11-07 Thread Matt Zytaruk

Hello there,

I've been trying to get domain specific searches using the Site Query 
Filter working, and it seems there is a bug somewhere as it just won't 
return any results. I changed the  code in the BasicIndexingFilter to 
make the site field stored instead of just indexed, and suddenly 
searches using the Site query filter work fine. Are you supposed to be 
able to search on un-stored fields? If not, why is the Site Query Filter 
plugin enabled by default, if the indexer doesn't store the site field 
information?


This is with the 0.8-dev version. Thanks for your help.

-Matt Zytaruk


Fetcher Speed Issues

2005-10-06 Thread Matt Zytaruk
Hi there, I just started working on a search engine based on the nutch 
project, but we are finding that the fetcher is crawling extremely slow. 
I've seen posts talking about people maxing out their 5mb lines with the 
fetcher, but we can't seem to get anymore than about 20k/s or 1.5 
pages/second, which isnt even a smidgen of our capacity, even with 
-threads set to 200 . This is using the mapred branch, in freebsd 4.


Are there any settings we might be missing that would cause this 
slowdown? or are there certain network configurations that could be 
causing this?


Also, is the -numFetchers option in 'nutch generate' broken in the 
mapred branch? it worked fine in 0.7, but doesn't seem to do anything in 
0.8-dev.


Thanks a lot for your help.

Matt Zytaruk


Fetch Speed Issues

2005-10-05 Thread Matt Zytaruk
Hi there, I just started working on a search engine based on the nutch 
project, but we are finding that the fetcher is crawling extremely slow. 
I've seen posts talking about people maxing out their 5mb lines with the 
fetcher, but we can't seem to get anymore than about 20k/s or 1.5 
pages/second, which isnt even a smidgen of our capacity, even with 
-threads set to 200 . This is using the mapred branch, by the way.


Are there any settings we might be missing that would cause this 
slowdown? or are there certain network configurations that could be 
causing this?


Also, is the -numFetchers option in 'nutch generate' broken in the 
mapred branch? it worked fine in 0.7, but doesn't seem to do anything in 
0.8-dev.


Thanks a lot for your help.

Matt Zytaruk