Re: [Nutch-cvs] svn commit: r397320 - /lucene/nutch/trunk/src/plugin/parse-oo/plugin.xml
parse-oo plugin manifest is valid with plugin.dtd Oops, I didn't catch that... Thanks! No problem Andrzej. It is just a cosmetic change since the plugin.xml are not validated at runtime (it is in my todo list), and the contentType and pathSuffix parameters are more or less deprecated. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
RE: exception
We updated hadoop from trunk branch. But now we get new errors: On tasktarcker side: skiped java.io.IOException: timed out waiting for response at org.apache.hadoop.ipc.Client.call(Client.java:305) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:149) at org.apache.hadoop.mapred.$Proxy0.pollForTaskWithClosedJob(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:310) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:374) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:813) 060427 062708 Client connection to 10.0.0.10:9001 caught: java.lang.RuntimeException: java.lang.ClassNotFoundException: java.lang.RuntimeException: java.lang.ClassNotFoundException: at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:152) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:139) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:186) at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:60) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:170) 060427 062708 Client connection to 10.0.0.10:9001: closing On jobtracker side: skiped 060427 061713 Server handler 3 on 9001 caught: java.lang.IllegalArgumentException: Ar gument is not an array java.lang.IllegalArgumentException: Argument is not an array at java.lang.reflect.Array.getLength(Native Method) at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:92) at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:250) skiped -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, April 27, 2006 12:48 AM To: nutch-dev@lucene.apache.org Subject: Re: exception Importance: High This is a Hadoop DFS error. It could mean that you don't have any datanodes running, or that all your datanodes are full. Or, it could be a bug in dfs. You might try a recent nightly build of Hadoop to see if it works any better. Doug Anton Potehin wrote: What means error of following type : java.rmi.RemoteException: java.io.IOException: Cannot obtain additional block for file /user/root/crawl/indexes/index/_0.prx
Re: Content-Type inconsistency?
Are you mainly concerned with charset in Content-Type? Not specifically. But while looking at these content-type inconsistency, I noticed that there is some prossible troubles with charset in content-type. Currently, what happens when Content-Type exists in both HTTP layer and in META tag (if contents is HTML)? We cannot use the one in Meta-tags : to extract it, we first need to know to use the html parser. Only the HTTP header is used. It is then checked/guessed using the mime-type repository (it is a mime-type database that contains mime-type and associated file extensions and optionaly some magic-bytes). How does Nutch guesses Content-Type, and when does it need to do that? See my response above Is there a situation where the guessed content-type differs from the content-type in the metadata? From the one in headers : yes (mainly when the server is badely configured) Here is an easy way to reproduce what I mean by content-type inconsistency: 1. Perform a crawl of the following URL : http://jerome.charron.free.fr/nutch/fake.zip (fake.zip is a fake zip file, in fact it is a html one) 2. While crawling, you can see that the content-type returned by the server is application/zip 3. But you can see that Nutch correctly guess the content-type to text/html (it uses the HtmlParser) 4. At this step, all is ok. 5. Then start your tomcat and try the following search : zip 6. You can see the fake.zip file in results. Click on details ; if the index-more plugin was activated then you can see that the stored content-type is application/zip and not text/html What I suggest is simply to use the content-type used by nutch to find which parser to use instead of the one returned by the server. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: exception
[EMAIL PROTECTED] wrote: We updated hadoop from trunk branch. But now we get new errors: Oops. Looks like I introduced a bug yesterday. Let me fix it... Sorry, Doug
TRUNK IllegalArgumentException: Argument is not an array (WAS: Re: exception)
I'm getting same as Anton below trying to launch a new job with latest from TRUNK. Logic in ObjectWriteable#readObject seems a little off. On the way in we test for a null instance. If null, we set to NullWriteable. Next we test declaredClass to see if its an array. We then try to do an Array.getLength on instance -- which we've above set as NullWriteable. Looks like we should test instance to see if its NullWriteable before we do the Array.getLength (or do the instance null check later). Hope above helps, St.Ack [EMAIL PROTECTED] wrote: We updated hadoop from trunk branch. But now we get new errors: On tasktarcker side: skiped java.io.IOException: timed out waiting for response at org.apache.hadoop.ipc.Client.call(Client.java:305) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:149) at org.apache.hadoop.mapred.$Proxy0.pollForTaskWithClosedJob(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:310) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:374) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:813) 060427 062708 Client connection to 10.0.0.10:9001 caught: java.lang.RuntimeException: java.lang.ClassNotFoundException: java.lang.RuntimeException: java.lang.ClassNotFoundException: at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:152) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:139) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:186) at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:60) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:170) 060427 062708 Client connection to 10.0.0.10:9001: closing On jobtracker side: skiped 060427 061713 Server handler 3 on 9001 caught: java.lang.IllegalArgumentException: Ar gument is not an array java.lang.IllegalArgumentException: Argument is not an array at java.lang.reflect.Array.getLength(Native Method) at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:92) at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:250) skiped -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, April 27, 2006 12:48 AM To: nutch-dev@lucene.apache.org Subject: Re: exception Importance: High This is a Hadoop DFS error. It could mean that you don't have any datanodes running, or that all your datanodes are full. Or, it could be a bug in dfs. You might try a recent nightly build of Hadoop to see if it works any better. Doug Anton Potehin wrote: What means error of following type : java.rmi.RemoteException: java.io.IOException: Cannot obtain additional block for file /user/root/crawl/indexes/index/_0.prx
Re: TRUNK IllegalArgumentException: Argument is not an array (WAS: Re: exception)
I just fixed this. Sorry for the inconvenience! Doug Michael Stack wrote: I'm getting same as Anton below trying to launch a new job with latest from TRUNK. Logic in ObjectWriteable#readObject seems a little off. On the way in we test for a null instance. If null, we set to NullWriteable. Next we test declaredClass to see if its an array. We then try to do an Array.getLength on instance -- which we've above set as NullWriteable. Looks like we should test instance to see if its NullWriteable before we do the Array.getLength (or do the instance null check later). Hope above helps, St.Ack [EMAIL PROTECTED] wrote: We updated hadoop from trunk branch. But now we get new errors: On tasktarcker side: skiped java.io.IOException: timed out waiting for response at org.apache.hadoop.ipc.Client.call(Client.java:305) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:149) at org.apache.hadoop.mapred.$Proxy0.pollForTaskWithClosedJob(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:310) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:374) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:813) 060427 062708 Client connection to 10.0.0.10:9001 caught: java.lang.RuntimeException: java.lang.ClassNotFoundException: java.lang.RuntimeException: java.lang.ClassNotFoundException: at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:152) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:139) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:186) at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:60) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:170) 060427 062708 Client connection to 10.0.0.10:9001: closing On jobtracker side: skiped 060427 061713 Server handler 3 on 9001 caught: java.lang.IllegalArgumentException: Ar gument is not an array java.lang.IllegalArgumentException: Argument is not an array at java.lang.reflect.Array.getLength(Native Method) at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:92) at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:250) skiped -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, April 27, 2006 12:48 AM To: nutch-dev@lucene.apache.org Subject: Re: exception Importance: High This is a Hadoop DFS error. It could mean that you don't have any datanodes running, or that all your datanodes are full. Or, it could be a bug in dfs. You might try a recent nightly build of Hadoop to see if it works any better. Doug Anton Potehin wrote: What means error of following type : java.rmi.RemoteException: java.io.IOException: Cannot obtain additional block for file /user/root/crawl/indexes/index/_0.prx
Re: Content-Type inconsistency?
I'm not sure if that is the right thing. If the site administrator did a poort job and a wrong media type is advertized, it's the site problem and Nutch shouldn't be fixing it, in my opinion. Those sites would not work properly with the browsers any way, and Nutch doesn't need to work properly except that it should protect itself from crashing. I tried to visit your fake.zip page with IE and Firefox, and both faithfully trusted the media type as advertised by the server, and asked me if I want to open it with WinZip or save it; there was no option to open it as an HTML. Why should Nutch treat it as HTML? Simply because it is a HTML file, with a strange name, of course, but it is a HTML file. My example is a kind of caricature. But some more real case could be : a HTML file with a text/plain content-type, or with an text/xml Finaly it is a good news that Nutch seems to be more intelligent on content-type guessing than Firefox or IE, no? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Content-Type inconsistency?
Jérôme Charron wrote: Finaly it is a good news that Nutch seems to be more intelligent on content-type guessing than Firefox or IE, no? I'm not so sure. When crawling Apache we had trouble with this feature. Some HTML files that had an XML header and the server identified as text/html Nutch decided to treat as XML, not HTML. We had to turn off the guessing of content types to index Apache correctly. I think we shouldn't aim guess things any more than a browser does. If browsers require standards compliance, then our lives will be simpler. Doug
[jira] Created: (NUTCH-256) Cannot open filename ....index.done.crc
Cannot open filename index.done.crc --- Key: NUTCH-256 URL: http://issues.apache.org/jira/browse/NUTCH-256 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: [EMAIL PROTECTED] Priority: Minor Trying to copy indices out of DFS I always get: [bregeon] workspace ./hadoop/bin/hadoop dfs -get outputs . 060427 160317 parsing file:/home/stack/workspace/hadoop-local-conf/hadoop-default.xml 060427 160317 parsing file:/home/stack/workspace/hadoop-local-conf/hadoop-site.xml 060427 160318 No FS indicated, using default:localhost:9001 060427 160318 Client connection to 127.0.0.1:9001: starting 060427 160318 Problem opening checksum file: /user/stack/outputs/indexes/part-0/index.done. Ignoring with exception org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open filename /user/stack/outputs/indexes/part-0/.index.done.crc at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:589) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-256) Cannot open filename ....index.done.crc
[ http://issues.apache.org/jira/browse/NUTCH-256?page=all ] [EMAIL PROTECTED] updated NUTCH-256: Attachment: index.done.crc.patch Ensure creation of companion index.done .crc file Cannot open filename index.done.crc --- Key: NUTCH-256 URL: http://issues.apache.org/jira/browse/NUTCH-256 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: [EMAIL PROTECTED] Priority: Minor Attachments: index.done.crc.patch Trying to copy indices out of DFS I always get: [bregeon] workspace ./hadoop/bin/hadoop dfs -get outputs . 060427 160317 parsing file:/home/stack/workspace/hadoop-local-conf/hadoop-default.xml 060427 160317 parsing file:/home/stack/workspace/hadoop-local-conf/hadoop-site.xml 060427 160318 No FS indicated, using default:localhost:9001 060427 160318 Client connection to 127.0.0.1:9001: starting 060427 160318 Problem opening checksum file: /user/stack/outputs/indexes/part-0/index.done. Ignoring with exception org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open filename /user/stack/outputs/indexes/part-0/.index.done.crc at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:589) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira