Re: [Nutch-cvs] svn commit: r397320 - /lucene/nutch/trunk/src/plugin/parse-oo/plugin.xml

2006-04-27 Thread Jérôme Charron
  parse-oo plugin manifest is valid with plugin.dtd
 Oops, I didn't catch that... Thanks!

No problem Andrzej.
It is just a cosmetic change since the plugin.xml are not validated at
runtime (it is in my todo list),
and the contentType and pathSuffix parameters are more or less deprecated.

Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/


RE: exception

2006-04-27 Thread anton
We updated hadoop from trunk branch. But now we get new errors:

On tasktarcker side:
skiped
java.io.IOException: timed out waiting for response
at org.apache.hadoop.ipc.Client.call(Client.java:305)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:149)
at org.apache.hadoop.mapred.$Proxy0.pollForTaskWithClosedJob(Unknown
Source)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:310)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:374)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:813)
060427 062708 Client connection to 10.0.0.10:9001 caught:
java.lang.RuntimeException:
 java.lang.ClassNotFoundException:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:152)
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:139)
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:186)
at
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:60)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:170)
060427 062708 Client connection to 10.0.0.10:9001: closing


On jobtracker side:
skiped
060427 061713 Server handler 3 on 9001 caught:
java.lang.IllegalArgumentException: Ar
gument is not an array
java.lang.IllegalArgumentException: Argument is not an array
at java.lang.reflect.Array.getLength(Native Method)
at
org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:92)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:250)
skiped

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 27, 2006 12:48 AM
To: nutch-dev@lucene.apache.org
Subject: Re: exception
Importance: High

This is a Hadoop DFS error.  It could mean that you don't have any 
datanodes running, or that all your datanodes are full.  Or, it could be 
a bug in dfs.  You might try a recent nightly build of Hadoop to see if 
it works any better.

Doug

Anton Potehin wrote:
 What means error of following type :
 
  
 
 java.rmi.RemoteException: java.io.IOException: Cannot obtain additional
 block for file /user/root/crawl/indexes/index/_0.prx
 
  
 
  
 
 




Re: Content-Type inconsistency?

2006-04-27 Thread Jérôme Charron
 Are you mainly concerned with charset in Content-Type?

Not specifically.
But while looking at these content-type inconsistency, I noticed that there
is some prossible
troubles with charset in content-type.


 Currently, what happens when Content-Type exists in both HTTP layer and in
 META tag (if contents is HTML)?

We cannot use the one in Meta-tags : to extract it, we first need to know to
use the html parser.
Only the HTTP header is used.
It is then checked/guessed using the mime-type repository (it is a mime-type
database that contains mime-type and associated file extensions and
optionaly some magic-bytes).

How does Nutch guesses Content-Type, and when does it need to do that?

See my response above


 Is there a situation where the guessed content-type differs from the
 content-type in the metadata?

From the one in headers : yes (mainly when the server is badely configured)


Here is an easy way to reproduce what I mean by content-type inconsistency:
1. Perform a crawl of the following URL :
http://jerome.charron.free.fr/nutch/fake.zip
(fake.zip is a fake zip file, in fact it is a html one)
2. While crawling, you can see that the content-type returned by the server
is application/zip
3. But you can see that Nutch correctly guess the content-type to text/html
(it uses the HtmlParser)
4. At this step, all is ok.
5. Then start your tomcat and try the following search : zip
6. You can see the fake.zip file in results. Click on details ; if the
index-more plugin was activated then you can see that the stored
content-type is application/zip and not text/html

What I suggest is simply to use the content-type used by nutch to find which
parser to use instead of the one returned by the server.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: exception

2006-04-27 Thread Doug Cutting

[EMAIL PROTECTED] wrote:

We updated hadoop from trunk branch. But now we get new errors:


Oops.  Looks like I introduced a bug yesterday.  Let me fix it...

Sorry,

Doug


TRUNK IllegalArgumentException: Argument is not an array (WAS: Re: exception)

2006-04-27 Thread Michael Stack
I'm getting same as Anton below trying to launch a new job with latest 
from TRUNK.


Logic in ObjectWriteable#readObject seems a little off.  On the way in 
we test for a null instance.  If null, we set to NullWriteable.


Next we test declaredClass to see if its an array.  We then try to do an 
Array.getLength on instance -- which we've above set as NullWriteable.


Looks like we should test instance to see if its NullWriteable before we 
do the Array.getLength (or do the instance null check later).


Hope above helps,
St.Ack



[EMAIL PROTECTED] wrote:

We updated hadoop from trunk branch. But now we get new errors:

On tasktarcker side:
skiped
java.io.IOException: timed out waiting for response
at org.apache.hadoop.ipc.Client.call(Client.java:305)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:149)
at org.apache.hadoop.mapred.$Proxy0.pollForTaskWithClosedJob(Unknown
Source)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:310)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:374)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:813)
060427 062708 Client connection to 10.0.0.10:9001 caught:
java.lang.RuntimeException:
 java.lang.ClassNotFoundException:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:152)
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:139)
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:186)
at
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:60)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:170)
060427 062708 Client connection to 10.0.0.10:9001: closing


On jobtracker side:
skiped
060427 061713 Server handler 3 on 9001 caught:
java.lang.IllegalArgumentException: Ar
gument is not an array
java.lang.IllegalArgumentException: Argument is not an array
at java.lang.reflect.Array.getLength(Native Method)
at
org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:92)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:250)
skiped

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 27, 2006 12:48 AM

To: nutch-dev@lucene.apache.org
Subject: Re: exception
Importance: High

This is a Hadoop DFS error.  It could mean that you don't have any 
datanodes running, or that all your datanodes are full.  Or, it could be 
a bug in dfs.  You might try a recent nightly build of Hadoop to see if 
it works any better.


Doug

Anton Potehin wrote:
  

What means error of following type :

 


java.rmi.RemoteException: java.io.IOException: Cannot obtain additional
block for file /user/root/crawl/indexes/index/_0.prx

 

 







  




Re: TRUNK IllegalArgumentException: Argument is not an array (WAS: Re: exception)

2006-04-27 Thread Doug Cutting

I just fixed this.  Sorry for the inconvenience!

Doug

Michael Stack wrote:
I'm getting same as Anton below trying to launch a new job with latest 
from TRUNK.


Logic in ObjectWriteable#readObject seems a little off.  On the way in 
we test for a null instance.  If null, we set to NullWriteable.


Next we test declaredClass to see if its an array.  We then try to do an 
Array.getLength on instance -- which we've above set as NullWriteable.


Looks like we should test instance to see if its NullWriteable before we 
do the Array.getLength (or do the instance null check later).


Hope above helps,
St.Ack



[EMAIL PROTECTED] wrote:


We updated hadoop from trunk branch. But now we get new errors:

On tasktarcker side:
skiped
java.io.IOException: timed out waiting for response
at org.apache.hadoop.ipc.Client.call(Client.java:305)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:149)
at 
org.apache.hadoop.mapred.$Proxy0.pollForTaskWithClosedJob(Unknown

Source)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:310)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:374)
at 
org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:813)

060427 062708 Client connection to 10.0.0.10:9001 caught:
java.lang.RuntimeException:
 java.lang.ClassNotFoundException:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:152)
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:139)
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:186)
at
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:60)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:170)
060427 062708 Client connection to 10.0.0.10:9001: closing


On jobtracker side:
skiped
060427 061713 Server handler 3 on 9001 caught:
java.lang.IllegalArgumentException: Ar
gument is not an array
java.lang.IllegalArgumentException: Argument is not an array
at java.lang.reflect.Array.getLength(Native Method)
at
org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:92)
at 
org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:250)
skiped

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, April 
27, 2006 12:48 AM

To: nutch-dev@lucene.apache.org
Subject: Re: exception
Importance: High

This is a Hadoop DFS error.  It could mean that you don't have any 
datanodes running, or that all your datanodes are full.  Or, it could 
be a bug in dfs.  You might try a recent nightly build of Hadoop to 
see if it works any better.


Doug

Anton Potehin wrote:
 


What means error of following type :

 


java.rmi.RemoteException: java.io.IOException: Cannot obtain additional
block for file /user/root/crawl/indexes/index/_0.prx

 

 








  





Re: Content-Type inconsistency?

2006-04-27 Thread Jérôme Charron
 I'm not sure if that is the right thing.
 If the site administrator did a poort job and a wrong media type is
 advertized, it's the site
 problem and Nutch shouldn't be fixing it, in my opinion.  Those sites
 would
 not work properly with the browsers any way, and Nutch doesn't need to
 work properly
 except that it should protect itself from crashing.  I tried to visit your
 fake.zip page with
 IE and Firefox, and both faithfully trusted the media type as advertised
 by the server, and
 asked me if I want to open it with WinZip or save it; there was no option
 to open it as an HTML.
 Why should Nutch treat it as HTML?

Simply because it is a HTML file, with a strange name, of course, but it is
a HTML file.
My example is a kind of caricature. But some more real case could be : a
HTML file with a text/plain content-type, or with an text/xml
Finaly it is a good news that Nutch seems to be more intelligent on
content-type guessing than Firefox or IE, no?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Content-Type inconsistency?

2006-04-27 Thread Doug Cutting

Jérôme Charron wrote:

Finaly it is a good news that Nutch seems to be more intelligent on
content-type guessing than Firefox or IE, no?


I'm not so sure.  When crawling Apache we had trouble with this feature. 
 Some HTML files that had an XML header and the server identified as 
text/html Nutch decided to treat as XML, not HTML.  We had to turn off 
the guessing of content types to index Apache correctly.  I think we 
shouldn't aim guess things any more than a browser does.  If browsers 
require standards compliance, then our lives will be simpler.


Doug


[jira] Created: (NUTCH-256) Cannot open filename ....index.done.crc

2006-04-27 Thread [EMAIL PROTECTED] (JIRA)
Cannot open filename index.done.crc
---

 Key: NUTCH-256
 URL: http://issues.apache.org/jira/browse/NUTCH-256
 Project: Nutch
Type: Bug

  Components: indexer  
Versions: 0.8-dev
Reporter: [EMAIL PROTECTED]
Priority: Minor


Trying to copy indices out of DFS I always get:

[bregeon] workspace  ./hadoop/bin/hadoop dfs -get outputs .
060427 160317 parsing 
file:/home/stack/workspace/hadoop-local-conf/hadoop-default.xml
060427 160317 parsing 
file:/home/stack/workspace/hadoop-local-conf/hadoop-site.xml
060427 160318 No FS indicated, using default:localhost:9001
060427 160318 Client connection to 127.0.0.1:9001: starting
060427 160318 Problem opening checksum file: 
/user/stack/outputs/indexes/part-0/index.done.  Ignoring with exception 
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open 
filename /user/stack/outputs/indexes/part-0/.index.done.crc
at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:589)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218)



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-256) Cannot open filename ....index.done.crc

2006-04-27 Thread [EMAIL PROTECTED] (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-256?page=all ]

[EMAIL PROTECTED] updated NUTCH-256:


Attachment: index.done.crc.patch

Ensure creation of companion index.done .crc file

 Cannot open filename index.done.crc
 ---

  Key: NUTCH-256
  URL: http://issues.apache.org/jira/browse/NUTCH-256
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: [EMAIL PROTECTED]
 Priority: Minor
  Attachments: index.done.crc.patch

 Trying to copy indices out of DFS I always get:
 [bregeon] workspace  ./hadoop/bin/hadoop dfs -get outputs .
 060427 160317 parsing 
 file:/home/stack/workspace/hadoop-local-conf/hadoop-default.xml
 060427 160317 parsing 
 file:/home/stack/workspace/hadoop-local-conf/hadoop-site.xml
 060427 160318 No FS indicated, using default:localhost:9001
 060427 160318 Client connection to 127.0.0.1:9001: starting
 060427 160318 Problem opening checksum file: 
 /user/stack/outputs/indexes/part-0/index.done.  Ignoring with exception 
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open 
 filename /user/stack/outputs/indexes/part-0/.index.done.crc
 at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130)
 at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:589)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira