subject:"RE\\\: error"

Re: Error in 0.8 regex-urlfilter.txt

2006-08-10 Thread ogjunk-nutch

Thanks, committed.
Otis

- Original Message 
From: Matthew Holt [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org
Sent: Wednesday, August 9, 2006 9:51:19 AM
Subject: Error in 0.8 regex-urlfilter.txt

I was doing a search and noticed that a 'png' file was indexed. I 
checked the crawl-urlfilter.txt and it had the following line preventing 
the index of a png file:
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

I then looked at regex-urlfilter.txt, and the line was similar, but 
lacked the 'png' definition. So apparently the recrawl was indexing the 
png files. The original regex-urlfilter.txt line is below:
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

It needs to be modified in trunk to match the line from crawl-urlfilter.txt.

Matt

Re: Error with Hadoop-0.4.0

2006-07-12 Thread Sami Siren


Gal Nitzan wrote:


To get the same behavior, just try to inject to a new crawldb that doesn't
exist.

The reason many doesn't get it is that crawldb already exists in their
environment.


 


true, I was injecting to existing crawldb.

--
Sami Siren

Re: Error with Hadoop-0.4.0

2006-07-12 Thread Sami Siren


Doug Cutting wrote:


 Jérôme Charron wrote:

 In my environment, the crawl command terminate with the following
 error: 2006-07-06 17:41:49,735 ERROR mapred.JobClient
 (JobClient.java:submitJob(273)) - Input directory
 /localpath/crawl/crawldb/current in local is invalid. Exception in
 thread main java.io.IOException: Input directory
 /localpathcrawl/crawldb/current in local is invalid. at
 org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at
 org.apache.nutch.crawl.Injector.inject(Injector.java:146) at
 org.apache.nutch.crawl.Crawl.main(Crawl.java:105)


 Hadoop 0.4.0 by default requires all input directories to exist,
 where previous releases did not. So we need to either create an
 empty current directory or change the InputFormat used in
 CrawlDb.createJob() to be one that overrides
 InputFormat.areValidInputDirectories(). The former is probably
 easier. I've attached a patch. Does this fix things for folks?



Patch works for me.
--
Sami Siren

Re: Error with Hadoop-0.4.0

2006-07-12 Thread Doug Cutting


Sami Siren wrote:

Patch works for me.


OK.  I just committed it.

Thanks!

Doug

Re: Error with Hadoop-0.4.0

2006-07-10 Thread Andrzej Bialecki


Stefan Groschupf wrote:

We tried your suggested fix:

Injector
by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))


I suspect that this is not the right solution - have you actually tested 
that the resulting db contains all entries from the input dirs?


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Error with Hadoop-0.4.0

2006-07-10 Thread Gal Nitzan

To get the same behavior, just try to inject to a new crawldb that doesn't
exist.

The reason many doesn't get it is that crawldb already exists in their
environment.



-Original Message-
From: Sami Siren [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 06, 2006 7:23 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Error with Hadoop-0.4.0

Jérôme Charron wrote:

 Hi,

 I encountered some problems with Nutch trunk version.
 In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK
 1.5
 (more precisely since HADOOP-129 and File replacement by Path).
 Does somebody have the same error?

I am not seeing this (just run inject on a single machine(linux) 
configuration, local fs without problems ).

--
 Sami Siren

Re: Error with Hadoop-0.4.0

2006-07-10 Thread Doug Cutting


Jérôme Charron wrote:

In my environment, the crawl command terminate with the following error:
2006-07-06 17:41:49,735 ERROR mapred.JobClient 
(JobClient.java:submitJob(273))

- Input directory /localpath/crawl/crawldb/current in local is invalid.
Exception in thread main java.io.IOException: Input directory
/localpathcrawl/crawldb/current in local is invalid.
   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
   at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)


Hadoop 0.4.0 by default requires all input directories to exist, where 
previous releases did not.  So we need to either create an empty 
current directory or change the InputFormat used in 
CrawlDb.createJob() to be one that overrides 
InputFormat.areValidInputDirectories().  The former is probably easier. 
 I've attached a patch.  Does this fix things for folks?


Doug
Index: src/java/org/apache/nutch/crawl/CrawlDb.java
===
--- src/java/org/apache/nutch/crawl/CrawlDb.java	(revision 417882)
+++ src/java/org/apache/nutch/crawl/CrawlDb.java	(working copy)
@@ -65,7 +65,8 @@
 if (LOG.isInfoEnabled()) { LOG.info(CrawlDb update: done); }
   }
 
-  public static JobConf createJob(Configuration config, Path crawlDb) {
+  public static JobConf createJob(Configuration config, Path crawlDb)
+throws IOException {
 Path newCrawlDb =
   new Path(crawlDb,
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
@@ -73,7 +74,11 @@
 JobConf job = new NutchJob(config);
 job.setJobName(crawldb  + crawlDb);
 
-job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
+
+Path current = new Path(crawlDb, CrawlDatum.DB_DIR_NAME);
+if (FileSystem.get(job).exists(current)) {
+  job.addInputPath(current);
+}
 job.setInputFormat(SequenceFileInputFormat.class);
 job.setInputKeyClass(UTF8.class);
 job.setInputValueClass(CrawlDatum.class);

Re: Error with Hadoop-0.4.0

2006-07-07 Thread Stefan Groschupf


Hi Jérôme,

I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.
We should fix that.

Stefan

On 06.07.2006, at 08:54, Jérôme Charron wrote:


Hi,

I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to Hadoop-0.4.0  
and JDK

1.5
(more precisely since HADOOP-129 and File replacement by Path).

In my environment, the crawl command terminate with the following  
error:
2006-07-06 17:41:49,735 ERROR mapred.JobClient  
(JobClient.java:submitJob(273))
- Input directory /localpath/crawl/crawldb/current in local is  
invalid.

Exception in thread main java.io.IOException: Input directory
/localpathcrawl/crawldb/current in local is invalid.
   at org.apache.hadoop.mapred.JobClient.submitJob 
(JobClient.java:274)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
327)

   at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

By looking at the Nutch code, and simply changing the line 145 of  
Injector

by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))
all is working fine. By taking a closer look at CrawlDb code, I  
finaly dont

understand why there is the following line in the createJob method:
job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));

For curiosity, if a hadoop guru can explain why there is such a
regression...

Does somebody have the same error?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Error with Hadoop-0.4.0

2006-07-07 Thread Jérôme Charron


I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.


Thanks for this feedback Stefan.



We should fix that.


What I suggest, is simply to remove the line 75 in createJob method from
CrawlDb :
setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
In fact, this method is only used by Injector.inject() and CrawlDb.update()
and
the inputPath setted in createJob is not needed neither by Injector.inject()
nor
CrawlDb.update() methods.

If no objection, I will commit this change tomorrow.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Error with Hadoop-0.4.0

2006-07-07 Thread Stefan Groschupf


We tried your suggested fix:

Injector
by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))


and this worked without any problem.

Thanks for catching that, this saved us a lot of time.
Stefan

On 07.07.2006, at 16:08, Jérôme Charron wrote:


I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.


Thanks for this feedback Stefan.



We should fix that.


What I suggest, is simply to remove the line 75 in createJob method  
from

CrawlDb :
setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
In fact, this method is only used by Injector.inject() and  
CrawlDb.update()

and
the inputPath setted in createJob is not needed neither by  
Injector.inject()

nor
CrawlDb.update() methods.

If no objection, I will commit this change tomorrow.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Error with Hadoop-0.4.0

2006-07-06 Thread Sami Siren


Jérôme Charron wrote:


Hi,

I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK
1.5
(more precisely since HADOOP-129 and File replacement by Path).
Does somebody have the same error?


I am not seeing this (just run inject on a single machine(linux) 
configuration, local fs without problems ).


--
Sami Siren

RE: error

2006-05-22 Thread Gal Nitzan

A new plugin was added to code base.

You need to add to tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml
property plugin.includes a new entry summary-basic or summary-lucene.

HTH

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 22, 2006 11:39 AM
To: nutch-dev@lucene.apache.org
Subject: error

I updated any plugins... And now I get errors in tomcat log: 

May 22, 2006 3:28:50 AM org.apache.nutch.plugin.PluginRepository init
SEVERE: org.apache.nutch.plugin.PluginRuntimeException: Plugin
(summary-basic), extension point: org.apache.nutch.searcher.Summarizer does
not exist.

How fix this problem?

Re: Error in 0.8 regex-urlfilter.txt

Re: Error with Hadoop-0.4.0

Re: Error with Hadoop-0.4.0

Re: Error with Hadoop-0.4.0

Re: Error with Hadoop-0.4.0

RE: Error with Hadoop-0.4.0

Re: Error with Hadoop-0.4.0

Re: Error with Hadoop-0.4.0

Re: Error with Hadoop-0.4.0

Re: Error with Hadoop-0.4.0

Re: Error with Hadoop-0.4.0

RE: error

12 matches

Site Navigation

Mail list logo

Footer information