Re: Error in 0.8 regex-urlfilter.txt
Thanks, committed. Otis - Original Message From: Matthew Holt [EMAIL PROTECTED] To: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org Sent: Wednesday, August 9, 2006 9:51:19 AM Subject: Error in 0.8 regex-urlfilter.txt I was doing a search and noticed that a 'png' file was indexed. I checked the crawl-urlfilter.txt and it had the following line preventing the index of a png file: -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$ I then looked at regex-urlfilter.txt, and the line was similar, but lacked the 'png' definition. So apparently the recrawl was indexing the png files. The original regex-urlfilter.txt line is below: -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ It needs to be modified in trunk to match the line from crawl-urlfilter.txt. Matt
Re: Error with Hadoop-0.4.0
Gal Nitzan wrote: To get the same behavior, just try to inject to a new crawldb that doesn't exist. The reason many doesn't get it is that crawldb already exists in their environment. true, I was injecting to existing crawldb. -- Sami Siren
Re: Error with Hadoop-0.4.0
Doug Cutting wrote: Jérôme Charron wrote: In my environment, the crawl command terminate with the following error: 2006-07-06 17:41:49,735 ERROR mapred.JobClient (JobClient.java:submitJob(273)) - Input directory /localpath/crawl/crawldb/current in local is invalid. Exception in thread main java.io.IOException: Input directory /localpathcrawl/crawldb/current in local is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at org.apache.nutch.crawl.Injector.inject(Injector.java:146) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) Hadoop 0.4.0 by default requires all input directories to exist, where previous releases did not. So we need to either create an empty current directory or change the InputFormat used in CrawlDb.createJob() to be one that overrides InputFormat.areValidInputDirectories(). The former is probably easier. I've attached a patch. Does this fix things for folks? Patch works for me. -- Sami Siren
Re: Error with Hadoop-0.4.0
Sami Siren wrote: Patch works for me. OK. I just committed it. Thanks! Doug
Re: Error with Hadoop-0.4.0
Stefan Groschupf wrote: We tried your suggested fix: Injector by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath (tempDir)) I suspect that this is not the right solution - have you actually tested that the resulting db contains all entries from the input dirs? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Error with Hadoop-0.4.0
To get the same behavior, just try to inject to a new crawldb that doesn't exist. The reason many doesn't get it is that crawldb already exists in their environment. -Original Message- From: Sami Siren [mailto:[EMAIL PROTECTED] Sent: Thursday, July 06, 2006 7:23 PM To: nutch-dev@lucene.apache.org Subject: Re: Error with Hadoop-0.4.0 Jérôme Charron wrote: Hi, I encountered some problems with Nutch trunk version. In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK 1.5 (more precisely since HADOOP-129 and File replacement by Path). Does somebody have the same error? I am not seeing this (just run inject on a single machine(linux) configuration, local fs without problems ). -- Sami Siren
Re: Error with Hadoop-0.4.0
Jérôme Charron wrote: In my environment, the crawl command terminate with the following error: 2006-07-06 17:41:49,735 ERROR mapred.JobClient (JobClient.java:submitJob(273)) - Input directory /localpath/crawl/crawldb/current in local is invalid. Exception in thread main java.io.IOException: Input directory /localpathcrawl/crawldb/current in local is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at org.apache.nutch.crawl.Injector.inject(Injector.java:146) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) Hadoop 0.4.0 by default requires all input directories to exist, where previous releases did not. So we need to either create an empty current directory or change the InputFormat used in CrawlDb.createJob() to be one that overrides InputFormat.areValidInputDirectories(). The former is probably easier. I've attached a patch. Does this fix things for folks? Doug Index: src/java/org/apache/nutch/crawl/CrawlDb.java === --- src/java/org/apache/nutch/crawl/CrawlDb.java (revision 417882) +++ src/java/org/apache/nutch/crawl/CrawlDb.java (working copy) @@ -65,7 +65,8 @@ if (LOG.isInfoEnabled()) { LOG.info(CrawlDb update: done); } } - public static JobConf createJob(Configuration config, Path crawlDb) { + public static JobConf createJob(Configuration config, Path crawlDb) +throws IOException { Path newCrawlDb = new Path(crawlDb, Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); @@ -73,7 +74,11 @@ JobConf job = new NutchJob(config); job.setJobName(crawldb + crawlDb); -job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME)); + +Path current = new Path(crawlDb, CrawlDatum.DB_DIR_NAME); +if (FileSystem.get(job).exists(current)) { + job.addInputPath(current); +} job.setInputFormat(SequenceFileInputFormat.class); job.setInputKeyClass(UTF8.class); job.setInputValueClass(CrawlDatum.class);
Re: Error with Hadoop-0.4.0
Hi Jérôme, I have the same problem on a distribute environment! :-( So I think can confirm this is a bug. We should fix that. Stefan On 06.07.2006, at 08:54, Jérôme Charron wrote: Hi, I encountered some problems with Nutch trunk version. In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK 1.5 (more precisely since HADOOP-129 and File replacement by Path). In my environment, the crawl command terminate with the following error: 2006-07-06 17:41:49,735 ERROR mapred.JobClient (JobClient.java:submitJob(273)) - Input directory /localpath/crawl/crawldb/current in local is invalid. Exception in thread main java.io.IOException: Input directory /localpathcrawl/crawldb/current in local is invalid. at org.apache.hadoop.mapred.JobClient.submitJob (JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 327) at org.apache.nutch.crawl.Injector.inject(Injector.java:146) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) By looking at the Nutch code, and simply changing the line 145 of Injector by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath (tempDir)) all is working fine. By taking a closer look at CrawlDb code, I finaly dont understand why there is the following line in the createJob method: job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME)); For curiosity, if a hadoop guru can explain why there is such a regression... Does somebody have the same error? Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Error with Hadoop-0.4.0
I have the same problem on a distribute environment! :-( So I think can confirm this is a bug. Thanks for this feedback Stefan. We should fix that. What I suggest, is simply to remove the line 75 in createJob method from CrawlDb : setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME)); In fact, this method is only used by Injector.inject() and CrawlDb.update() and the inputPath setted in createJob is not needed neither by Injector.inject() nor CrawlDb.update() methods. If no objection, I will commit this change tomorrow. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Error with Hadoop-0.4.0
We tried your suggested fix: Injector by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath (tempDir)) and this worked without any problem. Thanks for catching that, this saved us a lot of time. Stefan On 07.07.2006, at 16:08, Jérôme Charron wrote: I have the same problem on a distribute environment! :-( So I think can confirm this is a bug. Thanks for this feedback Stefan. We should fix that. What I suggest, is simply to remove the line 75 in createJob method from CrawlDb : setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME)); In fact, this method is only used by Injector.inject() and CrawlDb.update() and the inputPath setted in createJob is not needed neither by Injector.inject() nor CrawlDb.update() methods. If no objection, I will commit this change tomorrow. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Error with Hadoop-0.4.0
Jérôme Charron wrote: Hi, I encountered some problems with Nutch trunk version. In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK 1.5 (more precisely since HADOOP-129 and File replacement by Path). Does somebody have the same error? I am not seeing this (just run inject on a single machine(linux) configuration, local fs without problems ). -- Sami Siren
RE: error
A new plugin was added to code base. You need to add to tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml property plugin.includes a new entry summary-basic or summary-lucene. HTH -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Monday, May 22, 2006 11:39 AM To: nutch-dev@lucene.apache.org Subject: error I updated any plugins... And now I get errors in tomcat log: May 22, 2006 3:28:50 AM org.apache.nutch.plugin.PluginRepository init SEVERE: org.apache.nutch.plugin.PluginRuntimeException: Plugin (summary-basic), extension point: org.apache.nutch.searcher.Summarizer does not exist. How fix this problem?