[jira] Reopened: (NUTCH-309) Uses commons logging Code Guards

2006-07-07 Thread Doug Cutting (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-309?page=all ]
 
Doug Cutting reopened NUTCH-309:



I am re-opening this issue, as the guards were added in far too many places.  
Jerome, can you please fix these so that guards are only added when (a) the log 
level is DEBUG or TRACE, (b) it occurs in performance-critical code, and (c) 
the logged string is not constant.

 Uses commons logging Code Guards
 

  Key: NUTCH-309
  URL: http://issues.apache.org/jira/browse/NUTCH-309
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Jerome Charron
 Assignee: Jerome Charron
 Priority: Minor
  Fix For: 0.8-dev


 Code guards are typically used to guard code that only needs to execute in 
 support of logging, that otherwise introduces undesirable runtime overhead in 
 the general case (logging disabled). Examples are multiple parameters, or 
 expressions (e.g. string +  more) for parameters. Use the guard methods of 
 the form log.isPriority() to verify that logging should be performed, 
 before incurring the overhead of the logging method call. Yes, the logging 
 methods will perform the same check, but only after resolving parameters.
 (description extracted from 
 http://jakarta.apache.org/commons/logging/guide.html#Code_Guards)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-309) Uses commons logging Code Guards

2006-07-07 Thread Jerome Charron (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-309?page=comments#action_12419670 ] 

Jerome Charron commented on NUTCH-309:
--

As already discussed, it perfectly makes sense and I have planned to work on 
this issue.
Another minor change I would like to make is to replace the log4j.properties by 
log4j.xml : The log4j.xml provides more funtionality and flexibility : 
especially filters that provide a way to log to different appenders depending 
on the log level for instance (for instance I use this to log all levels to a 
file and warn and error level to the console).
 

 Uses commons logging Code Guards
 

  Key: NUTCH-309
  URL: http://issues.apache.org/jira/browse/NUTCH-309
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Jerome Charron
 Assignee: Jerome Charron
 Priority: Minor
  Fix For: 0.8-dev


 Code guards are typically used to guard code that only needs to execute in 
 support of logging, that otherwise introduces undesirable runtime overhead in 
 the general case (logging disabled). Examples are multiple parameters, or 
 expressions (e.g. string +  more) for parameters. Use the guard methods of 
 the form log.isPriority() to verify that logging should be performed, 
 before incurring the overhead of the logging method call. Yes, the logging 
 methods will perform the same check, but only after resolving parameters.
 (description extracted from 
 http://jakarta.apache.org/commons/logging/guide.html#Code_Guards)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-300) Clustering API improvements

2006-07-07 Thread nutch.newbie (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-300?page=comments#action_12419675 ] 

nutch.newbie commented on NUTCH-300:


Has anyone tried this patch? Anyone from the carrot team? Is it compatible with 
current version?

 Clustering API improvements
 ---

  Key: NUTCH-300
  URL: http://issues.apache.org/jira/browse/NUTCH-300
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
 Priority: Minor
  Attachments: patch.txt

 This patch adds support for retrieving original document scores (from 
 NutchBean), as well as cluster-level relevance scores (from Clusterer). Both 
 methods may improve visual representation of the clusters, where individual 
 items may be visually differentiated depending on their query relevance and 
 cluster relevance. A modified cluster.jsp illustrates this feature.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-300) Clustering API improvements

2006-07-07 Thread Dawid Weiss (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-300?page=comments#action_12419708 ] 

Dawid Weiss commented on NUTCH-300:
---

Hi. I just took a look at it -- I don't see anything wrong with the code and 
Andrzej has used Carrot2 before. We're under major refactorings to simplify 
things within Carrot2 -- the internals won't change much, but we drop obsolete 
APIs etc. The new web application has a new shiny user interface (at the moment 
XSLT-filtered from XMLs, so not applicable for huge user loads, but very 
convenient to work with on customizations). Stay tuned.

 Clustering API improvements
 ---

  Key: NUTCH-300
  URL: http://issues.apache.org/jira/browse/NUTCH-300
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
 Priority: Minor
  Attachments: patch.txt

 This patch adds support for retrieving original document scores (from 
 NutchBean), as well as cluster-level relevance scores (from Clusterer). Both 
 methods may improve visual representation of the clusters, where individual 
 items may be visually differentiated depending on their query relevance and 
 cluster relevance. A modified cluster.jsp illustrates this feature.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Error with Hadoop-0.4.0

2006-07-07 Thread Stefan Groschupf

Hi Jérôme,

I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.
We should fix that.

Stefan

On 06.07.2006, at 08:54, Jérôme Charron wrote:


Hi,

I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to Hadoop-0.4.0  
and JDK

1.5
(more precisely since HADOOP-129 and File replacement by Path).

In my environment, the crawl command terminate with the following  
error:
2006-07-06 17:41:49,735 ERROR mapred.JobClient  
(JobClient.java:submitJob(273))
- Input directory /localpath/crawl/crawldb/current in local is  
invalid.

Exception in thread main java.io.IOException: Input directory
/localpathcrawl/crawldb/current in local is invalid.
   at org.apache.hadoop.mapred.JobClient.submitJob 
(JobClient.java:274)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
327)

   at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

By looking at the Nutch code, and simply changing the line 145 of  
Injector

by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))
all is working fine. By taking a closer look at CrawlDb code, I  
finaly dont

understand why there is the following line in the createJob method:
job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));

For curiosity, if a hadoop guru can explain why there is such a
regression...

Does somebody have the same error?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/




Number of pages different to Indexed documents

2006-07-07 Thread Lourival Júnior

Hi all!

I have a little doubt. My WebDB contains, actually, 779 pages with 899
links. When I use the segread command it returns 779 count pages too in one
segment. However when I make a search or when I use the luke software the
maximum number of documents is 437. I've seen the recrawl logs and when the
script is fetching pages, some of them contains the message:

... failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater:
Exceeded http.max.delays: retry later.

I thing that it happens because some network problem. The fetcher try to
fetch some page, but it did not obtain. Because this, when the segment is
being indexed, only the fetched pages will appear in results. It is a
problem to me.

Could someone explain me what should I do to refetch these pages to increase
my web search results? Should I change the http.max.delays and
fetcher.server.delay properties in nutch-default.xml?

Regards,

--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]


RE: 0.8 release

2006-07-07 Thread Teruhiko Kurosaka
May I suggest someone take a look at NUTCH-266 before releaseing 0.8?
Nutch build as of half a month ago was not working for me and another
person.

-kuro 

 -Original Message-
 From: Stefan Groschupf [mailto:[EMAIL PROTECTED] 
 Sent: 2006-7-05 11:53
 To: nutch-dev@lucene.apache.org
 Subject: Re: 0.8 release
 
 +1, but I really would love to see NUTCH-293 as part of nutch .8  
 since this all about being more polite.
 Thanks.
 Stefan
 
 On 05.07.2006, at 03:46, Doug Cutting wrote:
 
  +1
 
  Piotr Kosiorowski wrote:
  +1.
  P.
  Andrzej Bialecki wrote:
  Sami Siren wrote:
  How would folks feel about releasing 0.8 now, there has been  
  quite a lot of improvements/new features
  since 0.7 series and I strongly feel that we should push the  
  first 0.8 series release (alfa/beta)
  out the door now. It would IMO lower the barrier to 
 first timers  
  try the 0.8 series and that would
  give us more feedback about the overall quality.
 
  Definitely +1. Let's do some testing, however, after the upgrade  
  to hadoop 0.3.2 - hadoop had many, many changes, so we just need  
  to make sure it's stable when used with Nutch ...
 
  We should also check JIRA and apply any trivial fixes before the  
  release.
 
 
  If there is a consensus about this I can volunteer to be the RM.
 
  That would be great, thanks!
 
 
 
 


Re: Error with Hadoop-0.4.0

2006-07-07 Thread Jérôme Charron

I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.


Thanks for this feedback Stefan.



We should fix that.


What I suggest, is simply to remove the line 75 in createJob method from
CrawlDb :
setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
In fact, this method is only used by Injector.inject() and CrawlDb.update()
and
the inputPath setted in createJob is not needed neither by Injector.inject()
nor
CrawlDb.update() methods.

If no objection, I will commit this change tomorrow.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Error with Hadoop-0.4.0

2006-07-07 Thread Stefan Groschupf

We tried your suggested fix:

Injector
by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))


and this worked without any problem.

Thanks for catching that, this saved us a lot of time.
Stefan

On 07.07.2006, at 16:08, Jérôme Charron wrote:


I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.


Thanks for this feedback Stefan.



We should fix that.


What I suggest, is simply to remove the line 75 in createJob method  
from

CrawlDb :
setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
In fact, this method is only used by Injector.inject() and  
CrawlDb.update()

and
the inputPath setted in createJob is not needed neither by  
Injector.inject()

nor
CrawlDb.update() methods.

If no objection, I will commit this change tomorrow.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/