Re: PMD integration

2006-04-07 Thread Dawid Weiss


Hi Piotr,

 that right now it is checking only main code (without plugins?).

Yes, that's correct -- I forgot to mention that. PMD target is hooked up 
with tests and stops the build if something fails. I thought the core 
code should be this strict; for plugins we can have more relaxed rules 
(in another target or even in the same one).


That's again up to you guys.

Dawid

P.S. Tom Copeland has already fixed the bug I mentioned in the patch. 
Quite impressive bugfix turnaround, isn't it. :)




Piotr Kosiorowski wrote:

P.


Dawid Weiss wrote:


All right, I though I'd give it a go since I have a spare few minutes. 
Jura is off, so I made the patches available here --


http://ophelia.cs.put.poznan.pl/~dweiss/nutch/

pmd.patch is the build file patch and libraries (binaries are in a 
separate zip file pmd-ext.zip).


pmd-fixes.patch fixes the current core code to go through pmd 
smoothly. I removed obvious unused code, but left FIXME comments where 
I wasn't sure if the removal can cause side effects (in these places 
PMD warnings are suppressed with NOPMD comments).


I also discovered a bug in PMD... eh... nothing's perfect.

https://sourceforge.net/tracker/?func=detailatid=479921aid=1465574group_id=56262 



D.


Piotr Kosiorowski wrote:
+1 - I offer my help - we can coordinate it and I can do a part of 
work. I

will also try to commit your patches quickly.
Piotr

On 4/6/06, Dawid Weiss [EMAIL PROTECTED] wrote:



Other options (raised on the Hadoop list) are Checkstyle:

PMD seems to be the best choice for an Apache project and they all seem
to perform at a similar level.


Anything that generates a lot of false positives is bad: it either
causes us to skip analysis of lots of files, or ignore the warnings.
Skipping the JavaCC-generated classes is reasonable, but I'm wary of
skipping much else.

I thought a bit about this. The warnings PMD may actually make sense to
fix. Take a look at maxDoc here:

class LuceneQueryOptimizer {

   private static class LimitExceeded extends RuntimeException {
 private int maxDoc;
 public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }
   }
...

maxDoc is accessed from LuceneQueryOptimizer which requires a synthetic
accessor in LimitExceeded. It also may look confusing because you
declare a field private to a class, but use it from the outside...
changing declarations to something like this:

class LuceneQueryOptimizer {

   private static class LimitExceeded extends RuntimeException {
 final int maxDoc;
 public LimitExceeded(int maxDoc) { this.maxDoc = maxDoc; }
   }
...

removes the warning and also seems to make more sense (note that 
package

scope of maxDoc doesn't really expose it much more than before because
the entire class is private).

So... if you agree to change existing warnings as shown above (there's
not that many) then integrating PMD with a set of sensible rules may
help detecting bad smells in the future (I couldn't resist -- it really
is called like this in software engineering :). I only used dead code
detection ruleset for now, other rulesets can be checked and we will 
see

if they help or quite the contrary.

If developers agree to the above I'll create a patch together with what
needs to be fixed to cleanly compile. Otherwise I see little sense in
integrating PMD.

D.












Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-07 Thread Dawid Weiss



My feeling was simply that the closest we are to Nutch-1.0, the more be need
some QA metrics (for us and for nutch users). No?


I absolutely agree Jérôme, really. It's just that developers usually 
tend to hook up dozens of QA plugins and never look at what they output 
(that's the usual scenario with Maven-built projects that I observed). 
What I think we need is a QA _person_ rather than just tools. But I'm 
always a bit skeptical, don't take it personally ;)


D.


Re: PMD integration

2006-04-07 Thread Jérôme Charron
  that right now it is checking only main code (without plugins?).
 Yes, that's correct -- I forgot to mention that. PMD target is hooked up
 with tests and stops the build if something fails. I thought the core
 code should be this strict; for plugins we can have more relaxed rules

-1
Since plugins provides a lot of Nutch functionalities (without any plugin,
Nutch provides no service), I think that plugins code should be as strict as
the core code.

Thanks

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-07 Thread Jérôme Charron
  My feeling was simply that the closest we are to Nutch-1.0, the more be
 need
  some QA metrics (for us and for nutch users). No?
 I absolutely agree Jérôme, really. It's just that developers usually
 tend to hook up dozens of QA plugins and never look at what they output
 (that's the usual scenario with Maven-built projects that I observed).

Yes, that's right...;-)

What I think we need is a QA _person_ rather than just tools. But I'm
 always a bit skeptical, don't take it personally ;)

I absolutely agree Dawid. But I don't think Nutch has enought human
resources
to have a QA person.
I will make a try to integrate a code coverage tool, and see if it gives us
some good
indices on unit tests needed efforts.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: PMD integration

2006-04-07 Thread Piotr Kosiorowski
I do agree with Jarome  - plugins should be checked too.
I would like to integrate PMD for core and plugins over the weekend based on
the Dawid's work - I will make it totally separate target (so test do not
depend on it).
The goal is to allow other developers to play with pmd easily but at the
same time I do not want the build to be affected.
I would like also to look at possibility to generate crossreferenced HTML
code from Nutch sources as it looks like pmd can use it and violation
reports would be much easier to read.
P,


On 4/7/06, Jérôme Charron [EMAIL PROTECTED] wrote:

   that right now it is checking only main code (without plugins?).
  Yes, that's correct -- I forgot to mention that. PMD target is hooked up
  with tests and stops the build if something fails. I thought the core
  code should be this strict; for plugins we can have more relaxed rules

 -1
 Since plugins provides a lot of Nutch functionalities (without any plugin,
 Nutch provides no service), I think that plugins code should be as strict
 as
 the core code.

 Thanks

 Jérôme

 --
 http://motrech.free.fr/
 http://www.frutch.org/




Re: PMD integration

2006-04-07 Thread Jérôme Charron
 I will make it totally separate target (so test do not
 depend on it).

+1


 The goal is to allow other developers to play with pmd easily but at the
 same time I do not want the build to be affected.

+1


 I would like also to look at possibility to generate crossreferenced HTML
 code from Nutch sources as it looks like pmd can use it and violation
 reports would be much easier to read.

+1

Thanks Piotr (and Dawid too of course)

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


CrawlDbReducer - selecting data for DB update

2006-04-07 Thread Andrzej Bialecki

Hi,

The more I look at CrawlDbReducer the less I like the method it uses to 
select the most recent records.


This selection is primarily made in the while() loop in 
CrawlDbReducer:45. My main objection is that selecting the highest 
value (meaning most recent) relies on the fact that values of status 
codes in CrawlDatum are ordered according to their meaning, and they are 
treated as a sort of state machine. However, adding new states is very 
difficult, if they should have values lower than STATUS_FETCH_GONE, as 
it leads to breaking backwards-compatibility with older segment data. 
Adding status codes with higher values may also break things here, 
because a CrawlDatum with the highest code would not be necessarily the 
most recent.


I encountered this problem first when adding the signature framework, 
fortunately there was one unused value (0) at that time, so I could add 
CrawlDatum.STATUS_SIGNATURE without breaking the assumptions in 
CrawlDbReducer.


However, now things become more difficult:

* we need another status code for newly discovered pages discovered as a 
result of redirection (see the thread on Meta-refresh). If we add this 
status as e.g. STATUS_FETCH_REDIRECT = 8, then the logic in 
CrawlDbReducer will break.


* we need something to mark pages as being on a fetchlist, to be 
updated soon (this is to support multiple parallel 
generate/fetch/update cycles). A new status code would do fine for this 
purpose (although we need an expiry timer for that too). Arguably, we 
could use the same trick that we used in 0.7 (moving next fetch time 1 
week into the future), but I'm not sure yet how it would play with the 
adaptive fetch patches, which manipulate this value too...


I could use a hack in the meantime: status values are for now all below 
128, we could use the upper nibble for these additional flags, and mask 
them out with 0x0f.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: PMD integration

2006-04-07 Thread Dawid Weiss



I do agree with Jarome  - plugins should be checked too.


This basically means modifying the fileset in the pmd task. Shouldn't be 
too difficult to include all plugin sources with a single include 
statement.



I will make it totally separate target (so test do not
depend on it).


That was actually Doug's idea (and I agree with it) to stop the build 
file if PMD complains about something. It's similar to testing -- if 
your tests fail, the entire build file fails.



The goal is to allow other developers to play with pmd easily but at the
same time I do not want the build to be affected.


Maybe in the initial phase, but I'd strongly recommend integrating it in 
the main build. PMD is quite fast and doesn't add much delay to the process.



I would like also to look at possibility to generate crossreferenced HTML
code from Nutch sources as it looks like pmd can use it and violation
reports would be much easier to read.


Yes, it's possible but I didn't play with it. I can't do it today, but 
maybe during the weekend. If you're faster than me, Piotr, let me know 
so that I don't waste the time. Thanks.


D.


Re: PMD integration

2006-04-07 Thread Piotr Kosiorowski


  I will make it totally separate target (so test do not
  depend on it).

 That was actually Doug's idea (and I agree with it) to stop the build
 file if PMD complains about something. It's similar to testing -- if
 your tests fail, the entire build file fails.

I totally agree with it - but I want to switch it on for others to
play first, and when we agree on
rules we want to use make it obligatory.
Piotr


Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Doug Cutting

Chris Mattmann wrote:

+1 for a release sooner rather than later.


I think this is a good plan.  There's no reason we can't do another 
release in a month.  If it is back-compatbible we can call it 0.8.x and 
if it's incompatible we can call it 0.9.0.


I'm going to make a Hadoop 0.1.1 release today that can be included in 
Nutch 0.8.0.  (With Hadoop we're going to aim for monthly releases, with 
potential bugfix releases between when serious bugs are found.  The big 
bug in Hadoop 0.1.0 is http://issues.apache.org/jira/browse/HADOOP-117.)


So we could aim for a Nutch 0.8.0 release sometime next week.  Does that 
work for folks?


Piotr, would you like to make this release, or should I?

Doug


Re: CrawlDbReducer - selecting data for DB update

2006-04-07 Thread Doug Cutting

Andrzej Bialecki wrote:
This selection is primarily made in the while() loop in 
CrawlDbReducer:45. My main objection is that selecting the highest 
value (meaning most recent) relies on the fact that values of status 
codes in CrawlDatum are ordered according to their meaning, and they are 
treated as a sort of state machine.


Yes, that was the design, that status codes are also priorities.

However, adding new states is very 
difficult, if they should have values lower than STATUS_FETCH_GONE, as 
it leads to breaking backwards-compatibility with older segment data. 


We can use CrawlDatum.VERSION to insert new status codes 
back-compatibly.  Perhaps we should change the codes to, instead of [0, 
1, 2, ...] to be [0, 10, 20, 30, ...] so that we can more easily 
introduce new values?  To update status codes from older versions we 
simply multiply by 10.


Would something like that work?

Or we could have a separate table mapping status codes to priority.

Doug


Re: PMD integration

2006-04-07 Thread Doug Cutting

Piotr Kosiorowski wrote:

I will make it totally separate target (so test do not
depend on it).


That was actually Doug's idea (and I agree with it) to stop the build
file if PMD complains about something. It's similar to testing -- if
your tests fail, the entire build file fails.


I totally agree with it - but I want to switch it on for others to
play first, and when we agree on
rules we want to use make it obligatory.


So we start out comitting it as an independent target, and then add it 
to the test target?  Is that the plan?  If so, +1.


Doug


Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Chris Mattmann
+1


On 4/7/06 10:20 AM, Doug Cutting [EMAIL PROTECTED] wrote:

 Chris Mattmann wrote:
 +1 for a release sooner rather than later.
 
 I think this is a good plan.  There's no reason we can't do another
 release in a month.  If it is back-compatbible we can call it 0.8.x and
 if it's incompatible we can call it 0.9.0.
 
 I'm going to make a Hadoop 0.1.1 release today that can be included in
 Nutch 0.8.0.  (With Hadoop we're going to aim for monthly releases, with
 potential bugfix releases between when serious bugs are found.  The big
 bug in Hadoop 0.1.0 is http://issues.apache.org/jira/browse/HADOOP-117.)
 
 So we could aim for a Nutch 0.8.0 release sometime next week.  Does that
 work for folks?
 
 Piotr, would you like to make this release, or should I?
 
 Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Piotr Kosiorowski

Doug Cutting wrote:


Piotr, would you like to make this release, or should I?

I would prefer you would do it this time - I am not sure if I can find 
some time next week. I would like to do some things before release though:

1) Commit clustering patch from Dawid (I took it over from Andrzej).
2) Commit pmd stuff as optional for this release. We will make it 
required later.
3) Review tutorial - I saw some posts on user list with claims about 
errors so I would like to check it before release.
4) It would be good to go through JIRA issues before - but I am not sure 
if I will manage it.

Any comments?

Regards
Piotr


Re: Patch to remove Nutch formating from logs

2006-04-07 Thread Piotr Kosiorowski

Hello  Christopher,
I personally do not like combining logging with severe error handling 
but it is one of the features of Nutch for some time and I do not think
it causes infinite loops in normal installations. Changing it as we are 
preparing to release a new version is not a good idea in my opinion.

But I will be happy if we change the way it is handled in future.
So for now -1.
Piotr


Christopher Burkey wrote:
Did anyone get this email? Can a commiter acknowledge this has been 
received?


We are have been having problems with infinite loops caused by Nutch. My 
theory is that the problem is related to using the log API to track 
severe errors.  This patch is a only a few lines of code and should be 
easy to insert. Please let me know if it has been received and what the 
feedback is.




Christopher Burkey wrote:

Hello,

   Here is a patch to change org.apache.nutch.util.LogFormatter to not 
insert itself as the default handler for the system.


   I have been using Nutch for a year and have been waiting for a 
version that I can embed into OpenEdit. The problem has been that 
Nutch inserts itself as the formatter for the Java log system and that 
interferes with OpenEdit logging.





diff -Naur ../java/org/apache/nutch/util/LogFormatter.java 
java/org/apache/nutch/util/LogFormatter.java
--- ../java/org/apache/nutch/util/LogFormatter.java2006-03-31 
13:40:50.0 -0500
+++ java/org/apache/nutch/util/LogFormatter.java2006-04-05 
16:27:59.0 -0400

@@ -16,13 +16,23 @@
 
 package org.apache.nutch.util;
 
-import java.util.logging.*;

-import java.io.*;
-import java.text.*;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.PrintStream;
+import java.io.PrintWriter;
+import java.io.StringWriter;
+import java.text.FieldPosition;
+import java.text.SimpleDateFormat;
 import java.util.Date;
-
-/** Prints just the date and the log message. */
-
+import java.util.logging.Formatter;
+import java.util.logging.Level;
+import java.util.logging.LogRecord;
+import java.util.logging.Logger;
+
+/** Prints just the date and the log message. + *  This was also used 
to stop processing as nutch crawls a web site
+ *  [EMAIL PROTECTED] changed this code to use a LogWrapper class 
to catch severe errors

+ * */
 public class LogFormatter extends Formatter {
   private static final String FORMAT = yyMMdd HHmmss;
   private static final String NEWLINE = 
System.getProperty(line.separator);

@@ -35,20 +45,27 @@
   private static boolean showTime = true;
   private static boolean showThreadIDs = false;
 
+  protected static LogFormatter sharedformatter =  new LogFormatter();
+  protected static SevereLogHandler sharedhandler =  new 
SevereLogHandler(sharedformatter);

+
+  /*
   // install when this class is loaded
   static {
 Handler[] handlers = LogFormatter.getLogger().getHandlers();
 for (int i = 0; i  handlers.length; i++) {
-  handlers[i].setFormatter(new LogFormatter());
+  handlers[i].setFormatter(sharedformatter);
   handlers[i].setLevel(Level.FINEST);
 }
   }
-
+  */
   /** Gets a logger and, as a side effect, installs this as the default
* formatter. */
   public static Logger getLogger(String name) {
 // just referencing this class installs it
-return Logger.getLogger(name);
+Logger logr = Logger.getLogger(name);
+logr.addHandler(sharedhandler);
+   
+return logr;

   }
  /** When true, time is logged with each entry. */
@@ -60,7 +77,10 @@
   public static void setShowThreadIDs(boolean showThreadIDs) {
 LogFormatter.showThreadIDs = showThreadIDs;
   }
-
+  public void setLoggedSevere( boolean inSevere )
+  {
+  loggedSevere = inSevere;
+  }
   /**
* Format the given LogRecord.
* @param record the log record to be formatted.
diff -Naur ../java/org/apache/nutch/util/SevereLogHandler.java 
java/org/apache/nutch/util/SevereLogHandler.java
--- ../java/org/apache/nutch/util/SevereLogHandler.java1969-12-31 
19:00:00.0 -0500
+++ java/org/apache/nutch/util/SevereLogHandler.java2006-04-05 
16:29:20.0 -0400

@@ -0,0 +1,46 @@
+/*
+ * Created on Apr 5, 2006
+ */
+package org.apache.nutch.util;
+
+import java.util.logging.Handler;
+import java.util.logging.Level;
+import java.util.logging.LogRecord;
+
+public class SevereLogHandler extends Handler
+{
+protected LogFormatter fieldNutchFormatter;
+   
+public SevereLogHandler(LogFormatter inFormatter)

+{
+setNutchFormatter(inFormatter);
+}
+   
+protected LogFormatter getNutchFormatter()

+{
+return fieldNutchFormatter;
+}
+
+protected void setNutchFormatter(LogFormatter inNutchFormatter)
+{
+fieldNutchFormatter = inNutchFormatter;
+}
+
+public void publish(LogRecord inRecord)
+{
+if ( inRecord.getLevel().intValue() == Level.SEVERE.intValue())
+{
+

Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Andrzej Bialecki

Doug Cutting wrote:

Chris Mattmann wrote:

+1 for a release sooner rather than later.


I think this is a good plan.  There's no reason we can't do another 
release in a month.  If it is back-compatbible we can call it 0.8.x 
and if it's incompatible we can call it 0.9.0.


I'm going to make a Hadoop 0.1.1 release today that can be included in 
Nutch 0.8.0.  (With Hadoop we're going to aim for monthly releases, 
with potential bugfix releases between when serious bugs are found.  
The big bug in Hadoop 0.1.0 is 
http://issues.apache.org/jira/browse/HADOOP-117.)


So we could aim for a Nutch 0.8.0 release sometime next week.  Does 
that work for folks?


Do you guys have any additional insights / suggestions whether NUTCH-240 
and/or NUTCH-61 should be included in this release?


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: [Proposal] New Lucene sub-project

2006-04-07 Thread Rida Benjelloun
Hi Jérôme,

I found your idea very interesting. I will be interested to contribute to
the Parse Plugins Framework. I have developed similar one using Lucene. The
project name is Lius.

If you are interested please let me know.



On 4/7/06, Jérôme Charron [EMAIL PROTECTED] wrote:

 Hi all,

 While chatting with Chris Mattmann, it seems to be evident to us that
 there
 is a need for a new sub-project within Lucene.

 For now, Lucene's sub-projects used in Nutch are :
 1. Lucene-java - The basis for search technology
 2. Hadoop - The distributed computing platform
 3. Nutch - The search engine that relies on Lucene and Hadoop.

 Since Nutch contains some value added pieces of code that focus on content
 analysis,
 we think it would be a good idea to split Nutch into a new sub-project
 based
 on content analysis
 manipulation. The components we have identified are :

 1. MimeType Repository
 2. Language Identifier
 3. Content Signature (MD5Signature / TextProfileSignature / ...)
 (4. Generic Meta Data Infrastructure)
 (5. Charset Detector)
 (6. Parse Plugins Framework)

 The idea is to expose these pieces of codes into a standalone lib, since
 we
 are convinced they could be usefull
 in many other projects than Nutch.
 The benefits will be to have some code more widely used / tested /
 contributed.
 If this proposal is accepted, we have a candidate name for this new
 project:
 Tika (comes from my son  ;-) )

 Any comment is welcome.

 Jérôme




Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Chris Mattmann
Hi Andrzej,


On 4/7/06 12:18 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Do you guys have any additional insights / suggestions whether NUTCH-240
 and/or NUTCH-61 should be included in this release?

Looking at the JIRA popular issues pane for Nutch (
http://issues.apache.org/jira/browse/NUTCH?report=com.atlassian.jira.plugin.
system.project:popularissues-panel), I note that NUTCH-61 is the most
popular issue right now with 7 votes. Additionally, NUTCH-240 shares the 3rd
most votes (4) with NUTCH-134. So, all in all, there are 4 issues with = 4
votes in JIRA. Of those 4 issues, 3 of them all have attached patches in
JIRA. Would it be safe to say that the committers should focus on committing
NUTCH-61, NUTCh-240, and NUTCH-48, since these 3 issues all have attached
patch files, and then freeze it for the 0.8.0 release? As for my own
opinion, I recently downloaded and reviewed NUTCH-61, and really like the
patch. +1 on my end. I haven't tried out NUTCH-240 yet, but it seems to be a
logical extension point for Nutch to be able to plug in different scoring
components. So, +1 from me.

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Jérôme Charron
 Do you guys have any additional insights / suggestions whether NUTCH-240
 and/or NUTCH-61 should be included in this release?

NUTCH-240 : I really like the idea, but for now, I agree with that is API is
still ugly. I would like to help in the next weeks...
So for me it should not be included in the 0.8 release...

Regards

Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/


[jira] Created: (NUTCH-245) XML Schemas for xml configuration files in conf directory

2006-04-07 Thread Chris A. Mattmann (JIRA)
XML Schemas for xml configuration files in conf directory
-

 Key: NUTCH-245
 URL: http://issues.apache.org/jira/browse/NUTCH-245
 Project: Nutch
Type: New Feature

  Components: fetcher, indexer, ndfs, searcher, web gui  
Versions: 0.7.2, 0.7.1, 0.7, 0.6, 0.8-dev
 Environment: Power PC Dual Processor 2.0 Ghz, Mac OS X 10.4, although 
improvement is independent of environment
Reporter: Chris A. Mattmann
 Assigned to: Chris A. Mattmann 
Priority: Minor


Currently, the plugin.xml file does not have a DTD or XML Schema associated 
with it, and most people just go look at an existing plugin's plugin.xml file 
to determine what are the allowable elements, etc. There should be an explicit 
plugin DTD file that describes the plugin.xml file. I'll look at the code and 
attach a plugin.dtd file for the Nutch conf directory later today. This way, 
people can use the DTD file to automatically (using tools such as XMLSpy) 
generate plugin.xml files that can then be validated. I'm also going to post 
another issue regarding adding an addition to the ant target that builds the 
Nutch website. The addition to the ant target would copy the existing DTD files 
in $NUTCH_HOME/conf to the Nutch website ROOT. That way, we could then 
reference the DTD file in all the XML instance files by reference something 
like !DOCTYPE system http://lucene.apache.org/nutch/dtd/parse-plugins.dtd;, 
within the parse-plugins.xml, or similarly for the nutch-site.xml, or 
mime-types.xml file.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Andrzej Bialecki

Chris Mattmann wrote:

opinion, I recently downloaded and reviewed NUTCH-61, and really like the
patch. +1 on my end. I haven't tried out NUTCH-240 yet, but it seems to be a
logical extension point for Nutch to be able to plug in different scoring
components. So, +1 from me.
  


Thanks for looking at this.

NUTCH-240: the API has some warts, it would be nice to clean up the 
passScore* methods before committing it - but this may involve changing 
too much code that is not strictly related to this patch.


NUTCH-61: I can commit this, it's been lightly tested on a dozen or so 
cycles of a small sample of urls. However, for some settings I've seen 
cases when AdaptiveFetchPolicy would go haywire and increase 
fetchInterval to infinity or to zero. So, this is really about whether 
people want to be blessed with this patch whether they need it or not, 
and weed out bugs as we go, or perhaps continue waiting for some 
volunteers to test it on a larger scale / more cycles.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: PMD integration

2006-04-07 Thread Piotr Kosiorowski

Committed.
One can run the pmd checks by 'ant pmd'. It produces file with html 
report in build directory. It covers core nutch and plugins.
Currently it uses unusedcode ruleset checks only but one can uncomment 
other rulesets in build.xml (or add another ones according to pmd 
documentation).


I would like to  add cross-referenced source so report is easier to read 
in near feature.

I have two additional questions for developers:
1) Should we check test sources with pmd?
2) We do have oro 2-0.7 in dependencies (I think urlfilter and similar 
things). PMD requires oro - 2.0.8. Do you think we can upgrade (as far 
as I know 2.0.7 and 2.0.8 should be compatible)? We would have only one 
oro jar than.


So happy PMD-ing,
Piotr





Doug Cutting wrote:

Piotr Kosiorowski wrote:

I will make it totally separate target (so test do not
depend on it).


That was actually Doug's idea (and I agree with it) to stop the build
file if PMD complains about something. It's similar to testing -- if
your tests fail, the entire build file fails.


I totally agree with it - but I want to switch it on for others to
play first, and when we agree on
rules we want to use make it obligatory.


So we start out comitting it as an independent target, and then add it 
to the test target?  Is that the plan?  If so, +1.


Doug





web ui improvement

2006-04-07 Thread Sami Siren
I have recently been working with refactoring the web gui to be a more 
extendable and manageable by replacing the spaghetti jsp ui with and ui 
layer done with struts and tiles. By doing so it will be much more easy 
to provide for example a plugin(extension) that will just change the 
layout of the search, provide new ui functionality (like  the did you 
mean feature) in a form of plugin etc..


I know there are people who think that a plain xml interface is good 
enough for all but I would like to give this new architecture a try.


As part of the required functionality of the 0.8 release discussion on 
some other thread my opinion is to postbone any new ui functionality

(for example NUTCH-48) until the new architecture is in place
(if it is decided to be applied).

If people are interested to see this work in progress I will prepare 
some sort of demo next week (still some work to do before it will reach 
a form of a patch).


--
 Sami Siren




Re: web ui improvement

2006-04-07 Thread Doug Cutting

Sami Siren wrote:
I know there are people who think that a plain xml interface is good 
enough for all but I would like to give this new architecture a try.


I think this would be a great addition.  The XML has a lot of uses, but 
we should include a good native, extensible, skinnable search UI.  +1


As part of the required functionality of the 0.8 release discussion on 
some other thread my opinion is to postbone any new ui functionality

(for example NUTCH-48) until the new architecture is in place


I would not veto someone testing  committing NUTCH-48.  We should avoid 
investing too much effort into this if it will soon be obsolete.  But if 
a small effort will give folks did you mean support in the interim, 
that's not a bad thing.  Of course, folks can always apply this patch 
themselves...


Doug


Re: web ui improvement

2006-04-07 Thread Sami Siren


As part of the required functionality of the 0.8 release discussion 
on some other thread my opinion is to postbone any new ui functionality

(for example NUTCH-48) until the new architecture is in place



I would not veto someone testing  committing NUTCH-48.  We should 
avoid investing too much effort into this if it will soon be 
obsolete.  But if a small effort will give folks did you mean 
support in the interim, that's not a bad thing.  Of course, folks can 
always apply this patch themselves...


Agreed, perhaps I meant to say that I will not apply it ;)

--
Sami Siren