Re: duplicate libs

2006-02-16 Thread Jérôme Charron
Thanks Dawid (and everyody else),

But in fact, by taking a closer look to this "problem", it seems it is not
really a problem... (sorry)
Finaly, I don't think it is a good idea to try finding the dependencies from
the plugin.xml file:
The plugin.xml file describes the runtime dependencies, whereas the
build.xml file describes the compile dependencies...
So, the final solution will be to simply adding something like this in all
plugin build.xml file :

  


  

I have tested it .. it works. So now, I will update all plugins before
committing

Regards

Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/


Re: duplicate libs

2006-02-16 Thread Dawid Weiss


I just wanted to say we've gone through such problems already in Carrot2 
-- many modules depend on each other, some of them have custom build 
steps. A pure ANT solution is likely to be quite ugly... But back to the 
point: you can test for existence of a plugin-specific build file and 
execute it if it exists. This will probably require mutable properties 
and conditionals available in ant-contrib (there are pure-ant 
workarounds, but they're not too pretty in the build file).


If you need anything ant-related I have a lot of experience with this 
tool, feel free to ask on my private e-mail or through the list (I check 
the list less frequently though).


D.

Doug Cutting wrote:

Jérôme Charron wrote:
Finaly, the more I look at the ant code for plugins the more I think 
we must

redesign it.
In the actual ant scripts, each plugin is a ant project, so there is 
no way

to define ant dependencies between plugins.
(=> if you compile a plugin A that depends on another one (B), you must
manually compile B before compiling A => we loose one of the major ant
benefit)
I suggest to define each plugin as a target, so that we can define 
someting

like:
depend="lib-http,lib-commons-httpclient">


This sounds good.  Note that the plugin build.xml may contain some 
plugin-specific commands, like copying test files to the build 
directory, downloading third party libraries, etc.  How will these be 
accomodated in your scheme?  It seems odd to include these in the 
plugin.xml, since they're really build-specific...


Doug


Re: duplicate libs

2006-02-16 Thread Doug Cutting

Jérôme Charron wrote:

Finaly, the more I look at the ant code for plugins the more I think we must
redesign it.
In the actual ant scripts, each plugin is a ant project, so there is no way
to define ant dependencies between plugins.
(=> if you compile a plugin A that depends on another one (B), you must
manually compile B before compiling A => we loose one of the major ant
benefit)
I suggest to define each plugin as a target, so that we can define someting
like:



This sounds good.  Note that the plugin build.xml may contain some 
plugin-specific commands, like copying test files to the build 
directory, downloading third party libraries, etc.  How will these be 
accomodated in your scheme?  It seems odd to include these in the 
plugin.xml, since they're really build-specific...


Doug


Re: duplicate libs

2006-02-16 Thread Jérôme Charron
> Sounds very good! I may missed - that are you able to extract the
> dependencies from the plugin.xml without hacking ant?

Yes, by using the xmlproperty task: it defines a property for each path
found in the xml document
( http://ant.apache.org/manual/CoreTasks/xmlproperty.html )

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: duplicate libs

2006-02-16 Thread Stefan Groschupf
Sounds very good! I may missed - that are you able to extract the  
dependencies from the plugin.xml without hacking ant?
May you can use a xpath to extract this values, but this is just a  
idea...

Cheers,
Stefan



Am 16.02.2006 um 10:54 schrieb Jérôme Charron:


Yes, there is an easier way. Implement a custom task to which you'll
pass a path to plugin.xml and a name for a path.


Finaly, the more I look at the ant code for plugins the more I  
think we must

redesign it.
In the actual ant scripts, each plugin is a ant project, so there  
is no way

to define ant dependencies between plugins.
(=> if you compile a plugin A that depends on another one (B), you  
must

manually compile B before compiling A => we loose one of the major ant
benefit)
I suggest to define each plugin as a target, so that we can define  
someting

like:



and then automatically extracts dependencies from plugin.xml with  
something

like:





Any comment is welcome.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




Re: duplicate libs

2006-02-16 Thread Jérôme Charron
> Yes, there is an easier way. Implement a custom task to which you'll
> pass a path to plugin.xml and a name for a path.

Finaly, the more I look at the ant code for plugins the more I think we must
redesign it.
In the actual ant scripts, each plugin is a ant project, so there is no way
to define ant dependencies between plugins.
(=> if you compile a plugin A that depends on another one (B), you must
manually compile B before compiling A => we loose one of the major ant
benefit)
I suggest to define each plugin as a target, so that we can define someting
like:


and then automatically extracts dependencies from plugin.xml with something
like:





Any comment is welcome.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


RE: duplicate libs

2006-02-15 Thread Fuad Efendi
>may you will find that interesting also:
>http://maven.apache.org/using/multiproject.html

I'd rather suggest to support Apache HttpClient, huge amount of unnecessary
code could be easily removed from Nutch. We don't need to calculate "actual
URL" after redirecting, GetMethod does it all for us.

Using HTTP HEAD can improve performance; and many more staff. Google uses
HEAD method, I noticed from logs.

What about NekoHTML parser? getTextHelper method seems to be very strange,
Java 5 does it all (DOM level 3); new Parser plugin could be based on
http://htmlparser.sourceforge.net - and again we can remove buggy
getOutlinks().

I have experience with Maven, and CruiseControl. All Maven's staff
(checkstyle, javadoc, xdoc, developer's activity report, etc.) could be run
via ANT. Not a first priority...



Re: duplicate libs

2006-02-15 Thread Stefan Groschupf

Hi,

I understand that and many people have the same point of view.

Maven seems to be a really good project software management tool.
But for now, I don't plan to migrate to maven...
(I don't have enought knowledge about it and so I don't have a good  
overview

of it).
May once in the future we can add a alternative build based on maven  
but still have the ant based build  and than we can see if maven fits  
all needs.


Stefan 


Re: duplicate libs

2006-02-15 Thread Jérôme Charron
> may you will find that interesting also:
> http://maven.apache.org/using/multiproject.html

Thanks Stefan.
Maven seems to be a really good project software management tool.
But for now, I don't plan to migrate to maven...
(I don't have enought knowledge about it and so I don't have a good overview
of it).

Regards

Jérôme


Re: duplicate libs

2006-02-15 Thread Stefan Groschupf

Hi Jérome,
may you will find that interesting also:
http://maven.apache.org/using/multiproject.html

Greetings,
Stefan 

Re: duplicate libs

2006-02-15 Thread Jérôme Charron
> Yes, there is an easier way. Implement a custom task to which you'll
> pass a path to plugin.xml and a name for a path. The task (Java code)
> will create a named (id)  object which can be subsequently used in
> ant with .
>
> This requires a custom ant task, but as you mentioned foreach is also a
> separate library, so I don't see a huge disadvantage.
>
> Carrot2 codebase contains similar fuctionality in carrot2-ant-extensions
> module, although it should be trivial to implement it from scratch.

Thanks Dawid for all these informations.
I really prefer your proposed way.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: duplicate libs

2006-02-14 Thread Dawid Weiss


Yes, there is an easier way. Implement a custom task to which you'll 
pass a path to plugin.xml and a name for a path. The task (Java code) 
will create a named (id)  object which can be subsequently used in 
ant with .


This requires a custom ant task, but as you mentioned foreach is also a 
separate library, so I don't see a huge disadvantage.


Carrot2 codebase contains similar fuctionality in carrot2-ant-extensions 
module, although it should be trivial to implement it from scratch.


D.

Jérôme Charron wrote:

Is there any ant guru in the nutch-dev list?
Since the number of plugins increase in nutch and that dependencies becomes
more and more used, I would like to add in build-plugin.xml the capability
to dynamicaly add into the classpath the dependencies defined in a
plugin.xml file (this avoid to declare dependencies twice in a plugin : once
in the plugin's build.xml and once in the plugin's plugin.xml).
So if someone have any idea on how to perform such behavior...
The only way I saw was to load the plugin.xml file using the 
ant task, then access the ${plugin.requires.import(plugin)} property which
contains all the plugins dependencies separated e a comma (ie
nutch-extensionpoints,lib-jakarta-poi,...)
The idea is then to use the  ant task to iterate over these values
and then build a fileset 
(foreach task requires to import antcontrib in ant).

Is there an easiest way to implement this?

Thanks

Jérôme



Re: duplicate libs

2006-02-14 Thread Jérôme Charron
Is there any ant guru in the nutch-dev list?
Since the number of plugins increase in nutch and that dependencies becomes
more and more used, I would like to add in build-plugin.xml the capability
to dynamicaly add into the classpath the dependencies defined in a
plugin.xml file (this avoid to declare dependencies twice in a plugin : once
in the plugin's build.xml and once in the plugin's plugin.xml).
So if someone have any idea on how to perform such behavior...
The only way I saw was to load the plugin.xml file using the 
ant task, then access the ${plugin.requires.import(plugin)} property which
contains all the plugins dependencies separated e a comma (ie
nutch-extensionpoints,lib-jakarta-poi,...)
The idea is then to use the  ant task to iterate over these values
and then build a fileset 
(foreach task requires to import antcontrib in ant).

Is there an easiest way to implement this?

Thanks

Jérôme


Re: duplicate libs

2006-02-14 Thread Doug Cutting

Jérôme Charron wrote:

Isn't it already reported in http://issues.apache.org/jira/browse/NUTCH-196?


Yes, you're right.


I have still provided a patch for a log4j lib.
If there is no objection, I will commit it and go ahead for
* lib-commons-httpclient
* lib-nekohtml


+1

Thanks!

Doug


Re: duplicate libs

2006-02-14 Thread Michael Nebel

Hi,

when you consolidate the libs, perhaps you can add a version of xalan. 
This seems to be needed by the OpenSearchServlet. But I'm not entirely 
sure that it's not a broknen tomcat installation of mine. Can someone 
please verify my observation?


Regards

Michael

--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/


Re: duplicate libs

2006-02-14 Thread Dawid Weiss



log4j-1.2.11.jar  src/plugin/clustering-carrot2/lib
log4j-1.2.6.jar 1 src/plugin/parse-rss/lib
log4j-1.2.9.jar   src/plugin/parse-pdf/lib

nekohtml-0.9.2.jarsrc/plugin/clustering-carrot2/lib
nekohtml-0.9.4.jarsrc/plugin/parse-html/lib


The differences here AFAIK are purely accidental, and I believe we can 
just keep the latest releases.


I'll adjust to whichever version is in the repository -- these two have 
stable APIs anyway.


D.


Re: duplicate libs

2006-02-14 Thread Jérôme Charron
> > There are a number of duplicated libs in the plugins, namely:

Isn't it already reported in http://issues.apache.org/jira/browse/NUTCH-196?
I have still provided a patch for a log4j lib.
If there is no objection, I will commit it and go ahead for
* lib-commons-httpclient
* lib-nekohtml

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


RE: duplicate libs

2006-02-13 Thread Chris Mattmann
Hi Andrzej,


> > commons-httpclient-3.0-beta1.jar  src/plugin/parse-rss/lib
> > commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib
> 
> Not sure what was the reason to use the beta1, perhaps no reason except
> that it was the latest available at the moment...

Yup, I think that was exactly the reason in the case of parse-rss...

> 
> >
> > log4j-1.2.11.jar  src/plugin/clustering-carrot2/lib
> > log4j-1.2.6.jar 1 src/plugin/parse-rss/lib
> > log4j-1.2.9.jar   src/plugin/parse-pdf/lib
> >
> > nekohtml-0.9.2.jarsrc/plugin/clustering-carrot2/lib
> > nekohtml-0.9.4.jarsrc/plugin/parse-html/lib
> 
> The differences here AFAIK are purely accidental, and I believe we can
> just keep the latest releases.

Agreed.

> 
> >
> > xerces-2_6_2.jar  lib
> > xercesImpl.jarsrc/plugin/parse-rss/lib
> 
> Not sure about these ones, but Xerces APIs are pretty stable, so I'd
> risk removing xercesImpl.jar .

I think that xercesImpl.jar contains classes that are required by parse-rss
to function. I haven't investigated in a while, but don't xerces-2_6_2.jar
and xercesImpl.jar contain different classes?

> 
> >
> > Are there any known reasons to keep multiple versions of things, or
> > should we move these each into their own plugin that can be shared?
> 
> The latter is what I advocated for log4j and various xml-related high
> level API libs (jdom, dom4j, jaxen).

+1

Cheers,
 Chris

> 
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com




RE: duplicate libs

2006-02-13 Thread Fuad Efendi
Ops... Sorry!

Last update: 2005-12-07 14:05:13
(I am tired as usual)

>>HttpClient v.3.0 is updated daily, and recent update Feb-13-2006 fixes
some
>>threading issues...

- not true.



RE: duplicate libs

2006-02-13 Thread Fuad Efendi
BTW,
HttpClient v.3.0 is updated daily, and recent update Feb-13-2006 fixes some
threading issues... We could also refactor smth in the plugin (I wish)...
Using Spring Framework I was able easily decouple all HttpPlugin
configuration parameters...

>commons-httpclient-3.0-beta1.jar  src/plugin/parse-rss/lib
>commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib



Re: duplicate libs

2006-02-13 Thread Andrzej Bialecki

Doug Cutting wrote:

There are a number of duplicated libs in the plugins, namely:

commons-httpclient-3.0-beta1.jar  src/plugin/parse-rss/lib
commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib


Not sure what was the reason to use the beta1, perhaps no reason except 
that it was the latest available at the moment...




log4j-1.2.11.jar  src/plugin/clustering-carrot2/lib
log4j-1.2.6.jar 1 src/plugin/parse-rss/lib
log4j-1.2.9.jar   src/plugin/parse-pdf/lib

nekohtml-0.9.2.jarsrc/plugin/clustering-carrot2/lib
nekohtml-0.9.4.jarsrc/plugin/parse-html/lib


The differences here AFAIK are purely accidental, and I believe we can 
just keep the latest releases.




xerces-2_6_2.jar  lib
xercesImpl.jarsrc/plugin/parse-rss/lib


Not sure about these ones, but Xerces APIs are pretty stable, so I'd 
risk removing xercesImpl.jar .




Are there any known reasons to keep multiple versions of things, or 
should we move these each into their own plugin that can be shared?


The latter is what I advocated for log4j and various xml-related high 
level API libs (jdom, dom4j, jaxen).


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: duplicate libs

2006-02-13 Thread Chris Mattmann
Hey Doug,

  I think that at least in the case of parse-rss, parse-pdf, and the nutch
core if there's probably some utility in having lib-xxx plugins (or at least
putting these jars in the $NUTCH_HOME/lib) for:

commons-httpclient
log4j
xerces

Then, protocol-httpclient, parse-pdf and the rest of the nutch core classes
could all reference these libraries. I'm working on NUTCH-140 right now, but
if there is need for this, I can create an issue in JIRA and then work on it
as well...

Cheers,
  Chris



On 2/13/06 3:26 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> There are a number of duplicated libs in the plugins, namely:
> 
> commons-httpclient-3.0-beta1.jar  src/plugin/parse-rss/lib
> commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib
> 
> log4j-1.2.11.jar  src/plugin/clustering-carrot2/lib
> log4j-1.2.6.jar 1 src/plugin/parse-rss/lib
> log4j-1.2.9.jar   src/plugin/parse-pdf/lib
> 
> nekohtml-0.9.2.jarsrc/plugin/clustering-carrot2/lib
> nekohtml-0.9.4.jarsrc/plugin/parse-html/lib
> 
> xerces-2_6_2.jar  lib
> xercesImpl.jarsrc/plugin/parse-rss/lib
> 
> Are there any known reasons to keep multiple versions of things, or
> should we move these each into their own plugin that can be shared?
> 
> Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




duplicate libs

2006-02-13 Thread Doug Cutting

There are a number of duplicated libs in the plugins, namely:

commons-httpclient-3.0-beta1.jar  src/plugin/parse-rss/lib
commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib

log4j-1.2.11.jar  src/plugin/clustering-carrot2/lib
log4j-1.2.6.jar 1 src/plugin/parse-rss/lib
log4j-1.2.9.jar   src/plugin/parse-pdf/lib

nekohtml-0.9.2.jarsrc/plugin/clustering-carrot2/lib
nekohtml-0.9.4.jarsrc/plugin/parse-html/lib

xerces-2_6_2.jar  lib
xercesImpl.jarsrc/plugin/parse-rss/lib

Are there any known reasons to keep multiple versions of things, or 
should we move these each into their own plugin that can be shared?


Doug