Re: duplicate libs
Thanks Dawid (and everyody else), But in fact, by taking a closer look to this "problem", it seems it is not really a problem... (sorry) Finaly, I don't think it is a good idea to try finding the dependencies from the plugin.xml file: The plugin.xml file describes the runtime dependencies, whereas the build.xml file describes the compile dependencies... So, the final solution will be to simply adding something like this in all plugin build.xml file : I have tested it .. it works. So now, I will update all plugins before committing Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: duplicate libs
I just wanted to say we've gone through such problems already in Carrot2 -- many modules depend on each other, some of them have custom build steps. A pure ANT solution is likely to be quite ugly... But back to the point: you can test for existence of a plugin-specific build file and execute it if it exists. This will probably require mutable properties and conditionals available in ant-contrib (there are pure-ant workarounds, but they're not too pretty in the build file). If you need anything ant-related I have a lot of experience with this tool, feel free to ask on my private e-mail or through the list (I check the list less frequently though). D. Doug Cutting wrote: Jérôme Charron wrote: Finaly, the more I look at the ant code for plugins the more I think we must redesign it. In the actual ant scripts, each plugin is a ant project, so there is no way to define ant dependencies between plugins. (=> if you compile a plugin A that depends on another one (B), you must manually compile B before compiling A => we loose one of the major ant benefit) I suggest to define each plugin as a target, so that we can define someting like: depend="lib-http,lib-commons-httpclient"> This sounds good. Note that the plugin build.xml may contain some plugin-specific commands, like copying test files to the build directory, downloading third party libraries, etc. How will these be accomodated in your scheme? It seems odd to include these in the plugin.xml, since they're really build-specific... Doug
Re: duplicate libs
Jérôme Charron wrote: Finaly, the more I look at the ant code for plugins the more I think we must redesign it. In the actual ant scripts, each plugin is a ant project, so there is no way to define ant dependencies between plugins. (=> if you compile a plugin A that depends on another one (B), you must manually compile B before compiling A => we loose one of the major ant benefit) I suggest to define each plugin as a target, so that we can define someting like: This sounds good. Note that the plugin build.xml may contain some plugin-specific commands, like copying test files to the build directory, downloading third party libraries, etc. How will these be accomodated in your scheme? It seems odd to include these in the plugin.xml, since they're really build-specific... Doug
Re: duplicate libs
> Sounds very good! I may missed - that are you able to extract the > dependencies from the plugin.xml without hacking ant? Yes, by using the xmlproperty task: it defines a property for each path found in the xml document ( http://ant.apache.org/manual/CoreTasks/xmlproperty.html ) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: duplicate libs
Sounds very good! I may missed - that are you able to extract the dependencies from the plugin.xml without hacking ant? May you can use a xpath to extract this values, but this is just a idea... Cheers, Stefan Am 16.02.2006 um 10:54 schrieb Jérôme Charron: Yes, there is an easier way. Implement a custom task to which you'll pass a path to plugin.xml and a name for a path. Finaly, the more I look at the ant code for plugins the more I think we must redesign it. In the actual ant scripts, each plugin is a ant project, so there is no way to define ant dependencies between plugins. (=> if you compile a plugin A that depends on another one (B), you must manually compile B before compiling A => we loose one of the major ant benefit) I suggest to define each plugin as a target, so that we can define someting like: and then automatically extracts dependencies from plugin.xml with something like: Any comment is welcome. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/ --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: duplicate libs
> Yes, there is an easier way. Implement a custom task to which you'll > pass a path to plugin.xml and a name for a path. Finaly, the more I look at the ant code for plugins the more I think we must redesign it. In the actual ant scripts, each plugin is a ant project, so there is no way to define ant dependencies between plugins. (=> if you compile a plugin A that depends on another one (B), you must manually compile B before compiling A => we loose one of the major ant benefit) I suggest to define each plugin as a target, so that we can define someting like: and then automatically extracts dependencies from plugin.xml with something like: Any comment is welcome. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
RE: duplicate libs
>may you will find that interesting also: >http://maven.apache.org/using/multiproject.html I'd rather suggest to support Apache HttpClient, huge amount of unnecessary code could be easily removed from Nutch. We don't need to calculate "actual URL" after redirecting, GetMethod does it all for us. Using HTTP HEAD can improve performance; and many more staff. Google uses HEAD method, I noticed from logs. What about NekoHTML parser? getTextHelper method seems to be very strange, Java 5 does it all (DOM level 3); new Parser plugin could be based on http://htmlparser.sourceforge.net - and again we can remove buggy getOutlinks(). I have experience with Maven, and CruiseControl. All Maven's staff (checkstyle, javadoc, xdoc, developer's activity report, etc.) could be run via ANT. Not a first priority...
Re: duplicate libs
Hi, I understand that and many people have the same point of view. Maven seems to be a really good project software management tool. But for now, I don't plan to migrate to maven... (I don't have enought knowledge about it and so I don't have a good overview of it). May once in the future we can add a alternative build based on maven but still have the ant based build and than we can see if maven fits all needs. Stefan
Re: duplicate libs
> may you will find that interesting also: > http://maven.apache.org/using/multiproject.html Thanks Stefan. Maven seems to be a really good project software management tool. But for now, I don't plan to migrate to maven... (I don't have enought knowledge about it and so I don't have a good overview of it). Regards Jérôme
Re: duplicate libs
Hi Jérome, may you will find that interesting also: http://maven.apache.org/using/multiproject.html Greetings, Stefan
Re: duplicate libs
> Yes, there is an easier way. Implement a custom task to which you'll > pass a path to plugin.xml and a name for a path. The task (Java code) > will create a named (id) object which can be subsequently used in > ant with . > > This requires a custom ant task, but as you mentioned foreach is also a > separate library, so I don't see a huge disadvantage. > > Carrot2 codebase contains similar fuctionality in carrot2-ant-extensions > module, although it should be trivial to implement it from scratch. Thanks Dawid for all these informations. I really prefer your proposed way. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: duplicate libs
Yes, there is an easier way. Implement a custom task to which you'll pass a path to plugin.xml and a name for a path. The task (Java code) will create a named (id) object which can be subsequently used in ant with . This requires a custom ant task, but as you mentioned foreach is also a separate library, so I don't see a huge disadvantage. Carrot2 codebase contains similar fuctionality in carrot2-ant-extensions module, although it should be trivial to implement it from scratch. D. Jérôme Charron wrote: Is there any ant guru in the nutch-dev list? Since the number of plugins increase in nutch and that dependencies becomes more and more used, I would like to add in build-plugin.xml the capability to dynamicaly add into the classpath the dependencies defined in a plugin.xml file (this avoid to declare dependencies twice in a plugin : once in the plugin's build.xml and once in the plugin's plugin.xml). So if someone have any idea on how to perform such behavior... The only way I saw was to load the plugin.xml file using the ant task, then access the ${plugin.requires.import(plugin)} property which contains all the plugins dependencies separated e a comma (ie nutch-extensionpoints,lib-jakarta-poi,...) The idea is then to use the ant task to iterate over these values and then build a fileset (foreach task requires to import antcontrib in ant). Is there an easiest way to implement this? Thanks Jérôme
Re: duplicate libs
Is there any ant guru in the nutch-dev list? Since the number of plugins increase in nutch and that dependencies becomes more and more used, I would like to add in build-plugin.xml the capability to dynamicaly add into the classpath the dependencies defined in a plugin.xml file (this avoid to declare dependencies twice in a plugin : once in the plugin's build.xml and once in the plugin's plugin.xml). So if someone have any idea on how to perform such behavior... The only way I saw was to load the plugin.xml file using the ant task, then access the ${plugin.requires.import(plugin)} property which contains all the plugins dependencies separated e a comma (ie nutch-extensionpoints,lib-jakarta-poi,...) The idea is then to use the ant task to iterate over these values and then build a fileset (foreach task requires to import antcontrib in ant). Is there an easiest way to implement this? Thanks Jérôme
Re: duplicate libs
Jérôme Charron wrote: Isn't it already reported in http://issues.apache.org/jira/browse/NUTCH-196? Yes, you're right. I have still provided a patch for a log4j lib. If there is no objection, I will commit it and go ahead for * lib-commons-httpclient * lib-nekohtml +1 Thanks! Doug
Re: duplicate libs
Hi, when you consolidate the libs, perhaps you can add a version of xalan. This seems to be needed by the OpenSearchServlet. But I'm not entirely sure that it's not a broknen tomcat installation of mine. Can someone please verify my observation? Regards Michael -- Michael Nebel http://www.nebel.de/ http://www.netluchs.de/
Re: duplicate libs
log4j-1.2.11.jar src/plugin/clustering-carrot2/lib log4j-1.2.6.jar 1 src/plugin/parse-rss/lib log4j-1.2.9.jar src/plugin/parse-pdf/lib nekohtml-0.9.2.jarsrc/plugin/clustering-carrot2/lib nekohtml-0.9.4.jarsrc/plugin/parse-html/lib The differences here AFAIK are purely accidental, and I believe we can just keep the latest releases. I'll adjust to whichever version is in the repository -- these two have stable APIs anyway. D.
Re: duplicate libs
> > There are a number of duplicated libs in the plugins, namely: Isn't it already reported in http://issues.apache.org/jira/browse/NUTCH-196? I have still provided a patch for a log4j lib. If there is no objection, I will commit it and go ahead for * lib-commons-httpclient * lib-nekohtml Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
RE: duplicate libs
Hi Andrzej, > > commons-httpclient-3.0-beta1.jar src/plugin/parse-rss/lib > > commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib > > Not sure what was the reason to use the beta1, perhaps no reason except > that it was the latest available at the moment... Yup, I think that was exactly the reason in the case of parse-rss... > > > > > log4j-1.2.11.jar src/plugin/clustering-carrot2/lib > > log4j-1.2.6.jar 1 src/plugin/parse-rss/lib > > log4j-1.2.9.jar src/plugin/parse-pdf/lib > > > > nekohtml-0.9.2.jarsrc/plugin/clustering-carrot2/lib > > nekohtml-0.9.4.jarsrc/plugin/parse-html/lib > > The differences here AFAIK are purely accidental, and I believe we can > just keep the latest releases. Agreed. > > > > > xerces-2_6_2.jar lib > > xercesImpl.jarsrc/plugin/parse-rss/lib > > Not sure about these ones, but Xerces APIs are pretty stable, so I'd > risk removing xercesImpl.jar . I think that xercesImpl.jar contains classes that are required by parse-rss to function. I haven't investigated in a while, but don't xerces-2_6_2.jar and xercesImpl.jar contain different classes? > > > > > Are there any known reasons to keep multiple versions of things, or > > should we move these each into their own plugin that can be shared? > > The latter is what I advocated for log4j and various xml-related high > level API libs (jdom, dom4j, jaxen). +1 Cheers, Chris > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com
RE: duplicate libs
Ops... Sorry! Last update: 2005-12-07 14:05:13 (I am tired as usual) >>HttpClient v.3.0 is updated daily, and recent update Feb-13-2006 fixes some >>threading issues... - not true.
RE: duplicate libs
BTW, HttpClient v.3.0 is updated daily, and recent update Feb-13-2006 fixes some threading issues... We could also refactor smth in the plugin (I wish)... Using Spring Framework I was able easily decouple all HttpPlugin configuration parameters... >commons-httpclient-3.0-beta1.jar src/plugin/parse-rss/lib >commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib
Re: duplicate libs
Doug Cutting wrote: There are a number of duplicated libs in the plugins, namely: commons-httpclient-3.0-beta1.jar src/plugin/parse-rss/lib commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib Not sure what was the reason to use the beta1, perhaps no reason except that it was the latest available at the moment... log4j-1.2.11.jar src/plugin/clustering-carrot2/lib log4j-1.2.6.jar 1 src/plugin/parse-rss/lib log4j-1.2.9.jar src/plugin/parse-pdf/lib nekohtml-0.9.2.jarsrc/plugin/clustering-carrot2/lib nekohtml-0.9.4.jarsrc/plugin/parse-html/lib The differences here AFAIK are purely accidental, and I believe we can just keep the latest releases. xerces-2_6_2.jar lib xercesImpl.jarsrc/plugin/parse-rss/lib Not sure about these ones, but Xerces APIs are pretty stable, so I'd risk removing xercesImpl.jar . Are there any known reasons to keep multiple versions of things, or should we move these each into their own plugin that can be shared? The latter is what I advocated for log4j and various xml-related high level API libs (jdom, dom4j, jaxen). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: duplicate libs
Hey Doug, I think that at least in the case of parse-rss, parse-pdf, and the nutch core if there's probably some utility in having lib-xxx plugins (or at least putting these jars in the $NUTCH_HOME/lib) for: commons-httpclient log4j xerces Then, protocol-httpclient, parse-pdf and the rest of the nutch core classes could all reference these libraries. I'm working on NUTCH-140 right now, but if there is need for this, I can create an issue in JIRA and then work on it as well... Cheers, Chris On 2/13/06 3:26 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: > There are a number of duplicated libs in the plugins, namely: > > commons-httpclient-3.0-beta1.jar src/plugin/parse-rss/lib > commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib > > log4j-1.2.11.jar src/plugin/clustering-carrot2/lib > log4j-1.2.6.jar 1 src/plugin/parse-rss/lib > log4j-1.2.9.jar src/plugin/parse-pdf/lib > > nekohtml-0.9.2.jarsrc/plugin/clustering-carrot2/lib > nekohtml-0.9.4.jarsrc/plugin/parse-html/lib > > xerces-2_6_2.jar lib > xercesImpl.jarsrc/plugin/parse-rss/lib > > Are there any known reasons to keep multiple versions of things, or > should we move these each into their own plugin that can be shared? > > Doug __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
duplicate libs
There are a number of duplicated libs in the plugins, namely: commons-httpclient-3.0-beta1.jar src/plugin/parse-rss/lib commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib log4j-1.2.11.jar src/plugin/clustering-carrot2/lib log4j-1.2.6.jar 1 src/plugin/parse-rss/lib log4j-1.2.9.jar src/plugin/parse-pdf/lib nekohtml-0.9.2.jarsrc/plugin/clustering-carrot2/lib nekohtml-0.9.4.jarsrc/plugin/parse-html/lib xerces-2_6_2.jar lib xercesImpl.jarsrc/plugin/parse-rss/lib Are there any known reasons to keep multiple versions of things, or should we move these each into their own plugin that can be shared? Doug