On a related note, I've also released a project that I developed for my book and for presentations that I have been giving on Ant, XDoclet, and JUnit. This project is a documentation search engine with a web (Struts) interface. It uses Lucene and the Ant task I mentioned already to index a directory full of HTML and text files. The sample data provided is Ant's documentation.

Its available as version 0.3 (currently, but always grab the latest thats there) at http://www.ehatchersolutions.com/downloads/

I have not documented it well yet, but that is my plan over the next couple of weeks.

To get it running you need:

- Ant 1.5.1 (1.5 is not sufficient)
- JUnit 3.8 or up (3.8.1 is the latest)
- j2ee.jar - I don't provide this in the download for size (and legal?) reasons.

Build it this way:

ant -Dj2ee.jar=/path/to/my/j2ee.jar

Or if you run it without the -D switch it will tell you where to place j2ee.jar by default. If you have J2EE_HOME set it will pick that up automatically and use it appropriately.

Deploy the WAR in a web container, or the EAR in JBoss. Navigate to:

http://localhost:8080/ant-sample/

and search for your favorite Ant tasks or Ant related information.

Let me know if you experience any issues with it, or have comments.

Erik

Erik Hatcher wrote:
Look in the Lucene sandbox in CVS. I contributed an Ant task that indexed HTML documents. It uses JTidy under the covers to parse HTML into title and body content, and it could be extended to pull other information such <meta> keywords.

Erik


Leo Galambos wrote:

So, I have tried this with Lucene:
1) original JavaCC LL(k) HTML parser
2) SWING's HTML parser

In case of (1) I could process about 300K of HTML documents. In case of (2) more than 400K.

But I cannot process complete collection (5M) and finish my hard stress
tests of Lucene.

Is there anyone who has HTML parser that really works with Lucene? :) If
you think that you have one, please let me know. I wanted to try Neko, but it looks complicated and I do not want to affect the results by ``robust'' parser.

THX

-g-


--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to