Hi, On 5/3/07, Rida Benjelloun <[EMAIL PROTECTED]> wrote:
Lius is currently under apache licence. If people are interested on it we can use it as starting point for the development of tika.
I think that would be great. We discussed in the ApacheCon that selecting a single existing codebase as the starting point would be the quickest way to bootstrap our efforts, and Lius and the Nutch parsers are probably the best candidates for this. The only downside in doing that is that it might cause trouble later on when we want to refactor things to be more general. For Lius the main problem is tight integration with Lucene. For example the lius.index.Indexer class imports the following: import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; Optimally the Tika toolkit should have no compile-time dependencies to Lucene. Do you think it would be feasible to refactor the Lius classes to avoid the Lucene dependencies?
Structured text: lius use JDOM, XPATH and namespaces for the extraction of structured contents.
Could you describe this in more detail. What does the XML content model look like? I could just look at the source, but it's more productive if we discuss the design on the mailing list.
Sax could be more powerful but does not offer XPATH for the extraction of contents.
It's possible to transform a SAX stream into a DOM tree for easy XPath access so I don't think we lose any functionality by choosing SAX over a DOM model. In fact it is even possible to evaluate XPath expressions against a live SAX stream, you just won't get full DOM nodes as the results.
If you have have questions about Lius do not hesitate to communicate with me. The source code is available: http://sourceforge.net/projects/lius/
Why do you have the class files instead of the java files in Lius svn? BR, Jukka Zitting
