I am new to py-lucene but have worked on java lucene 1.4.3. How I can index following types of files by using py-lucene? [word files, pdf , excel, xsl, xml, open office files] is there any support of 3rd party lib in py-lucene also?
(As for java lucene 3rd party libraries are available)
As part of the "Lucene in Action" samples and test cases porting effort I got support for some non-plain text formats with PyLucene:
- html, via the HTMLParser module in python
- xml, via the xml.sax parser module in python
- pdf, via the pdftotext and pdfinfo programs available from the xpdf
package at http://www.foolabs.com/xpdf
- msword, via the antiword program available from the antiword package at
http://www.winfield.demon.nlFor examples on how to use these with PyLucene, please refer to the
samples/LuceneInAction/FileIndexer.py sample and the samples/LuceneInAction/lia/handlingtypes code tree.
Andi.. _______________________________________________ pylucene-dev mailing list [email protected] http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
