http://www.linux-magazin.de/Artikel/ausgabe/2004/04/swish/swish.html
Não sei alemão mas usando o google translate dá para entender algo do que fala sobre como configurar o swish-e para indexar arquivos OpenOffice:
http://www.google.com/translate?u=http%3A%2F%2Fwww.linux-magazin.de%2FArtikel%2Fausgabe%2F2004%2F04%2Fswish%2Fswish.html&langpair=de%7Cen&hl=en&ie=UTF8
==============================
Open Office documents scan
Open Office stores its files as Zip archives, in which actual contents are contained always in the XML file "content xml". In order to scan these documents, somewhat more expenditure is necessary. First the filter is to be effective to all kinds of open Office files, apply thus to different Suffixe.
The "IndexContents" directive in listing 5 (line 3) assigns texts, tables and presentations to the XML format. Somewhat tricky the "FileFilterMatch" instruction precipitates. It defines the file types over the regular expression "/\.(sxw|sxc|sxi)$/i "and the Unzip program assigns, including the call parameter" "- p to them \"%p \" content.xml "". Thus Unzip extracts the file "content.xml" and passes it on to the standard output.
Listing 5: Filter for open Office
01 # Open Office 02 FileFilterMatch "/usr/bin/unzip" "-p \"%p\" content.xml" /\.(sxw|sxc|sxi)$/i 03 IndexContents XML* .sxw .sxc .sxi
04 StoreDescription XML* <text:p>
A characteristic is here the line "StoreDescription". Actually this directive is meant for taking up short description texts to the index which Swish e indicates with an extended search. Among other things day is to be indicated here, which contains the description. Even the range of the description can optional be limited. That does not have to do anything in the reason with the normal indexing of a XML document. Practice shows however that Swish e indicates open Office documents correctly only if this option is indicated. Otherwise the Parser breaks off frequently too early and leaves a large part of the text unconsidered.
==============================
A vantagem do swish-e é que você pode usá-lo para indexar tudo na sua rede: arquivos do MS Office antigos, PDFs, emails, arquivos html, imagens, etc (leia o artigo)
--------------------------------------------------------------------------- Esta lista é patrocinada pela Conectiva S.A. Visite http://www.conectiva.com.br
Arquivo: http://bazar2.conectiva.com.br/mailman/listinfo/linux-br Regras de utilização da lista: http://linux-br.conectiva.com.br FAQ: http://www.zago.eti.br/menu.html