Re: [Ferret-talk] Parsers for input to index?

Stuart Sierra Wed, 25 Apr 2007 11:24:09 -0700

Hello Dick, and all (first post),

Here are some more that I use:


HTML to text: Vilistextum
http://bhaak.dyndns.org/vilistextum/
also lynx:
http://lynx.browser.org/

PDF to text: pdftotext, from Xpdf
http://www.foolabs.com/xpdf/

WordPerfect to text: wpd2text, from libwpd
http://libwpd.sourceforge.net/

Converting other text encodings: iconv
http://www.gnu.org/software/libiconv/

-Stuart Sierra


John Leach wrote:
> you may need to turn to using some external tools.
> 
> something similar to this was discussed before and some tools suggested.
> 
> See: http://www.ruby-forum.com/topic/103374
> 
> On Wed, 2007-04-25 at 19:14 +0200, Dick Monahan wrote:
>> The documents we want to index come in many formats;  e.g., HTML, PDF,
>> RTF, Word, Excel, etc., etc., etc.  I've been searching to find parsers
>> that will translate each of these formats to indexable text, but have
>> had little success.  Any help will be appreciated.

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] Parsers for input to index?

Reply via email to