i haven't tested it yet but have a look in the manual... ;-) ---- cut http://www.aspseek.org/man/aspseek.conf.5.html ---- external converters
index(1) has an ability to deal with document types other than text/plain and text/html. It does so with the help of an external programs or scripts, which can convert from some format to text/plain (or text/html), so you are able to index .ps, .pdf etc. Converter from/type to/type command line Specifies that for converting documents with MIME-type from/type to MIME-type to/type the command specified by command line will be used. Argument from/type can be any type returned by Web server. Argument to/type can be either text/plain or text/html. In the command line you usually specify program or script to run, together with its options. Program is expected to to read from stdin and write the converted document to stdout. If your program can't deal with stdin/stdout streams, you should use $in and $out strings in command line, and they will be substituted with two file names in /tmp directory. index(1) will create files with unique names, write the document downloaded to the first file (referenced as $in), run the /bin/prog, read the second file (referenced as $out) into memory, and then delete both files. You can also use $url in command line, it will be substituted with the actual URL of downloaded document. You can use it in your own scripts to distinguish between a different document variations, or to be able to write one script for many different MIME-types. Please note that index(1) relies on a Content-Type header returned by a Web server. Some Web-servers are misconfigured and give wrong info (for example, return header Content-Type: audio/x-pn-realaudio-plugin for .rpm files). Examples: Converter application/postscript text/plain ps2ascii # ps2ascii can't deal with PDF files from stdin Converter application/pdf text/plain ps2ascii $in $out ---- cut ---- mfg Markus Rietzler * <rietzler_software/> | http://www.rietzler-software.de * Wuppertal-Navigator | http://www.wuppertal-navigator.de * eMail: [EMAIL PROTECTED] Neue Nordstrasse 43 42105 Wuppertal Fon: 0700.RIETZLER (0700.7438 9537) 0202.420830 Fax: 0202.242 24 66 -----Ursprungliche Nachricht----- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]Im Auftrag von Diego Montalvo Gesendet: Donnerstag, 21. Februar 2002 19:53 An: [EMAIL PROTECTED] Betreff: Re: AW: [aseek-users] ASPSeek - PDF / RTF Kir or Markus, What are the proper steps to having this type of functionality? I must first download the converters then , how do I configure ASPSeek for the external converters? Diego --- Kir Kolyshkin <[EMAIL PROTECTED]> wrote: > ASPSeek will also present "text version" of beer.pdf > to be viewed > (in the place where "cached" link usually is), much > like as Google does, > so you can see the result of conversion. Excerpts > are also supported. > > > [EMAIL PROTECTED] wrote: > > > > no no, > > > > the external converter is started from aspseek > during the index process when aspseek finds a pdf > file. > > so in your case: > > > > when aspseek indexes www.crazy.com and finds > beer.pdf it starts the converter. the converter > reads the pdf-document convert it to txt/html. now > aspseek indexes this export. > > > > no your users can search also in pdf documents. so > when "beer" is in beer.bdf, aspseek will list the > link to beer.pdf as a result and even displays the > short extract. your users now can click on the link > and acrobat reader opens to display the pdf-file. > > > > so external converter means a helper programme for > apseek to index pdf-documents. > > > > Markus Rietzler > > * kommunikation & online service > > * RZF NRW > > * Tel: 0211.4572-130 > > > > -----Urspr|ngliche Nachricht----- > > Von: Diego Montalvo [mailto:[EMAIL PROTECTED]] > > Gesendet am: Donnerstag, 21. Februar 2002 16:55 > > An: [EMAIL PROTECTED] > > Betreff: Re: [aseek-users] ASPSeek - PDF / RTF > > > > Kir, > > > > I am somewhat confused, so ASPSeek will crawl and > > index .PDF and such files, but will not present > them > > as .html? Therefore I need a external converter? > > > > Or does an external converter first convert, then > I > > run ASPSeek? > > > > example: I want to index "www.crazy.com/beer.pdf" > i > > simply use ASPSeek, to retreive words from > "beer.pdf" > > but then I mst use an external program to view in > > html? > > > > do you have a link to such a search engine using > > ASPSeek with external converters? > > > > Diego > > > > --- Kir Kolyshkin <[EMAIL PROTECTED]> wrote: > > > Diego Montalvo wrote: > > > > > > > > Hello, > > > > > > > > In the ASPSeek Manual pages there is a mention > > > that > > > > ASPSeek understands PDF, RTF formats with help > of > > > an > > > > external program, what program is that? I > would > > > like > > > > to embed it into ASPSeek. > > > > > > There's no need to embed. Manual talks about > > > External Converters, > > > described in > > > > http://www.aspseek.org/man/aspseek.conf.5.html#lbAM > > > So as long as you have program that can convert, > > > say, pdf to html, > > > you can index pdf documents with aspseek. > > > > > > Good ps to text (or html) converter is here: > > > http://www.nzdl.org/html/prescript.html > > > There are also links to other such tools. > > > > > > As for converter from rtf or doc format, I know > of > > > word2x: http://word2x.alcom.co.uk/ > > > antiword: > http://www.winfield.demon.nl/index.html > > > unrtf: > http://www.geocities.com/tuorfa/unrtf.html > > > -- > > > [EMAIL PROTECTED] http://kir.vtx.ru/ ICQ > 7551596 > > > Phone +7 903 6722750 > > > Hi, I'm a signature virus: copy me to your > > > .signature to help me spread! > > > -- > > > > __________________________________________________ > > Do You Yahoo!? > > Yahoo! Sports - Coverage of the 2002 Olympic Games > > http://sports.yahoo.com > > -- > [EMAIL PROTECTED] http://kir.vtx.ru/ ICQ 7551596 > Phone +7 903 6722750 > Hi, I'm a signature virus: copy me to your > .signature to help me spread! > -- __________________________________________________ Do You Yahoo!? Yahoo! Sports - Coverage of the 2002 Olympic Games http://sports.yahoo.com
