Re: AW: AW: AW: AW: [aseek-users] external converters, pdf files

Kir Kolyshkin Thu, 12 Sep 2002 03:05:48 -0700

Attached patch shold fix it. Can you try it out?

[EMAIL PROTECTED] wrote:
> 
> mh,
> with adding charset there is no exec-line in the index-log, so no
> pdf-conversion at all.
> tried to
> 
>         Converter application/pdf text/html; charset=iso8859-1
> /usr/bin/pdftohtml -i -noframes -stdout $in >$out
> and
>         Converter application/pdf "text/html; charset=iso8859-1"
> /usr/bin/pdftohtml -i -noframes -stdout $in >$out
> 
> without charset option pdf's are indexed...
> 
> mfg
> 
> Markus Rietzler
> * <rietzler_software/>
> * RZF NRW
> * Tel: 0211.4572-130
> 
> -----Urspr�ngliche Nachricht-----
> Von: Kir Kolyshkin [mailto:[EMAIL PROTECTED]]
> Gesendet am: Donnerstag, 12. September 2002 11:26
> An: [EMAIL PROTECTED]
> Betreff: Re: AW: AW: AW: [aseek-users] external converters, pdf files
> 
> Try to add ";charset=iso8859-1" (substitute with charset you need)
> after "text/html" in Converter line.
> 
> Sorry, seems it is missing from man page; I will add it.
> 
> So, it will look like:
> 
> Converter application/pdf text/html; charset=iso8859-5 /usr/bin/pdftohtml -i
> -noframes -stdout $in >$out
> Please report here if it works or not :)
> 
> [EMAIL PROTECTED] wrote:
> >
> > that trick worked (half).
> > the pdf-file is being indexed but i can't search for words with eg.
> umlauts.
> > in the excerpt i see "?" on the places where umlauts (�,�,�) should be.
> > so the charset of the document is wrong. any ideas
> >
> > mfg
> >
> > Markus Rietzler
> > * <rietzler_software/>
> > * RZF NRW
> > * Tel: 0211.4572-130
> >
> > -----Urspr�ngliche Nachricht-----
> > Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]]
> > Gesendet am: Donnerstag, 12. September 2002 02:48
> > An: [EMAIL PROTECTED]
> > Betreff: Re: AW: AW: [aseek-users] external converters, pdf files
> >
> > While trying out pdftohtml for myself (yeah, it's nicer than plain text
> > and provides a title too), I figured it out.
> > pdftohtml will add an ".html" extension to the output file; and hence
> > index won't use it nor delete it afterwards. The solution I use is simply:
> >
> > Converter application/pdf            text/html       /usr/bin/pdftohtml -i
> > -noframes -stdout $in >$out
> >
> > And so far it seems to be working...
> >
> > Cheers,
> >
> > [EMAIL PROTECTED] wrote:
> >
> > >in aspseek.conf i have
> > >
> > >       Converter application/pdf text/html /users/aspseek/sbin/pdftohtml
> -i
> > >-noframes $in $out.html
> > >
> > >and this is what index says, looks good for me
> > >
> > >www@I011-32:/users/aspseek/sbin> ./index -a -m -u http://..../test.pdf
> > >Loading configuration from /users/aspseek/etc/db.conf
> > >Loading configuration from /users/aspseek/etc/ucharset.conf
> > >Loading configuration from /users/aspseek/etc/stopwords.conf
> > >Loading configuration from /users/aspseek/etc/server.url
> > >Loading configuration from /users/aspseek/etc/allow.url
> > >Loading configuration from /users/aspseek/etc/aspseek.conf
> > >Adding URL: http://..../test.pdf
> > >exec /users/aspseek/sbin/pdftohtml -i -noframes /tmp/asiLa7HF2
> > >/tmp/asoOpsGUU.html
> > >Page-1
> > >Page-2
> > >Page-3
> > >Page-4
> > >Page-5
> > >Page-6
> > >Page-7
> > >Page-8
> > >Saving real-time database ... done.
> > >Saving delta files [..................................................]
> > >done.
> > >Deleting 'deleted' records from urlword[s] ... done. (0 records deleted)
> > >Saving real-time ... done
> > >Saving redirects ... done
> > >Splitting href delta file ... done
> > >Saving href delta files ... done
> > >Saving direct href delta files ... done
> > >Calculating ranks  [................................................]
> done.
> > >Saving lastmods ... done
> > >Generating word site ... done
> > >Generating subset http://..../% ... done (193 URLs)
> > >index process finished.
> > >
> > >btw: those two tempfiles are not deleted in /tmp, maybe another bug
> > >/tmp/asoOpsGUU.html is a html-export of the pdf file. so text is
> recognized
> > >and exported correct, but when i search for one of the words from this
> file
> > >i get no results...
> > >
> > >urlword-table says
> > >
> > >*************************** 1. row ***************************
> > >         url_id: 100
> > >        site_id: 1
> > >        deleted: 0
> > >            url: http://.../versorgungsreform.pdf
> > >next_index_time: 1031828953
> > >         status: 200
> > >            crc: d41d8cd98f00b204e9800998ecf8427e
> > >  last_modified: Wed, 11 Sep 2002 02:00:11 GMT
> > >           etag: "1cc11a-cb26-3d7ea3ab"
> > >last_index_time: 1031742553
> > >       referrer: 23
> > >            tag: 0
> > >           hops: 3
> > >          redir: 0
> > >         origin: 0
> > >1 row in set (0.00 sec)
> > >
> > >mfg
> > >
> > >Markus Rietzler
> > >* <rietzler_software/>
> > >* RZF NRW
> > >* Tel: 0211.4572-130
> > >
> > >
> > >
> > >-----Urspr�ngliche Nachricht-----
> > >Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]]
> > >Gesendet am: Mittwoch, 11. September 2002 10:48
> > >An: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> > >Betreff: Re: AW: [aseek-users] external converters, pdf files
> > >
> > >Hi,
> > >What does your 'converter' line in aspseek.conf look like?
> > >Also try running index -a -m -u "%.pdf" and see what the output is
> > >(perhaps an error message is displayed).
> > >
> > >Cheers,
> > >
> > >[EMAIL PROTECTED] wrote:
> > >
> > >
> > >
> > >>nono,
> > >>these are "plain" pdf files, mostly converted from winword. so there is
> a
> > >>lot of text. when i use pdf2text or pdftohtml and look in the result, i
> > get
> > >>all the words/text from the pdf file. so something different happens
> > >>
> > >>
> > >here...
> > >
> > >
> > >>mfg
> > >>
> > >>Markus Rietzler
> > >>* <rietzler_software/>
> > >>* RZF NRW
> > >>* Tel: 0211.4572-130
> > >>
> > >>
> > >>
> > >>-----Urspr�ngliche Nachricht-----
> > >>Von: Gregory Kozlovsky [mailto:[EMAIL PROTECTED]]
> > >>Gesendet am: Mittwoch, 11. September 2002 10:07
> > >>An: '[EMAIL PROTECTED]'
> > >>Betreff: RE: [aseek-users] external converters, pdf files
> > >>
> > >>Sometimes, what appears to be text in .pdf files is actually scanned
> > images
> > >>that cannot be indexed. Check for it.
> > >>
> > >>   Gregory Kozlovsky
> > >>
> > >>-----Original Message-----
> > >>From: [EMAIL PROTECTED]
> > >>[mailto:[EMAIL PROTECTED]]
> > >>Sent: Mittwoch, 11. September 2002 09:59
> > >>To: [EMAIL PROTECTED]
> > >>Subject: [aseek-users] external converters, pdf files
> > >>
> > >>
> > >>hi,
> > >>i am trying to setup aspseek with external converter support. i
> installed
> > >>pdftohtml, indexing works fine, pdf files seem to be processed, i can
> find
> > >>the urls to the pdf files in urlword table even with status code 200.
> but
> > >>when i do a search with words from the pdf-files i get no result, pdf
> > files
> > >>were not listet in the results...
> > >>
> > >>any idea?
> > >>
> > >>thanxs
> > >>
> > >>mfg
> > >>
> > >>Markus Rietzler
> > >>* <rietzler_software/>
> > >>* RZF NRW
> > >>* Tel: 0211.4572-130
> > >>
> > >>
> > >>
> > >>-----Urspr�ngliche Nachricht-----
> > >>Von: Charlie Farinella [mailto:[EMAIL PROTECTED]]
> > >>Gesendet am: Dienstag, 10. September 2002 23:35
> > >>An: [EMAIL PROTECTED]
> > >>Betreff: [aseek-users] selective removal of urls
> > >>
> > >>Is there a way to selectively remove a url from our database after it
> > >>has been indexed?  We would like to remove porn sites from a family
> > >>friendly database.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >
> > >
> > >
> > >
> 
> -- [EMAIL PROTECTED]  ICQ7551596  [EMAIL PROTECTED] --
>    Guinness a Day Keeps a Doctor Away (people's wisdom)


-- [EMAIL PROTECTED]  ICQ7551596  [EMAIL PROTECTED] --
   Guinness a Day Keeps a Doctor Away (people's wisdom)

Index: src/config.cpp
===================================================================
RCS file: /home/cvs/aspseek/src/config.cpp,v
retrieving revision 1.51
diff -u -r1.51 config.cpp
--- src/config.cpp      19 Jul 2002 15:02:10 -0000      1.51
+++ src/config.cpp      12 Sep 2002 10:28:47 -0000
@@ -574,7 +574,15 @@
                                if ((charset = strstr(cmd, "charset=")) != NULL)
                                {
                                        charset += 8;
-                                       cmd = GetToken(NULL, "\n\r", &lt);
+                                       cmd = strchr(charset + 1, ' ');
+                                       if (*cmd != '\0')
+                                       {
+                                               *cmd = '\0'; cmd++;
+                                               while ((*cmd == ' ') ||
+                                                       (*cmd == '\t')) cmd++;
+                                       }
+                                       else
+                                               cmd = NULL;
                                }
                                else
                                {

Re: AW: AW: AW: AW: [aseek-users] external converters, pdf files

Reply via email to