Author: elsner
Email: [EMAIL PROTECTED]
Message:
I could not get my pdftotext parser going ...

$ /fs1/sw/xpdf/pdftotext 
  # works from the command line, exectuable by world.

#My indexer.conf:
---

DBAddr          mysql://xyz@localhost/mnogosearch/

#DBMode single
#VarDir /usr/local/mnogosearch/var

LocalCharset Phrase yes
#CrossWords no

#StopwordFile stopwords.txt
StopwordTable stopword

MinWordLength 1
MaxWordLength 32

MaxDocSize 10000000

HTTPHeader User-Agent: MnoGoSearch_RZ_UOS
#HTTPHeader Accept-Language: de, en
#HTTPHeader From: [EMAIL PROTECTED]

# ServerTable server

DeleteNoServer yes

# Exclude cgi-bin and non-parsed-headers using "string" match:
Disallow /cgi-bin/* */nph-*

# Exclude anything with '?' sign in URL. Note that '?' sign has a 
# special meaning in "string" match, so we have to use "regex" match here:
Disallow Regex  \?

# Exclude Apache directory list in different sort order using "string" match:
Disallow *D=A *D=D *M=A *M=D *N=A *N=D *S=A *S=D

CheckOnly *.pl   *.cgi 
CheckOnly *.b    *.sh   *.md5
CheckOnly *.arj  *.tar  *.zip  *.tgz  *.gz
CheckOnly *.lha  *.lzh  *.rar  *.zoo  *.tar*.Z
CheckOnly *.gif  *.jpg  *.jpeg *.bmp  *.tiff 
CheckOnly *.vdo  *.mpeg *.mpe  *.mpg  *.avi  *.movie
CheckOnly *.mid  *.mp3  *.rm   *.ram  *.wav  *.aiff
CheckOnly *.vrml *.wrl  *.png
CheckOnly *.exe  *.cab  *.dll  *.bin  *.class
CheckOnly *.tex  *.texi *.xls  *.doc  *.texinfo
CheckOnly *.ai   *.eps  *.ppt  *.hqx
CheckOnly *.cpt  *.bms  *.oda  *.tcl
CheckOnly *.rpm  *.m3u  *.qt   *.mov
CheckOnly *.map  *.aif  *.sit  *.sea
# CheckOnly *.rtf  *.pdf  *.cdf  *.ps

UseRemoteContentType no

AddType text/plain      *.txt
AddType text/plain  *.js *.java
AddType text/plain  *.h *.c *.cpp

AddType text/html       *.html *.htm
AddType text/html   *.cfm *.cfml

AddType image/x-xpixmap *.xpm
AddType image/x-xbitmap *.xbm
AddType image/gif       *.gif

AddType application/pdf *.pdf
AddType application/unknown *.*

Mime "application/pdf; charset=iso-8859-1"  "text/plain"                  
"/fs1/sw/xpdf/pdftotext $1 $2"

Period 7d

Robots yes

Clones yes

BodyWeight 2
#CrossWeight 32
TitleWeight 4
KeywordWeight 8
DescWeight 16
#UrlWeight 0
#UrlHostWeight 0
#UrlPathWeight 0
#UrlFileWeight 0

DeleteBad yes
Index yes

Follow site

CharSet iso-8859-1

# Server        http://localhost/
Server  http://rz-intern.rz.uni-osnabrueck.de/

---

When running indexer with -v 5, the PDF file is indexed
like all other html/txt stuff, no special treatment ...

Anything wrong/missing?

Frank


Reply: <http://www.mnogosearch.org/board/message.php?id=3021>

___________________________________________
If you want to unsubscribe send "unsubscribe general"
to [EMAIL PROTECTED]

Reply via email to