According to Ted Stresen-Reuter:
> 1.
> On our intranet we have some pdf files that were made in adobe acrobat. The
> files contain hyperlinks to other files. My guess is that the pdf2html (or
> is it pdf2text) converter doesn't know how to follow links. Does anyone know
> of a product that does or am I relegated to listing each pdf individually if
> I want it to be indexed?

The usual external parser scripts make use of pdftotext, which comes with
the xpdf package.  It only extracts plain text from the PDF documents.
I've been meaning to try pdftohtml (http://pdftohtml.sourceforge.net/)
but haven't yet had the chance.  I don't know if it will extract hypertext
links from the PDFs, but it's worth a try.  If you do try it, please let
us know how it goes.  You may want to retrofit this tool into doc2html.pl,
so you get all the wrapper script handling of arguments and such.

If pdftohtml doesn't do it, I've found the following trick seems to
find links in PDFs, but without the description text, so you could try
working this into an external converter script:

  strings file.pdf | sed -n 's|^/URI (\(.*\)).*|<link href="\1">|p'

> 2.
> Our intranet is sprinkled with links back to the firm directory. For
> example, on each department's home page is a list of the staff that works in
> that department and a link back to each persons profile in the firm
> directory. Likewise, when viewing an individual's profile in the firm
> directory, you see a list of other members of the same department with links
> to their individual profiles as well. When I conduct a search on
> 'technology', expecting to see the Information Technology Home Page listed
> first (it is the title of the page, has Information Technology in the
> description and keywords and has an h1 tag at what is essentially the start
> of the page) and yet it appears at the end of the list with only one star.
> Each individual, however, is listed at the start of the list and with 5
> stars. Is this because there are far more pages that point to each
> individual's profile than there are that point to the Information Technology
> Home Page and if so, what do the developers of htdig recommend changing so
> that the home page comes up first?

Try lowering the value of the backlink_factor attribute (see
http://www.htdig.org/attrs.html#backlink_factor) to see if that helps
(no need to reindex).  Also, if the word "technology" appears in the
link description text for the links to any of the individuals' pages,
that will greatly boost their score if description_factor is still at
the default value.  If that's the case, you can lower this factor too
(you'll have to reindex if using the 3.1.x series), or change the
descriptions and reindex.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)

_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to