Re: [htdig] yet another pdf parser

Gilles Detillieux Mon, 10 Sep 2001 12:54:52 -0700
According to Stefan Nehlsen:
> On Thu, Sep 06, 2001 at 04:06:24PM -0500, Gilles Detillieux wrote:
> > According to Stefan Nehlsen:
> > 
> > However, since 3.1.4 was released, the use of external parsers isn't
> > usually recommended, as external converters do a better job.
> 
> When I started to play with htdig (a year ago ?) parse_doc.pl was the
> one to use. I stick with it for for a long time and started to fix
> bugs (german umlaute) to make it work in the way it should.

Well, 3.1.4 was released in December 1999, but I've gotten more emphatic
in the past year about recommending external converters over external
parsers.  This was because of a surprising number of "bug reports" and
requests for help that directly resulted from limitations in parse_doc.pl.

> We have quite a lot of large pdf-files here and I wanted to put
> links to pseudo-anchors (#page=<n> style) into the excerpt. Changing
> parse_doc.pl doesn't to seem to be fun and so I started to rewrite it.
> 
> Another problem is that most of our pdf-files doesn't contain nice
> title information. I use the perl-script to generate it from
> urls that are structured in some parts of our content and to merge
> external stored titles for an other part.

Yes, these are all good additions, but they could go into an external
converter just about as easily as in an external parser.  The #page=n
anchors could be output as <a name="#page=n"></a> tags, and titles,
however they are obtained by your script, can go between <title> and
</title> tags in the external converter's HTML output.  If anything,
the external converter scripts tend to be simpler because the parsing
bits are left out.

> It is still htdig 3.1.5 and still alpha but you may have a look at it:
> 
>       http://www.parlanet.de/_index_parlanet.html
> 
> ( Stay on the german version (others don't really exist) and search
>   for "castor". )
> 
> The design is not really good because it is using frames and so I had
> to use a php-wrapper and made a small change to htsearch.  (patch is
> attached -- please ignore it :-)

Actually, I think it's a good patch!  We've had requests before for some
way of putting target attributes in the anchor links, and your approach
seems like a clean one.  I'll post it, and consider it for the next release.

> Biggest problem now is performance - I've got to get bigger hardware.
> 
> Will version 3.2.x be faster?

Well, as they say, your mileage may vary.  Technically, there are a few
improvements in the new database structuring that should speed things up.
However, the word database also tends to be larger, and so there seems to
be a slowdown of some searches.  As a whole, things should be somewhat
faster as long as you don't rely on the new and as-yet unoptimized phrase
searching feature.  With 3.2 as with 3.1, though, you need good hardware
to support a large search index.

> > See
> > 
> >   ftp://ftp.htdig.org/pub/htdig/contrib/parsers/doc2html.tar.gz
> > 
> > for the latest and fanciest incarnation of these.
> 
> I found it out that was doing to much things I didn't need. Why should I
> use the same script for different types when htdig is able to
> choose the right one.

It depends on what you're indexing.  Sure, for PDFs, they are usually tagged
unambiguously by the server, so htdig can pick the right converter/parser.
The trick is the .doc files, which may be WP, Word, RTF, or something else,
so having one script that looks at both the "magic number" at the start of
the document as well as the server's returned Content-Type header can be a
real benefit.

> I was thinking about embedding a perl interpreter into htdig
> when I was reading it. :-)
> 
> ( Maybe I should put some comments into my program. )
> 
> >  The big problem
> > with external parsers is they don't parse words consistently in the
> > manner that the internal parsers do, and they don't respond to changes 
> > in the config file.  E.g., if you drop minimum_word_length from 3 to
> > 2, you still won't get 2-letter words from the external parser because
> > of the hardcoded 3 in there.  It also won't look at valid_punctuation,
> > extra_word_characters, or any other attribute that controls parsing.
> 
> ok -- this is true -- but this was not really my problem.
> 
> htdig is really great, it is working quite well and everytime I look at
> it I find new features to try out.

Glad to hear it.

Here's a repost of your patch, for the benefit of the mailing list.
Apply in 3.1.5's main source directory using "patch -p0 < this-message".

--- htcommon/defaults.cc.org    Wed Aug 29 11:05:37 2001
+++ htcommon/defaults.cc        Wed Aug 29 11:07:12 2001
@@ -31,6 +31,7 @@
     {"allow_in_form",                  ""},
     {"allow_numbers",                  "false"},
     {"allow_virtual_hosts",            "true"},
+    {"anchor_target",                  ""},
     {"authorization",                  ""}, 
     {"backlink_factor",                 "1000"},
     {"bad_extensions",                 ".wav .gz .z .sit .au .zip .tar .hqx .exe .com 
.gif .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi"},
--- htsearch/Display.cc-ORG     Tue Aug 21 14:40:57 2001
+++ htsearch/Display.cc Wed Aug 29 11:29:45 2001
@@ -1199,6 +1199,7 @@
 {
     static char                *start_highlight = config["start_highlight"];
     static char                *end_highlight = config["end_highlight"];
+    static char                *anchor_target = config["anchor_target"];
     static String      result;
     int                        pos;
     int                        which, length;
@@ -1211,8 +1212,12 @@
        result.append(str, pos);
        ww = (WeightWord *) (*searchWords)[which];
        result << start_highlight;
-       if (first && fanchor)
-           result << "<a href=\"" << urlanchor << "\">";
+       if (first && fanchor) {
+           result << "<a ";
+           if ( *anchor_target ) 
+               result << "target=\"" << anchor_target << "\" ";
+           result << "href=\"" << urlanchor << "\">";
+       }
        result.append(str + pos, length);
        if (first && fanchor)
            result << "</a>";


-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
Re: [htdig] yet another pdf parser

Reply via email to