Hello all,
I am the author of the code that came under such severe criticism by Thomas, so
I feel compelled to answer.
When I started this effort, no CHM to PDF conversion utility was available in
Linux. There were a few, yes, of which chm2pdf had the best concept in my
opinion, but they were all very limited to what they could accomplish. The code
you criticise was created in an effort to push things to a more usable state.
As always, one cannot satisfy all...
The standard scenario I had in mind when I started was...uhmm...a user like
myself: you could take the script, change the CHM2PDF_TEMP_WORK_DIR and
CHM2PDF_TEMP_ORIG_DIR variables at the top to something useful and acceptable,
then run it and (most of the time) be happy! You can read about the whole story
here:
http://www.karakas-online.de/forum/viewtopic.php?t=10275
It was NOT my intention to spend 3 months pondering about security issues
arising from using the tmp dir or whatever else. My understanding was that if
tmp is not suitable, the user should change it to some directory in his home.
It was also NOT a priority to make things as efficient as possible. We have to
see if things work this way acceptably *first*. If they do, THEN we can
optimize. I will not start thinking about how to optimize a search-and-replace
procedure that takes a few minutes (at most) to run on modern computers before
I have feedback from the stakeholders (the community in this case) that
what I search and replace are the right things in the first place!
A good example is that of anchors, that Thomas finds should be kept because of
all the API documentation that uses it, while, on the other hand, I found that
it was better to ignore, as using them would too often lead to unusable
results. Obviously we have here two sets of CHM files that behave differently
with regard to anchors: a great deal of API docs that (I am inclined to
believe) use anchors correctly and another great deal of other CHMs that don't.
I would *never* throw away anchors without a reason. So if I did it it was
because I had to. Now if the community thinks we should keep them, I am open to
suggestions on what to do if we keep them and we get one of those CHMs that
don't use it correctly, or cause trouble. To put it more vividly: I was *fed
up* seeing CHMs that would NOT work through the conversion, due to the anchor
thing!
So we have to reach an agreement on what is right first, before we continue
with how to implement the right thing efficiently.
However, since I have a personal interest in data structures and algorithms
myself (see my papers in
http://www.karakas-online.de/myWork/computer_graphics.html), let me outline the
general problem of this search-and-replace part of chm2pdf:
We are given a set of filenames. Each file contains a certain amount of links
to other files in the set. We don't know in advance how many, or what the mean
value is. *Every* filename could be mentioned, even more than once (maybe two,
three, or even 10 times), inside a given file. We want to correct all those
links to some equivalents that we know in advance, i.e. we know that we have to
change link A to link A', link B to link B' and so on, for known A, A', B, B'
etc.
Suppose you have n filenames (i.e. n distinct files) and each file contains in
the mean m links to other files. We have to do nm corrections - more or less,
in some statistical sense. Now if the distinct files represent distinct pages
(which is not always the case, usually they represent distinct subsections (in
the DocBook sense) or some other *structural* (not presentational, as pages
are) entity), then we have to do a number of corrections *of the order* of pm -
where p is the number of pages and again m is the mean number of references
(links) to other files from any given file in the collection.
How do you solve this efficiently? The *least* number of actions you have to
take is of the order of pm, or, more accurately, is *in the mean, exactly* (not
*order of*) nm. chm2pdf goes through each file (that's n in nm) and in order to
find the m changes per file (m always understood in the mean sense here), it
goes though the list of all *possible* changes, searches for them and does a
replace if it finds a match. Now this is indeed of the order of n*n, because
all *possible* changes are in an array of length n.
So we are talking about a reduction from nn to nm, where m could also be large
(even of the order of n itself, let's not forget this!). Actually, it boils
down to finding the right m occurrences of the filenames - in each file! We
could, for example, first search for all links inside a file, then, for every
link we find, check if it is in the list of filenames and do the replace
part. But you still have to do a match against all *possible* filenames, i.e.
you still have to do n comparisons (inside the loop for every one of the n
filesnames). That's again nn (n squared) operations!