Bug#502702: chm2pdf is nowhere close to releasable

2008-11-25 Thread Chris Karakas
Hello all,

I am the author of the code that came under such severe criticism by Thomas, so 
I feel compelled to answer.

When I  started this effort, no CHM to PDF conversion utility was available in 
Linux. There were a few, yes, of which chm2pdf had the best concept in my 
opinion, but they were all very limited to what they could accomplish. The code 
you criticise was created in an effort to push things to a more usable state. 
As always, one cannot satisfy all...

The standard scenario I had in mind when I started was...uhmm...a user like 
myself: you could take the script, change the CHM2PDF_TEMP_WORK_DIR and 
CHM2PDF_TEMP_ORIG_DIR variables at the top to something useful and acceptable, 
then run it and (most of the time) be happy! You can read about the whole story 
here:

http://www.karakas-online.de/forum/viewtopic.php?t=10275

It was NOT my intention to spend 3 months pondering about security issues 
arising from using the tmp dir or whatever else. My understanding was that if 
tmp is not suitable, the user should change it to some directory in his home.

It was also NOT a priority to make things as efficient as possible. We have to 
see if things work this way acceptably *first*. If they do, THEN we can 
optimize. I will not start thinking about how to optimize a search-and-replace 
procedure that takes a few minutes (at most) to run on modern computers before 
I have feedback from the stakeholders (the community in this case) that 
what I search and replace are the right things in the first place!

A good example is that of anchors, that Thomas finds should be kept because of 
all the API documentation that uses it, while, on the other hand, I found that 
it was better to ignore, as using them would too often lead to unusable 
results. Obviously we have here two sets of CHM files that behave differently 
with regard to anchors: a great deal of API docs that (I am inclined to 
believe) use anchors correctly and another great deal of other CHMs that don't.

I would *never* throw away anchors without a reason. So if I did it it was 
because I had to. Now if the community thinks we should keep them, I am open to 
suggestions on what to do if we keep them and we get one of those CHMs that 
don't use it correctly, or cause trouble. To put it more vividly: I was *fed 
up* seeing CHMs that would NOT work through the conversion, due to the anchor 
thing!

So we have to reach an agreement on what is right first, before we continue 
with how to implement the right thing efficiently.

However, since I have a personal interest in data structures and algorithms 
myself (see my papers in 
http://www.karakas-online.de/myWork/computer_graphics.html), let me outline the 
general problem of this search-and-replace part of chm2pdf: 

We are given a set of filenames. Each file contains a certain amount of links 
to other files in the set. We don't know in advance how many, or what the mean 
value is. *Every* filename could be mentioned, even more than once (maybe two, 
three, or even 10 times), inside a given file. We want to correct all those 
links to some equivalents that we know in advance, i.e. we know that we have to 
change link A to link A', link B to link B' and so on, for known A, A', B, B' 
etc.

Suppose you have n filenames (i.e. n distinct files) and each file contains in 
the mean m links to other files. We have to do nm corrections - more or less, 
in some statistical sense. Now if the distinct files represent distinct pages 
(which is not always the case, usually they represent distinct subsections (in 
the DocBook sense) or some other *structural* (not presentational, as pages 
are) entity), then we have to do a number of corrections *of the order* of pm - 
where p is the number of pages and again m is the mean number of references 
(links) to other files from any given file in the collection.

How do you solve this efficiently? The *least* number of actions you have to 
take is of the order of pm, or, more accurately, is *in the mean, exactly* (not 
*order of*) nm. chm2pdf goes through each file (that's n in nm) and in order to 
find the m changes per file (m always understood in the mean sense here), it 
goes though the list of all *possible* changes, searches for them and does a 
replace if it finds a match. Now this is indeed of the order of n*n, because 
all *possible* changes are in an array of length n.

So we are talking about a reduction from nn to nm, where m could also be large 
(even of the order of n itself, let's not forget this!). Actually, it boils 
down to finding the right m occurrences of the filenames - in each file! We 
could, for example, first search for all links inside a file, then, for every 
link we find, check if it is in the list of filenames and do the replace 
part. But you still have to do a match against all *possible* filenames, i.e. 
you still have to do n comparisons (inside the loop for every one of the n 
filesnames). That's again nn (n squared) operations!


Bug#502702: chm2pdf is nowhere close to releasable

2008-10-19 Thread Thomas Viehmann
X-Debbugs-RC: [EMAIL PROTECTED]
Package: chm2pdf
Severity: grave
Justification: is between unusable and dangerous for non-trivial chm files
Version: 0.9-2

Hi,

I've been looking at chm2pdf's RC bug.
There are so much more problems that releasing lenny seems totally out of 
question for
quality reasons.
Some items in addition to the tempfile RC bug (only the ones that have turned 
up in the
attempt to get it to convert some chm to pdf, I have not specifically looked at 
anything):
- the calls to subprograms using os.system are not using proper
  escaping, this is a security hazard (Raphael's patch proposed to fix the
  rm -f, the the rest is just as bad),
- the code looses perfectly good data in a way that cannot be turned off (note 
that the
  typical chm files I found that related to free software were api docs which 
are loosing
  a big deal without links to the right place on the page in per-module lists of
  functions):
  # Replace links of the form somefile.html#894 with somefile0206.html
  # The following will match anchors like 'a href=temp0206.html#894' and 
will store the
  # 'temp0206.html' in backreference 1.
  # The replace string will then replace it with 'a href=temp0206.html', 
i.e. it will
  # take away the '#894' part.
  # This is because the numbers after the '#' are often wrong or non-existent. 
It is
  # better to link to an existing chapter than to a non-existent part of an 
existing
  # chapter.
- The implementation is inacceptably inefficient: The convert_to_pdf function's 
uses
  the following (match_strings and replace_garbled_strings are of length = 
number of
  pages) and this in in a loop over all pages. This is completely bogus (in 
terms of
  what they lengthily explain to try to achieve) and unacceptably inefficient, 
this can
  readily be implemented in linear time without effort.

 # Substitutions in 1st pass: we replace the original filenames with their
 # corresponding garbled equivalents.
for match_string in  match_strings:
replace_string = 
replace_garbled_strings[match_strings.index(match_string)]
page = re.sub(match_string, replace_string, page)
 # Substitutuions in the 2nd pass: we replace the garbled filenames with 
the correct
 # ones.
for match_string in  replace_garbled_strings:
replace_string = 
replace_strings[replace_garbled_strings.index(match_string)]
page = re.sub(match_string, replace_string, page)

chm2pdf never has been in a Debian release and it should not be before it gets 
better.
Please remove it from lenny.

Kind regards

T.
-- 
Thomas Viehmann, http://thomas.viehmann.net/



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]