On 24-Jan-01 Dave Sherohman wrote: > BTW, anyone know what's up with pstotext? I ran a PS doc through it > last night and there were a lot of extra spa ces in the outpu t, > including many in mid-word. Is this preventable?
[Gurus: please read the speculative bit at the end] No. And there's nothing going wrong with pstotext in this respect either. A typical reason would be that the software which created the PostScript file did some "kerning", i.e. moving letters of a word (usually) closer together (e.g. in "Wombat" the "ombat" would be moved slightly left so that the "o" was slightly over-hung by the "W"). When this happens, the sequence of characters in a word is broken at that point, and a PostScript "motion" command is interpolated, so that in the PS code it is no longer a contiguous sequence of characters. Unless your PStoWhatever is clever enough to reconstruct the intended word from the fragments, it will do the dumb default thing of treating separate sequences as separate words. And it would have to be pretty clever, since the spacing between words (in "filled" text) may be done by exactly the same mechanism as kerning. The following, for instance, is from a PS file containing the sentence "The Wombat is a small animal.": .318(The W)12.318 F .318 (ombat is a)-1.92 F 3.802 (small animal.)72 244.8 R Since the break-up is present in the PS file to start with, it is not due to pstotext in the first place. Only if pstotext was supposed to be capable of realising that "Wombat" was the intended result of "W" followed by "ombat", while "a" sollowed by "small" should be left alone, should you suspect a flaw in pstotext. As you apparently realise, you cannot expect to do better than a very crude extraction of textual content from a PS file; a PS file is a computer program for placing marks on a page, and the fact that some of these marks are represented by characters is pretty incidental. [For gurus] Nevertheless, I suspect that a relatively straightforward algorithm could be created for this job, assuming (for present purposes) that only the standard printable ASCII characters are needed. When a construct like "(The W)" is encountered, this is interpreted as an instruction to render the string "The W" on the display device. Each character (including the space) in the string is in fact a pointer to a position in the font definition which causes the PS interpreter to look up the primitive PS drawing commands which will creat the shape of the printed character. It strikes me as eminently possible to construct a program which would act like a PS interpreter in all respects _except_ that the drawing commands evoked by (e.g) the character "W" would be replaced by simple emission of the ASCII code for "W" to the standard output. Questions of motion between characters could be handled by the following kind of thing (where "Motion" means the displacement between where the PS file asks for a character to be printed, and where it would have been printed if it had immediately followed the previously printed character): 1. If the Motion is a small Motion (kerning) ignore it. 2. If the Motion is (approximately) a positive space, emit a space. Similarly for (approximately) 2 or more spaces. 3. If the Motion is (approximately) a negative space (overprinting) emit a "backspace". 4. If a Motion is (approximately) a positive or negative line-space, (superscript, subscript) emit the corresponding positive or negative line feed. 5. If a Motion is a combination of backspace & upwards (accent above) emit the appropriate thing. Etc. Now: does anyone know a program which works like that? (The advantage would be that the sort of thing that Dave Sherohman wants to do would be strsightforward and should come out right, while many of the computer-program-like things which PostScript can do -- like loops and conditional branching -- would also be done as they should be; also definitions within the file (which can be "macros" that print out as blocks of text) would work too.) Best wishes to all, Ted. -------------------------------------------------------------------- Topical Thought: It is better to arrive, than to travel hopefuilly. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 284 7749 Date: 24-Jan-01 Time: 17:44:01 ------------------------------ XFMail ------------------------------