On Thu, Jan 20, 2022 at 08:26:40PM +0100, Steinar H. Gunderson wrote: > On Mon, Jan 17, 2022 at 10:46:29PM +0000, Colin Watson wrote: > > test_manfile (which despite the name is not a test function) calls > > find_name with file!="-" and encoding=NULL; that causes find_name to > > call get_page_encoding, which always returns something non-NULL > > ("ISO-8859-1" for English pages), and then call add_manconv from that to > > UTF-8. > > I think there's a potential bug here; from attaching gdb and breaking on > iconv_open, it seems there's a lot of encoding from UTF-8 to UTF-8, which > should be no-ops (except that it might do some additional well-formedness > checking). Is that intentional?
One annoying thing about man pages is that they didn't originally have an especially well-defined encoding. I've put work in to try to impose one, but the best I was ever able to do was to say that pages for a given language must be either in UTF-8 or whatever the typical legacy encoding for that language is. manconv implements this by doing trial decoding from UTF-8 to the target encoding (which is always UTF-8 in mandb(8), and typically but not always UTF-8 in man(1)); if that succeeds then it's done, but if it fails then it has to go back and try again with a different source encoding. This is not necessarily reliable in general, but it's unreliable in the correct direction (any errors can be fixed by converting the page to UTF-8), and in practice detection errors seem rare on the sorts of text that show up in man pages. So the current behaviour isn't a bug as such, but there's definitely room for optimization here: when operating in-process, and in the common case where the target encoding is UTF-8, the UTF-8 to UTF-8 trial decoding path could be changed to just do a read-only "is this UTF-8" test rather than effectively copying everything to a new buffer via iconv. I don't know how much faster that would be, though it seems likely to be an improvement. It would be some more code, so I didn't implement it in the first pass at this, but it would be worth a second pass to see how much it helps. I'll see if I can make time for this, though I think a reasonable priority for me is to finish working on your existing MR comments first and get this ready to land. > Apart from that, I've given your patch a quick run, and it seems to cut out > nearly all of the unneeded overhead. So what we're potentially left with > without doing strange things like multithreading or simplifying the lexer, > is: > > - Get rid of the unneeded conversions (~15–20% overhead, it seems). > - Launch in the background, potentially. > > Does that make sense? In any case, what we have here is already a huge > improvement. I think so, and I should be able to get the first pass done this weekend. Thanks for helping with this! -- Colin Watson (he/him) [cjwat...@debian.org]