On Thu, Jan 20, 2022 at 08:26:40PM +0100, Steinar H. Gunderson wrote:
> On Mon, Jan 17, 2022 at 10:46:29PM +0000, Colin Watson wrote:
> > test_manfile (which despite the name is not a test function) calls
> > find_name with file!="-" and encoding=NULL; that causes find_name to
> > call get_page_encoding, which always returns something non-NULL
> > ("ISO-8859-1" for English pages), and then call add_manconv from that to
> > UTF-8.
> 
> I think there's a potential bug here; from attaching gdb and breaking on
> iconv_open, it seems there's a lot of encoding from UTF-8 to UTF-8, which
> should be no-ops (except that it might do some additional well-formedness
> checking). Is that intentional?

One annoying thing about man pages is that they didn't originally have
an especially well-defined encoding.  I've put work in to try to impose
one, but the best I was ever able to do was to say that pages for a
given language must be either in UTF-8 or whatever the typical legacy
encoding for that language is.  manconv implements this by doing trial
decoding from UTF-8 to the target encoding (which is always UTF-8 in
mandb(8), and typically but not always UTF-8 in man(1)); if that
succeeds then it's done, but if it fails then it has to go back and try
again with a different source encoding.  This is not necessarily
reliable in general, but it's unreliable in the correct direction (any
errors can be fixed by converting the page to UTF-8), and in practice
detection errors seem rare on the sorts of text that show up in man
pages.

So the current behaviour isn't a bug as such, but there's definitely
room for optimization here: when operating in-process, and in the common
case where the target encoding is UTF-8, the UTF-8 to UTF-8 trial
decoding path could be changed to just do a read-only "is this UTF-8"
test rather than effectively copying everything to a new buffer via
iconv.  I don't know how much faster that would be, though it seems
likely to be an improvement.  It would be some more code, so I didn't
implement it in the first pass at this, but it would be worth a second
pass to see how much it helps.

I'll see if I can make time for this, though I think a reasonable
priority for me is to finish working on your existing MR comments first
and get this ready to land.

> Apart from that, I've given your patch a quick run, and it seems to cut out
> nearly all of the unneeded overhead. So what we're potentially left with
> without doing strange things like multithreading or simplifying the lexer,
> is:
> 
>  - Get rid of the unneeded conversions (~15–20% overhead, it seems).
>  - Launch in the background, potentially.
> 
> Does that make sense? In any case, what we have here is already a huge
> improvement.

I think so, and I should be able to get the first pass done this
weekend.  Thanks for helping with this!

-- 
Colin Watson (he/him)                              [cjwat...@debian.org]

Reply via email to