On 14/06/2024 21:04, Ihor Radchenko wrote:
Morgan Willcock writes:

i.e. Inserting "https://domain/test-"; into the buffer will create a
clickable link for "https://domain/test";.

I improved the heuristics we use to detect plain links.
Fixed, on main.
https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=73da6beb5

+++ b/etc/ORG-NEWS
[...]
+*** Trailing =-= is now allowed in plain links

After a look into

7dcb1afb6 2021-03-24 21:27:24 +0800 Ihor Radchenko: Improve org-link-plain-re

I suspect, it worked prior to v9.5. Without a unit test it may be accidentally broken again.

+: https://domain/test-

example.org, example.net, example.com are domains reserved for usage in examples: <https://www.iana.org/assignments/special-use-domain-names/special-use-domain-names.xhtml>

                    (or (regexp "[^[:punct:] \t\n]")

I have realized that some Org regexps use [:punct:] *regexp class* and others *syntax class*, see latex math regexp. I am in doubts if the discrepancy is intentional.

I have noticed that the following change

09ced6d2c 2024-02-03 15:15:46 +0100 Ihor Radchenko: org-link-plain-re: Improve regexp heuristics

that causes

    (link http://example.org/a<b)

input is exported as

    <p>
(link <a href="http://example.org/a%3Cb)">http://example.org/a%3Cb)</a></p>

I expect that ")" should not be parsed as a part of the link. Balanced brackets are tricky with regexps (and it is not possible to match arbitrary nested ones).

Perhaps "[^[:punct:] \t\n]" is too strict in respect to spaces. It does not allow the recommended workaround with zero width space:

(org-export-string-as
 "http://example.org\N{ZERO WIDTH SPACE}[fn::footnote]" 'html 'body)
"<p>
<a href=\"http://example.org​[fn::footnote]\";>http://example.org​[fn::footnote]</a></p>
"

Actually some kind of non-breakable space should be better in such cases:

(org-export-string-as
 "http://example.org\N{NO-BREAK SPACE}[fn::footnote]" 'html 'body)
"<p>
<a href=\"http://example.org [fn::footnote]\";>http://example.org [fn::footnote]</a></p>
"

I would consider [:space:] or \s-.

As to the original bug report, while reading it, I noticed that thunderbird includes dash into the recognized link for

  "https://domain/test-";

I decided to look into its implementation and to my surprise I found: ``punctation chars and "-" at the end are stipped off.'' I realized that double quotes along with angle brackets are treated as a recommended way to mark URLs in plain text. Thunderbird does not consider dash as a part of links for e.g. http://example.org/t- It might be an attempt to reserve possibility to assemble URLs wrapped into several lines with added hyphenation marks, but it has not been implemented (RFC2396 appendix E warns about accidentally added hyphens).

https://www.bucksch.org/1/projects/mozilla/16507/
https://searchfox.org/mozilla-central/source/netwerk/streamconv/converters/mozTXTToHTMLConv.cpp#line-243
mozTXTToHTMLConv::FindURLEnd

Implementation is tricky, I have not noticed anything that may be reused to improve heuristics for Org. Nowadays it is likely better to inspect autolinking code for GitHub/GitLab or widely used python packages.


Reply via email to