Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-26 Thread G. Branden Robinson
At 2023-04-26T19:33:48+0200, Oliver Corff wrote:
> On 26/04/2023 15:16, G. Branden Robinson wrote:
> > Be sure you review my earlier messages to Oliver in detail.  The
> > hyphenation code isn't "broken", it's simply limited to the C/C++
> > char type for character code points and hyphenation codes (which are
> > not "the same thing as" character code points, but do correspond to
> > them).
> 
> I am not familiar with modern incarnations of C/C++. Is there really
> no char data type that is Unicode-compliant?

There is.  But "Unicode" is a _family_ of standards.  There are multiple
ways to encode Unicode characters, and those ways are good for different
things.

Unicode is a 20.1-bit character encoding.[0]  For practical purposes,
this rounds to 32 bits.  So if you use a 32-bit arithmetic type to store
a Unicode character, you'll be fine.  The type `char32_t` has been
around since ISO C11 and ISO C++11 and is arguably the best fit for this
purpose, since `int` is not _guaranteed_ to be 32 bits wide.[1]

A long, long time ago, people noticed that in real-world texts, the code
points used by Unicode strings were not, and were not ever expected to
be, anywhere near uniformly distributed within the code space.  That
fact on top of the baked-in use of only 20.1 bits of a 32-bit type can
make use of the latter more wasteful than a lot of people can tolerate.

In fact, for much of the text encountered on the Internet--outside of
East Asia--the Unicode code points encountered in character strings are
extremely heavily weighted toward the left side of the distribution--
specifically, to the first 128 code points, also known as ISO 646 or
"ASCII".

Along came Unix creator, Ken Thompson, over 20 years after his first
major contribution to software engineering.  Thompson was a man whose
brain took to Huffman coding like a duck to water.  Thus was born UTF-8,
(which isn't a Huffman code precisely but has properties reminiscent of
one) where your ASCII code points would be expressible in one byte, and
then the higher the code point you needed to encode, the longer the
multi-byte sequence you required.  Since the Unicode Consortium had
allocated commonly used symbols and alphabetic scripts toward to lower
code points in the first place, this meant that even where you needed
more than one byte to encode a code point, with UTF-8 you might not need
more than two.  And as a matter of fact, under UTF-8, every character in
every code block up to and including NKo is expressible using up to two
bytes.[2]

So UTF-8 is pretty great at not being wasteful, but it does have
downsides.  It is more expensive to process than traditional byte-wide
strings.  It has state.  When you see a byte with its high bit set, you
know that it begins or continues a UTF-8 sequence.  Not all such
sequences are valid.  You have to decide what to do if a multibyte UTF-8
sequence is truncated.  If you _write_ UTF-8, you have to know how to do
so; the mapping from an ISO 10646-1 20.1-bit code point to a UTF-8
sequence is not trivial.

groff, like AT&T nroff (and roff(1) before it?), doesn't handle a large
quantity of character strings at one time, relative to the overall size
of its typical inputs.  It follows the Unix filter model and does not
absorb an entire input before processing it.  It accumulates input lines
until it is time to emit an output line (either to the output stream or
to a diversion), then flushes the output line and moves on.

It would probably be a good idea to represent Unicode strings internally
using char32_t as a base type anyway, but groff's design under the Unix
filter model described above makes the choice less dramatic in terms of
increased space consumption than it would otherwise be.  The formatter
is not even instrumented to measure the output node lists it builds.  If
this were done, we could tell exactly what the cost of moving from
`char` to `char32_t` is.  And if I'm the person who ends up doing this
work, maybe I will collect such measurements.

"Unicode-compliant" is not a precise enough term to mean much of
anything.  Unicode is a family of standards; the core of which most
people of heard, and a pile of "standard annexes" which can be really
important in various domains of practical engineering around this
encoding system.[3]

There exist many resources to assist us with converting UTF-8 to and
from 32-bit code points.  For instance, there is GNU libunistring.  I've
studied the problems it identifies with "the char32_t approach"[4] and I
don't think they apply to groff.

Maybe gnulib, which we already use, has some facilities as well.  I'm
not sure--it's huge and I've only recently started familiarizing myself
with its manual.

I've been pointedly ignoring two other encodings of Unicode strings:
UTF-16LE and UTF-16BE.  They are terrible but we can't completely avoid
them; Microsoft and Adobe are deeply wedded to them and while we can
largely ignore Microsoft, Adobe managed to contaminate the international
standard f

Re: Warn on semantic newlines

2023-04-26 Thread Bjarni Ingi Gislason
On Fri, Jun 10, 2022 at 11:52:30AM +0200, Alejandro Colomar wrote:
> Hi, Ingo and Branden!
> 
> As far as I know, there's currently no tool that warns on "foo. bar" in
> filled test.  Not `mandoc -Tlint`, and not `groff -ww`, and not `groff
> -rCHECKSTYLE=999`.  I know that CHECKSTYLE is not designed in a way that
> could catch this easily, but maybe -ww or -Tlint could.  Do you think you
> could add some semantic newlines warnings so that writers could realize by
> themselves that their text could be improved?
> 
> The tool could have a secondary warning, not so important, for "foo, bar".
> 
> Also, as far as I know, neither of -ww nor -Tlint have something equivalent
> to -Wno-switch (or -Wno-error=switch), which could be nice to silence (or
> make non-fatal) some warnings on purpose.  Do you think that could be
> implemented in groff(1) or mandoc(1)?
> 
[...]

  "groff" is not the right tool for such things, but "grep" is.

  The attachment contains a shell script that tests various cases of
defects in man pages.

  It can test for just one or few cases or all of them.

  For example create a file with

foo. bar
foo.  bar
foo.  Bar
foo. Bar

  or more examples

and run 

 all 

  Later you can use the reported test numbers to just run those tests.

  The script can (still) produce a lot of wrong positive results.

#!/bin/bash
# Input
# 1) one number, one or more files
# 2) "all", one or more files

# In $SEDLIB: "groff.comment.sed", "groff.TH.sed", "groff.hyphen-minus.sed",
# "check_manuals", "strings_gt"
#
# "chk_manuals" uses: "in_out_put.sh", "mandoc", "groff.lint", and
# "roff.singleword.sed" 

# Environmental variable: MANWIDTH (see man(1)) with 'm' unit
#
# Instead of "test-groff" (in the git repository)

#set -x
set -f
SEDLIB=$HOME/bin

# Check arguments

case "$1" in
  all|[1-9]|[1-9][0-9]) :
  ;;
  *) echo 'First argument is not a number or the word "all", but "'"$1"'"' >&2
exit 1
  ;;
esac

Cmd_name=${0##*/}
#echo $Cmd_name
total=0
declare -a command do_what patch_explain

#n+1 1

total=$((total + 1))
#regexp[${total}]=" -e ' $'"
do_what[${total}]='Remove space at end of lines.'
patch_explain[${total}]=\
'Remove space characters at the end of lines.

Use "git apply ... --whitespace=fix" to fix extra space issues, or use
global configuration "core.whitespace".'
command[${total}]="grep -n -e ' \$'"

#n+1 2
#set -f

while false; do
total=$((total + 1))
#regexp[${total}]=b
#eval echo \"\${regexp[${total}]}\"
do_what[${total}]='Fix warnings from test-groff.'
#eval echo \"\${do_what[${total}]}\"
patch_explain[${total}]=\
"Enable and fix warnings from 'test-groff'."
#eval echo \"\${patch_explain[${total}]}\"; exit
command[${total}]='chk_manuals'
done


#n+1 3

total=$((total + 1))
#regexp[${total}]=" -e '\`-'"
do_what[${total}]='Change \`- to '"'"'"\-".'
patch_explain[${total}]=\
'Change \`- (prints as a grave and a hyphen) to '"'"'\-.
\` (grave) is sometimes used as a left single quote (in ASCII text),
but is easily confused with the start of a command substitution in the
shell.'
command[${total}]="grep -n -e '\`-'"

#n+1 4



while false; do
total=$((total + 1))

#regexp[${total}]="-e '\\\`'"
#eval echo \"\${regexp[${total}]}\"
do_what[${total}]="Change \\\` or '\`' (grave) to ', if it is a quotation mark."
#eval echo \"\${do_what[${total}]}\"
patch_explain[${total}]=\
'Change \` or ` (grave) to '"'"', if it is a quotation mark.'
#eval echo \"\${patch_explain[${total}]}\"; exit
command[${total}]="sed -e \"/^[.'] *.*\\*[^\\]/d\" \$file | \
grep -n --label=\${file} --exclude=\${file##\*/} -e '^[\\]\`' \
  -e '[ (][\\]\`' -e '^\`' -e '[ (]\`' -"

done
#n+1 4

total=$((total + 1))
#regexp[${total}]=b
#eval echo \"\${regexp[${total}]}\"
do_what[${total}]="Change \\' (acute) to \\[aq] (single straight quote)
or ', if used as a single quote.
Change \\' (acute) to a ', if used as an apostrophe."
#eval echo do_what, single straight quote: \"\${do_what[${total}]}\";
patch_explain[${total}]=\
"Change \\' (acute) to \\[aq], if used as a quote.
Change \\' (acute) to ', if used as an apostrophe."
#eval echoh \"single straight quote:\" \"\${patch_explain[${total}]}\";
command[${total}]="grep -n -e ''\"'\"" 
#echo without " eval", change "acute": "\"\${command[${total}]}\""
#eval echo change " acute: "\"\${command[${total}]}\""
#eval echo command[$total] er \""\${command[$total]}"\"

#n+1 5

#total=$((total + 1))
#regexp[${total}]=b
#eval echo \"\${regexp[${total}]}\"
#do_what[${total}]='Add .tr '"'"'\\[aq] at the top of the file.'
#eval echo \"\${do_what[${total}]}\"
#patch_explain[${total}]=\
#'Add command .tr '"'"'\[aq] at the top of the source file to use only
#straight single quotes (nondirectional ones)[1].
#[1] man-pages(7) [package "manpages"'
#eval echo \"\${patch_explain[${total}]}\"; exit
# Not in comment or command lines
#command[${total}]="sed -e \"/^[.']/d\" \$file | \
#  grep -q  -e \" '\" -e \"' \" -e \"'$\" " 
#  grep -q -n --label=\$file --exclude=\${file##\*/} -e \" '\" \
#-e \"' \" 

Re: proctological linter warnings on groff's man pages (was: mdoc(7): CHECKSTYLE)

2023-04-26 Thread Alejandro Colomar


On 4/26/23 11:44, Alejandro Colomar wrote:
> Hi Branden,
> 
> On 4/26/23 11:06, G. Branden Robinson wrote:
>> Hi Alex,
>>
>> At 2023-04-24T19:11:58+0200, Alex Colomar wrote:
 At 2023-04-23T16:17:06+0200, Alejandro Colomar wrote:
> I got some errors from mdoc(7), which were probably due to the
> LANDMINE
> .
> Why is that file problematic with mdoc(7)?
>> [...]
>>> 
>>
>> It's not obvious to me why that macro file would cause any problems.
> 
> You could try if you're curious; since I already removed groff's
> pages from my build system, I'd have to repeat the setup.  If you
> want me to do it, I can.  Otherwise, I already removed that file
> from the Linux man-pages, so we can just ignore this, if you're not
> curious enough.
> 

I can't reproduce it.  It might be due to running 1.22.4 (but that
being from the 23rd or April, something doesn't really fit in my head).

-- 

GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5


OpenPGP_signature
Description: OpenPGP digital signature


Re: mdoc(7): CHECKSTYLE

2023-04-26 Thread Alejandro Colomar
Hi Branden,

On 4/24/23 19:11, Alex Colomar wrote:
> $ make check -k
> TROFF .tmp/man/man1/gpinyin.1.cat.set
> troff: man1/gpinyin.1:316: warning: macro 'AD' not defined
> make: *** [share/mk/build/catman.mk:47: .tmp/man/man1/gpinyin.1.cat.set] 
> Error 1
> make: *** Deleting file '.tmp/man/man1/gpinyin.1.cat.set'
> TROFF .tmp/man/man5/groff_font.5.cat.set
> troff: man5/groff_font.5:813: warning: macro 'AD' not defined
> make: *** [share/mk/build/catman.mk:47: 
> .tmp/man/man5/groff_font.5.cat.set] Error 1
> make: *** Deleting file '.tmp/man/man5/groff_font.5.cat.set'
> TROFF .tmp/man/man7/groff_char.7.cat.set
> troff: man7/groff_char.7:1582: warning: can't find special character 'bs'
> troff: man7/groff_char.7:1808: warning: can't find special character 
> 'radicalex'
> troff: man7/groff_char.7:1810: warning: can't find special character 
> 'sqrtex'
> make: *** [share/mk/build/catman.mk:47: 
> .tmp/man/man7/groff_char.7.cat.set] Error 1
> make: *** Deleting file '.tmp/man/man7/groff_char.7.cat.set'
> TROFF .tmp/man/man7/groff_mdoc.7.cat.set
> mdoc warning: .St: Unknown standard abbreviation '-susv1' (#2540)
>Please refer to the groff_mdoc(7) manpage for a
>list of available standard abbreviations.
> mdoc warning: .St: Unknown standard abbreviation '-susv4' (#2546)
>Please refer to the groff_mdoc(7) manpage for a
>list of available standard abbreviations.
> make: *** [share/mk/build/catman.mk:53: 
> .tmp/man/man7/groff_mdoc.7.cat.set] Error 1
> make: *** Deleting file '.tmp/man/man7/groff_mdoc.7.cat.set'
> make: Target 'check' not remade because of errors.

You were right; I ran 1.22.4.  1.23.0 seems to like your pages :)

$ make check V=1
make: Nothing to be done for 'check'.


Cheers,
Alex

-- 

GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5


OpenPGP_signature
Description: OpenPGP digital signature


Re: mdoc(7): CHECKSTYLE

2023-04-26 Thread Alejandro Colomar
Hi Branden,

On 4/24/23 19:11, Alex Colomar wrote:
> $ make check -k
> TROFF .tmp/man/man1/gpinyin.1.cat.set
> troff: man1/gpinyin.1:316: warning: macro 'AD' not defined
> make: *** [share/mk/build/catman.mk:47: .tmp/man/man1/gpinyin.1.cat.set] 
> Error 1
> make: *** Deleting file '.tmp/man/man1/gpinyin.1.cat.set'
> TROFF .tmp/man/man5/groff_font.5.cat.set
> troff: man5/groff_font.5:813: warning: macro 'AD' not defined
> make: *** [share/mk/build/catman.mk:47: 
> .tmp/man/man5/groff_font.5.cat.set] Error 1
> make: *** Deleting file '.tmp/man/man5/groff_font.5.cat.set'
> TROFF .tmp/man/man7/groff_char.7.cat.set
> troff: man7/groff_char.7:1582: warning: can't find special character 'bs'
> troff: man7/groff_char.7:1808: warning: can't find special character 
> 'radicalex'
> troff: man7/groff_char.7:1810: warning: can't find special character 
> 'sqrtex'
> make: *** [share/mk/build/catman.mk:47: 
> .tmp/man/man7/groff_char.7.cat.set] Error 1
> make: *** Deleting file '.tmp/man/man7/groff_char.7.cat.set'
> TROFF .tmp/man/man7/groff_mdoc.7.cat.set
> mdoc warning: .St: Unknown standard abbreviation '-susv1' (#2540)
>Please refer to the groff_mdoc(7) manpage for a
>list of available standard abbreviations.
> mdoc warning: .St: Unknown standard abbreviation '-susv4' (#2546)
>Please refer to the groff_mdoc(7) manpage for a
>list of available standard abbreviations.
> make: *** [share/mk/build/catman.mk:53: 
> .tmp/man/man7/groff_mdoc.7.cat.set] Error 1
> make: *** Deleting file '.tmp/man/man7/groff_mdoc.7.cat.set'
> make: Target 'check' not remade because of errors.

You were right; I ran 1.22.4.  1.23.0 seems to like your pages :)

$ make check V=1
make: Nothing to be done for 'check'.


Cheers,
Alex

-- 

GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5


OpenPGP_signature
Description: OpenPGP digital signature


Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-26 Thread Oliver Corff

Hi Robin and Branden,

On 26/04/2023 15:16, G. Branden Robinson wrote:

At 2023-04-26T15:16:55+0300, Robin Haberkorn wrote:

For future texts I therefore wanted to return to Groff (where we also
have the excellent MOM macros). Not being able to hyphenate UTF-8
Cyrillic text is a major limitation for me. I might get away with
converting it to KOI8 first, but could I still mix in Unicode
characters this way (as they are considered special characters by
Groff)?


I have similar needs as you in processing UTF-8 Cyrillic text (mostly
not Russian, though).

Mixing two different encodings in one document is generally not a very
feasible idea, and typically single-byte values may be displayed by a
single generic placeholder. Open, for instance, any KOI8-R encoded
document in an utf8-terminal; you either get something that looks like
two-letter combinations or question marks all over the KOI8-R part(s) of
the document. While a machine could, in theory, deal with such a matter,
it is simply a nuisance for a human editor/author to have to work with
such an input.


Be sure you review my earlier messages to Oliver in detail.  The
hyphenation code isn't "broken", it's simply limited to the C/C++ char
type for character code points and hyphenation codes (which are not "the
same thing as" character code points, but do correspond to them).


I am not familiar with modern incarnations of C/C++. Is there really no
char data type that is Unicode-compliant?

Best regards,

Oliver.


--

Dr. Oliver Corff
Wittelsbacherstr. 5A
10707 Berlin
GERMANY
Tel.: +49-30-85727260
mailto:oliver.co...@email.de




Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-26 Thread G. Branden Robinson
At 2023-04-26T15:16:55+0300, Robin Haberkorn wrote:
> For future texts I therefore wanted to return to Groff (where we also
> have the excellent MOM macros). Not being able to hyphenate UTF-8
> Cyrillic text is a major limitation for me. I might get away with
> converting it to KOI8 first, but could I still mix in Unicode
> characters this way (as they are considered special characters by
> Groff)?

Yes.  Special characters are written in ASCII, so there's no problem
there.  You could even mix KOI8-R Russian with Unicode Russian in the
form \[u0432]...just don't expect the latter to hyphenate correctly.

> Perhaps I will have a look at the hyphenation code and try to fix it.
> Hacking the typesetter is always a perfect distraction from the work
> you are supposed to do instead. ;-)

Be sure you review my earlier messages to Oliver in detail.  The
hyphenation code isn't "broken", it's simply limited to the C/C++ char
type for character code points and hyphenation codes (which are not "the
same thing as" character code points, but do correspond to them).

Regards,
Branden


signature.asc
Description: PGP signature


Re: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian

2023-04-26 Thread Robin Haberkorn

25.04.23 19:51, G. Branden Robinson пишет:

 While I'm pontificating I'll opine that I'm not a huge fan of C++ as
 a language, but I have found with groff that, given discipline, and
 by maintaining a clear view of its roots in C (_also_  not my
 favorite language--but one alienating, enemy-making rant at a time),
 and not picking up every f***ing new feature that gets shoved into
 the language as soon as (or before) it's standardized, it_can_  be
 managed.  But I also think that the C++ templating facility was, in
 implementation, one of the worst features ever developed for any
 programming language.


I would agree to that largely. The only acceptable C++ is the one close to C. 
Especially if you do indeed interface with C APIs. But even then it remains 
broken by design with its classes in headers, forcing you to expose every type 
belonging to your class to everybody. What's the benefit in C++, especially when 
restraining from namespaces? Deeply nested class hierarchies? You really 
shouldn't have those anyway. IMHO you can get much clearer and better isolated 
code (smaller headers anyway) with properly written idiomatic plain C code. It's 
the lesser of two evils. The preprocessor is one of those things I am also not 
happy with, although I found that C++ often pushes you to metaprogramming only 
for marginally improved typesafety compared to plain-C non-preprocessed 
solutions. As a side effect you get overblown binaries that will blow your cache 
hierarchies. On the other hand the C preprocessor could be made much more useful 
for metaprogramming with a few simple extensions...
I have not long ago migrated SciTECO from "C-like" C++ to plain C and I am not 
looking back!




Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-26 Thread Robin Haberkorn

Hello!

I can confirm that Neatroff (and Heirloom Troff) works well for typesetting 
Russian texts including hyphenation.
BUT, I found them unsuitable for complex scientific texts as their ms macros are 
buggy and tbl is somewhat limited. Regarding Neatroff, I found that its 
hyperlinking capabilities are extremely limited.


For future texts I therefore wanted to return to Groff (where we also have the 
excellent MOM macros). Not being able to hyphenate UTF-8 Cyrillic text is a 
major limitation for me. I might get away with converting it to KOI8 first, but 
could I still mix in Unicode characters this way (as they are considered special 
characters by Groff)?


Perhaps I will have a look at the hyphenation code and try to fix it. Hacking 
the typesetter is always a perfect distraction from the work you are supposed to 
do instead. ;-)


Yours sincerely,
Robin

26.04.23 14:10, Ralph Corderoy пишет:

Hi Oliver,

Are you aware there are other troff implementations than GNU's groff?
Neatroff is one.  Ali Gholami Rudi wrote it because he wanted better
Unicode support for foreign languages, including right-to-left text.
He seems very much of your mould in needs.

A good summary of its features is http://litcave.rudi.ir/neatroff.pdf
I see UTF-8 hyphenation files mentioned.
There's also whole-paragraph formatting and lots of other delights.
Rudi's http://litcave.rudi.ir has a Typesetting section past the initial
list of recent changes to his software.

Feel free to continue discussing neatroff here along with general troff
questions.





Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-26 Thread Oliver Corff

Hi Ralph,

I could not resist the temptation to procrastinate from my current work
and had a look at neatroff.

Really neat!

Out-of-the-box, my test file russ.ms and TeX utf8 hyphenation patterns
taken straight from my TeX installation produced the attached very
satisfying result.

Best regards,

Oliver.


On 26/04/2023 13:10, Ralph Corderoy wrote:

Hi Oliver,

Are you aware there are other troff implementations than GNU's groff?
Neatroff is one.  Ali Gholami Rudi wrote it because he wanted better
Unicode support for foreign languages, including right-to-left text.
He seems very much of your mould in needs.

A good summary of its features is http://litcave.rudi.ir/neatroff.pdf
I see UTF-8 hyphenation files mentioned.
There's also whole-paragraph formatting and lots of other delights.
Rudi's http://litcave.rudi.ir has a Typesetting section past the initial
list of recent changes to his software.

Feel free to continue discussing neatroff here along with general troff
questions.


--
Dr. Oliver Corff
Wittelsbacherstr. 5A
10707 Berlin
GERMANY
Tel.: +49-30-85727260
mailto:oliver.co...@email.de



russ.ps
Description: PostScript document
.hpf hyph-ru.tex
.TL
A Test of Russian
.AB
This little test is supposed to typeset Russian.
I searched for a few terribly long Russian words
and set everything in two-column mode as to 
challenge hyphenation.
.AE
.2C
.SH
Longest Russian Words
.LP
Превысокомногорассмотрительствующий Водогрязеторфопарафинолечение
Cельскохозяйственно-машиностроительный
Рентгеноэлектрокардиографического Частнопредпринимательского
Переосвидетельствующимися
Субстанционализирующимися
Превысокомногорассмотрительствующий Водогрязеторфопарафинолечение
Cельскохозяйственно-машиностроительный
Рентгеноэлектрокардиографического Частнопредпринимательского
Переосвидетельствующимися
Субстанционализирующимися
Превысокомногорассмотрительствующий Водогрязеторфопарафинолечение
Cельскохозяйственно-машиностроительный
Рентгеноэлектрокардиографического Частнопредпринимательского
Переосвидетельствующимися
Субстанционализирующимися
.SH
A Russian Test.
.LP
В начале 1980-х годов компания AT&T, которой принадлежала Bell Labs, осознала ценность Unix и начала создание коммерческой версии операционной системы. Эта версия, поступившая в продажу в 1982 году, носила название UNIX System III и была основана на седьмой версии системы.

Однако компания не могла напрямую начать развитие Unix как коммерческого продукта из-за запрета, наложенного правительством США в 1956 году. Министерство юстиции вынудило AT&T подписать соглашение, запрещавшее компании заниматься деятельностью, не связанной с телефонными и телеграфными сетями и оборудованием. Для того, чтобы всё-таки иметь возможность перевести Unix в ранг коммерческих продуктов, компания передала исходный код операционной системы некоторым высшим учебным заведениям, лицензировав код под очень либеральными условиями. В декабре 1973 года одним из первых исходные коды получил университет Беркли[11].

С 1978 года начинает свою историю BSD Unix, созданный в университете Беркли. Его первая версия была основана на шестой редакции. В 1979 выпущена новая версия, названная 3BSD, основанная на седьмой редакции. BSD поддерживал такие полезные свойства, как виртуальную память и замещение страниц по требованию. Автором BSD был Билл Джой.

Важной причиной раскола Unix стала реализация в 1980 году стека пр

Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-26 Thread Oliver Corff

Hi Ralph,

thank you very much for mentioning neatroff. In principle, I am aware
that there are other implementations, all with their particular unique
features, but I never dived into anything other than groff so far (also
due to the fruitful and friendly exchange on this mailing list), and
neatroff was to me known by name only.

I'll have a look at neatroff during the weekend.

I also noticed heirloom troff (and their font support) but so far
haven't managed to build it from source. Their system layout has some
pecularities.

Best regards,

Oliver.


On 26/04/2023 13:10, Ralph Corderoy wrote:

Hi Oliver,

Are you aware there are other troff implementations than GNU's groff?
Neatroff is one.  Ali Gholami Rudi wrote it because he wanted better
Unicode support for foreign languages, including right-to-left text.
He seems very much of your mould in needs.

A good summary of its features is http://litcave.rudi.ir/neatroff.pdf
I see UTF-8 hyphenation files mentioned.
There's also whole-paragraph formatting and lots of other delights.
Rudi's http://litcave.rudi.ir has a Typesetting section past the initial
list of recent changes to his software.

Feel free to continue discussing neatroff here along with general troff
questions.


--
Dr. Oliver Corff
mailto:oliver.co...@email.de




neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-26 Thread Ralph Corderoy
Hi Oliver,

Are you aware there are other troff implementations than GNU's groff?
Neatroff is one.  Ali Gholami Rudi wrote it because he wanted better
Unicode support for foreign languages, including right-to-left text.
He seems very much of your mould in needs.

A good summary of its features is http://litcave.rudi.ir/neatroff.pdf
I see UTF-8 hyphenation files mentioned.
There's also whole-paragraph formatting and lots of other delights.
Rudi's http://litcave.rudi.ir has a Typesetting section past the initial
list of recent changes to his software.

Feel free to continue discussing neatroff here along with general troff
questions.

-- 
Cheers, Ralph.



Re: proctological linter warnings on groff's man pages (was: mdoc(7): CHECKSTYLE)

2023-04-26 Thread Alejandro Colomar
Hi Branden,

On 4/26/23 11:06, G. Branden Robinson wrote:
> Hi Alex,
> 
> At 2023-04-24T19:11:58+0200, Alex Colomar wrote:
>>> At 2023-04-23T16:17:06+0200, Alejandro Colomar wrote:
 I got some errors from mdoc(7), which were probably due to the
 LANDMINE
 .
 Why is that file problematic with mdoc(7)?
> [...]
>> 
> 
> It's not obvious to me why that macro file would cause any problems.

You could try if you're curious; since I already removed groff's
pages from my build system, I'd have to repeat the setup.  If you
want me to do it, I can.  Otherwise, I already removed that file
from the Linux man-pages, so we can just ignore this, if you're not
curious enough.

> 
>> $ make check -k
>> troff: man1/gpinyin.1:316: warning: macro 'AD' not defined
>> troff: man5/groff_font.5:813: warning: macro 'AD' not defined
> 
> These are expected if you format groff 1.23.0 man pages with groff
> 1.22.4 (or older).  I had to resort to some unpleasantness in those 2
> pages.

But I didn't.  I have groff 1.23.0-some-commit-around-rc3 in all of
my machines.  Maybe I installed it incorrectly?


I don't remember if I run this from my desktop or laptop.  Here's
what my desktop reports:

$ groff --version
GNU groff version 1.23.0.rc3.44-a241d
Copyright (C) 2022 Free Software Foundation, Inc.
GNU groff comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of groff and its subprograms
under the terms of the GNU General Public License.
For more information about these matters, see the file
named COPYING.

called subprograms:

GNU grops (groff) version 1.23.0.rc3.44-a241d
GNU troff (groff) version 1.23.0.rc3.44-a241d


Maybe I did something wrong and run 1.22.4 somehow.  Do you want me
to try again with 1.23.0?

> 
>> troff: man7/groff_char.7:1582: warning: can't find special character 'bs'
>> troff: man7/groff_char.7:1808: warning: can't find special character 
>> 'radicalex'
>> troff: man7/groff_char.7:1810: warning: can't find special character 'sqrtex'
> 
> These are documented in the groff PROBLEMS file.  They are harmless.
> 
> https://git.savannah.gnu.org/cgit/groff.git/tree/PROBLEMS?id=dbd2b2007280f5125227307ff6962c0948366aef#n985
> 
> The solution is to design and ship a small font to solve some glyph
> coverage problems.  I think we talked about that on this list last
> summer.

But that seems to be about PS, isn't it?  My I run was with -Tutf8.

> 
>> mdoc warning: .St: Unknown standard abbreviation '-susv1' (#2540)
>>   Please refer to the groff_mdoc(7) manpage for a
>>   list of available standard abbreviations.
>> mdoc warning: .St: Unknown standard abbreviation '-susv4' (#2546)
>>   Please refer to the groff_mdoc(7) manpage for a
>>   list of available standard abbreviations.
> 
> https://savannah.gnu.org/bugs/?55789

Hmm, this is fixed, so it seems I was using 1.22.4 accidentally.

> 
>> $ make lint-man-tbl -k
>> LINT (tbl comment)   .tmp/man/man1/chem.1.lint-man.tbl.touch
>> man1/chem.1:1: missing '\" t' comment:
>> .TH \%chem 1 "24 April 2023" "groff 1.23.0.rc4.19-96b92"
>> make: *** [share/mk/lint/man/man.mk:42:
>> .tmp/man/man1/chem.1.lint-man.tbl.touch] Error 1
>> LINT (tbl comment)   .tmp/man/man1/groff.1.lint-man.tbl.touch
>> man1/groff.1:1: missing '\" t' comment:
>> .TH groff 1 "24 April 2023" "groff 1.23.0.rc4.19-96b92"
>> make: *** [share/mk/lint/man/man.mk:42:
>> .tmp/man/man1/groff.1.lint-man.tbl.touch] Error 1
>> LINT (tbl comment)   .tmp/man/man7/groff_www.7.lint-man.tbl.touch
>> man7/groff_www.7:1: missing '\" t' comment:
>> .TH groff_www 7 "24 April 2023" "groff 1.23.0.rc4.19-96b92"
>> make: *** [share/mk/lint/man/man.mk:42:
>> .tmp/man/man7/groff_www.7.lint-man.tbl.touch] Error 1
>> make: Target 'lint-man-tbl' not remade because of errors.
> 
> I thought Colin Watson withdrew interpretation of this type of comment
> from man-db man, and mandoc(1) doesn't support it either.
> 
> If I'm right, that would leave it without any consumers in Free Software
> man pagers (troffs themselves, AFAIK, have never done anything with it).

I received a report that lintian still uses it.  Maybe Jakub or Marcos
can confirm.

> 
>> And here go the ones from mandoc(1), which is more picky:
>>
>> $ make lint-man-mandoc -k |& grep -v -e 'WARNINGS: invalid escape sequence'
>> -e 'UNSUPP: unsupported roff request' -e 'WARNING: invalid escape sequence'
>> LINT (mandoc).tmp/man/man1/addftinfo.1.lint-man.mandoc.touch
>> mandoc: man1/addftinfo.1:1:17: WARNING: cannot parse date, using it
>> verbatim: TH 24 April 2023
> 
> I reject mandoc's attempt at date format enforcement.  While,
> personally, I prefer ISO 8601 format, our man pages' dates come from our
> mdate.pl script.  That's the place to change it, if anyon

proctological linter warnings on groff's man pages (was: mdoc(7): CHECKSTYLE)

2023-04-26 Thread G. Branden Robinson
Hi Alex,

At 2023-04-24T19:11:58+0200, Alex Colomar wrote:
> > At 2023-04-23T16:17:06+0200, Alejandro Colomar wrote:
> > > I got some errors from mdoc(7), which were probably due to the
> > > LANDMINE
> > > .
> > > Why is that file problematic with mdoc(7)?
[...]
> 

It's not obvious to me why that macro file would cause any problems.

> $ make check -k
> troff: man1/gpinyin.1:316: warning: macro 'AD' not defined
> troff: man5/groff_font.5:813: warning: macro 'AD' not defined

These are expected if you format groff 1.23.0 man pages with groff
1.22.4 (or older).  I had to resort to some unpleasantness in those 2
pages.

> troff: man7/groff_char.7:1582: warning: can't find special character 'bs'
> troff: man7/groff_char.7:1808: warning: can't find special character 
> 'radicalex'
> troff: man7/groff_char.7:1810: warning: can't find special character 'sqrtex'

These are documented in the groff PROBLEMS file.  They are harmless.

https://git.savannah.gnu.org/cgit/groff.git/tree/PROBLEMS?id=dbd2b2007280f5125227307ff6962c0948366aef#n985

The solution is to design and ship a small font to solve some glyph
coverage problems.  I think we talked about that on this list last
summer.

> mdoc warning: .St: Unknown standard abbreviation '-susv1' (#2540)
>   Please refer to the groff_mdoc(7) manpage for a
>   list of available standard abbreviations.
> mdoc warning: .St: Unknown standard abbreviation '-susv4' (#2546)
>   Please refer to the groff_mdoc(7) manpage for a
>   list of available standard abbreviations.

https://savannah.gnu.org/bugs/?55789

> $ make lint-man-tbl -k
> LINT (tbl comment).tmp/man/man1/chem.1.lint-man.tbl.touch
> man1/chem.1:1: missing '\" t' comment:
> .TH \%chem 1 "24 April 2023" "groff 1.23.0.rc4.19-96b92"
> make: *** [share/mk/lint/man/man.mk:42:
> .tmp/man/man1/chem.1.lint-man.tbl.touch] Error 1
> LINT (tbl comment).tmp/man/man1/groff.1.lint-man.tbl.touch
> man1/groff.1:1: missing '\" t' comment:
> .TH groff 1 "24 April 2023" "groff 1.23.0.rc4.19-96b92"
> make: *** [share/mk/lint/man/man.mk:42:
> .tmp/man/man1/groff.1.lint-man.tbl.touch] Error 1
> LINT (tbl comment).tmp/man/man7/groff_www.7.lint-man.tbl.touch
> man7/groff_www.7:1: missing '\" t' comment:
> .TH groff_www 7 "24 April 2023" "groff 1.23.0.rc4.19-96b92"
> make: *** [share/mk/lint/man/man.mk:42:
> .tmp/man/man7/groff_www.7.lint-man.tbl.touch] Error 1
> make: Target 'lint-man-tbl' not remade because of errors.

I thought Colin Watson withdrew interpretation of this type of comment
from man-db man, and mandoc(1) doesn't support it either.

If I'm right, that would leave it without any consumers in Free Software
man pagers (troffs themselves, AFAIK, have never done anything with it).

> And here go the ones from mandoc(1), which is more picky:
> 
> $ make lint-man-mandoc -k |& grep -v -e 'WARNINGS: invalid escape sequence'
> -e 'UNSUPP: unsupported roff request' -e 'WARNING: invalid escape sequence'
> LINT (mandoc) .tmp/man/man1/addftinfo.1.lint-man.mandoc.touch
> mandoc: man1/addftinfo.1:1:17: WARNING: cannot parse date, using it
> verbatim: TH 24 April 2023

I reject mandoc's attempt at date format enforcement.  While,
personally, I prefer ISO 8601 format, our man pages' dates come from our
mdate.pl script.  That's the place to change it, if anyone means to.

> mandoc: man1/gperl.1:290:2: WARNING: skipping paragraph macro: PP empty
> mandoc: man1/gperl.1:462:2: WARNING: skipping paragraph macro: PP after SH

This is a Bernd Warken page that I have not thoroughly reviewed.

The worst of those is glilypond(1).

> make: *** [share/mk/lint/man/man.mk:28:
> .tmp/man/man1/gperl.1.lint-man.mandoc.touch] Error 1
> LINT (mandoc) .tmp/man/man1/gpinyin.1.lint-man.mandoc.touch
> mandoc: man1/gpinyin.1:255:23: WARNING: undefined string, using "": a-
> mandoc: man1/gpinyin.1:257:33: WARNING: undefined string, using "": a<
> mandoc: man1/gpinyin.1:263:2: ERROR: skipping insecure request: lf
> mandoc: man1/gpinyin.1:316:5: WARNING: undefined string, using "": AD

See the comments in the page.  I had to resort to some tricks.

> mandoc: man1/gropdf.1:738:2: WARNING: skipping paragraph macro: PP empty
> mandoc: man1/gropdf.1:774:2: WARNING: skipping paragraph macro: PP empty
> mandoc: man1/gropdf.1:796:2: WARNING: skipping paragraph macro: PP empty
> mandoc: man1/gropdf.1:1075:2: WARNING: skipping paragraph macro: PP empty
> mandoc: man1/gropdf.1:1075:2: WARNING: skipping paragraph macro: PP empty
> mandoc: man1/gropdf.1:1080:2: WARNING: skipping paragraph macro: PP empty
> mandoc: man1/gropdf.1:1091:2: WARNING: skipping paragraph macro: PP empty
> mandoc: man1/gropdf.1:1103:2: WARNING: skipping paragraph macro: PP empty
> mandoc: man1/gropdf.1:1134:2: WARNI

Re: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian

2023-04-26 Thread Oliver Corff

Hi Branden,

I'll take the route you suggest, i.e. install a 1.23.0 version where
I'll place the macros; but I'll have to postpone this until Saturday ---
so no earlier feedback possible.

Best regards,

Oliver.


On 26/04/2023 10:18, G. Branden Robinson wrote:

Hi Oliver,

At 2023-04-26T09:19:41+0200, Oliver Corff wrote:

thank you very much for the sharing your insight regarding groff
internals.

I wish they were deeper!  There is still plenty I have to learn.


I tried your demonstration, replacing the text file with my own file
(utf8-encoded Cyrillic), and I did not succeed to reproduce your
results.

I copied all Russian-related macros (ru.tmac, hyphen.ru and
koi8-ru.tmac) into my ../current/tmac directory (production system is
still 1.22.4), and running groff results in unusable output.

No, I wouldn't expect this to work.


The headline "Abstract" gets translated into Russian, but is displayed
in non-utf8 format. All utf8-text is ok. If I omit the -k option then
utf8-encoded text is unusable as well, but this is no surprise.

As noted in my previous mail, if you want hyphenation to work with
Russian, neither UTF-8 input (processed by preconv(1)) not Unicode code
points from the Cyrillic code block in their groff special character
escape form, like \[u0400], can be used.


Do I miss something from post-1.23.0 that enables the described magic?

Yes.  I refactored localization handling extensively to enable the
current approach.  As noted earlier in my compliment on your demo
document, I wanted to make it easy to change localizations an arbitrary
number of times within a document.

I worked on this stuff a while back.  In about January 2021 I made an
attempt, some of which I had to revert, and re-landed the work in its
current form around July of that year.  More work specifically on
hyphenation followed in early 2022.

Some relevant commit IDs, not including the must more recent Spanish and
Russian localization work (which slotted right in as I had hoped) are:

a86d9251ed05cec18f6279a9e613449ae7aa7315
a60784b82a5c53caff5443fc036b8d13f4084a32
7eb25c45b5ec67f1037abcc670793b734584987c
7c31d53f83888d88262075875b6ba5463dcfa5c5
2a36cf12b865be4c1df1c27139b1c58798cafb60
920fff1cf59d38bacd9b1b99b3d1ce3ce4e1e13f

I don't recall having to change anything in the formatter to enable
this work, so in principle you could replace an entire tmac directory
from a groff 1.22.4 installation with one from 1.23.0 (RC), but I can't
claim that as a supported configuration.  It's probably better just to
build and install groff 1.23.0.rc4, and _then_ add in the Russian
localization files.  If you're comfortable setting up chroots or virtual
machines, you might prefer to evaluate things that way.

Regards,
Branden


--
Dr. Oliver Corff
Wittelsbacherstr. 5A
10707 Berlin
GERMANY
Tel.: +49-30-85727260
mailto:oliver.co...@email.de




Re: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian

2023-04-26 Thread G. Branden Robinson
Hi Oliver,

At 2023-04-26T09:19:41+0200, Oliver Corff wrote:
> thank you very much for the sharing your insight regarding groff
> internals.

I wish they were deeper!  There is still plenty I have to learn.

> I tried your demonstration, replacing the text file with my own file
> (utf8-encoded Cyrillic), and I did not succeed to reproduce your
> results.
> 
> I copied all Russian-related macros (ru.tmac, hyphen.ru and
> koi8-ru.tmac) into my ../current/tmac directory (production system is
> still 1.22.4), and running groff results in unusable output.

No, I wouldn't expect this to work.

> The headline "Abstract" gets translated into Russian, but is displayed
> in non-utf8 format. All utf8-text is ok. If I omit the -k option then
> utf8-encoded text is unusable as well, but this is no surprise.

As noted in my previous mail, if you want hyphenation to work with
Russian, neither UTF-8 input (processed by preconv(1)) not Unicode code
points from the Cyrillic code block in their groff special character
escape form, like \[u0400], can be used.

> Do I miss something from post-1.23.0 that enables the described magic?

Yes.  I refactored localization handling extensively to enable the
current approach.  As noted earlier in my compliment on your demo
document, I wanted to make it easy to change localizations an arbitrary
number of times within a document.

I worked on this stuff a while back.  In about January 2021 I made an
attempt, some of which I had to revert, and re-landed the work in its
current form around July of that year.  More work specifically on
hyphenation followed in early 2022.

Some relevant commit IDs, not including the must more recent Spanish and
Russian localization work (which slotted right in as I had hoped) are:

a86d9251ed05cec18f6279a9e613449ae7aa7315
a60784b82a5c53caff5443fc036b8d13f4084a32
7eb25c45b5ec67f1037abcc670793b734584987c
7c31d53f83888d88262075875b6ba5463dcfa5c5
2a36cf12b865be4c1df1c27139b1c58798cafb60
920fff1cf59d38bacd9b1b99b3d1ce3ce4e1e13f

I don't recall having to change anything in the formatter to enable
this work, so in principle you could replace an entire tmac directory
from a groff 1.22.4 installation with one from 1.23.0 (RC), but I can't
claim that as a supported configuration.  It's probably better just to
build and install groff 1.23.0.rc4, and _then_ add in the Russian
localization files.  If you're comfortable setting up chroots or virtual
machines, you might prefer to evaluate things that way.

Regards,
Branden


signature.asc
Description: PGP signature


Re: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian

2023-04-26 Thread Oliver Corff

Hi Branden,

thank you very much for the sharing your insight regarding groff internals.

I tried your demonstration, replacing the text file with my own file 
(utf8-encoded Cyrillic), and I did not succeed to reproduce your results.


I copied all Russian-related macros (ru.tmac, hyphen.ru and 
koi8-ru.tmac) into my ../current/tmac directory (production system is 
still 1.22.4), and running groff results in unusable output.


The headline "Abstract" gets translated into Russian, but is displayed 
in non-utf8 format. All utf8-text is ok. If I omit the -k option then 
utf8-encoded text is unusable as well, but this is no surprise.


Do I miss something from post-1.23.0 that enables the described magic? 
Or is there a flow in my own approaches and processes?


Best regards,

Oliver.


On 26/04/2023 06:42, G. Branden Robinson wrote:

Hi Oliver,

At 2023-04-25T20:02:00+0200, Oliver Corff wrote:

Yes, KOI8-R has the Cyrillic uppercase in 0xE0..0xFF, lowercase in
0xC0..0xDF; in the control code area, there are no letters in the
human sense of the word. I had a look at the current groff
documentation referenced by your footnote, and I imagine that
KOI8-R-encoded Cyrillic text will be processed seamlessly (that was
the basic assumption behind my recent and only temporary suggestion to
process Greek in ISO encoding), yet my input is \[u04xx]-style Unicode
Cyrillic.

Right.  I don't think we can support that at present.


Somehow Cyrillic input in utf8, made readable by preconv(1), should
match the letter code positions in KOI8-R, otherwise pattern matching
for hyphenation would fail.

For Unicode-encoded Cyrillic input, I think you're going to need to
covert the input to KOI8-R first with iconv.


How is Unicode Cyrillic text in groff internally represented? When
dumping gtroff output to the console, I see u04xx codepoints. In my
naive understanding I assume it would be the same internally.

At 2023-04-25T16:25:49+0200, Oliver Corff wrote:

Since groff internally seems to work with Unicode code positions, the
question is: in which format should the hyphenation patterns be
presented to groff? As-is, that is as utf8 text, or in \[u04xx] form?
That does not seem to work either, according to my last experiment.

I didn't squarely address this question of yours earlier, which might
have helped.  Sorry about that.

There are a couple of answers to that depending on what stage of
processing we're talking about, but the earlier one is of more interest.

groff internally represents characters as bytes.  8-bit bytes.  That's
all we have.

We support Unicode code points the same way we represent everything else
that isn't ASCII--with "special characters".  \(hy, \[coproduct],
\[u0400] and so on.


I tried the KOI8-R-encoded hyphenation file in my little russ.ms
document, but no hyphenation was introduced. I set the .hy register
etc., but nothing happened: no hyphenation. That's also why I put
these monster words with 30-odd characters into the file and forced
everything to be in two-column mode, in order to make the
line-breaking as challenging as possible.

Hmm.  Did you load the Russian localization file, as suggested by the
documentation?

Here's an exhibit I've prepared.

$ file ATTIC/udhr-ru-koi8r.ms
ATTIC/udhr-ru-koi8r.ms: troff or preprocessor input, ISO-8859 text
$ iconv -f koi8-r -t utf8 ATTIC/udhr-ru-koi8r.ms
.nr LL 28n
.LP
Все люди рождаются свободными и равными в своем достоинстве и правах.
Они наделены разумом и совестью и должны поступать в отношении друг
друга в духе братства.
.LP
Каждый человек должен обладать всеми правами и всеми свободами,
провозглашенными настоящей Декларацией, без какого бы то ни было
различия, как-то в отношении расы, цвета кожи, пола, языка, религии,
политических или иных убеждений, национального или социального
происхождения, имущественного, сословного или иного положения.
.LP
Кроме того, не должно проводиться никакого различия на основе
политического, правового или международного статуса страны или
территории, к которой человек принадлежит, независимо от того, является
ли эта территория независимой, подопечной, несамоуправляющейся или
как-либо иначе ограниченной в своем суверенитете.
.LP
Каждый человек имеет право на жизнь, на свободу и на личную
неприкосновенность.
.LP
Никто не должен содержаться в рабстве или в подневольном состоянии;
рабство и работорговля запрещаются во всех их видах.
.LP
Никто не должен подвергаться пыткам или жестоким, бесчеловечным или
унижающим его достоинство обращению и наказанию.
.LP
Каждый человек, где бы он ни находился, имеет право на признание его
$ ./build/test-groff -ms -mru -Tutf8 ATTIC/udhr-ru-koi8r.ms




Все люди рождаются свободны‐
ми  и равными в своем досто‐
инстве и правах. Они наделе‐
ны  разумом  и  совестью   и
должны поступать в отношении
друг друга в духе братства.

Каждый  человек должен обла‐
дать всеми правами  и  всеми
свободами,  провозглашенными
настоящей  Декларацией,  без
какого  бы то ни было разли‐
чия,  как‐то   в   отношении