Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-05-01 Thread Oliver Corff

Hi Branden,

On 30/04/2023 15:35, G. Branden Robinson wrote:

At 2023-04-29T21:38:53-0500, Dave Kemper wrote:

On 4/29/23, Oliver Corff  wrote:

Would it be a feasible option to use UTF-8 throughout the inner
workings of a future groff,

I'm going to phrase this more confrontationally than it needs to be just
to make a point about software design:

No need for apologies. We are discussing principles of work here.

It's none of your business what data type groff uses for characters in
its _inner workings_.

Of course I mean that purely from the software-architectural
perspective.  There is no reason for anyone except groff's developers to
care what primitive data type groff uses for this purpose as long as it
behaves correctly and is performant.  The whole point of encapsulation
is to keep other software modules from having to worry about this sort
of thing.

If it hisses like a utf8-duck, quacks like a utf8-duck and croaks like a
utf8-duck, it is a utf8-duck.

In another sense, it's totally your business and you can look at the
implementation at any time--it's Free Software.  But other software,
including parts of groff that are not GNU troff, the formatter, should
keep its dirty nose out, and expect to be excluded through
language-imposed visibility restrictions (or the impermeable wall of the
Unix process structure).

We absolutely want good UTF-8 support at the _edges_ of the system.  We
want to change GNU troff to cheerfully and correctly interpret UTF-8
input.  And we want output drivers that target devices using UTF-8 as a
character encoding to reliably produce it.

But that's all.

Consider my perspective to be a projection from a known surface to an
unknown core.



This is the topic of http://savannah.gnu.org/bugs/?40720


Only recently, I started to discover the treasure trove of information
to be unearthed from Savannah (it took my quite a while to grasp its
significance).

[...]


A rough sketch of my plan is this:

1.  Ensure that the groff string class is well-encapsulated.
2.  Change the internal type, and constructors and output functions
 only, to perform is transformation on this new type.
3.  Verify that nothing broke.  (If I did 1 and 2 correctly, nothing
 will.)
4.  Remap the code points we're squatting on.  Haven't decided yet
 whether to map them to illegal Unicode code points or to the Unicode
 Private Use Area.  With a char32_t we have all the room in the
 world.
5.  Drop code page 1047 support, per recent discussions with Mike Fulton
 of IBM on this list.
6.  Start not merely accepting, but _assuming_ UTF-8 input, because we
 won't misinterpret C1 controls anymore.

If that doesn't sound like enough work--at some point in the above, each
and every preprocessor has to be checked to ensure it isn't screwing up
the input before it gets to the formatter.

Whenever I can assist with test cases and data files, also for
preprocessors like tbl, please let me know.

I don't see getting rid of preconv(1) in the near term.  It will remain
useful, particularly if I add the couple of small features I had in mind
for it.  It may continue to play a role in getting input into the
correct Unicode Normalization Form (D).  It might make sense to leave
that business out of the formatter proper.

Regards,
Branden


Best regards,

Oliver.

--
Dr. Oliver Corff
mailto:oliver.co...@email.de




Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-30 Thread G. Branden Robinson
At 2023-04-29T21:38:53-0500, Dave Kemper wrote:
> On 4/29/23, Oliver Corff  wrote:
> > Would it be a feasible option to use UTF-8 throughout the inner
> > workings of a future groff,

I'm going to phrase this more confrontationally than it needs to be just
to make a point about software design:

It's none of your business what data type groff uses for characters in
its _inner workings_.

Of course I mean that purely from the software-architectural
perspective.  There is no reason for anyone except groff's developers to
care what primitive data type groff uses for this purpose as long as it
behaves correctly and is performant.  The whole point of encapsulation
is to keep other software modules from having to worry about this sort
of thing.

In another sense, it's totally your business and you can look at the
implementation at any time--it's Free Software.  But other software,
including parts of groff that are not GNU troff, the formatter, should
keep its dirty nose out, and expect to be excluded through
language-imposed visibility restrictions (or the impermeable wall of the
Unix process structure).

We absolutely want good UTF-8 support at the _edges_ of the system.  We
want to change GNU troff to cheerfully and correctly interpret UTF-8
input.  And we want output drivers that target devices using UTF-8 as a
character encoding to reliably produce it.

But that's all.

> This is the topic of http://savannah.gnu.org/bugs/?40720
[...]
> But in my opinion, the discussion is somewhat academic given the scope
> of the task and the number of current groff developers familiar with
> core parts of the code.

My idea for the initial scope is  small.  I'm not convinced that
the groff string class is sealed as tightly as it should be.  So when I
take a second crack at changing its internal data type (my first was 2
years or so ago), I need to review it carefully.

From what I've seen the main point of interface we're concerned with is
its `contents` member function, which does in fact return a pointer to a
narrow character.

Possibly that needs to be renamed `as_c_string`, and existing uses of
`contents` audited to verify that they really do need a C string there,
or if they wouldn't work just as well dealing with something else.

Our diagnostic message functions (`fatal`, `error`, `warning`, `debug`
and friends) _do_ expect C strings.  I don't see that changing, since
their next stop is the standard error stream.

As part of this I also need to look over the ISO C++98 string class and
see how much sense just to make groff's string class a
basic_string.[1]

A rough sketch of my plan is this:

1.  Ensure that the groff string class is well-encapsulated.
2.  Change the internal type, and constructors and output functions
only, to perform is transformation on this new type.
3.  Verify that nothing broke.  (If I did 1 and 2 correctly, nothing
will.)
4.  Remap the code points we're squatting on.  Haven't decided yet
whether to map them to illegal Unicode code points or to the Unicode
Private Use Area.  With a char32_t we have all the room in the
world.
5.  Drop code page 1047 support, per recent discussions with Mike Fulton
of IBM on this list.
6.  Start not merely accepting, but _assuming_ UTF-8 input, because we
won't misinterpret C1 controls anymore.

If that doesn't sound like enough work--at some point in the above, each
and every preprocessor has to be checked to ensure it isn't screwing up
the input before it gets to the formatter.

I don't see getting rid of preconv(1) in the near term.  It will remain
useful, particularly if I add the couple of small features I had in mind
for it.  It may continue to play a role in getting input into the
correct Unicode Normalization Form (D).  It might make sense to leave
that business out of the formatter proper.

Regards,
Branden

[1] std::u32string is C++11, and thus not available according to the
portability horizon we have.  But we can make our own
basic_string with C++98 facilities and gnulib's 'inttypes'
module.  Hooray, templates!  ;-)


signature.asc
Description: PGP signature


Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-29 Thread Dave Kemper
On 4/26/23, G. Branden Robinson  wrote:
> It would probably be a good idea to represent Unicode strings internally
> using char32_t as a base type anyway, but groff's design under the Unix
> filter model described above makes the choice less dramatic in terms of
> increased space consumption than it would otherwise be.

But to keep scalability in mind, this design shouldn't be assumed to
be immutable.  Implementing the Knuth-Plass (or some other)
paragraph-at-once algorithm would greatly expand the amount of input
groff has to remember at once, and a theoretical future
chapter-at-once algorithm (to, for example, optimize page layouts to
eliminate widows) vastly expands it beyond that.

It's possible memory is too cheap to worry about even the worst case,
where groff 4.38 has to hold an entire document in memory (maybe to
finally allow it to put the table of contents up front without page
reordering), but it's a question worth considering before making
changes to groff's fundamental data type.



Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-29 Thread Dave Kemper
On 4/29/23, Oliver Corff  wrote:
> Would it be a feasible option to use UTF-8 throughout the inner workings
> of a future groff,

This is the topic of http://savannah.gnu.org/bugs/?40720 (though most
of the interesting discussion has taken place in
http://savannah.gnu.org/bugs/?58796).  But in my opinion, the
discussion is somewhat academic given the scope of the task and the
number of current groff developers familiar with core parts of the
code.



Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-29 Thread Oliver Corff

Hi Branden,

On 27/04/2023 05:07, G. Branden Robinson wrote:

At 2023-04-26T19:33:48+0200, Oliver Corff wrote:

I am not familiar with modern incarnations of C/C++. Is there really
no char data type that is Unicode-compliant?



There is.  But "Unicode" is a _family_ of standards.  There are multiple
ways to encode Unicode characters, and those ways are good for different
things.

I was intentionally vague.

Along came Unix creator, Ken Thompson, over 20 years after his first
major contribution to software engineering.  Thompson was a man whose
brain took to Huffman coding like a duck to water.  Thus was born UTF-8,
(which isn't a Huffman code precisely but has properties reminiscent of
one) where your ASCII code points would be expressible in one byte, and
then the higher the code point you needed to encode, the longer the
multi-byte sequence you required.  Since the Unicode Consortium had
allocated commonly used symbols and alphabetic scripts toward to lower
code points in the first place, this meant that even where you needed
more than one byte to encode a code point, with UTF-8 you might not need
more than two.  And as a matter of fact, under UTF-8, every character in
every code block up to and including NKo is expressible using up to two
bytes.[2]


I like the Huffman code analogy! The situation is not as clear-cut for
CJK texts; there are massive peaks of frequency at a few dozen or
hundred characters (both in Chinese and Japanese) but due to the
arrangement of characters these biases are not visible from the
character tables --- the distribution is more even, not so much leaning
to the left. In general (i.e., BMP - Basic Multilingual Plane) CJK
characters need three octets, which is a 50% penalty over traditional
CJK encodings (where the user was limited to using *either* Chinese,
simplified, *or* Chinese, traditional, *or* Japanese, but not a mixture
of everything. On today's systems, this does not really slow down work,
and if a text file for a whole book increases from 700 kB to a little
over 1 MB, it doesn't really change anything from a user's perspective.

"Unicode-compliant" is not a precise enough term to mean much of
anything.  Unicode is a family of standards; the core of which most
people of heard, and a pile of "standard annexes" which can be really
important in various domains of practical engineering around this
encoding system.[3]

Regards,
Branden


For the matter of this exchange: I never really left the letter P when
learning programming languages. I started with Pascal when Borland
Pascal on DOS machines was all the rage, and from there directly jumped
to Perl the moment I familiarized myself with X11 workstations at our
university, due to its wonderfully elliptical style (I am a linguist by
training, and many of the Perl language constructs just got alive in my
brain the very instant I used them for the first time). Later I started
learning Prolog, but never made it to Py (and anything that follows).

For all (read: my) practical purposes, Perl reads and stores all
characters in UTF-8 (to be honest: I am not at all aware of the *exact*
internal data storage model of Perl), and I can process strings
containing a wild mix of characters CJK, Cyrillic, and other character
sets) without ever running into problems. I never ever have to process
bytes in characters or observe states as long as the file handles are
declared as :utf8. Even data files dozens of MB containing 10,000 or
100,000s of text lines are processed without any noticeable penalty or
trade-off in time.

Perl is written in C (as far as I know), so probably uses the C libraries.

Would it be a feasible option to use UTF-8 throughout the inner workings
of a future groff, and translate UTF-8 to UTF-16 if and only if there is
the absolute need to do so? You mentioned the PDF bookmarks as a
critical case.

Best regards,

Oliver.

--
Dr. Oliver Corff
Mail: oliver.co...@email.de




Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-27 Thread Alejandro Colomar
Hi Branden,

On 4/27/23 05:07, G. Branden Robinson wrote:
> [0] If you're like me, the idea of a "20.1-bit" quantity sounds weird.
> You can't encode a tenth of a bit in a single logic gate, or one
> position in a machine register.  The key is to think in terms of
> information theory, not digital logic.  Unicode has decided that its
> range of valid code points is zero to 0x10.  That's 1114111
> decimal.  That number (plus one for code point 0) is the number of
> distinct characters encodable in Unicode.  The base 2 logarithm of
> that is...
> 
> $ python3 -c "import math; print(math.log(1114112, 2))"
> 20.087462841250343

You don't need python3 for that:

$ echo 'l(1114112) / l(2)' | bc -l
20.08746284125033940845

You might notice there's a difference in the decimals.  bc(1) is the
more accurate, according to Wolfram Alpha.  (My physical calculator
doesn't have enough precision to contrast).  All digits provided by
bc(1) are correct, while python3 is printing more than it's capable
of.  I guess python3 is using a 'double', which usually has around 15
digits of precission.  bc(1) on the contrary, is likely to be using
'long double', for being able to provide so many digits.

Of course, bc(1) is way smaller:

$ ls $(which python3.11) -lh
-rwxr-xr-x 1 root root 6.6M Mar 13 13:18 /usr/bin/python3.11
$ ls $(which bc) -lh
-rwxr-xr-x 1 root root 95K Sep  5  2021 /usr/bin/bc

And of course it's faster:

$ time echo 'l(1114112) / l(2)' | bc -l
20.08746284125033940845

real0m0.003s
user0m0.005s
sys 0m0.000s
$ time python3 -c "import math; print(math.log(1114112, 2))"
20.087462841250343

real0m0.015s
user0m0.011s
sys 0m0.004s


Cheers,
Alex
-- 

GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5


OpenPGP_signature
Description: OpenPGP digital signature


Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-26 Thread G. Branden Robinson
At 2023-04-26T19:33:48+0200, Oliver Corff wrote:
> On 26/04/2023 15:16, G. Branden Robinson wrote:
> > Be sure you review my earlier messages to Oliver in detail.  The
> > hyphenation code isn't "broken", it's simply limited to the C/C++
> > char type for character code points and hyphenation codes (which are
> > not "the same thing as" character code points, but do correspond to
> > them).
> 
> I am not familiar with modern incarnations of C/C++. Is there really
> no char data type that is Unicode-compliant?

There is.  But "Unicode" is a _family_ of standards.  There are multiple
ways to encode Unicode characters, and those ways are good for different
things.

Unicode is a 20.1-bit character encoding.[0]  For practical purposes,
this rounds to 32 bits.  So if you use a 32-bit arithmetic type to store
a Unicode character, you'll be fine.  The type `char32_t` has been
around since ISO C11 and ISO C++11 and is arguably the best fit for this
purpose, since `int` is not _guaranteed_ to be 32 bits wide.[1]

A long, long time ago, people noticed that in real-world texts, the code
points used by Unicode strings were not, and were not ever expected to
be, anywhere near uniformly distributed within the code space.  That
fact on top of the baked-in use of only 20.1 bits of a 32-bit type can
make use of the latter more wasteful than a lot of people can tolerate.

In fact, for much of the text encountered on the Internet--outside of
East Asia--the Unicode code points encountered in character strings are
extremely heavily weighted toward the left side of the distribution--
specifically, to the first 128 code points, also known as ISO 646 or
"ASCII".

Along came Unix creator, Ken Thompson, over 20 years after his first
major contribution to software engineering.  Thompson was a man whose
brain took to Huffman coding like a duck to water.  Thus was born UTF-8,
(which isn't a Huffman code precisely but has properties reminiscent of
one) where your ASCII code points would be expressible in one byte, and
then the higher the code point you needed to encode, the longer the
multi-byte sequence you required.  Since the Unicode Consortium had
allocated commonly used symbols and alphabetic scripts toward to lower
code points in the first place, this meant that even where you needed
more than one byte to encode a code point, with UTF-8 you might not need
more than two.  And as a matter of fact, under UTF-8, every character in
every code block up to and including NKo is expressible using up to two
bytes.[2]

So UTF-8 is pretty great at not being wasteful, but it does have
downsides.  It is more expensive to process than traditional byte-wide
strings.  It has state.  When you see a byte with its high bit set, you
know that it begins or continues a UTF-8 sequence.  Not all such
sequences are valid.  You have to decide what to do if a multibyte UTF-8
sequence is truncated.  If you _write_ UTF-8, you have to know how to do
so; the mapping from an ISO 10646-1 20.1-bit code point to a UTF-8
sequence is not trivial.

groff, like AT&T nroff (and roff(1) before it?), doesn't handle a large
quantity of character strings at one time, relative to the overall size
of its typical inputs.  It follows the Unix filter model and does not
absorb an entire input before processing it.  It accumulates input lines
until it is time to emit an output line (either to the output stream or
to a diversion), then flushes the output line and moves on.

It would probably be a good idea to represent Unicode strings internally
using char32_t as a base type anyway, but groff's design under the Unix
filter model described above makes the choice less dramatic in terms of
increased space consumption than it would otherwise be.  The formatter
is not even instrumented to measure the output node lists it builds.  If
this were done, we could tell exactly what the cost of moving from
`char` to `char32_t` is.  And if I'm the person who ends up doing this
work, maybe I will collect such measurements.

"Unicode-compliant" is not a precise enough term to mean much of
anything.  Unicode is a family of standards; the core of which most
people of heard, and a pile of "standard annexes" which can be really
important in various domains of practical engineering around this
encoding system.[3]

There exist many resources to assist us with converting UTF-8 to and
from 32-bit code points.  For instance, there is GNU libunistring.  I've
studied the problems it identifies with "the char32_t approach"[4] and I
don't think they apply to groff.

Maybe gnulib, which we already use, has some facilities as well.  I'm
not sure--it's huge and I've only recently started familiarizing myself
with its manual.

I've been pointedly ignoring two other encodings of Unicode strings:
UTF-16LE and UTF-16BE.  They are terrible but we can't completely avoid
them; Microsoft and Adobe are deeply wedded to them and while we can
largely ignore Microsoft, Adobe managed to contaminate the international
standard f

Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-26 Thread Oliver Corff

Hi Robin and Branden,

On 26/04/2023 15:16, G. Branden Robinson wrote:

At 2023-04-26T15:16:55+0300, Robin Haberkorn wrote:

For future texts I therefore wanted to return to Groff (where we also
have the excellent MOM macros). Not being able to hyphenate UTF-8
Cyrillic text is a major limitation for me. I might get away with
converting it to KOI8 first, but could I still mix in Unicode
characters this way (as they are considered special characters by
Groff)?


I have similar needs as you in processing UTF-8 Cyrillic text (mostly
not Russian, though).

Mixing two different encodings in one document is generally not a very
feasible idea, and typically single-byte values may be displayed by a
single generic placeholder. Open, for instance, any KOI8-R encoded
document in an utf8-terminal; you either get something that looks like
two-letter combinations or question marks all over the KOI8-R part(s) of
the document. While a machine could, in theory, deal with such a matter,
it is simply a nuisance for a human editor/author to have to work with
such an input.


Be sure you review my earlier messages to Oliver in detail.  The
hyphenation code isn't "broken", it's simply limited to the C/C++ char
type for character code points and hyphenation codes (which are not "the
same thing as" character code points, but do correspond to them).


I am not familiar with modern incarnations of C/C++. Is there really no
char data type that is Unicode-compliant?

Best regards,

Oliver.


--

Dr. Oliver Corff
Wittelsbacherstr. 5A
10707 Berlin
GERMANY
Tel.: +49-30-85727260
mailto:oliver.co...@email.de




Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-26 Thread G. Branden Robinson
At 2023-04-26T15:16:55+0300, Robin Haberkorn wrote:
> For future texts I therefore wanted to return to Groff (where we also
> have the excellent MOM macros). Not being able to hyphenate UTF-8
> Cyrillic text is a major limitation for me. I might get away with
> converting it to KOI8 first, but could I still mix in Unicode
> characters this way (as they are considered special characters by
> Groff)?

Yes.  Special characters are written in ASCII, so there's no problem
there.  You could even mix KOI8-R Russian with Unicode Russian in the
form \[u0432]...just don't expect the latter to hyphenate correctly.

> Perhaps I will have a look at the hyphenation code and try to fix it.
> Hacking the typesetter is always a perfect distraction from the work
> you are supposed to do instead. ;-)

Be sure you review my earlier messages to Oliver in detail.  The
hyphenation code isn't "broken", it's simply limited to the C/C++ char
type for character code points and hyphenation codes (which are not "the
same thing as" character code points, but do correspond to them).

Regards,
Branden


signature.asc
Description: PGP signature


Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-26 Thread Robin Haberkorn

Hello!

I can confirm that Neatroff (and Heirloom Troff) works well for typesetting 
Russian texts including hyphenation.
BUT, I found them unsuitable for complex scientific texts as their ms macros are 
buggy and tbl is somewhat limited. Regarding Neatroff, I found that its 
hyperlinking capabilities are extremely limited.


For future texts I therefore wanted to return to Groff (where we also have the 
excellent MOM macros). Not being able to hyphenate UTF-8 Cyrillic text is a 
major limitation for me. I might get away with converting it to KOI8 first, but 
could I still mix in Unicode characters this way (as they are considered special 
characters by Groff)?


Perhaps I will have a look at the hyphenation code and try to fix it. Hacking 
the typesetter is always a perfect distraction from the work you are supposed to 
do instead. ;-)


Yours sincerely,
Robin

26.04.23 14:10, Ralph Corderoy пишет:

Hi Oliver,

Are you aware there are other troff implementations than GNU's groff?
Neatroff is one.  Ali Gholami Rudi wrote it because he wanted better
Unicode support for foreign languages, including right-to-left text.
He seems very much of your mould in needs.

A good summary of its features is http://litcave.rudi.ir/neatroff.pdf
I see UTF-8 hyphenation files mentioned.
There's also whole-paragraph formatting and lots of other delights.
Rudi's http://litcave.rudi.ir has a Typesetting section past the initial
list of recent changes to his software.

Feel free to continue discussing neatroff here along with general troff
questions.





Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-26 Thread Oliver Corff

Hi Ralph,

I could not resist the temptation to procrastinate from my current work
and had a look at neatroff.

Really neat!

Out-of-the-box, my test file russ.ms and TeX utf8 hyphenation patterns
taken straight from my TeX installation produced the attached very
satisfying result.

Best regards,

Oliver.


On 26/04/2023 13:10, Ralph Corderoy wrote:

Hi Oliver,

Are you aware there are other troff implementations than GNU's groff?
Neatroff is one.  Ali Gholami Rudi wrote it because he wanted better
Unicode support for foreign languages, including right-to-left text.
He seems very much of your mould in needs.

A good summary of its features is http://litcave.rudi.ir/neatroff.pdf
I see UTF-8 hyphenation files mentioned.
There's also whole-paragraph formatting and lots of other delights.
Rudi's http://litcave.rudi.ir has a Typesetting section past the initial
list of recent changes to his software.

Feel free to continue discussing neatroff here along with general troff
questions.


--
Dr. Oliver Corff
Wittelsbacherstr. 5A
10707 Berlin
GERMANY
Tel.: +49-30-85727260
mailto:oliver.co...@email.de



russ.ps
Description: PostScript document
.hpf hyph-ru.tex
.TL
A Test of Russian
.AB
This little test is supposed to typeset Russian.
I searched for a few terribly long Russian words
and set everything in two-column mode as to 
challenge hyphenation.
.AE
.2C
.SH
Longest Russian Words
.LP
Превысокомногорассмотрительствующий Водогрязеторфопарафинолечение
Cельскохозяйственно-машиностроительный
Рентгеноэлектрокардиографического Частнопредпринимательского
Переосвидетельствующимися
Субстанционализирующимися
Превысокомногорассмотрительствующий Водогрязеторфопарафинолечение
Cельскохозяйственно-машиностроительный
Рентгеноэлектрокардиографического Частнопредпринимательского
Переосвидетельствующимися
Субстанционализирующимися
Превысокомногорассмотрительствующий Водогрязеторфопарафинолечение
Cельскохозяйственно-машиностроительный
Рентгеноэлектрокардиографического Частнопредпринимательского
Переосвидетельствующимися
Субстанционализирующимися
.SH
A Russian Test.
.LP
В начале 1980-х годов компания AT&T, которой принадлежала Bell Labs, осознала ценность Unix и начала создание коммерческой версии операционной системы. Эта версия, поступившая в продажу в 1982 году, носила название UNIX System III и была основана на седьмой версии системы.

Однако компания не могла напрямую начать развитие Unix как коммерческого продукта из-за запрета, наложенного правительством США в 1956 году. Министерство юстиции вынудило AT&T подписать соглашение, запрещавшее компании заниматься деятельностью, не связанной с телефонными и телеграфными сетями и оборудованием. Для того, чтобы всё-таки иметь возможность перевести Unix в ранг коммерческих продуктов, компания передала исходный код операционной системы некоторым высшим учебным заведениям, лицензировав код под очень либеральными условиями. В декабре 1973 года одним из первых исходные коды получил университет Беркли[11].

С 1978 года начинает свою историю BSD Unix, созданный в университете Беркли. Его первая версия была основана на шестой редакции. В 1979 выпущена новая версия, названная 3BSD, основанная на седьмой редакции. BSD поддерживал такие полезные свойства, как виртуальную память и замещение страниц по требованию. Автором BSD был Билл Джой.

Важной причиной раскола Unix стала реализация в 1980 году стека пр

Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-26 Thread Oliver Corff

Hi Ralph,

thank you very much for mentioning neatroff. In principle, I am aware
that there are other implementations, all with their particular unique
features, but I never dived into anything other than groff so far (also
due to the fruitful and friendly exchange on this mailing list), and
neatroff was to me known by name only.

I'll have a look at neatroff during the weekend.

I also noticed heirloom troff (and their font support) but so far
haven't managed to build it from source. Their system layout has some
pecularities.

Best regards,

Oliver.


On 26/04/2023 13:10, Ralph Corderoy wrote:

Hi Oliver,

Are you aware there are other troff implementations than GNU's groff?
Neatroff is one.  Ali Gholami Rudi wrote it because he wanted better
Unicode support for foreign languages, including right-to-left text.
He seems very much of your mould in needs.

A good summary of its features is http://litcave.rudi.ir/neatroff.pdf
I see UTF-8 hyphenation files mentioned.
There's also whole-paragraph formatting and lots of other delights.
Rudi's http://litcave.rudi.ir has a Typesetting section past the initial
list of recent changes to his software.

Feel free to continue discussing neatroff here along with general troff
questions.


--
Dr. Oliver Corff
mailto:oliver.co...@email.de




neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

2023-04-26 Thread Ralph Corderoy
Hi Oliver,

Are you aware there are other troff implementations than GNU's groff?
Neatroff is one.  Ali Gholami Rudi wrote it because he wanted better
Unicode support for foreign languages, including right-to-left text.
He seems very much of your mould in needs.

A good summary of its features is http://litcave.rudi.ir/neatroff.pdf
I see UTF-8 hyphenation files mentioned.
There's also whole-paragraph formatting and lots of other delights.
Rudi's http://litcave.rudi.ir has a Typesetting section past the initial
list of recent changes to his software.

Feel free to continue discussing neatroff here along with general troff
questions.

-- 
Cheers, Ralph.