Re: [tex-hyphen] Hyphenation in Albanian

2021-02-14 Thread Claudio Beccari

Dear Arthur, dear Mojca
Attached you find a zip file named AlbanianHyphenation.zip.

This is the result of my efforts with the substantial help of MoA Sabina 
Koliqi, original Albanian graduate in Albanian Literature, then Italian 
professor graduated in Education Teaching.
I do not know the Albanian language, but this language is dr Koliqi's 
mother language and is implied by her university studies; I know how to 
build hyphen patterns; we joined our competences and the above .zip file 
contains our results, in particular the hyph-sq.tex file  contains the 
UTF-8 encoded patterns, with a preamble modeled on the other pattern 
files distributed with TeX Live.


We looked for an hyphenated Albanian word list, but we could not find 
any. Dr Koliqi, extracted a word list from a couple of chapters of an 
Albanian book; she tried to create an Albainan hyphenated word list. 
Then I entered the challenge, but I was unsuccessful with the patgen 
program that is distributed with the TeX System; documentation is very 
scarce and refers to the Omega program. As a result we abandoned the 
patgen solution and we moved to another approach that I find very 
effective, even if it requires a lot of "elbow grease".


The approach is based on LuaLaTeX and its ability to load on the fly a 
pattern file and to hyphenate a list of words given as simple text. This 
is provided by package testhyphens.sty and its checkhyphens environment. 
As you see form the zipped file, the source abanian-test-lualatex-2.tex 
loads also the multicol.sty package, in order to typeset the result in 
four column mode; of course the setting for four columns can be changed 
to 1 (one) column and the result may be used as a dictionary if patgen 
is to be used to find another (different) pattern-set created without 
any use of elbow grease. My preceding experience with other languages 
taught me that this elbow grease spent by a sufficiently well educated 
person produces better results than patgen. Of course this statement is 
not valid for certain languages, English in first place, because 
patterns are based on spelling and not on pronunciation; for English in 
both main incarnations, British and US, there are errors that can't be 
corrected because there are homographs that are pronounced differently 
if they refer to nouns or to verbs: for example "the record" and "I 
record"; "the analyses" and "he analyses".


Therefore we started with a basic list of a dozen patterns (the single 
letter patterns with implied 0 values on both sides were omitted, and 
only the Albanian digraphs were considered). After each run of the 
LuaLaTeX compilation dr Koliqi would correct on the printed list the 
wrong hyphenation points; I would modify the pattern list; and we would 
iterate until all words were correctly hyphenated. Non very 
professional, you might think, but very effective.


The Albanian hyphenation is peculiar; Albanians say they have an 
alphabet made up with more than 30 letters; while interacting with dr 
Koliqi I found out that in Albanian they miss a word for "letter" as it 
is implied by any computer encoding, from ASCII to UTF-8, therefore 
"sh", "dh", "zh", and similar digraphs are called with the same name as 
"a", "b", "c", and so on. Eventually we could find a common mutual 
understanding, and we could proceed pretty rapidly.


We worked on an initial set of a little more than 2600 words; then we 
reduced the set to the actual one contained in the LuaLaTeX source file. 
Differently from patgen, the pattern set we built up does not minimize 
the probabilities of hyphenation errors; the number of wrong hyphenated 
words is zero.


Notice: the LuaTeX source file sets both the left and right hyphenmin 
values to 1; in practice the hyphenation language description file 
should set both to the value 2. I always build the hyphen sets with the 
value 1, because I imagine that in some rare cases of narrow column 
typesetting, the correct justification may be achieved with this not too 
professional typographical setting.


But the word set we worked on is limited; and it is possible that while 
actually using this pattern set by the Albanian users with their actual 
documents, some more patterns, or a list of hyphenation exceptions might 
become necessary. I might be available to modify such patterns for a 
short while; at my age I am not going to live for ever; therefore the 
Albanian TeX community should take over.


All the best

Claudio

On 16/06/2020 15:22, Arthur Reutenauer wrote:

Dear Claudio,

On Mon, Jun 15, 2020 at 11:57:33PM +0200, Claudio Beccari wrote:

I can certainly ask the student to allow distributing her thesis, but I
believe it will not be of great utility, because, as I said, the thesis is
in Italian, with very few stretches in Albanian, where the needed rare
hyphen points were set by hand.

   I think the list of hyphenated words would be very useful, so if she’s
ready to publish that, it would be really great.

 

Re: [tex-hyphen] Hyphenation in Albanian

2020-06-16 Thread Arthur Reutenauer
Dear Claudio,

On Mon, Jun 15, 2020 at 11:57:33PM +0200, Claudio Beccari wrote:
> I can certainly ask the student to allow distributing her thesis, but I
> believe it will not be of great utility, because, as I said, the thesis is
> in Italian, with very few stretches in Albanian, where the needed rare
> hyphen points were set by hand.

  I think the list of hyphenated words would be very useful, so if she’s
ready to publish that, it would be really great.

Best,

Arthur


Re: [tex-hyphen] Hyphenation in Albanian

2020-06-16 Thread Arthur Reutenauer
Joan,

  I just created https://github.com/hyphenation/albanian as an empty
repository and will add you as a contributor.  Is your GitHub user name
iGianni?

Arthur


Re: [tex-hyphen] Hyphenation in Albanian

2020-06-15 Thread Claudio Beccari

Dear Arthur,
I can certainly ask the student to allow distributing her thesis, but I 
believe it will not be of great utility, because, as I said, the thesis 
is in Italian, with very few stretches in Albanian, where the needed 
rare hyphen points were set by hand.


All the best
Claudio

On 15/06/2020 21:40, Arthur Reutenauer wrote:

Hi Claudio,

On Sun, Jun 14, 2020 at 12:05:19AM +0200, Claudio Beccari wrote:

Recently I assisted an Albanian student getting her degree in Italy, who
wrote her thesi in Italian, bu with many stretches of text in Albanian;
these parts where hyphenated by hand, because she could not use LaTeX, but
the final printing was done from a LaTeX generated pdf file; the supervisors
were very happy to see a well typeset thesis, that in humanities apparently
is pretty uncommon.

   If that thesis is available somewhere, it would be very useful to be
able to look at it :-)

Best,

Arthur




Re: [tex-hyphen] Hyphenation in Albanian

2020-06-15 Thread Joan Jani




I would like to thank Claudio, Mojca and Arthur for their replies.

I apologies but I had been not subscribed  at the mailing list so I 
did not receive Claudio's email yesterday.


Now everything is ok and for sure I will continue to work in this 
issue for the coming days.


My to-do list for coming days will be:

    1.  Find a detailed grammatically theory of hyphenation in 
Albania. Since I am not a linguist, I have to ask help from a friend 
of mine, who is an expert in this filed.


    2. Translate the rules in english and put the document at public 
domain using github.


    3. Read the documentation wich Claudio, Mojca and Arthur recomend.

    4. Create some patterns and test  if are working correctly.

I hope that this would be the begging of adding something that later

My personal email is igi...@hotmail.com.

I would like to thank you all again for your warm welcoming.

Kind regards.

Joan Jani
On 15/6/20 11:05 π.μ., Mojca Miklavec wrote:

Hi,

Off-list.

Claudio Beccari already wrote a good answer.

We don't really have a team actively working on creating new patterns
for new languages, but there are a bunch of experts (Claudio being
among them). We are mostly collecting existing patterns and making
sure that they stay in consistent shape. So by far the best way to get
the patterns working would be to try to create them yourself, or find
someone to help you. This may include people on the list, but you need
to provide some faithful sources, grammar rules, dictionaries etc.

There are two orthogonal ways to achieve the goal:
- assemble a list of hyphenated words from a dictionary and run patgen
(or one of its rewrites, we can help you with that)
- come up with a set of clear rules for hyphenation (like: always
hyphenate after letter 'a', never hyphenate between these letter
pairs, ...) and write hyphenation patterns manually

I would suggest you to read
 https://tug.org/docs/liang/liang-thesis.pdf

Mojca

PS: Please don't expect an answer to a private mail, I've been
struggling recently to find time to answer emails. But you can
continue the discussion on the list, you just need to provide more
information, try to understand how hyphenation patterns work (read the
above or maybe find some BachoTeX talks from Arthur Reutenauer).

On Sat, 13 Jun 2020 at 19:22, Joan Jani  wrote:

Hello to all,

I am a latex user for more than 15 years (I wrote my first lab 
report in latex back in 2004).


Since there is no hyphenation patterns for Albanian i get always the 
message bellow:


-- No hyphenation patterns were preloaded for (babel) the language 
'Albanian' into the format.


I want to participate in your group helping to create this 
hyphenation pattern.


Kind regards,

Joan Jani


Re: [tex-hyphen] Hyphenation in Albanian

2020-06-15 Thread Joan Jani

I would like to thank Claudio, Mojca and Arthur for their replies.

I apologies but I had been not subscribed  at the mailing list so I did 
not receive Claudio's email yesterday.


Now everything is ok and for sure I will continue to work in this issue 
for the coming days.


My to-do list for coming days will be:

    1.  Find a detailed grammatically theory of hyphenation in Albania. 
Since I am not a linguist, I have to ask help from a friend of mine, who 
is an expert in this filed.


    2. Translate the rules in english and put the document at public 
domain using github.


    3. Read the documentation wich Claudio, Mojca and Arthur recomend.

    4. Create some patterns and test  if are working correctly.

I hope that this would be the begging of adding something that later

My personal email is igi...@hotmail.com.

I would like to thank you all again for your warm welcoming.

Kind regards.

Joan Jani
On 15/6/20 11:05 π.μ., Mojca Miklavec wrote:

Hi,

Off-list.

Claudio Beccari already wrote a good answer.

We don't really have a team actively working on creating new patterns
for new languages, but there are a bunch of experts (Claudio being
among them). We are mostly collecting existing patterns and making
sure that they stay in consistent shape. So by far the best way to get
the patterns working would be to try to create them yourself, or find
someone to help you. This may include people on the list, but you need
to provide some faithful sources, grammar rules, dictionaries etc.

There are two orthogonal ways to achieve the goal:
- assemble a list of hyphenated words from a dictionary and run patgen
(or one of its rewrites, we can help you with that)
- come up with a set of clear rules for hyphenation (like: always
hyphenate after letter 'a', never hyphenate between these letter
pairs, ...) and write hyphenation patterns manually

I would suggest you to read
 https://tug.org/docs/liang/liang-thesis.pdf

Mojca

PS: Please don't expect an answer to a private mail, I've been
struggling recently to find time to answer emails. But you can
continue the discussion on the list, you just need to provide more
information, try to understand how hyphenation patterns work (read the
above or maybe find some BachoTeX talks from Arthur Reutenauer).

On Sat, 13 Jun 2020 at 19:22, Joan Jani  wrote:

Hello to all,

I am a latex user for more than 15 years (I wrote my first lab report in latex 
back in 2004).

Since there is no hyphenation patterns for Albanian i get always the message 
bellow:

-- No hyphenation patterns were preloaded for (babel) the language 'Albanian' 
into the format.

I want to participate in your group helping to create this hyphenation pattern.

Kind regards,

Joan Jani


Re: [tex-hyphen] Hyphenation in Albanian

2020-06-15 Thread Arthur Reutenauer
Hi Claudio,

On Sun, Jun 14, 2020 at 12:05:19AM +0200, Claudio Beccari wrote:
> Recently I assisted an Albanian student getting her degree in Italy, who
> wrote her thesi in Italian, bu with many stretches of text in Albanian;
> these parts where hyphenated by hand, because she could not use LaTeX, but
> the final printing was done from a LaTeX generated pdf file; the supervisors
> were very happy to see a well typeset thesis, that in humanities apparently
> is pretty uncommon.

  If that thesis is available somewhere, it would be very useful to be
able to look at it :-)

Best,

Arthur


Re: [tex-hyphen] Hyphenation in Albanian

2020-06-14 Thread Javier Bezos

El 14/06/2020 a las 0:05, Claudio Beccari escribió:
Apparently the language handler polyglossia has a module for Albanian; on 
the opposite the Babel documentation does not list your language among the 
supported ones; but here exists the file albanian.ldf; this file esplicitly 
for what concern hyphenation falls back to the Englis patterns..


Actually, both polyglossia and babel attempt to use the albanian
hyphenation, but since it doesn't exist, both polyglossia and babel
fall back to english. Note patterns aren't specific to either
package. Once the patterns have been created and the configuration
files updated, both packages should work.

Javier



Re: [tex-hyphen] Hyphenation in Albanian

2020-06-13 Thread Claudio Beccari
Apparently the language handler polyglossia has a module for Albanian; 
on the opposite the Babel documentation does not list your language 
among the supported ones; but here exists the file albanian.ldf; this 
file esplicitly for what concern hyphenation falls back to the Englis 
patterns..


The support offered by polyglossia is specified in gloss-albanian.ldf 
This file sets \hyphennames={albanian}, but I assume that actually 
albanian patterns are missing also for polyglossia.



Very good. For the very ittle I know of Albanian, I'd say you are in a 
very good position to create yourself the suitable patterns. Your 
language, as far as I can say, is is pretty much phonetic and probably 
the grammar rules reflect this aspect. Start from the grammar rules of 
syllabification defferent from hyphenation — hyphenation must obey 
grammar, but also typography; for example the minimum number of 
characters for the first and the last hyphen points. An example in 
Italian: the word "idea" can be syllabified according to grammar in 
"i-de-a", but in typography it cannot be hyphenated at all because in 
Italian typography you neve leave a word fragment at the end or at the 
start of a line made up with just one letter.
Remember you language has many sounds, each one generally rendered with 
a single letter possibly with diacritics; very few dygraphs, such as sh, 
dh, and similar ones; patterns mus include at least one pattern ofr 
every letter, moreover you have to take care of indivisible consonat 
clusters, or indivisible vocal clusters. I suggest you to work with a 
grammar at hand; while you create or modify your patterns write down a 
file with words containing those patterns. While you proceed you might 
need to check the correctness of your patterns. At the moment it would 
be confusing, but in due time write me and I will show you, if still 
possible, the set up I used to test the patterns I wrote for several 
languages, mostly of Latin origin, without the complex set up that is 
needed by the team.


Remember to code you pattern file in UTF-8 encoding. It is much easier 
for you to read them and for the subsequent processing needed by the team.


Recently I assisted an Albanian student getting her degree in Italy, who 
wrote her thesi in Italian, bu with many stretches of text in Albanian; 
these parts where hyphenated by hand, because she could not use LaTeX, 
but the final printing was done from a LaTeX generated pdf file; the 
supervisors were very happy to see a well typeset thesis, that in 
humanities apparently is pretty uncommon.


All the best
Claudio

On 13/06/2020 19:04, Joan Jani wrote:


Hello to all,

I am a latex user for more than 15 years (I wrote my first lab report 
in latex back in 2004).


Since there is no hyphenation patterns for Albanian i get always the 
message bellow:


-- No hyphenation patterns were preloaded for (babel) the language 
'Albanian' into the format.


I want to participate in your group helping to create this hyphenation 
pattern.


Kind regards,

Joan Jani