Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-25 Thread Dominik Wujastyk
thanks. I've signed up.



On 25 November 2010 11:35, Arthur Reutenauer <
arthur.reutena...@normalesup.org> wrote:

> > Should we have a separate list for this sort of thing?
>
>   There is the tex-hyphen list (http://tug.org/mailman/listinfo/tex-hyphen
> );
> this kind of discussion is certainly welcome there.
>
>Arthur
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-25 Thread Arthur Reutenauer
> Should we have a separate list for this sort of thing?

  There is the tex-hyphen list (http://tug.org/mailman/listinfo/tex-hyphen);
this kind of discussion is certainly welcome there.

Arthur


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-25 Thread Dominik Wujastyk
Sanskrit
ka-rman -> kar-man

Should we have a separate list for this sort of thing?

Dominik


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-23 Thread Mojca Miklavec
On Tue, Nov 23, 2010 at 01:00, Manuel B. wrote:
>
>>> If Indic scripts hyphenate in the same way in all the languages that
>>> use the script
>
>>I've seen no evidence to let me think that they do, but I'm happy
>>about any input.
>
> Hmm... I think this discussion could be brought to an end more quickly
> by falsification: we need an example of two Indian languages with
> different hyphenation rules in the same script.

We don't really need to elaborate any further unless somebody wants to
typeset in a language that is not supported yet.

The author of hyphenation patterns says:

On Mon, Nov 22, 2010 at 17:11, Santhosh Thottingal wrote:
>
> As far as I know, for Indian languages, it is true that languages
> using the same script have same hyphenation patterns. So there should
> not be a difference between Sanskrit and Hindi(Devanagari script) or
> Assamese and Bengali(Bengali script).
>
> And for Indian scripts, the basic rules are almost same,  but not all.
> Tamil got major differences from Malayalam for example.

I would rather not try to be too clever and do modifications on my
own. At the moment there are at most two languages with the same
patterns, even though there are probably more of them that are not yet
supported by Polyglossia.

I would say: once we get requests to support another dozen of
languages written in the same script, we may start thinking about
using per-script patterns to reduce the number of preloaded languages.

Mojca



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-22 Thread Manuel B.
>But first of all the question: what would be the biggest benefit? New 
>languages?

My idea was, that the biggest benefit of a single hyphenation file for
several Indic scripts could be, that it is possibly easier to
maintain. Only one file has to be updated if a change in the pattern
is necessary, not many. But I'm ready to admit, that this view of
things might be naiive.

I think Arthur has a good point in saying that it is probably not
worth the effort to merge the hyphenation files into one.

And I didn't know that there is a correspondence to the OOo
hyphenation files. In that case I absolutely agree, that this
correspondence should be preserved, despite the duplication of
identical data.

>> If Indic scripts hyphenate in the same way in all the languages that
>> use the script

>I've seen no evidence to let me think that they do, but I'm happy
>about any input.

Hmm... I think this discussion could be brought to an end more quickly
by falsification: we need an example of two Indian languages with
different hyphenation rules in the same script.

Cheers,
Manuel

2010/11/22 BPJ :
> 2010-11-22 18:24, Dominik Wujastyk skrev:
>>
>> Those who write both transliterated Hindi and Sanskrit in the
>> same publication will be glad of the ISO standard, I suppose.
>
> You have the problem in transliterated Hindi on its own, since
> both graphemes occur there.  In fact they are in complementary
> distribution, and in a way which would be easy to automatize,
> but being different graphemes they should be transliterated
> differently.  Retransliteration shouldn't require linguistic
> analysis.
>
>> Typical standard's work: result of a committee that has a
>> certain limited logic to it, but pays not enough attention to
>> usage amongst professional groups, and consequently leaves
>> nobody actually happy.
>
> Agreed.  I'm definitely not a friend of standards for
> standards' sake, but that applies to century-old standards
> founded by people not considering modern languages too!
> Of course you _can_ use different transliterations for Sanskrit and Hindi,
> but IMHO transliteration should be by script and not
> by language. But let's be thankful nobody came up with d̤ for ड़
> since IPA uses d̤ for ध!
>
> /bpj
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-22 Thread Santhosh Thottingal
On Mon, Nov 22, 2010 at 8:05 PM, Arthur Reutenauer
 wrote:
>> If Indic scripts hyphenate in the same way in all the languages that
>> use the script
>
>  I've seen no evidence to let me think that they do, but I'm happy
> about any input.  Santhosh, since you obviously used Yves' hyphenation
> patterns for Sanskrit as a basis for your files, can you tell us a bit
> more about that?  I'm curious in particular about the rule "do not break
> before a final consonant", which you stripped.


Hi all,
As far as I know, for Indian languages, it is true that languages
using the same script have same hyphenation patterns. So there should
not be a difference between Sanskrit and Hindi(Devanagari script) or
Assamese and Bengali(Bengali script).

And for Indian scripts, the basic rules are almost same,  but not all.
Tamil got major differences from Malayalam for example.

Arthur,
"do not break before a final consonant or cluster" is not valid as far
as I know. At least for my mother tongue, Malayalam, I am sure that
this rule is not there. For other languages I relied on the inputs
from my friends, but did not come through this rule so far. But  even
then, this rule often get applied when applications set "minimum
characters after break" setting that many applications provide.

There is one thing to be noted while discussing about having a single
pattern file for all Indic scripts. The patterns are used by many
applications other than tex, and it is reasonable for them to rely on
the system locale or detected script or user supplied language code
for finding out which hyphenation rules are to be used. So It is a
reasonable use case that one user search for hyphen-ml_IN package in a
distro if he want to use Malayalam hyphenation in openoffice. In most
popular GNU/Linux distros, there is a  metapackage for language
support. For eg: language-support-ml installs everything required for
Malayalam. For the maintainers of this package, it is easy to link
them to particular language hyphenation package.   So I don't see much
benefit in merging all of them.

I think we can compare this with Indic fonts packaging Maintaining
happening in linux distros. Debian used to have a ttf-indic-fonts
package. Now we have that as a metapackage with dependencies to
ttf-malayalam-fonts, ttf-tamil-fonts, ttf-hindi-fonts etc and it makes
the maintainers, and bug reporters task easy.

ps: The git repo  I maintain for Indic hyphenation
patterns(http://git.savannah.gnu.org/cgit/smc/hyphenation.git) -
upstream repo for fedora, openoffice etc.

Thanks
Santhosh Thottingal
http://thottingal.in



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-22 Thread BPJ

2010-11-22 18:24, Dominik Wujastyk skrev:

Those who write both transliterated Hindi and Sanskrit in the
same publication will be glad of the ISO standard, I suppose.


You have the problem in transliterated Hindi on its own, since
both graphemes occur there.  In fact they are in complementary
distribution, and in a way which would be easy to automatize,
but being different graphemes they should be transliterated
differently.  Retransliteration shouldn't require linguistic
analysis.


Typical standard's work: result of a committee that has a
certain limited logic to it, but pays not enough attention to
usage amongst professional groups, and consequently leaves
nobody actually happy.


Agreed.  I'm definitely not a friend of standards for
standards' sake, but that applies to century-old standards
founded by people not considering modern languages too!
Of course you _can_ use different transliterations for Sanskrit 
and Hindi, but IMHO transliteration should be by script and not

by language. But let's be thankful nobody came up with d̤ for ड़
since IPA uses d̤ for ध!

/bpj


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-22 Thread Dominik Wujastyk
Sanskritists have been using ṛ (r-underdot) for over a century.
Promulgating a new standard that changes this usage to r-undercircle is far
from being an obvious choice, in my view.  But we're irrevocably lumbered
with it now.  :-(  Though I note that most Sanskritists pay no attention to
the ISO standard, and continue with IAST, which has been standard in
professional journals and book publications since the nineteenth century.

Of course Hindi flap and Sanskrit vocalic-r have to be distinguished, but
the long-established uniform usage of Sanskritists, present in literally
thousands of publications, should have been given greater weight.

Most Sanskritists view m-overdot (for anusvāra) as obsolete usage, weakly
referential to the Nāgarī orthography, and now strongly deprecated.  Again,
it isn't used in any professional publications, and hasn't been for a
hundred years or more.

Those who write both transliterated Hindi and Sanskrit in the same
publication will be glad of the ISO standard, I suppose.

Typical standard's work: result of a committee that has a certain limited
logic to it, but pays not enough attention to usage amongst professional
groups, and consequently leaves nobody actually happy.

Dominik

On 22 November 2010 18:03, BPJ  wrote:

> 2010-11-21 10:22, Manuel B. skrev:
>
>  1) I saw that that all diacritics used for IAST appear in the pattern,
>> while some of them (for example ṛ and ṝ) are marked as "non standart
>> transliteration". That is OK, insofar as IAST is not a standart in the
>> official sense. But IAST is most commonly used and the "standart"
>> transliteration of vocalic r in IAST is ṛ, not r̥.
>>
>
> The problem is that since for Hindi and other modern
> Indic languages ṛ is used for the retroflex flap
> -- ḍ with underdot in Nagari -- modeled on the
> Urdu letter for that sound.  In a strict
> transliteration you need a way to distinguish
> between the two, and between ri and r̥.  Since
> Indo-Europeanists have been using r̥ for over a
> century that's obviously the best choice.
>
> /bpj
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-22 Thread BPJ

2010-11-21 10:22, Manuel B. skrev:

1) I saw that that all diacritics used for IAST appear in the pattern,
while some of them (for example ṛ and ṝ) are marked as "non standart
transliteration". That is OK, insofar as IAST is not a standart in the
official sense. But IAST is most commonly used and the "standart"
transliteration of vocalic r in IAST is ṛ, not r̥.


The problem is that since for Hindi and other modern
Indic languages ṛ is used for the retroflex flap
-- ḍ with underdot in Nagari -- modeled on the
Urdu letter for that sound.  In a strict
transliteration you need a way to distinguish
between the two, and between ri and r̥.  Since
Indo-Europeanists have been using r̥ for over a
century that's obviously the best choice.

/bpj


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-22 Thread Yves Codet

Le 22 nov. 2010 à 14:23, Arthur Reutenauer a écrit :

>> Debatable, I'm not sure :) Gustibus et coloribus non est disputandum. 
>> Personally I don't mind breaks such as a-rhasi.
> 
>  Well, it's not only a matter of taste: in that case, it looked
> incorrect to Dominik, to the point that he thought something was wrong
> with his installation; which is somewhat problematic.

I'll correct that. Please remember those patterns for transliteration are only 
tentative and the last message Dominik sent shows there's still a lot of work 
to do.

>> I know many prefer ar-hasi, but there are some books where you would find 
>> a-rhasi. On page 189 of Gray's edition of Vāsavadattā (Delhi, 1962), for 
>> instance, I can see: ...nirmu-kta..., ...ku-ṭṭimam.
> 
>  As the author of the pattern file, it's obviously up to you to decide
> which to choose if both solutions are used in books.
> 
>> So, for a start, I did exactly what Arthur described, I chose the easy way. 
>> But I can add rules allowing a break after the first consonant of a 
>> consonant cluster. If there are rules such as:
>> a1
>> ...
>> r3h
>> you should get ar-hasi rather than a-rhasi without having to modify 
>> hyphenmins.
> 
>  The one thing one shouldn't do would be to allow both options at the
> same time.  *That* would be bad taste :-)  But if you're happy with
> switching, I'm all for it.

Would this be better taste? :)

.a2
a1
...
r1h

Best wishes,

Yves








--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-22 Thread Arthur Reutenauer
> If Indic scripts hyphenate in the same way in all the languages that
> use the script

  I've seen no evidence to let me think that they do, but I'm happy
about any input.  Santhosh, since you obviously used Yves' hyphenation
patterns for Sanskrit as a basis for your files, can you tell us a bit
more about that?  I'm curious in particular about the rule "do not break
before a final consonant", which you stripped.

Arthur


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-22 Thread Arthur Reutenauer
Hello,

> I'll also add the missing characters, ṁ, ẖ, ḫ and the sign for anudātta 
> (I think that's all, as far as Sanskrit is concerned).

  I'll wait for your update :-)

> Arthur and Mojca are better qualified than I to answer those questions. What 
> comes to mind is that such a "total" hyphenation file might rapidly become 
> difficult to maintain, all the more so as it would require several 
> maintainers.

  That's indeed another point, maybe even more important.  As Mojca
mentioned, all the patterns for Modern Indic scripts come from
OpenOffice, and are in fact written by a single person.  I believe it is
really better if we can keep them in sync with OpenOffice, and reflect
modifications that would be made there; and they need to have different
files anyway.

> Besides, some languages might require special rules, exceptions for instance, 
> which could be unwanted in another language using the same script.

  Absolutely.

> Arthur and Mojca, what do you think?

  It's a waste of time.

Arthur


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-22 Thread Arthur Reutenauer
> 2) That might be a stupid question, but aren't hyphennation patterns
> for most Abugida-scripts more or less the same?

  Yes, more or less.  If you check the actual files you'll see that
there are some differences between languages that use the same script.
There's not much you can do with that, since TeX can only read one list
of patterns per language.  It's in particular not possible, from within
a TeX document, to create a modified hyphenation trie by deleting or
inserting from an existing trie.  You need a different language.  (And
you need to load the patterns in ini mode anyway.)

  You could also imagine to have a master file for each Indic script
that would contain the patterns that are needed for all the languages
written using that script, and a separate file with additional patterns
for each individual languages; but that seems hardly worth the effort,
for the reason below.

>   Lots
> of hyphennation patterns have to be duplicated, if they are ordered by
> language. While one could have a hyphen-indic.tex instead.

  You will need a separate file for Sanskrit anyway, since it can be
written in many different scripts, and there is not yet a mechanism to
switch patterns when switching scripts (it's tied to a language).
Hence, you're left with the modern Indic languages.  Among those for
which we have patterns, there happens to be only two pairs that are
written in the same script: Hindi and Marathi (in Devanagari), and
Bengali and Assamese (in Bengali); both of which containing less than
100 patterns.  It does not seem worth the trouble (although those two
pairs are actually exactly identical, so that we could have the same
file, thereby saving almost 4 kilobytes in TeX distributions; but I
wouldn't know how to name the two common files anyway...)

  In fact, since the pattern files we have for the different Indic
languages basically list all the Unicode characters relevant for their
script, plus a few consonant clusters, they all contain about 100
patterns and take up less than 2 kilobytes; apart of course for
Sanskrit, for which we have patterns in half a dozen Indic scripts, plus
transliteration in Latin (~800 patterns, < 10kb).  Balance that with the
three different files for German (reformed spelling, old spelling, old
spelling in Switzerland) that have each 14000+ patterns and weigh almost
100kb; Norwegian (27000 patterns, ~200kb); and finally Hungarian (>6
patterns, >500kb); and you'll see why I'm not eager to develop a
complicated scheme in order to share information between hyphenation
patterns that are "more or less" the same.

Arthur


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-22 Thread Arthur Reutenauer
> Debatable, I'm not sure :) Gustibus et coloribus non est disputandum. 
> Personally I don't mind breaks such as a-rhasi.

  Well, it's not only a matter of taste: in that case, it looked
incorrect to Dominik, to the point that he thought something was wrong
with his installation; which is somewhat problematic.

> I know many prefer ar-hasi, but there are some books where you would find 
> a-rhasi. On page 189 of Gray's edition of Vāsavadattā (Delhi, 1962), for 
> instance, I can see: ...nirmu-kta..., ...ku-ṭṭimam.

  As the author of the pattern file, it's obviously up to you to decide
which to choose if both solutions are used in books.

> So, for a start, I did exactly what Arthur described, I chose the easy way. 
> But I can add rules allowing a break after the first consonant of a consonant 
> cluster. If there are rules such as:
> a1
> ...
> r3h
> you should get ar-hasi rather than a-rhasi without having to modify 
> hyphenmins.

  The one thing one shouldn't do would be to allow both options at the
same time.  *That* would be bad taste :-)  But if you're happy with
switching, I'm all for it.

Arthur


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-22 Thread Dominik Wujastyk
 On 21 November 2010 10:12, Yves Codet  wrote:

> Debatable, I'm not sure :) Gustibus et coloribus non est disputandum.
> Personally I don't mind breaks such as a-rhasi. I know many prefer ar-hasi,
> but there are some books where you would find a-rhasi. On page 189 of Gray's
> edition of Vāsavadattā (Delhi, 1962), for instance, I can see:
> ...nirmu-kta..., ...ku-ṭṭimam.
>
> So, for a start, I did exactly what Arthur described, I chose the easy way.
> But I can add rules allowing a break after the first consonant of a
> consonant cluster. If there are rules such as:
> a1
> ...
> r3h
> you should get ar-hasi rather than a-rhasi without having to modify
> hyphenmins.
>
>
I cannot think of cases where a line-final single-letter hyphenation like
a-rhasi would look good.  Even examples with alpha-privative, like a-bheda,
- which are at least etymologically justified - don't look good.

The trouble here is that of good precedent.  We need some roman-script
Sanskrit with lots of hyphens that has been typeset by knowledgeable
typesetters and looks beautiful.  I don't think that exists, or at least,
it's not known to me.  The biggest romanised corpus I can think of
immediately is the Pali Text Society volumes, but of course that's Pali not
Sanskrit.  And I don't know how good the hyphenation is.

I would expect the Clay Sanskrit Library to have good hyphenation; again
it's hard to tell, and I don't have all vols. to hand.  But in
Dezs\H{o}'s *Much
Ado About Religion* has a pṛ-cchāmaḥ (p.110) which is pretty ugly, I think,
though not impossibly so.  The cardinal sin of hyphenating a digraph
aspirated consonant is avoided (budd-ha), as far as I can see.  I don't have
the prose *Daśakumāracarita* which, being prose, should offer more
hyphenation cases than verse works.

I think we're breaking new ground here, and I think it may take a while for
a nice set of hyphenation patterns to settle down.  The guidelines surely
must include consideration of:

   1. etymology - word breaks within compounds (sārva-bhaumas)
   2. etymology - prefix, suffix, infix breaks within words (bhav-a-ti
   bud-dha adhi-kṛtam)
   3. euphony - lines shouldn't begin with non-existent initials like rh or
   mh- (a-rhasi).  (Okay, since Pingree's CESS A4, we know there's an author
   Mhālugi, but how many other words begin with mh-?)

Dominik


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-22 Thread Dominik Wujastyk
It works.  Thanks!  I tried \sanskritfont yesterday myself, and it didn't
work, but my file was pretty cluttered by that time and who knows what else
was in the way.

Dominik



On 21 November 2010 13:42, Yves Codet  wrote:

>
> Le 21 nov. 2010 à 10:12, Yves Codet a écrit :
>
> > Dominik, I think you can write \sanskritfont, can’t you?
>
> I just tried this:
>
> 
> \documentclass{article}
> \usepackage{fontspec}
> \usepackage{polyglossia}
> \setdefaultlanguage{sanskrit}
> \newfontfamily\sanskritfont{Charis SIL}
> \textwidth=0.5cm
>
> \begin{document}
>
> \noindent
> manum ekāgram āsīnam abhigamya maharṣayaḥ |
>
> \end{document}
> 
>
> It worked by me, with Polyglossia v1.2.0a.
>
> Best wishes,
>
> Yves
>
>
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-22 Thread Mojca Miklavec
On Sun, Nov 21, 2010 at 22:34, Yves Codet wrote:
>
> Le 21 nov. 2010 à 10:22, Manuel B. a écrit :
>
>> But I don't know how far one can go here. While IAST is meant
>> exclusivly for Sanskrit-transliteration (I know that it's used for
>> Pali also, but in a slightly different way), ISO 15919 contains far
>> more diacritics, than are needed for the transliteration of Sanskrit.
>> It's rather meant as a transliteration of many or most Indian
>> languages. Should it be duplicated then in every hyphenation pattern
>> of every language in question?
>>
>> 2) That might be a stupid question, but aren't hyphennation patterns
>> for most Abugida-scripts more or less the same? That means the
>> hyphennation is rather script dependend, than language dependend. Lots
>> of hyphennation patterns have to be duplicated, if they are ordered by
>> language. While one could have a hyphen-indic.tex instead.
>
> Arthur and Mojca are better qualified than I to answer those questions. What 
> comes to mind is that such a "total" hyphenation file might rapidly become 
> difficult to maintain, all the more so as it would require several 
> maintainers. Besides, some languages might require special rules, exceptions 
> for instance, which could be unwanted in another language using the same 
> script.
>
> Arthur and Mojca, what do you think?

Hello,

Exactly at this point we are discussing whether we should use
one-pattern-per-language or one-pattern-per-script for Ethiopic script
that has been requested recently on the XeTeX mailing list, but for
Ethiopic scripts we have made the first version of patterns by
ourselves, so at least I know exactly what is there (which is not the
case for Indic scripts).

In case of Indic scripts, all I did was fetch the scripts from
OpenOffice and repackaged them for use in TeX.

There might be a reason for language-dependent ordering in OpenOffice
since it applies patterns based on language. Having a single file for
patterns in OOo would mean duplicating that same file ten times, I
guess. In TeX one can reuse the same file for multiple languages more
easily.

>From my perspective we are the coordinators & collectors of
hyphenation patterns. We are not specialists for every language that
is being maintained in our repository which means that we still need
someone to create the patterns for the language he/she masters.

If Indic scripts hyphenate in the same way in all the languages that
use the script, then in principle I have nothing against having a
single file that would cover them all, but only if that really brings
some benefit and in that case probably somebody else should do it.
Does anyone require a language that is not present in repository, but
would be covered with a "generic Indic script" hyphenation rules?

If (for example) the author of OpenOffice files would prepare and
maintain the file and thus guarantee compatible behaviour with OOo,
that would be the best option. But first of all the question: what
would be the biggest benefit? New languages?

The rest of thread was talking about Sanskrit.

Mojca

PS: if any other language specialist could offer some more answers
about Ethiopic scripts, feel free to reply to me and Arthur off-list.



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-21 Thread Yves Codet
Hello.

Le 21 nov. 2010 à 10:22, Manuel B. a écrit :

> While I was checking hyphen-sa.tex, I wondered two things (which are
> irrelevant to Dominik's problem):
> 
> 1) I saw that that all diacritics used for IAST appear in the pattern,
> while some of them (for example ṛ and ṝ) are marked as "non standart
> transliteration". That is OK, insofar as IAST is not a standart in the
> official sense. But IAST is most commonly used and the "standart"
> transliteration of vocalic r in IAST is ṛ, not r̥.
> 
> The latter belongs to the international standart transliteration of
> Indic scripts, defined as ISO 15919. So if ISO 15919 has to be taken
> into concern for the Sanskrit hyphenation pattern, it should be done
> so completly. Which means, that for example ṁ should also be added,
> and ṃ marked as "non standart transliteration", and so on.

I agree with you on both points.

The comments you mention were merely notes to myself (what we call in French a 
"pense-bête" :), but since they can be read by other people they should be 
clearer, and I'll use IAST or ISO 15919 instead of "non standard" and 
(implicitly) "standard".

I'll also add the missing characters, ṁ, ẖ, ḫ and the sign for anudātta (I 
think that's all, as far as Sanskrit is concerned).

> But I don't know how far one can go here. While IAST is meant
> exclusivly for Sanskrit-transliteration (I know that it's used for
> Pali also, but in a slightly different way), ISO 15919 contains far
> more diacritics, than are needed for the transliteration of Sanskrit.
> It's rather meant as a transliteration of many or most Indian
> languages. Should it be duplicated then in every hyphenation pattern
> of every language in question?
> 
> 2) That might be a stupid question, but aren't hyphennation patterns
> for most Abugida-scripts more or less the same? That means the
> hyphennation is rather script dependend, than language dependend. Lots
> of hyphennation patterns have to be duplicated, if they are ordered by
> language. While one could have a hyphen-indic.tex instead.

Arthur and Mojca are better qualified than I to answer those questions. What 
comes to mind is that such a "total" hyphenation file might rapidly become 
difficult to maintain, all the more so as it would require several maintainers. 
Besides, some languages might require special rules, exceptions for instance, 
which could be unwanted in another language using the same script.

Arthur and Mojca, what do you think?

Regards,

Yves





--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-21 Thread Dominik Wujastyk
That's extremely helpful!  Thank you, Arthur.

I've upped the first argument of hyphenmins to 2, which helps a lot for
romanisation, but may make the Nagari breaks more difficult.  I suppose it's
not reasonable to assume that hyphenation parameters will be the same across
different scripts.

Best,
Dominik


On 20 November 2010 22:12, Arthur Reutenauer <
arthur.reutena...@normalesup.org> wrote:

> > I'm really not sure what I'm getting as a result. It looks as if it's
> roman
> > script being hyphenated as if it were Devanagari. The initial a- of
> several
> > words, like arhasi, gets separated (a-rhasi), which might just about look
> > okay in Nagari, but not in romanisation. Am I actually getting the right
> > thing
>
>   You're indeed getting what the patterns say.  From what I read in
> hyph-sa.tex, the patterns allow breaks after any vowel (but not inside
> diphthongs), and forbids them before final consonants or consonant
> clusters; and that's about it.  It's certainly a debatable choice, but
> it does seem like the patterns really aim at mimicking the way (say)
> Sanskrit written using Devanagari is hyphenated.  You would have to take
> this up with Yves.
>
> > Why do I have to pretend that this is Devanagari (\devanagarifont)?
>
>   This is by design in polyglossia (see gloss-sanskrit.ldf).  You would
> have to take this up with François.  (And I'm the one responsible for
> integrating hyph-sa.tex into hyph-utf8.  Why does it seem like there is
> a French mafia around Sanskrit support in XeTeX? ;-)
>
>Arthur
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-21 Thread Yves Codet
Hello.

Le 20 nov. 2010 à 22:12, Arthur Reutenauer a écrit :

>> I'm really not sure what I'm getting as a result. It looks as if it's roman
>> script being hyphenated as if it were Devanagari. The initial a- of several
>> words, like arhasi, gets separated (a-rhasi), which might just about look
>> okay in Nagari, but not in romanisation. Am I actually getting the right
>> thing
> 
>  You're indeed getting what the patterns say.  From what I read in
> hyph-sa.tex, the patterns allow breaks after any vowel (but not inside
> diphthongs), and forbids them before final consonants or consonant
> clusters; and that's about it.  It's certainly a debatable choice, but
> it does seem like the patterns really aim at mimicking the way (say)
> Sanskrit written using Devanagari is hyphenated.  You would have to take
> this up with Yves.

Debatable, I'm not sure :) Gustibus et coloribus non est disputandum. 
Personally I don't mind breaks such as a-rhasi. I know many prefer ar-hasi, but 
there are some books where you would find a-rhasi. On page 189 of Gray's 
edition of Vāsavadattā (Delhi, 1962), for instance, I can see: 
...nirmu-kta..., ...ku-ṭṭimam.

So, for a start, I did exactly what Arthur described, I chose the easy way. But 
I can add rules allowing a break after the first consonant of a consonant 
cluster. If there are rules such as:
a1
...
r3h
you should get ar-hasi rather than a-rhasi without having to modify hyphenmins.

>> Why do I have to pretend that this is Devanagari (\devanagarifont)?
> 
>  This is by design in polyglossia (see gloss-sanskrit.ldf).  You would
> have to take this up with François.  (And I'm the one responsible for
> integrating hyph-sa.tex into hyph-utf8.  Why does it seem like there is
> a French mafia around Sanskrit support in XeTeX? ;-)

:) 

Dominik, I think you can write \sanskritfont, can’t you?

Best wishes,

Yves





--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-21 Thread Yves Codet

Le 21 nov. 2010 à 10:12, Yves Codet a écrit :

> Dominik, I think you can write \sanskritfont, can’t you?

I just tried this:


\documentclass{article}
\usepackage{fontspec}
\usepackage{polyglossia}
\setdefaultlanguage{sanskrit}
\newfontfamily\sanskritfont{Charis SIL}
\textwidth=0.5cm

\begin{document}

\noindent
manum ekāgram āsīnam abhigamya maharṣayaḥ |

\end{document}


It worked by me, with Polyglossia v1.2.0a.

Best wishes,

Yves





--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-21 Thread Manuel B.
Im glad to here that there is finaly some implementation of roman
transliteration in the sanskrit hyphenation pattern. Keep up the good
work!

While I was checking hyphen-sa.tex, I wondered two things (which are
irrelevant to Dominik's problem):

1) I saw that that all diacritics used for IAST appear in the pattern,
while some of them (for example ṛ and ṝ) are marked as "non standart
transliteration". That is OK, insofar as IAST is not a standart in the
official sense. But IAST is most commonly used and the "standart"
transliteration of vocalic r in IAST is ṛ, not r̥.

The latter belongs to the international standart transliteration of
Indic scripts, defined as ISO 15919. So if ISO 15919 has to be taken
into concern for the Sanskrit hyphenation pattern, it should be done
so completly. Which means, that for example ṁ should also be added,
and ṃ marked as "non standart transliteration", and so on.

But I don't know how far one can go here. While IAST is meant
exclusivly for Sanskrit-transliteration (I know that it's used for
Pali also, but in a slightly different way), ISO 15919 contains far
more diacritics, than are needed for the transliteration of Sanskrit.
It's rather meant as a transliteration of many or most Indian
languages. Should it be duplicated then in every hyphenation pattern
of every language in question?

2) That might be a stupid question, but aren't hyphennation patterns
for most Abugida-scripts more or less the same? That means the
hyphennation is rather script dependend, than language dependend. Lots
of hyphennation patterns have to be duplicated, if they are ordered by
language. While one could have a hyphen-indic.tex instead.

Have a nice weekend!
Manuel

2010/11/21 Dominik Wujastyk :
> That's extremely helpful!  Thank you, Arthur.
>
> I've upped the first argument of hyphenmins to 2, which helps a lot for
> romanisation, but may make the Nagari breaks more difficult.  I suppose it's
> not reasonable to assume that hyphenation parameters will be the same across
> different scripts.
>
> Best,
> Dominik
>
>
> On 20 November 2010 22:12, Arthur Reutenauer
>  wrote:
>>
>> > I'm really not sure what I'm getting as a result. It looks as if it's
>> > roman
>> > script being hyphenated as if it were Devanagari. The initial a- of
>> > several
>> > words, like arhasi, gets separated (a-rhasi), which might just about
>> > look
>> > okay in Nagari, but not in romanisation. Am I actually getting the right
>> > thing
>>
>>  You're indeed getting what the patterns say.  From what I read in
>> hyph-sa.tex, the patterns allow breaks after any vowel (but not inside
>> diphthongs), and forbids them before final consonants or consonant
>> clusters; and that's about it.  It's certainly a debatable choice, but
>> it does seem like the patterns really aim at mimicking the way (say)
>> Sanskrit written using Devanagari is hyphenated.  You would have to take
>> this up with Yves.
>>
>> > Why do I have to pretend that this is Devanagari (\devanagarifont)?
>>
>>  This is by design in polyglossia (see gloss-sanskrit.ldf).  You would
>> have to take this up with François.  (And I'm the one responsible for
>> integrating hyph-sa.tex into hyph-utf8.  Why does it seem like there is
>> a French mafia around Sanskrit support in XeTeX? ;-)
>>
>>        Arthur
>>
>>
>> --
>> Subscriptions, Archive, and List information, etc.:
>>  http://tug.org/mailman/listinfo/xetex
>
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>
>



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-20 Thread Arthur Reutenauer
> I'm really not sure what I'm getting as a result. It looks as if it's roman
> script being hyphenated as if it were Devanagari. The initial a- of several
> words, like arhasi, gets separated (a-rhasi), which might just about look
> okay in Nagari, but not in romanisation. Am I actually getting the right
> thing

  You're indeed getting what the patterns say.  From what I read in
hyph-sa.tex, the patterns allow breaks after any vowel (but not inside
diphthongs), and forbids them before final consonants or consonant
clusters; and that's about it.  It's certainly a debatable choice, but
it does seem like the patterns really aim at mimicking the way (say)
Sanskrit written using Devanagari is hyphenated.  You would have to take
this up with Yves.

> Why do I have to pretend that this is Devanagari (\devanagarifont)?

  This is by design in polyglossia (see gloss-sanskrit.ldf).  You would
have to take this up with François.  (And I'm the one responsible for
integrating hyph-sa.tex into hyph-utf8.  Why does it seem like there is
a French mafia around Sanskrit support in XeTeX? ;-)

Arthur


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


[XeTeX] Hyphenated, transliterated Sanskrit.

2010-11-20 Thread Dominik Wujastyk
I've been banging my head against this for a while today, without resolving
things.  I see that the UTF8 hyph-sa.tex file contains the rules for
hyphenating Sanskrit in several scripts, including Roman (Latin?).  The way
this should work, I believe, is that as long as I flag my words as being in
Sanskrit, then they'll get appropriately hyphenated whichever of these
scripts I use.

But I can't find a way to get Polyglossia to accept Sanskrit written in
Roman script.

If I say

 \setotherlanguage{sanskrit}

\newfontfamily\devanagarifont[Script=Latin]{Gentium Basic}


or even

\newfontfamily\devanagarifont{Gentium Basic}


and then


\setlength\textwidth{1cm} % or whatever, to get lots of hyphenating.

\textsanskrit{manum ekāgram āsīnam abhigamya maharṣayaḥ |}


I'm really not sure what I'm getting as a result. It looks as if it's roman
script being hyphenated as if it were Devanagari. The initial a- of several
words, like arhasi, gets separated (a-rhasi), which might just about look
okay in Nagari, but not in romanisation. Am I actually getting the right
thing, but I just need to crank up the first argument of hyphenmins? (This
does seem to improve things.)


Why do I have to pretend that this is Devanagari (\devanagarifont)? I would
like to be able to define \sanskritfont and then define the script, e.g.,


\newfontfamily\sanskritfont[Script=[Devanagari|Latin|Malayalam|Bengali]{Code2000}


I think I'll stop. You can see I'm in a terrible muddle.


Has someone got this running already? Yves? Jonathan?


Best,

Dominik


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex