Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-11 Thread Keith J. Schultz
Hi All, Phillip.

Let recap the situation here:

The original post from Scott stated he had a problem going from his wiki
to PDF via Xe(La)TeX! His problem involved texts with mixed directionality.

I did not express myself very well and should have said that in unicode one 
can identify characters with RTL directionality and language as such. 
Sorry if this miss understanding caused to much noise and just last nicht I 
realized
the fact.

I was always under the impression that the main emphasis was on RTL-languages.
Phillip Vietnamese is LTR, to my knowledge and I did not state that unicode can 
identify all
languages. 

The fact remains that one can identify the directionality of code points. How 
they have to be processed
is another matter.

The Bidi-Algorithm is in the standard. Which helps getting the directional 
display right.

 The fact remains that he can identify his problem cases and handle them 
appropriately.

regards
Keith.




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-10 Thread Khaled Hosny
On Wed, Dec 11, 2013 at 08:36:53AM +1100, Andrew Cunningham wrote:
> More to the point which libraries is XeTeX using for Bidi support me how
> up-to-date are they?

The issue is not the libraries (we use ICU which implemented the latest
BiDi algorithm changes in its last release), but the fact that XeTeX
does not use it in a way that applies BiDi algorithm to the paragraph as
a whole. It can be done, I have done it in LuaTeX using Lua code and the
hooks LuaTeX provides to the internal of TeX, but doing it by modifying
(Xe)TeX’s WEB code is not for the faint of the heart (AKA me, the last
time I dared to do something like that, people kept hitting bugs for the
third year in row).

Regards,
Khaled


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-10 Thread Khaled Hosny
On Wed, Dec 11, 2013 at 04:36:31AM +0200, Khaled Hosny wrote:
> On Tue, Dec 10, 2013 at 11:11:27AM -0500, C. Scott Ananian wrote:
> > On Tue, Dec 10, 2013 at 6:09 AM, Zdenek Wagner  
> > wrote:
> > > 2013/12/10 Keith J. Schultz :
> > >> I will repeat I do not know Vietnamese so I can not give you
> > [...]
> > >> Now, if "sang" is true Vietnamese and not a latinized form stand 
> > >> corrected! Though I have
> > [...]
> > > Yes, it is true Vietnamese word. I do not know Vietnamese, I could
> > 
> > https://www.google.com/search?q=sang+site%3Avi.wikipedia.org
> > 
> > ..which is indeed the issue I am attempting to deal with (trying to
> > put the discussion back on track) -- a bunch of user authored content
> > which looks correct to a native speaker when using the unicode bidi
> > algorithm (implemented in the browser).  Language tags are only
> > applied sporadically when needed to correct some obvious issue --
> > although the future Visual Editor project at wikimedia hopes to make
> > language tagging a more integrated part of the editing process.
> > 
> > Language tagging uses the HTML  standard.
> >  Directionality tagging uses  and  where necessary.  But
> > again, the point of the bidi algorithm is to avoid the necessity of
> > manual tagging in many cases.
> > 
> > Ultimately, wikipedias goal is to allow the largest number of
> > individual authors the ability to create encyclopedic content in their
> > language as easily as possible.  Our greatest challenge is the "as
> > easily as possible" part.  We can't impose language tagging as a
> > barrier to entry, when it is not necessary for the author's text to be
> > readable and useful to the public.
> 
> There is a big difference between (barely) readable text and
> typographically correct one, if your goal is only the former, this
> language tagging can be skipped (and you can forget about hyphenation,
> too, except for the main document language which is, hopefully, already
> known).
> 
> This leaves you with the BiDi algorithm, for which there exists many
> implementations that you might be able to use while processing your text
> before generating TeX files. There even exists a TeX pre-processor that
> can apply BiDi algorithm to TeX documents, that you might be able to use
> or adapt (I never used it myself, and it was written for e-TeX but XeTeX
> RTL model is essentially the same, so it should work in theory).

http://biditex.sourceforge.net/

Regards,
Khaled


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-10 Thread Khaled Hosny
On Tue, Dec 10, 2013 at 11:11:27AM -0500, C. Scott Ananian wrote:
> On Tue, Dec 10, 2013 at 6:09 AM, Zdenek Wagner  
> wrote:
> > 2013/12/10 Keith J. Schultz :
> >> I will repeat I do not know Vietnamese so I can not give you
> [...]
> >> Now, if "sang" is true Vietnamese and not a latinized form stand 
> >> corrected! Though I have
> [...]
> > Yes, it is true Vietnamese word. I do not know Vietnamese, I could
> 
> https://www.google.com/search?q=sang+site%3Avi.wikipedia.org
> 
> ..which is indeed the issue I am attempting to deal with (trying to
> put the discussion back on track) -- a bunch of user authored content
> which looks correct to a native speaker when using the unicode bidi
> algorithm (implemented in the browser).  Language tags are only
> applied sporadically when needed to correct some obvious issue --
> although the future Visual Editor project at wikimedia hopes to make
> language tagging a more integrated part of the editing process.
> 
> Language tagging uses the HTML  standard.
>  Directionality tagging uses  and  where necessary.  But
> again, the point of the bidi algorithm is to avoid the necessity of
> manual tagging in many cases.
> 
> Ultimately, wikipedias goal is to allow the largest number of
> individual authors the ability to create encyclopedic content in their
> language as easily as possible.  Our greatest challenge is the "as
> easily as possible" part.  We can't impose language tagging as a
> barrier to entry, when it is not necessary for the author's text to be
> readable and useful to the public.

There is a big difference between (barely) readable text and
typographically correct one, if your goal is only the former, this
language tagging can be skipped (and you can forget about hyphenation,
too, except for the main document language which is, hopefully, already
known).

This leaves you with the BiDi algorithm, for which there exists many
implementations that you might be able to use while processing your text
before generating TeX files. There even exists a TeX pre-processor that
can apply BiDi algorithm to TeX documents, that you might be able to use
or adapt (I never used it myself, and it was written for e-TeX but XeTeX
RTL model is essentially the same, so it should work in theory).

Regards,
Khaled


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-10 Thread Andrew Cunningham
8
On 11/12/2013 5:27 AM, "C. Scott Ananian"  wrote:
>
> ..which is indeed the issue I am attempting to deal with (trying to
> put the discussion back on track) -- a bunch of user authored content
> which looks correct to a native speaker when using the unicode bidi
> algorithm (implemented in the browser).  Language tags are only
> applied sporadically when needed to correct some obvious issue --
> although the future Visual Editor project at wikimedia hopes to make
> language tagging a more integrated part of the editing process.
>

There are a number of problems with current implementations of web browser
web support.

If you are going to compare bidi  support in XeTeX with web browsers,  it
would be better to compare it with the proposed changes in HTML5 and CSS3
which will bring it more inline with key changes in Unicode 6.3 bidi
support.

> Language tagging uses the HTML  standard.
>  Directionality tagging uses  and  where necessary.

In most cases   and  should not be needed.

But
> again, the point of the bidi algorithm is to avoid the necessity of
> manual tagging in many cases.

This is the idea in the proposed changes in HTML5,

More to the point which libraries is XeTeX using for Bidi support me how
up-to-date are they?


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-10 Thread C. Scott Ananian
On Tue, Dec 10, 2013 at 6:09 AM, Zdenek Wagner  wrote:
> 2013/12/10 Keith J. Schultz :
>> I will repeat I do not know Vietnamese so I can not give you
[...]
>> Now, if "sang" is true Vietnamese and not a latinized form stand corrected! 
>> Though I have
[...]
> Yes, it is true Vietnamese word. I do not know Vietnamese, I could

https://www.google.com/search?q=sang+site%3Avi.wikipedia.org

..which is indeed the issue I am attempting to deal with (trying to
put the discussion back on track) -- a bunch of user authored content
which looks correct to a native speaker when using the unicode bidi
algorithm (implemented in the browser).  Language tags are only
applied sporadically when needed to correct some obvious issue --
although the future Visual Editor project at wikimedia hopes to make
language tagging a more integrated part of the editing process.

Language tagging uses the HTML  standard.
 Directionality tagging uses  and  where necessary.  But
again, the point of the bidi algorithm is to avoid the necessity of
manual tagging in many cases.

Ultimately, wikipedias goal is to allow the largest number of
individual authors the ability to create encyclopedic content in their
language as easily as possible.  Our greatest challenge is the "as
easily as possible" part.  We can't impose language tagging as a
barrier to entry, when it is not necessary for the author's text to be
readable and useful to the public.  We can encourage it in order to
obtain good hyphentation of embedded texts, but in our case that must
be an optional enhancement, not a requirement in order for the text to
be read.  (Which is why if we did do automated language guessing, it
would likely be primarily to *disable* hyphenation when we detect an
embedded text whose language differs from the one currently selected.
That is the safe option; we'll sacrifice some beauty but preserve the
legibility of the text -- which is our foremost concern.  We can't use
automated language guessing to second-guess the unicode bidi
algorithm, because the text *as it appears in the browser* is the text
which has been proof-read by our editors, and must be considered
canonically correct.)
  --scott

-- 
 ( http://cscott.net/ )


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-10 Thread C. Scott Ananian
But it does automatically add italic correction, rather than requiring
this to be specified each time.
 --scott

-- 
 ( http://cscott.net/ )


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-10 Thread Zdenek Wagner
2013/12/10 Keith J. Schultz :
> Hi Phillip,
>
> I will repeat I do not know Vietnamese so I can not give you
> the utf-8 sequence for it. All I can say that in utf-8 the singular letters 
> will
> be encoded in multi-bytes whereas the english letters will be just one byte.
>
It has no relation to English, it is just because these characters
have codepoints less than 128. In Czech some characters will be
encoded as one byte, some as two bytes. The character "s" may appear
in English, German, Czech, Hungarian, Spanish and many other
languages. You have not answered Phillip's question what is the utf-8
sequence to distinguish English "s" from Czech "s", from Vietnamese
"s", from Hungarian "s" etc.

> Now, i also, mentioned that differentiating western language poses a 
> different matter!
> "sang" in English and  "sang" in German an Austrian can not be singularly  
> deferentiated
> as to which language it belongs to! All latin characters/letters.
> Now, if "sang" is true Vietnamese and not a latinized form stand corrected! 
> Though I have
> a feeling it is latinized! If we are talking of the phonetic reprsentation, 
> then a analysis
> on text and belong singular text level is required.
>
Yes, it is true Vietnamese word. I do not know Vietnamese, I could
only verify it by google translate but I know that Vietnamese uses
latin alphabet with accents. And of course, some words do not have
accents. It is the same in Czech, we also use accented characters but
many words do not have them. And for instance, strom in Czech has
different meaning that Strom in German.

> It has been mentioned by others that seems to be a lack of multi-lingual utf-8
> editors(input methods) on the other side also, Xe(La)TeX lack of 
> implementation of
> properly handling the unicode standard.
>
Unicode is not a typographic standard and programs from the TeX world
deal with typography. If you want to achieve typographically good
output, you have to use language specific rules, ie tha languages must
be properly tagged. Once you tag the language, it will appear right in
the Xe(La)TeX output. If you are interested in Unicode only and not in
typography, why do you wish to use a typographic tool?

I can explain it another way. If you wish to connect two pieces of
wood, you can use either a nail or a screw. If you use a screw, you
must first make a hole and the screw the pieces. However, if you do
not like to make a hole and want to use a hammer only, why do you
bother with a screw and do not use a nail?

> It is not the standard that is the problem, but the implementation of input 
> and the
> implementation of the output method.
>
> True enough, Unicode is not by far finish and is still evolving with all the 
> cavets
> involved. Yet, the problem here does arises out of the fact that the unicode 
> standard
> and utf-8 encoding/decoding is inadequate, but in its implementation.
> The culprit is not utf-8!
>
>
>
> Am 09.12.2013 um 23:51 schrieb Philip Taylor :
>
>>
>>
>> Keith J. Schultz wrote:
>>> Hi Phillip,
>>>
>>> 1) I do not know Vietnamese!
>>>
>>> 2) If I did uses the proper BMP would give me the answer.
>>> As "sang would be a sequence of singualr octcets, and Vietnamese
>>> would use multi-byte sequences!
>>>
>>> case closed! Like I mentioned there are often ways used to reduce the 
>>> length of
>>> the multibyte sequences. In that case one has to know the processed use to 
>>> get the proper
>>> unicode character code!
>>
>> It is not necessary to "know" a language in order to be able to
>> algorithmically determine in which language a particular stretch
>> of text is written, if such algorithmic determination is possible.
>> I do not "know" Hebrew, but even I know that "בית דין‎" is Hebrew
>> and that "你好" is not.  What I do not know (and what I challenge
>> you to tell us" is whether "sang" is English or Vietnamese.
>>
>> You wrote :  "for efficiency reasons, utf-8 strings are not properly
>> encoded and programs assume a particular language, to save space."
>>
>> I invited you to tell us (the XeTeX list members, that is) what
>> would be a "properly encoded utf-8 string" for the sequence
>> "sang" which would enable a computer algorithm to determine
>> whether that string was "sang" (Vietnamese) or "sang" (English).
>>
>> I am still hoping that you will be able to tell us what that
>> properly encoded utf-8 string is, rather than just metaphorically
>> waving your arms in the air while throwing around phrases such as
>> "proper BMP", "singular octets" and "multi-byte sequences".
>>
>> Philip Taylor
>>
>>
>>
>
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-10 Thread Keith J. Schultz
Hi Phillip,

I will repeat I do not know Vietnamese so I can not give you
the utf-8 sequence for it. All I can say that in utf-8 the singular letters will
be encoded in multi-bytes whereas the english letters will be just one byte.

Now, i also, mentioned that differentiating western language poses a different 
matter!
"sang" in English and  "sang" in German an Austrian can not be singularly  
deferentiated
as to which language it belongs to! All latin characters/letters. 
Now, if "sang" is true Vietnamese and not a latinized form stand corrected! 
Though I have 
a feeling it is latinized! If we are talking of the phonetic reprsentation, 
then a analysis
on text and belong singular text level is required. 

It has been mentioned by others that seems to be a lack of multi-lingual utf-8
editors(input methods) on the other side also, Xe(La)TeX lack of implementation 
of
properly handling the unicode standard. 

It is not the standard that is the problem, but the implementation of input and 
the
implementation of the output method. 

True enough, Unicode is not by far finish and is still evolving with all the 
cavets
involved. Yet, the problem here does arises out of the fact that the unicode 
standard
and utf-8 encoding/decoding is inadequate, but in its implementation.
The culprit is not utf-8!



Am 09.12.2013 um 23:51 schrieb Philip Taylor :

> 
> 
> Keith J. Schultz wrote:
>> Hi Phillip,
>> 
>> 1) I do not know Vietnamese!
>> 
>> 2) If I did uses the proper BMP would give me the answer.
>> As "sang would be a sequence of singualr octcets, and Vietnamese
>> would use multi-byte sequences! 
>> 
>> case closed! Like I mentioned there are often ways used to reduce the length 
>> of
>> the multibyte sequences. In that case one has to know the processed use to 
>> get the proper
>> unicode character code!
> 
> It is not necessary to "know" a language in order to be able to
> algorithmically determine in which language a particular stretch
> of text is written, if such algorithmic determination is possible.
> I do not "know" Hebrew, but even I know that "בית דין‎" is Hebrew
> and that "你好" is not.  What I do not know (and what I challenge
> you to tell us" is whether "sang" is English or Vietnamese.
> 
> You wrote :  "for efficiency reasons, utf-8 strings are not properly
> encoded and programs assume a particular language, to save space."
> 
> I invited you to tell us (the XeTeX list members, that is) what
> would be a "properly encoded utf-8 string" for the sequence
> "sang" which would enable a computer algorithm to determine
> whether that string was "sang" (Vietnamese) or "sang" (English).
> 
> I am still hoping that you will be able to tell us what that
> properly encoded utf-8 string is, rather than just metaphorically
> waving your arms in the air while throwing around phrases such as
> "proper BMP", "singular octets" and "multi-byte sequences".
> 
> Philip Taylor
> 
> 
> 




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread Philip Taylor


Keith J. Schultz wrote:
> Hi Phillip,
> 
> 1) I do not know Vietnamese!
> 
> 2) If I did uses the proper BMP would give me the answer.
>  As "sang would be a sequence of singualr octcets, and Vietnamese
>  would use multi-byte sequences! 
> 
> case closed! Like I mentioned there are often ways used to reduce the length 
> of
> the multibyte sequences. In that case one has to know the processed use to 
> get the proper
> unicode character code!

It is not necessary to "know" a language in order to be able to
algorithmically determine in which language a particular stretch
of text is written, if such algorithmic determination is possible.
I do not "know" Hebrew, but even I know that "בית דין‎" is Hebrew
and that "你好" is not.  What I do not know (and what I challenge
you to tell us" is whether "sang" is English or Vietnamese.

You wrote :  "for efficiency reasons, utf-8 strings are not properly
encoded and programs assume a particular language, to save space."

I invited you to tell us (the XeTeX list members, that is) what
would be a "properly encoded utf-8 string" for the sequence
"sang" which would enable a computer algorithm to determine
whether that string was "sang" (Vietnamese) or "sang" (English).

I am still hoping that you will be able to tell us what that
properly encoded utf-8 string is, rather than just metaphorically
waving your arms in the air while throwing around phrases such as
"proper BMP", "singular octets" and "multi-byte sequences".

Philip Taylor





--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread mskala
On Tue, 10 Dec 2013, Khaled Hosny wrote:
> Now you beat Keith in Who Wrote The Most Nonessential Text In This
> Thread contest.

Well, it's always nice to be a winner.
-- 
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread Khaled Hosny
On Mon, Dec 09, 2013 at 01:40:21PM -0600, msk...@ansuz.sooke.bc.ca wrote:
> On Mon, 9 Dec 2013, C. Scott Ananian wrote:
> > feeding the output to xelatex.  That work won't help others who find
> > themselves in a similar situation (or document authors who would
> > prefer not to have to explicitly annotate every LTR embedding), but it
> 
> The software also doesn't automatically determine which words should be
> set in italics, even though this policy is inconvenient for authors who
> prefer not to have to explicitly annotate it every time.

Now you beat Keith in Who Wrote The Most Nonessential Text In This
Thread contest.


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread Khaled Hosny
On Mon, Dec 09, 2013 at 09:32:05AM -0600, msk...@ansuz.sooke.bc.ca wrote:
> On Mon, 9 Dec 2013, Khaled Hosny wrote:
> > >U+E0001 U+E0065 U+E006E U+0073 U+0061 U+006E U+0067
> >
> > And it is a kind of tagging, so beyond the scope of identifying the
> > language of *untagged* text (which is the claim that spurred all this
> > discussion).
> 
> The claim was "A properly encoded utf-8 string should contain everything
> you need!".

You are reading too much into this statement. The original claim was
that you don’t need to tag a Unicode string to be able to identify its
language, which is not the case, and your method is just a (deprecated)
form of tagging, so it does not prove the original claim.

> If you forbid using Unicode tag characters, then you're
> saying "It is impossible to encode language in Unicode when you're not
> allowed to use the features designed for that purpose," which is not
> an interesting statement.

I’m not forbidding anything, but the grand OP’s issue was that he cannot
manually tag the text, and I don’t see how changing the form of tagging
solves anything, since one still needs to do it manually.
 
> Yes, of course some kind of tagging is needed.  Keith seems to think that
> the tagging will magically come from "proper" UTF-8, and of course he's
> wrong.  I think language tagging would be possible in pure Unicode, as the
> string above demonstrates, but that's not a good way to do it.  The really
> original question had to do with RTL versus LTR detection, not language
> detection, and that's a different issue.

We are not even limited to plain text, since we are dealing with
Wikipedia article here, which is a tagged text, so what form of tagging
to use is not even an issue. The tagging itself is the issue.

> Unicode specifies a way to detect RTL versus LTR, such that in many cases
> it doesn't require tagging.

Right, and the grand OP was adviced to use that, and it is very
reliable, but it solves half the issue, since it does not help with
language tagging that is needed for other things like hyphenation
patterns or using different typographic convention, different fonts and
so on for different language, which IMO is a requirement for any
typesetting job for anything but the most trivial of texts.

> Unicode's way of doing it may or may not be a good one, but we cannot
> reasonably pretend that it doesn't exist.  The Unicode bidi algorithm
> does exist.  XeTeX does not implement the Unicode bidi algorithm.

No one claimed that in the whole thread, so I’m not sure what you are
trying to disprove here.

> The interesting remaining question is whether XeTeX should implement
> it.  I tend to think not - because if we implement it, people will
> blame us for its failings.  It'd also be a lot of work, break
> compatibility with the rest of the TeX world, STILL require tagging in
> many cases, and so on.

To the contrary, I think XeTeX should, but it is not a trivial job and
the so unlikely to be done.

Regards,
Khaled


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread Axel E. Retif

On 12/09/2013 10:15 AM, Zdenek Wagner wrote:



A bit off topic, dou you know a good Linux text editor woth properly
implemented bidi algorithm so that I could type multilingual texts?
Evne the combination of Urdu and TeX macros is a pain, it is not easy
to type
\textbf{میں نے
\today\
  کو سب کچھ کیا۔}


I use Emacs (24.3.1) for my work. I can input Hebrew words (though I 
don't know Hebrew, just the alphabet) ---Emacs immediately recognizes 
I'm typing Hebrew and shifts to RTL and back again, at least in the two 
major modes I use: AUCTeX and Org.



Best

Axel



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread mskala
On Mon, 9 Dec 2013, C. Scott Ananian wrote:
> feeding the output to xelatex.  That work won't help others who find
> themselves in a similar situation (or document authors who would
> prefer not to have to explicitly annotate every LTR embedding), but it

The software also doesn't automatically determine which words should be
set in italics, even though this policy is inconvenient for authors who
prefer not to have to explicitly annotate it every time.
-- 
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread C. Scott Ananian
In my particular case, I have citations in (for example) the arabic
wikipedia, which cite references on English or Turkish webpages (to
cite the example of the arwiki article on 'Istanbul').  The original
author of the article did not explicitly mark the language of the
reference, because the unicode bidirectional algorithm did a perfect
job of rendering the cited page title LTR in an otherwise RTL context.
 When I translate this to XeLaTeX, the entire citation is garbled
because, although XeLaTeX/polyglossia does render the individual words
LTR (using directionality implied from the unicode code block), the
individual words are laid out RTL and the punctuation is a mess,
because XeLaTeX does not implement the bidir algorithm's mechanism for
inferring the directionality of 'weak' and 'soft' characters.   (The
original citations also don't necessarily add  tags where
necessary, but that appears to be an easily fixed fault of the
citation template.)

My understanding from this discussion is that I should implement the
unicode bidi algorithm myself in my article preprocessor, to
explicitly annotate the directionality of soft characters before
feeding the output to xelatex.  That work won't help others who find
themselves in a similar situation (or document authors who would
prefer not to have to explicitly annotate every LTR embedding), but it
should be a reasonable solution to my particular problem.
 --scott


On Mon, Dec 9, 2013 at 10:32 AM,   wrote:
> On Mon, 9 Dec 2013, Khaled Hosny wrote:
>> >U+E0001 U+E0065 U+E006E U+0073 U+0061 U+006E U+0067
>>
>> And it is a kind of tagging, so beyond the scope of identifying the
>> language of *untagged* text (which is the claim that spurred all this
>> discussion).
>
> The claim was "A properly encoded utf-8 string should contain everything
> you need!".  If you forbid using Unicode tag characters, then you're
> saying "It is impossible to encode language in Unicode when you're not
> allowed to use the features designed for that purpose," which is not
> an interesting statement.
>
> Yes, of course some kind of tagging is needed.  Keith seems to think that
> the tagging will magically come from "proper" UTF-8, and of course he's
> wrong.  I think language tagging would be possible in pure Unicode, as the
> string above demonstrates, but that's not a good way to do it.  The really
> original question had to do with RTL versus LTR detection, not language
> detection, and that's a different issue.
>
> Unicode specifies a way to detect RTL versus LTR, such that in many cases
> it doesn't require tagging.  Unicode's way of doing it may or may not be a
> good one, but we cannot reasonably pretend that it doesn't exist.  The
> Unicode bidi algorithm does exist.  XeTeX does not implement the Unicode
> bidi algorithm.  The interesting remaining question is whether XeTeX
> should implement it.  I tend to think not - because if we implement it,
> people will blame us for its failings.  It'd also be a lot of work, break
> compatibility with the rest of the TeX world, STILL require tagging in
> many cases, and so on.
>
> --
> Matthew Skala
> msk...@ansuz.sooke.bc.ca People before principles.
> http://ansuz.sooke.bc.ca/
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



-- 
 ( http://cscott.net/ )


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread maxwell

On 2013-12-09 11:15, Zdenek Wagner wrote:

A bit off topic, dou you know a good Linux text editor woth properly
implemented bidi algorithm so that I could type multilingual texts?


Yudit (http://www.yudit.org/) claims to be that.  I have not used it.

   Mike Maxwell


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread mskala
On Mon, 9 Dec 2013, Zdenek Wagner wrote:
> A bit off topic, dou you know a good Linux text editor woth properly
> implemented bidi algorithm so that I could type multilingual texts?

No, I don't really do any work with RTL languages myself.  Wikipedia's
comparison list at http://en.wikipedia.org/wiki/Comparison_of_text_editors
mentions several that claim bidirectional text support, but I can't speak
to whether the ones listed there are any good at it.
-- 
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread Zdenek Wagner
2013/12/9  :
> On Mon, 9 Dec 2013, Khaled Hosny wrote:
>> >U+E0001 U+E0065 U+E006E U+0073 U+0061 U+006E U+0067
>>
>> And it is a kind of tagging, so beyond the scope of identifying the
>> language of *untagged* text (which is the claim that spurred all this
>> discussion).
>
> The claim was "A properly encoded utf-8 string should contain everything
> you need!".  If you forbid using Unicode tag characters, then you're
> saying "It is impossible to encode language in Unicode when you're not
> allowed to use the features designed for that purpose," which is not
> an interesting statement.
>
> Yes, of course some kind of tagging is needed.  Keith seems to think that
> the tagging will magically come from "proper" UTF-8, and of course he's
> wrong.  I think language tagging would be possible in pure Unicode, as the
> string above demonstrates, but that's not a good way to do it.  The really
> original question had to do with RTL versus LTR detection, not language
> detection, and that's a different issue.
>
> Unicode specifies a way to detect RTL versus LTR, such that in many cases
> it doesn't require tagging.  Unicode's way of doing it may or may not be a
> good one, but we cannot reasonably pretend that it doesn't exist.  The
> Unicode bidi algorithm does exist.  XeTeX does not implement the Unicode
> bidi algorithm.  The interesting remaining question is whether XeTeX
> should implement it.  I tend to think not - because if we implement it,
> people will blame us for its failings.  It'd also be a lot of work, break
> compatibility with the rest of the TeX world, STILL require tagging in
> many cases, and so on.
>
A bit off topic, dou you know a good Linux text editor woth properly
implemented bidi algorithm so that I could type multilingual texts?
Evne the combination of Urdu and TeX macros is a pain, it is not easy
to type
\textbf{میں نے
\today\
 کو سب کچھ کیا۔}
I am not able to type it on a single line, gedit, kate and even gmail
and facebook get confused and create garbage if I mix LTR and RTL
scripts.. I can only use a commercial XML editor that allows me to
combine text in a latin script with texts in Hindi and Urdu.

> --
> Matthew Skala
> msk...@ansuz.sooke.bc.ca People before principles.
> http://ansuz.sooke.bc.ca/
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread mskala
On Mon, 9 Dec 2013, Khaled Hosny wrote:
> >U+E0001 U+E0065 U+E006E U+0073 U+0061 U+006E U+0067
>
> And it is a kind of tagging, so beyond the scope of identifying the
> language of *untagged* text (which is the claim that spurred all this
> discussion).

The claim was "A properly encoded utf-8 string should contain everything
you need!".  If you forbid using Unicode tag characters, then you're
saying "It is impossible to encode language in Unicode when you're not
allowed to use the features designed for that purpose," which is not
an interesting statement.

Yes, of course some kind of tagging is needed.  Keith seems to think that
the tagging will magically come from "proper" UTF-8, and of course he's
wrong.  I think language tagging would be possible in pure Unicode, as the
string above demonstrates, but that's not a good way to do it.  The really
original question had to do with RTL versus LTR detection, not language
detection, and that's a different issue.

Unicode specifies a way to detect RTL versus LTR, such that in many cases
it doesn't require tagging.  Unicode's way of doing it may or may not be a
good one, but we cannot reasonably pretend that it doesn't exist.  The
Unicode bidi algorithm does exist.  XeTeX does not implement the Unicode
bidi algorithm.  The interesting remaining question is whether XeTeX
should implement it.  I tend to think not - because if we implement it,
people will blame us for its failings.  It'd also be a lot of work, break
compatibility with the rest of the TeX world, STILL require tagging in
many cases, and so on.

-- 
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread Khaled Hosny
On Mon, Dec 09, 2013 at 08:16:03AM -0600, msk...@ansuz.sooke.bc.ca wrote:
> On Mon, 9 Dec 2013, Philip Taylor wrote:
> > Keith -- could you possible supply an example of
> > "a properly encoded utf-8 string" from which it
> > can be unambiguously determined whether the string
> > "sang" is an English word (the past tense of "sing")
> 
> I'll probably regret pointing this out, and the characters involved have
> been deprecated since Unicode 5, but:
> 
>U+E0001 U+E0065 U+E006E U+0073 U+0061 U+006E U+0067

And it is a kind of tagging, so beyond the scope of identifying the
language of *untagged* text (which is the claim that spurred all this
discussion).

Regards,
Khaled


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread mskala
On Mon, 9 Dec 2013, Philip Taylor wrote:
> Keith -- could you possible supply an example of
> "a properly encoded utf-8 string" from which it
> can be unambiguously determined whether the string
> "sang" is an English word (the past tense of "sing")

I'll probably regret pointing this out, and the characters involved have
been deprecated since Unicode 5, but:

   U+E0001 U+E0065 U+E006E U+0073 U+0061 U+006E U+0067

or in UTF-8 bytes:

   f3 a0 80 81 f3 a0 81 a5 f3 a0 81 ae 73 61 6e 67

The Web form you mentioned sanitizes away the special characters.  I don't
think that's unique to "tags" - it seems to also block everything outside
the Basic Multilingual Plane.  Bad form for something claiming to be an
authoritative analyser of Unicode strings.
-- 
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread Khaled Hosny
On Mon, Dec 09, 2013 at 01:28:46PM +0100, Keith J. Schultz wrote:
> Hi Khaled,
> 
> I would agree with you if the text was not encoded in unicode!
> A properly encoded utf-8 string should contain everything you need!

No it doesn’t, otherwise please prove me wrong and till me how you can,
programatically, identify the language of this paragraph using Unicode
properties.

> Unfortunately, for efficiency reasons, utf-8 strings are not properly
> encoded and programs assume a particular language, to save space.
> In multi-language environments methods are used for efficiency to make
> sure the system uses the correct language! 
>
> It is not the fault of utf-8, but the way it is implemented.  

Encodings has nothing to do with language identification, you can always
convert text to Unicode prior to processing it.

> As far as the methods you point to, they are for identify texts of unknown
> origine and possibly of unknown encoding or an encoding that already has not 
> identified
> the language. 

If the language of the text is already known (i.e. properly tagged
text), we don’t need to identify it.

> Am 09.12.2013 um 10:38 schrieb Khaled Hosny :
> 
> > On Mon, Dec 09, 2013 at 09:22:10AM +0100, Keith J. Schultz wrote:
> >> Hi Khaled,
> >> 
> >> your question can not be serious!
> > 
> > No, it is.
> > 
> >> It is pretty much in the standard! 
> > 
> > No.
> > 
> >> True enough that for most western languages american, english, spanish,
> >> german, austrian, etc. this is somewhat difficult. Yet, these are not 
> >> causing the problems.
> > 
> > You can’t identify the language of a Unicode string just by examining
> > the Unicode properties for the characters in that string, simply because
> > such Unicode property does not exist. Language identifications involves
> > quite some statistical analysis[1]. You can identify scripts using
> > Unicode properties quite reliably, though.
> > 
> > 1. 
> > https://en.wikipedia.org/wiki/Language_identification#Statistical_approaches
> > 
> > Regards,
> > Khaled
> [snip, snip]

> 
> 
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread Zdenek Wagner
2013/12/9 Philip Taylor :
> Keith -- could you possible supply an example of
> "a properly encoded utf-8 string" from which it
> can be unambiguously determined whether the string
> "sang" is an English word (the past tense of "sing")
> or a Vietnamese word meaning "to", "posh" or "knowingly"

And it may be a Danish noun meaning "a song" or a past tense from
Danish verb "synge".

> in English ?  Could you also paste that string into
> Richard Ishida's Unicode String Analyser :
>
> http://rishida.net/tools/analysestring/
>
> and let us know what information it returns ?
>
> Philip Taylor
> 
>
> Keith J. Schultz wrote:
>
>> Unfortunately, for efficiency reasons, utf-8 strings are not properly
>> encoded and programs assume a particular language, to save space.
> --
> Windows 8 ? Just say "no".
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread Philip Taylor
Keith -- could you possible supply an example of
"a properly encoded utf-8 string" from which it
can be unambiguously determined whether the string
"sang" is an English word (the past tense of "sing")
or a Vietnamese word meaning "to", "posh" or "knowingly"
in English ?  Could you also paste that string into
Richard Ishida's Unicode String Analyser :

http://rishida.net/tools/analysestring/

and let us know what information it returns ?

Philip Taylor


Keith J. Schultz wrote:

> Unfortunately, for efficiency reasons, utf-8 strings are not properly
> encoded and programs assume a particular language, to save space.
-- 
Windows 8 ? Just say "no".


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread Keith J. Schultz
Hi Khaled,

I would agree with you if the text was not encoded in unicode!
A properly encoded utf-8 string should contain everything you need!
Unfortunately, for efficiency reasons, utf-8 strings are not properly
encoded and programs assume a particular language, to save space.
In multi-language environments methods are used for efficiency to make
sure the system uses the correct language! 

It is not the fault of utf-8, but the way it is implemented.  

As far as the methods you point to, they are for identify texts of unknown
origine and possibly of unknown encoding or an encoding that already has not 
identified
the language. 
Am 09.12.2013 um 10:38 schrieb Khaled Hosny :

> On Mon, Dec 09, 2013 at 09:22:10AM +0100, Keith J. Schultz wrote:
>> Hi Khaled,
>> 
>> your question can not be serious!
> 
> No, it is.
> 
>> It is pretty much in the standard! 
> 
> No.
> 
>> True enough that for most western languages american, english, spanish,
>> german, austrian, etc. this is somewhat difficult. Yet, these are not 
>> causing the problems.
> 
> You can’t identify the language of a Unicode string just by examining
> the Unicode properties for the characters in that string, simply because
> such Unicode property does not exist. Language identifications involves
> quite some statistical analysis[1]. You can identify scripts using
> Unicode properties quite reliably, though.
> 
> 1. 
> https://en.wikipedia.org/wiki/Language_identification#Statistical_approaches
> 
> Regards,
> Khaled
[snip, snip]

--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread Khaled Hosny
On Mon, Dec 09, 2013 at 09:22:10AM +0100, Keith J. Schultz wrote:
> Hi Khaled,
> 
> your question can not be serious!

No, it is.

> It is pretty much in the standard! 

No.

> True enough that for most western languages american, english, spanish,
> german, austrian, etc. this is somewhat difficult. Yet, these are not causing 
> the problems.

You can’t identify the language of a Unicode string just by examining
the Unicode properties for the characters in that string, simply because
such Unicode property does not exist. Language identifications involves
quite some statistical analysis[1]. You can identify scripts using
Unicode properties quite reliably, though.

1. https://en.wikipedia.org/wiki/Language_identification#Statistical_approaches

Regards,
Khaled

> regards
>   Keith.
> 
> Am 05.12.2013 um 09:46 schrieb Khaled Hosny :
> 
> > On Thu, Dec 05, 2013 at 09:41:04AM +0100, Keith J. Schultz wrote:
> >> Hi Scott,
> >> 
> >> We are talking Unicode here right! What is there to guess? 
> > 
> > And how do you, using Unicode, tell in what language is this line
> > written?
> > 
> 
> 
> 
> 
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-09 Thread Keith J. Schultz
Hi Khaled,

your question can not be serious!

It is pretty much in the standard! 

True enough that for most western languages american, english, spanish,
german, austrian, etc. this is somewhat difficult. Yet, these are not causing 
the problems.

regards
Keith.

Am 05.12.2013 um 09:46 schrieb Khaled Hosny :

> On Thu, Dec 05, 2013 at 09:41:04AM +0100, Keith J. Schultz wrote:
>> Hi Scott,
>> 
>> We are talking Unicode here right! What is there to guess? 
> 
> And how do you, using Unicode, tell in what language is this line
> written?
> 




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-07 Thread Khaled Hosny
On Sat, Dec 07, 2013 at 03:15:53PM +0100, Zdenek Wagner wrote:
> 2013/12/7 Khaled Hosny :
> > Not at all! I even designed a companion Latin font, so that readers of
> > Latin script can enjoy the same quality and polishes of FreeSerif that
> > Arabic script reader enjoy:
> >
> > http://www.khaledhosny.org/files/tmp/freeserif.html
> >
> Do I understand well that your name in Arabic is خالد حسني

Yes.


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-07 Thread Zdenek Wagner
2013/12/7 Khaled Hosny :
> Not at all! I even designed a companion Latin font, so that readers of
> Latin script can enjoy the same quality and polishes of FreeSerif that
> Arabic script reader enjoy:
>
> http://www.khaledhosny.org/files/tmp/freeserif.html
>
Do I understand well that your name in Arabic is خالد حسني (I do not
know Arabic, I just guess from the calligraphy because I know حسن from
Hindi and Urdu and the dictionary says that the word is of Arabic
origin).

> Please use both and don’t let Arabic readers have all the joy.
>
> Regards,
> Khaled
>
> On Sat, Dec 07, 2013 at 01:28:31AM +0100, Dominik Wujastyk wrote:
>> I'm sensing, I think, that you don't like that font, Khaled?
>>
>> Dominik :-)
>>
>>
>> 2013/12/5 Khaled Hosny :
>> >
>>
>>
>> > >> > Please, please, please, never ever use GNU free font for Arabic; it is
>> > >> > the most hideous, crappy and useless un-Arabic font ever created, my
>> > >> > blood boils every time I see it in use.
>> > >> >
>> > >> Could you summarize what is wrong and report it?
>> > >
>> > > All of it, the Arabic range is utter crap.
>> > >
>> > >> Steve White will
>> > >> certainly fix it (unless it is better toreplace the whole Arabic
>> > >> block).
>> > >
>> > > I did, and even offered to work on replacement, but the offer was turned
>> > > down.
>> > >
>> > >> I see problems with dochachmee he, Urdu words as بھآرت and ؔٹھیک are
>> > >> not displayed properly.
>> > >
>> > > There is no point in looking at the microlevel, the whole thing is
>> > > worthless garbage and should be tossed in the nearest trash bin. Whoever
>> > > designed it has absolutely no idea about Arabic and its design, I take
>> > > it personally and find this garbage an insult to the Arabic script. Show
>> > > a text typeset with it to an Urdu speaker and he is likely to vomit in
>> > > disgust.
>> > >
>> >
>> >
>
>>
>>
>> --
>> Subscriptions, Archive, and List information, etc.:
>>   http://tug.org/mailman/listinfo/xetex
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-07 Thread Khaled Hosny
Not at all! I even designed a companion Latin font, so that readers of
Latin script can enjoy the same quality and polishes of FreeSerif that
Arabic script reader enjoy:

http://www.khaledhosny.org/files/tmp/freeserif.html

Please use both and don’t let Arabic readers have all the joy.

Regards,
Khaled

On Sat, Dec 07, 2013 at 01:28:31AM +0100, Dominik Wujastyk wrote:
> I'm sensing, I think, that you don't like that font, Khaled?
> 
> Dominik :-)
> 
> 
> 2013/12/5 Khaled Hosny :
> >
> 
> 
> > >> > Please, please, please, never ever use GNU free font for Arabic; it is
> > >> > the most hideous, crappy and useless un-Arabic font ever created, my
> > >> > blood boils every time I see it in use.
> > >> >
> > >> Could you summarize what is wrong and report it?
> > >
> > > All of it, the Arabic range is utter crap.
> > >
> > >> Steve White will
> > >> certainly fix it (unless it is better toreplace the whole Arabic
> > >> block).
> > >
> > > I did, and even offered to work on replacement, but the offer was turned
> > > down.
> > >
> > >> I see problems with dochachmee he, Urdu words as بھآرت and ؔٹھیک are
> > >> not displayed properly.
> > >
> > > There is no point in looking at the microlevel, the whole thing is
> > > worthless garbage and should be tossed in the nearest trash bin. Whoever
> > > designed it has absolutely no idea about Arabic and its design, I take
> > > it personally and find this garbage an insult to the Arabic script. Show
> > > a text typeset with it to an Urdu speaker and he is likely to vomit in
> > > disgust.
> > >
> >
> >

> 
> 
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-07 Thread Georg Duffner

Am 07.12.2013 01:28, schrieb Dominik Wujastyk:

I'm sensing, I think, that you don't like that font, Khaled?

Dominik :-)


He’s not alone and the Arabic is not the only problem...

Georg



2013/12/5 Khaled Hosny :






Please, please, please, never ever use GNU free font for Arabic; it is
the most hideous, crappy and useless un-Arabic font ever created, my
blood boils every time I see it in use.


Could you summarize what is wrong and report it?


All of it, the Arabic range is utter crap.


Steve White will
certainly fix it (unless it is better toreplace the whole Arabic
block).


I did, and even offered to work on replacement, but the offer was turned
down.


I see problems with dochachmee he, Urdu words as بھآرت and ؔٹھیک are
not displayed properly.


There is no point in looking at the microlevel, the whole thing is
worthless garbage and should be tossed in the nearest trash bin. Whoever
designed it has absolutely no idea about Arabic and its design, I take
it personally and find this garbage an insult to the Arabic script. Show
a text typeset with it to an Urdu speaker and he is likely to vomit in
disgust.










--
Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex




--
EB Garamond: http://www.georgduffner.at/ebgaramond


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-06 Thread Dominik Wujastyk
I'm sensing, I think, that you don't like that font, Khaled?

Dominik :-)


2013/12/5 Khaled Hosny :
>


> >> > Please, please, please, never ever use GNU free font for Arabic; it is
> >> > the most hideous, crappy and useless un-Arabic font ever created, my
> >> > blood boils every time I see it in use.
> >> >
> >> Could you summarize what is wrong and report it?
> >
> > All of it, the Arabic range is utter crap.
> >
> >> Steve White will
> >> certainly fix it (unless it is better toreplace the whole Arabic
> >> block).
> >
> > I did, and even offered to work on replacement, but the offer was turned
> > down.
> >
> >> I see problems with dochachmee he, Urdu words as بھآرت and ؔٹھیک are
> >> not displayed properly.
> >
> > There is no point in looking at the microlevel, the whole thing is
> > worthless garbage and should be tossed in the nearest trash bin. Whoever
> > designed it has absolutely no idea about Arabic and its design, I take
> > it personally and find this garbage an insult to the Arabic script. Show
> > a text typeset with it to an Urdu speaker and he is likely to vomit in
> > disgust.
> >
>
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-05 Thread Jonathan Kew

On 5/12/13 12:48, C. Scott Ananian wrote:

Can anyone point me to docs on XeT--TeX?  A Google the other day failed
to turn up anything useful.



(TeX--XeT, not XeT--TeX.)

This is part of e-TeX; see the e-TeX manual[1], section 4.1.

HTH,

JK

[1] http://tug.ctan.org/systems/e-tex/v2/doc/etex_man.pdf


Also: polyglossia appears to be doing some amount of LTR/RTL
directionality switching based on the character block.  Can anyone offer
advice on how to avoid fighting with that, if I'm implementing my own
bidi algorithm?

Finally: any advice on using CJK languages with polyglossia?  Embedded
CJK is quite common.  Should I be writing gloss-ja etc files to set the
right directionality and font and get the appropriate CJK support
packages loaded?
   --scott

On Dec 5, 2013 5:42 AM, "Jonathan Kew" mailto:jfkth...@googlemail.com>> wrote:

On 4/12/13 13:24, C. Scott Ananian wrote:

The goal is to match the Unicode bidi algorithm, because that is
how the
web page displays and thus how the original author saw the text
as they
wrote.


This would be a nice enhancement, but would require a significant
amount of work (or in other words, it's not likely to get
implemented quickly, if at all).

Currently, typesetting bidi text with xetex requires correct use of
the TeX--XeT bidi commands (\beginR, \endR, \beginL, \endL) to mark
up the text direction. These could be used directly, or via
higher-level markup that's tagging script and language, but you
definitely need them to be present in some way.

Sorry, that's not what you want to hear, but it's how things are. At
this point, I think the most practical way forward in your situation
is probably to implement this as part of whatever tool is taking the
wikipedia content and converting it to (Xe)LaTeX markup - that tool
could inspect the content of each element it's processing, and add
any necessary direction controls for XeTeX.

JK

Guessing the proper language tag to use is likely infeasible;
note that the example given contains titles in Turkish as well as
English.  The safest option is probably to treat embedded LTR
text in an
RTL context as 'exotic' and not to attempt hyphenation.

I've heard it said that LuaTeX has "better bidi support".  What does
that mean, exactly? Should I be considering switching?
--scott

On Dec 4, 2013 4:08 AM, "Keith J. Schultz"
mailto:schul...@uni-trier.de>
>__>
wrote:

 Hi Scott,

 Am 03.12.2013 um 19:42 schrieb C. Scott Ananian
mailto:csc...@cscott.net>
 >>:

  >
  > But in the XeLaTeX/polyglossia/bidi output, the "soft
space" weak
  > directionality of the Unicode BiDi algorithm doesn't
seem to be
  > honored (or implemented?) and so the English article
titles appear
  > with the individual words in RTL order, which is a mess.
  Manually
  > tagging the language of the article title is probably
the Right
 thing,
  > but infeasible for the entire wikipedia.
  Well, without proper tagging you can not expect
any system to
  work properly or as expected!
  For most entries a simple script should do the
trick to add the
  language tags to the article titles.

 Hope this helps
  regards
  Keith.


 --__
 Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/__listinfo/xetex






--__
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/__listinfo/xetex





--__
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/__listinfo/xetex






--
Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex





--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-05 Thread Alan Munn

On Dec 5, 2013, at 7:48 AM, C. Scott Ananian  wrote:

> Can anyone point me to docs on XeT--TeX?  A Google the other day failed to 
> turn up anything useful.

On your TeX system, texdoc xetex gives the main documentation. But the bidi 
documentation, polyglossia documentation and fontspec documentation will also 
be useful.
> 
> Also: polyglossia appears to be doing some amount of LTR/RTL directionality 
> switching based on the character block.  Can anyone offer advice on how to 
> avoid fighting with that, if I'm implementing my own bidi algorithm?

I think polyglossia does switching only based on language. For RTL languages, 
it relies on the bidi package.
> 
> Finally: any advice on using CJK languages with polyglossia?  Embedded CJK is 
> quite common.  Should I be writing gloss-ja etc files to set the right 
> directionality and font and get the appropriate CJK support packages loaded?

There is a separate xeCJK package.  I don’t know how well they all work 
together.

Alan

>   --scott
> 
> On Dec 5, 2013 5:42 AM, "Jonathan Kew"  wrote:
> On 4/12/13 13:24, C. Scott Ananian wrote:
> The goal is to match the Unicode bidi algorithm, because that is how the
> web page displays and thus how the original author saw the text as they
> wrote.
> 
> This would be a nice enhancement, but would require a significant amount of 
> work (or in other words, it's not likely to get implemented quickly, if at 
> all).
> 
> Currently, typesetting bidi text with xetex requires correct use of the 
> TeX--XeT bidi commands (\beginR, \endR, \beginL, \endL) to mark up the text 
> direction. These could be used directly, or via higher-level markup that's 
> tagging script and language, but you definitely need them to be present in 
> some way.
> 
> Sorry, that's not what you want to hear, but it's how things are. At this 
> point, I think the most practical way forward in your situation is probably 
> to implement this as part of whatever tool is taking the wikipedia content 
> and converting it to (Xe)LaTeX markup - that tool could inspect the content 
> of each element it's processing, and add any necessary direction controls for 
> XeTeX.
> 
> JK
> 
> Guessing the proper language tag to use is likely infeasible;
> note that the example given contains titles in Turkish as well as
> English.  The safest option is probably to treat embedded LTR text in an
> RTL context as 'exotic' and not to attempt hyphenation.
> 
> I've heard it said that LuaTeX has "better bidi support".  What does
> that mean, exactly? Should I be considering switching?
>--scott
> 
> On Dec 4, 2013 4:08 AM, "Keith J. Schultz"  > wrote:
> 
> Hi Scott,
> 
> Am 03.12.2013 um 19:42 schrieb C. Scott Ananian  >:
> 
>  >
>  > But in the XeLaTeX/polyglossia/bidi output, the "soft space" weak
>  > directionality of the Unicode BiDi algorithm doesn't seem to be
>  > honored (or implemented?) and so the English article titles appear
>  > with the individual words in RTL order, which is a mess.  Manually
>  > tagging the language of the article title is probably the Right
> thing,
>  > but infeasible for the entire wikipedia.
>  Well, without proper tagging you can not expect any system to
>  work properly or as expected!
>  For most entries a simple script should do the trick to add the
>  language tags to the article titles.
> 
> Hope this helps
>  regards
>  Keith.
> 
> 
> --
> Subscriptions, Archive, and List information, etc.:
> http://tug.org/mailman/listinfo/xetex
> 
> 
> 
> 
> 
> --
> Subscriptions, Archive, and List information, etc.:
>http://tug.org/mailman/listinfo/xetex
> 
> 
> 
> 
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
> 
> 
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex

-- 
Alan Munn
am...@gmx.com







--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-05 Thread C. Scott Ananian
Can anyone point me to docs on XeT--TeX?  A Google the other day failed to
turn up anything useful.

Also: polyglossia appears to be doing some amount of LTR/RTL directionality
switching based on the character block.  Can anyone offer advice on how to
avoid fighting with that, if I'm implementing my own bidi algorithm?

Finally: any advice on using CJK languages with polyglossia?  Embedded CJK
is quite common.  Should I be writing gloss-ja etc files to set the right
directionality and font and get the appropriate CJK support packages loaded?
  --scott
On Dec 5, 2013 5:42 AM, "Jonathan Kew"  wrote:

> On 4/12/13 13:24, C. Scott Ananian wrote:
>
>> The goal is to match the Unicode bidi algorithm, because that is how the
>> web page displays and thus how the original author saw the text as they
>> wrote.
>>
>
> This would be a nice enhancement, but would require a significant amount
> of work (or in other words, it's not likely to get implemented quickly, if
> at all).
>
> Currently, typesetting bidi text with xetex requires correct use of the
> TeX--XeT bidi commands (\beginR, \endR, \beginL, \endL) to mark up the text
> direction. These could be used directly, or via higher-level markup that's
> tagging script and language, but you definitely need them to be present in
> some way.
>
> Sorry, that's not what you want to hear, but it's how things are. At this
> point, I think the most practical way forward in your situation is probably
> to implement this as part of whatever tool is taking the wikipedia content
> and converting it to (Xe)LaTeX markup - that tool could inspect the content
> of each element it's processing, and add any necessary direction controls
> for XeTeX.
>
> JK
>
>  Guessing the proper language tag to use is likely infeasible;
>> note that the example given contains titles in Turkish as well as
>> English.  The safest option is probably to treat embedded LTR text in an
>> RTL context as 'exotic' and not to attempt hyphenation.
>>
>> I've heard it said that LuaTeX has "better bidi support".  What does
>> that mean, exactly? Should I be considering switching?
>>--scott
>>
>> On Dec 4, 2013 4:08 AM, "Keith J. Schultz" > > wrote:
>>
>> Hi Scott,
>>
>> Am 03.12.2013 um 19:42 schrieb C. Scott Ananian > >:
>>
>>  >
>>  > But in the XeLaTeX/polyglossia/bidi output, the "soft space" weak
>>  > directionality of the Unicode BiDi algorithm doesn't seem to be
>>  > honored (or implemented?) and so the English article titles appear
>>  > with the individual words in RTL order, which is a mess.  Manually
>>  > tagging the language of the article title is probably the Right
>> thing,
>>  > but infeasible for the entire wikipedia.
>>  Well, without proper tagging you can not expect any system to
>>  work properly or as expected!
>>  For most entries a simple script should do the trick to add
>> the
>>  language tags to the article titles.
>>
>> Hope this helps
>>  regards
>>  Keith.
>>
>>
>> --
>> Subscriptions, Archive, and List information, etc.:
>> http://tug.org/mailman/listinfo/xetex
>>
>>
>>
>>
>>
>> --
>> Subscriptions, Archive, and List information, etc.:
>>http://tug.org/mailman/listinfo/xetex
>>
>>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-05 Thread Zdenek Wagner
2013/12/5 Khaled Hosny :
> On Thu, Dec 05, 2013 at 12:29:40PM +0100, Zdenek Wagner wrote:
>> 2013/12/5 Khaled Hosny :
>> > On Wed, Dec 04, 2013 at 12:31:58AM -0500, C. Scott Ananian wrote:
>> >> On Tue, Dec 3, 2013 at 5:33 PM, Khaled Hosny  
>> >> wrote:
>> >> > On Tue, Dec 03, 2013 at 01:42:21PM -0500, C. Scott Ananian wrote:
>> >> >> Does XeLaTeX implement the Unicode BiDi algorithm?
>> >> >
>> >> > Short answer: no.
>> >> >
>> >> > I think sample documents (minimal working example) are needed for any
>> >> > useful suggestion.
>> >>
>> >>
>> >> Attached are the first 23 references from
>> >> https://ar.wikipedia.org/wiki/%D8%A5%D8%B3%D8%B7%D9%86%D8%A8%D9%88%D9%84#.D9.85.D8.B5.D8.A7.D8.AF.D8.B1
>> >> (the Arabic wikipedia article on Istanbul), as generated by my XeLaTeX
>> >> formatter.
>> >>
>> >> Things to notice:
>> >> 1) Unicode BiDi algorithm at work in web version in places like
>> >> citation [1], "Statistics of the 2010 Turkey census".  XeLaTeX renders
>> >> this backwards.
>> >
>> > You need to explicitly markup LTR and RTL text, e.g. using polyglossia
>> > or bidi. You need this to enable hyphenation as well, citations 12 and
>> > 13 have very bad spacing because no hyphenation was enabled, for
>> > example. I guess the tool that generates the TeX file will have to do
>> > that.
>> >
>> >> 2) Broken italic for arabic in GNU freefont in citation [2].
>> >> (straightforward to fix)
>> >
>> > Please, please, please, never ever use GNU free font for Arabic; it is
>> > the most hideous, crappy and useless un-Arabic font ever created, my
>> > blood boils every time I see it in use.
>> >
>> Could you summarize what is wrong and report it?
>
> All of it, the Arabic range is utter crap.
>
>> Steve White will
>> certainly fix it (unless it is better toreplace the whole Arabic
>> block).
>
> I did, and even offered to work on replacement, but the offer was turned
> down.
>
>> I see problems with dochachmee he, Urdu words as بھآرت and ؔٹھیک are
>> not displayed properly.
>
> There is no point in looking at the microlevel, the whole thing is
> worthless garbage and should be tossed in the nearest trash bin. Whoever
> designed it has absolutely no idea about Arabic and its design, I take
> it personally and find this garbage an insult to the Arabic script. Show
> a text typeset with it to an Urdu speaker and he is likely to vomit in
> disgust.
>
It is pitty to know this. Although I have a book of Arabic calligraphy
collected by a native Arabic calligrapher, I do not know Arabic and
even do not know Urdu, I can sometimes read parts of Urdu texts and I
can recognize that some characters are not correctly connected. This
font is useful for me in the text editor if I type multilingual text
containing Czech, English, Hindi and Urdu. If I do not use FreeSerif,
the editor automatically selects some Arabic font which lacks Urdu
characters. Of course, in XeTeX I use Nafees fonts for Urdu.

> Regards,
> Khaled
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-05 Thread Khaled Hosny
On Thu, Dec 05, 2013 at 12:29:40PM +0100, Zdenek Wagner wrote:
> 2013/12/5 Khaled Hosny :
> > On Wed, Dec 04, 2013 at 12:31:58AM -0500, C. Scott Ananian wrote:
> >> On Tue, Dec 3, 2013 at 5:33 PM, Khaled Hosny  wrote:
> >> > On Tue, Dec 03, 2013 at 01:42:21PM -0500, C. Scott Ananian wrote:
> >> >> Does XeLaTeX implement the Unicode BiDi algorithm?
> >> >
> >> > Short answer: no.
> >> >
> >> > I think sample documents (minimal working example) are needed for any
> >> > useful suggestion.
> >>
> >>
> >> Attached are the first 23 references from
> >> https://ar.wikipedia.org/wiki/%D8%A5%D8%B3%D8%B7%D9%86%D8%A8%D9%88%D9%84#.D9.85.D8.B5.D8.A7.D8.AF.D8.B1
> >> (the Arabic wikipedia article on Istanbul), as generated by my XeLaTeX
> >> formatter.
> >>
> >> Things to notice:
> >> 1) Unicode BiDi algorithm at work in web version in places like
> >> citation [1], "Statistics of the 2010 Turkey census".  XeLaTeX renders
> >> this backwards.
> >
> > You need to explicitly markup LTR and RTL text, e.g. using polyglossia
> > or bidi. You need this to enable hyphenation as well, citations 12 and
> > 13 have very bad spacing because no hyphenation was enabled, for
> > example. I guess the tool that generates the TeX file will have to do
> > that.
> >
> >> 2) Broken italic for arabic in GNU freefont in citation [2].
> >> (straightforward to fix)
> >
> > Please, please, please, never ever use GNU free font for Arabic; it is
> > the most hideous, crappy and useless un-Arabic font ever created, my
> > blood boils every time I see it in use.
> >
> Could you summarize what is wrong and report it?

All of it, the Arabic range is utter crap.

> Steve White will
> certainly fix it (unless it is better toreplace the whole Arabic
> block).

I did, and even offered to work on replacement, but the offer was turned
down.

> I see problems with dochachmee he, Urdu words as بھآرت and ؔٹھیک are
> not displayed properly.

There is no point in looking at the microlevel, the whole thing is
worthless garbage and should be tossed in the nearest trash bin. Whoever
designed it has absolutely no idea about Arabic and its design, I take
it personally and find this garbage an insult to the Arabic script. Show
a text typeset with it to an Urdu speaker and he is likely to vomit in
disgust.

Regards,
Khaled


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-05 Thread Zdenek Wagner
2013/12/5 Khaled Hosny :
> On Wed, Dec 04, 2013 at 12:31:58AM -0500, C. Scott Ananian wrote:
>> On Tue, Dec 3, 2013 at 5:33 PM, Khaled Hosny  wrote:
>> > On Tue, Dec 03, 2013 at 01:42:21PM -0500, C. Scott Ananian wrote:
>> >> Does XeLaTeX implement the Unicode BiDi algorithm?
>> >
>> > Short answer: no.
>> >
>> > I think sample documents (minimal working example) are needed for any
>> > useful suggestion.
>>
>>
>> Attached are the first 23 references from
>> https://ar.wikipedia.org/wiki/%D8%A5%D8%B3%D8%B7%D9%86%D8%A8%D9%88%D9%84#.D9.85.D8.B5.D8.A7.D8.AF.D8.B1
>> (the Arabic wikipedia article on Istanbul), as generated by my XeLaTeX
>> formatter.
>>
>> Things to notice:
>> 1) Unicode BiDi algorithm at work in web version in places like
>> citation [1], "Statistics of the 2010 Turkey census".  XeLaTeX renders
>> this backwards.
>
> You need to explicitly markup LTR and RTL text, e.g. using polyglossia
> or bidi. You need this to enable hyphenation as well, citations 12 and
> 13 have very bad spacing because no hyphenation was enabled, for
> example. I guess the tool that generates the TeX file will have to do
> that.
>
>> 2) Broken italic for arabic in GNU freefont in citation [2].
>> (straightforward to fix)
>
> Please, please, please, never ever use GNU free font for Arabic; it is
> the most hideous, crappy and useless un-Arabic font ever created, my
> blood boils every time I see it in use.
>
Could you summarize what is wrong and report it? Steve White will
certainly fix it (unless it is better toreplace the whole Arabic
block). I see problems with dochachmee he, Urdu words as بھآرت and
ؔٹھیک are not displayed properly. I plan to report it and scan samples
from the Urdu-Hindi dictionary. (Two years ago Devanagari in FreeFont
was terrible but with my reports and testing Steve created beautiful
fonts.)

>> 3) Arabic comma instead of English comma in citation [23]. (in both
>> web and XeLaTeX output)
>
> Bad input, the input has an Arabic comma, no tricks are done here.
>
> Regards,
> Khaled
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-05 Thread Jonathan Kew

On 4/12/13 13:24, C. Scott Ananian wrote:

The goal is to match the Unicode bidi algorithm, because that is how the
web page displays and thus how the original author saw the text as they
wrote.


This would be a nice enhancement, but would require a significant amount 
of work (or in other words, it's not likely to get implemented quickly, 
if at all).


Currently, typesetting bidi text with xetex requires correct use of the 
TeX--XeT bidi commands (\beginR, \endR, \beginL, \endL) to mark up the 
text direction. These could be used directly, or via higher-level markup 
that's tagging script and language, but you definitely need them to be 
present in some way.


Sorry, that's not what you want to hear, but it's how things are. At 
this point, I think the most practical way forward in your situation is 
probably to implement this as part of whatever tool is taking the 
wikipedia content and converting it to (Xe)LaTeX markup - that tool 
could inspect the content of each element it's processing, and add any 
necessary direction controls for XeTeX.


JK


Guessing the proper language tag to use is likely infeasible;
note that the example given contains titles in Turkish as well as
English.  The safest option is probably to treat embedded LTR text in an
RTL context as 'exotic' and not to attempt hyphenation.

I've heard it said that LuaTeX has "better bidi support".  What does
that mean, exactly? Should I be considering switching?
   --scott

On Dec 4, 2013 4:08 AM, "Keith J. Schultz" mailto:schul...@uni-trier.de>> wrote:

Hi Scott,

Am 03.12.2013 um 19:42 schrieb C. Scott Ananian mailto:csc...@cscott.net>>:

 >
 > But in the XeLaTeX/polyglossia/bidi output, the "soft space" weak
 > directionality of the Unicode BiDi algorithm doesn't seem to be
 > honored (or implemented?) and so the English article titles appear
 > with the individual words in RTL order, which is a mess.  Manually
 > tagging the language of the article title is probably the Right
thing,
 > but infeasible for the entire wikipedia.
 Well, without proper tagging you can not expect any system to
 work properly or as expected!
 For most entries a simple script should do the trick to add the
 language tags to the article titles.

Hope this helps
 regards
 Keith.


--
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex





--
Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex





--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-05 Thread Khaled Hosny
On Thu, Dec 05, 2013 at 09:41:04AM +0100, Keith J. Schultz wrote:
> Hi Scott,
> 
> We are talking Unicode here right! What is there to guess? 

And how do you, using Unicode, tell in what language is this line
written?

Regards,
Khaled


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-05 Thread Keith J. Schultz
Hi Scott,

We are talking Unicode here right! What is there to guess? 

Then there is always the possibility of having the text tagged when written by 
the original
author. Of course, only when you can control his input tools.

Lua(La)TeX has other great feature. You have a complete programming language
you can use to maniplulate data/text before it is processed by TeX or even 
after it has been 
processed by TeX. 
This gives easier ways of manipulating and processing text than TeX has. 

regards
Keith.

Am 04.12.2013 um 14:24 schrieb C. Scott Ananian :

> The goal is to match the Unicode bidi algorithm, because that is how the web 
> page displays and thus how the original author saw the text as they wrote.  
> Guessing the proper language tag to use is likely infeasible; note that the 
> example given contains titles in Turkish as well as English.  The safest 
> option is probably to treat embedded LTR text in an RTL context as 'exotic' 
> and not to attempt hyphenation.
> 
> I've heard it said that LuaTeX has "better bidi support".  What does that 
> mean, exactly? Should I be considering switching?
>   --scott
> 
> On Dec 4, 2013 4:08 AM, "Keith J. Schultz"  wrote:
> Hi Scott,
> 
> Am 03.12.2013 um 19:42 schrieb C. Scott Ananian :
> 
> >
> > But in the XeLaTeX/polyglossia/bidi output, the "soft space" weak
> > directionality of the Unicode BiDi algorithm doesn't seem to be
> > honored (or implemented?) and so the English article titles appear
> > with the individual words in RTL order, which is a mess.  Manually
> > tagging the language of the article title is probably the Right thing,
> > but infeasible for the entire wikipedia.
> Well, without proper tagging you can not expect any system to
> work properly or as expected!
> For most entries a simple script should do the trick to add the
> language tags to the article titles.
> 
> Hope this helps
> regards
> Keith.
> 
> 
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
> 
> 
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-04 Thread Khaled Hosny
On Wed, Dec 04, 2013 at 11:50:05PM +0100, Zdenek Wagner wrote:
> 2013/12/4 C. Scott Ananian :
> > 3) Arabic comma instead of English comma in citation [23]. (in both
> > web and XeLaTeX output)
> >
> The engine cannot recognize the context if the language is not tagged,
> the comma will always be displayed using the default language.

The engine does nothing special with the comma, it is simply a different
character and was input as such.

Regards,
Khaled


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-04 Thread Khaled Hosny
On Wed, Dec 04, 2013 at 12:31:58AM -0500, C. Scott Ananian wrote:
> On Tue, Dec 3, 2013 at 5:33 PM, Khaled Hosny  wrote:
> > On Tue, Dec 03, 2013 at 01:42:21PM -0500, C. Scott Ananian wrote:
> >> Does XeLaTeX implement the Unicode BiDi algorithm?
> >
> > Short answer: no.
> >
> > I think sample documents (minimal working example) are needed for any
> > useful suggestion.
> 
> 
> Attached are the first 23 references from
> https://ar.wikipedia.org/wiki/%D8%A5%D8%B3%D8%B7%D9%86%D8%A8%D9%88%D9%84#.D9.85.D8.B5.D8.A7.D8.AF.D8.B1
> (the Arabic wikipedia article on Istanbul), as generated by my XeLaTeX
> formatter.
> 
> Things to notice:
> 1) Unicode BiDi algorithm at work in web version in places like
> citation [1], "Statistics of the 2010 Turkey census".  XeLaTeX renders
> this backwards.

You need to explicitly markup LTR and RTL text, e.g. using polyglossia
or bidi. You need this to enable hyphenation as well, citations 12 and
13 have very bad spacing because no hyphenation was enabled, for
example. I guess the tool that generates the TeX file will have to do
that.

> 2) Broken italic for arabic in GNU freefont in citation [2].
> (straightforward to fix)

Please, please, please, never ever use GNU free font for Arabic; it is
the most hideous, crappy and useless un-Arabic font ever created, my
blood boils every time I see it in use.

> 3) Arabic comma instead of English comma in citation [23]. (in both
> web and XeLaTeX output)

Bad input, the input has an Arabic comma, no tricks are done here.

Regards,
Khaled


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-04 Thread Andrew Cunningham
Well first step is implementing and providing ways of using the bidi alg
and its changes in Unicode 6.3, especially being able to leverage off bidi
isolation.

Andrew


On 4 December 2013 20:07, Keith J. Schultz  wrote:

> Hi Scott,
>
> Am 03.12.2013 um 19:42 schrieb C. Scott Ananian :
>
> >
> > But in the XeLaTeX/polyglossia/bidi output, the "soft space" weak
> > directionality of the Unicode BiDi algorithm doesn't seem to be
> > honored (or implemented?) and so the English article titles appear
> > with the individual words in RTL order, which is a mess.  Manually
> > tagging the language of the article title is probably the Right thing,
> > but infeasible for the entire wikipedia.
> Well, without proper tagging you can not expect any system to
> work properly or as expected!
> For most entries a simple script should do the trick to add the
> language tags to the article titles.
>
> Hope this helps
> regards
> Keith.
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>



-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-04 Thread Zdenek Wagner
2013/12/4 C. Scott Ananian :
> The goal is to match the Unicode bidi algorithm, because that is how the web
> page displays and thus how the original author saw the text as they wrote.
> Guessing the proper language tag to use is likely infeasible; note that the
> example given contains titles in Turkish as well as English.  The safest
> option is probably to treat embedded LTR text in an RTL context as 'exotic'
> and not to attempt hyphenation.
>
> I've heard it said that LuaTeX has "better bidi support".  What does that
> mean, exactly? Should I be considering switching?
>   --scott
>
LuaTeX offers various features for the arabic script but support of
indic scripts is missing. If you wish to typeset the texts from Hindi
Wikipedia, LuaTeX cannot be used.

> On Dec 4, 2013 4:08 AM, "Keith J. Schultz"  wrote:
>>
>> Hi Scott,
>>
>> Am 03.12.2013 um 19:42 schrieb C. Scott Ananian :
>>
>> >
>> > But in the XeLaTeX/polyglossia/bidi output, the "soft space" weak
>> > directionality of the Unicode BiDi algorithm doesn't seem to be
>> > honored (or implemented?) and so the English article titles appear
>> > with the individual words in RTL order, which is a mess.  Manually
>> > tagging the language of the article title is probably the Right thing,
>> > but infeasible for the entire wikipedia.
>> Well, without proper tagging you can not expect any system to
>> work properly or as expected!
>> For most entries a simple script should do the trick to add the
>> language tags to the article titles.
>>
>> Hope this helps
>> regards
>> Keith.
>>
>>
>> --
>> Subscriptions, Archive, and List information, etc.:
>>   http://tug.org/mailman/listinfo/xetex
>
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-04 Thread Zdenek Wagner
2013/12/4 C. Scott Ananian :
> On Tue, Dec 3, 2013 at 5:33 PM, Khaled Hosny  wrote:
>> On Tue, Dec 03, 2013 at 01:42:21PM -0500, C. Scott Ananian wrote:
>>> Does XeLaTeX implement the Unicode BiDi algorithm?
>>
>> Short answer: no.
>>
>> I think sample documents (minimal working example) are needed for any
>> useful suggestion.
>
>
> Attached are the first 23 references from
> https://ar.wikipedia.org/wiki/%D8%A5%D8%B3%D8%B7%D9%86%D8%A8%D9%88%D9%84#.D9.85.D8.B5.D8.A7.D8.AF.D8.B1
> (the Arabic wikipedia article on Istanbul), as generated by my XeLaTeX
> formatter.
>
> Things to notice:
> 1) Unicode BiDi algorithm at work in web version in places like
> citation [1], "Statistics of the 2010 Turkey census".  XeLaTeX renders
> this backwards.
> 2) Broken italic for arabic in GNU freefont in citation [2].
> (straightforward to fix)

GNU FreeSerif does not contain arabic in the italic shape, only
regular and bold is supported.

> 3) Arabic comma instead of English comma in citation [23]. (in both
> web and XeLaTeX output)
>
The engine cannot recognize the context if the language is not tagged,
the comma will always be displayed using the default language.

> Item #1 is the one I'd really appreciate suggestions for fixing.
>   --scott
>
> --
>  ( http://cscott.net/ )
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-04 Thread C. Scott Ananian
The goal is to match the Unicode bidi algorithm, because that is how the
web page displays and thus how the original author saw the text as they
wrote.  Guessing the proper language tag to use is likely infeasible; note
that the example given contains titles in Turkish as well as English.  The
safest option is probably to treat embedded LTR text in an RTL context as
'exotic' and not to attempt hyphenation.

I've heard it said that LuaTeX has "better bidi support".  What does that
mean, exactly? Should I be considering switching?
  --scott
On Dec 4, 2013 4:08 AM, "Keith J. Schultz"  wrote:

> Hi Scott,
>
> Am 03.12.2013 um 19:42 schrieb C. Scott Ananian :
>
> >
> > But in the XeLaTeX/polyglossia/bidi output, the "soft space" weak
> > directionality of the Unicode BiDi algorithm doesn't seem to be
> > honored (or implemented?) and so the English article titles appear
> > with the individual words in RTL order, which is a mess.  Manually
> > tagging the language of the article title is probably the Right thing,
> > but infeasible for the entire wikipedia.
> Well, without proper tagging you can not expect any system to
> work properly or as expected!
> For most entries a simple script should do the trick to add the
> language tags to the article titles.
>
> Hope this helps
> regards
> Keith.
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-04 Thread C. Scott Ananian
On Tue, Dec 3, 2013 at 5:33 PM, Khaled Hosny  wrote:
> On Tue, Dec 03, 2013 at 01:42:21PM -0500, C. Scott Ananian wrote:
>> Does XeLaTeX implement the Unicode BiDi algorithm?
>
> Short answer: no.
>
> I think sample documents (minimal working example) are needed for any
> useful suggestion.


Attached are the first 23 references from
https://ar.wikipedia.org/wiki/%D8%A5%D8%B3%D8%B7%D9%86%D8%A8%D9%88%D9%84#.D9.85.D8.B5.D8.A7.D8.AF.D8.B1
(the Arabic wikipedia article on Istanbul), as generated by my XeLaTeX
formatter.

Things to notice:
1) Unicode BiDi algorithm at work in web version in places like
citation [1], "Statistics of the 2010 Turkey census".  XeLaTeX renders
this backwards.
2) Broken italic for arabic in GNU freefont in citation [2].
(straightforward to fix)
3) Arabic comma instead of English comma in citation [23]. (in both
web and XeLaTeX output)

Item #1 is the one I'd really appreciate suggestions for fixing.
  --scott

-- 
 ( http://cscott.net/ )


arabic-sm.tex
Description: TeX document


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-04 Thread Keith J. Schultz
Hi Scott,

Am 03.12.2013 um 19:42 schrieb C. Scott Ananian :

> 
> But in the XeLaTeX/polyglossia/bidi output, the "soft space" weak
> directionality of the Unicode BiDi algorithm doesn't seem to be
> honored (or implemented?) and so the English article titles appear
> with the individual words in RTL order, which is a mess.  Manually
> tagging the language of the article title is probably the Right thing,
> but infeasible for the entire wikipedia.
Well, without proper tagging you can not expect any system to
work properly or as expected!
For most entries a simple script should do the trick to add the 
language tags to the article titles. 

Hope this helps
regards
Keith.


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-03 Thread Zdenek Wagner
2013/12/3 Khaled Hosny :
> On Tue, Dec 03, 2013 at 01:42:21PM -0500, C. Scott Ananian wrote:
>> Does XeLaTeX implement the Unicode BiDi algorithm?
>
> Short answer: no.
>
> Long answer: XeTeX, more or less, breaks words at spaces or other
> non-character material (spaces in TeX are converted to the so called
> glue, so are not handled as characters) and applies the Unicode BiDi
> algorithm to each word separately, which effectively means it is just
> used to determine the direction of the individual word.
>
>> If so, why isn't it working (I can provide a TeX sample)?  If not,
>> does anyone have any suggestions for workarounds -- other than
>> implementing the BiDi algorithm myself and adding explicit \RL and \LR
>> commands?
>
> I think sample documents (minimal working example) are needed for any
> useful suggestion.
>
I do not want to give any suggestion without analysis of sample
documents but you should remember that without proper language tagging
XeTeX will not use correct \pattern for hyphenation.

> Regards,
> Khaled
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] xetex and the unicode bidirectional algorithm.

2013-12-03 Thread Khaled Hosny
On Tue, Dec 03, 2013 at 01:42:21PM -0500, C. Scott Ananian wrote:
> Does XeLaTeX implement the Unicode BiDi algorithm?

Short answer: no.

Long answer: XeTeX, more or less, breaks words at spaces or other
non-character material (spaces in TeX are converted to the so called
glue, so are not handled as characters) and applies the Unicode BiDi
algorithm to each word separately, which effectively means it is just
used to determine the direction of the individual word.

> If so, why isn't it working (I can provide a TeX sample)?  If not,
> does anyone have any suggestions for workarounds -- other than
> implementing the BiDi algorithm myself and adding explicit \RL and \LR
> commands?

I think sample documents (minimal working example) are needed for any
useful suggestion.

Regards,
Khaled


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex