Re: [XeTeX] New feature REQUEST for xetex

2016-02-23 Thread ShreeDevi Kumar
I am attaching a sample pdf and it's OCRed text using Tesseract OCR (
https://github.com/tesseract-ocr/tesseract).

The resulting pdf allows for search as well as copy paste for devanagri
unicode text.

The pdf is rendered using the original image, but the OCRed text is
available as text layer making it a searchable pdf. I do not think it uses
'actualtext' but I could be wrong. It allows for search for letters/partial
words but the highlight is in the ballpark, not always on that exact letter.

(please note that search may not find the original text as displayed in pdf
because OCR is not accurate for devanagri).





ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Feb 23, 2016 at 4:22 PM, Jonathan Kew  wrote:

> On 23/2/16 10:37, Zdenek Wagner wrote:
>
>> How Jonathan,
>>
>> how do you put the ActualText to PDF? Is it per syllable, or per word?
>>
>
> Per word.
>
> We have a commercial OCR software that can convert scanned PDF to pages
>> with selectable texts. I have not examined it thoroughly but it seems to
>> me that it analyzes the scanned image, splits it to subimages "per word"
>> and attaches ActualText to each word. In such a way it is impossible to
>> select just a group of characters, the smallest entity that can be
>> copied & pasted (or searched for) is a word. It might fix the
>> hignlighting problem but I am just guessing.
>>
>
> I don't think so. Even single-syllable words like भी don't highlight well
> in the example.
>
> (FWIW, it is possible to search for a substring within a word, and Acrobat
> finds it OK, but it can't accurately highlight what's been found; you get
> the same (inaccurate) highlighting of the word regardless of what substring
> within it was searched.)
>
> Setting ActualText per syllable would make finer-grained copy/paste
> possible (currently, entire words are always copied), but would be
> significantly more complex to implement (as well as adding to the PDF file
> bloat). I think the per-word version should be a useful start, at least.
>
>
>>
>> Zdeněk Wagner
>> http://ttsm.icpf.cas.cz/team/wagner.shtml
>> http://icebearsoft.euweb.cz
>>
>> 2016-02-23 11:06 GMT+01:00 Jonathan Kew > >:
>>
>>
>> On 23/2/16 02:54, Andrew Cunningham wrote:
>>
>> It would probably more than double, i was under the impression
>> that
>> ActualText was a tag attrubute, so extensive tagging would be
>> needed,
>> and actual text added to the tags.
>>
>>
>> The ActualText tagging is highly compressible, so in practice the
>> increase in overall PDF size is not all that great.
>>
>>
>> But the question is how to practically make use of ActualText if
>> there
>> is a visible text layer.
>>
>> PDF/UA for instance leaves the question deliberately ambigious.
>> ActualText is the way to make the content accessible, but
>> developers
>> creating tools for PDF do not actually have to process the
>> ActualText.
>>
>> So to index and search PDF files you need to build a discovery
>> system
>> utilising tools that allow you to specify the use of ActualText in
>> preference to a visible text layer.
>>
>>
>> Acrobat Reader uses it, if present, so that Copy/Paste from the PDF
>> results in the correct Unicode text (more or less), and Find behaves
>> as expected.
>>
>> Other PDF readers (such as Apple's Preview) may well ignore the
>> ActualText tagging, in which case it doesn't help. I don't know
>> whether tools like Evince or Okular handle it
>>
>>
>> I'm attaching two sample PDFs with a simple chunk of Hindi text
>> (from the Unicode web site). The first, dev-old.pdf, is what XeTeX
>> currently generates (using the "Annapurna SIL" OpenType font). In
>> general, Copy/Paste and text search don't work very well -- a few
>> characters may be OK, but others are junk.
>>
>> The second sample, dev-actualtext.pdf, was generated with an
>> experimental new \XeTeXgenerateactualtext feature, which
>> automatically "tags" each word with an ActualText representation.
>>
>> Some points to note:
>>
>> - The file size is 24662 bytes, while dev-old was 22875 bytes. Not
>> too bad. Of course, a lot of that is the embedded font data; with
>> longer documents that have lots of text but only a few fonts, the
>> difference would presumably be somewhat greater.
>>
>> - Copy/Paste and Search work pretty well in Acrobat Reader. Not in
>> Preview.app.
>>
>> - Highlighting of selected text (in Acrobat Reader) is somewhat
>> broken, apparently due to the ActualText tagging (it looks better in
>> dev-old). This may be fixable by tweaking exactly how the tagging is
>> written into the PDF; I haven't investigated it further.
>>
>>
>> No guarantees at this point as to whether/when this fea

Re: [XeTeX] New feature REQUEST for xetex

2016-02-23 Thread Andrew Cunningham
They don't solve it.

On Tuesday, 23 February 2016, Mojca Miklavec 
wrote:
> Just curious: how does Word or InDesign solve such problems (once a
> PDF gets generated)?
>
> Mojca
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>

-- 
Andrew Cunningham
lang.supp...@gmail.com


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] New feature REQUEST for xetex

2016-02-23 Thread Jonathan Kew

On 23/2/16 10:37, Zdenek Wagner wrote:

How Jonathan,

how do you put the ActualText to PDF? Is it per syllable, or per word?


Per word.


We have a commercial OCR software that can convert scanned PDF to pages
with selectable texts. I have not examined it thoroughly but it seems to
me that it analyzes the scanned image, splits it to subimages "per word"
and attaches ActualText to each word. In such a way it is impossible to
select just a group of characters, the smallest entity that can be
copied & pasted (or searched for) is a word. It might fix the
hignlighting problem but I am just guessing.


I don't think so. Even single-syllable words like भी don't highlight 
well in the example.


(FWIW, it is possible to search for a substring within a word, and 
Acrobat finds it OK, but it can't accurately highlight what's been 
found; you get the same (inaccurate) highlighting of the word regardless 
of what substring within it was searched.)


Setting ActualText per syllable would make finer-grained copy/paste 
possible (currently, entire words are always copied), but would be 
significantly more complex to implement (as well as adding to the PDF 
file bloat). I think the per-word version should be a useful start, at 
least.





Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz

2016-02-23 11:06 GMT+01:00 Jonathan Kew mailto:jfkth...@gmail.com>>:

On 23/2/16 02:54, Andrew Cunningham wrote:

It would probably more than double, i was under the impression that
ActualText was a tag attrubute, so extensive tagging would be
needed,
and actual text added to the tags.


The ActualText tagging is highly compressible, so in practice the
increase in overall PDF size is not all that great.


But the question is how to practically make use of ActualText if
there
is a visible text layer.

PDF/UA for instance leaves the question deliberately ambigious.
ActualText is the way to make the content accessible, but developers
creating tools for PDF do not actually have to process the
ActualText.

So to index and search PDF files you need to build a discovery
system
utilising tools that allow you to specify the use of ActualText in
preference to a visible text layer.


Acrobat Reader uses it, if present, so that Copy/Paste from the PDF
results in the correct Unicode text (more or less), and Find behaves
as expected.

Other PDF readers (such as Apple's Preview) may well ignore the
ActualText tagging, in which case it doesn't help. I don't know
whether tools like Evince or Okular handle it


I'm attaching two sample PDFs with a simple chunk of Hindi text
(from the Unicode web site). The first, dev-old.pdf, is what XeTeX
currently generates (using the "Annapurna SIL" OpenType font). In
general, Copy/Paste and text search don't work very well -- a few
characters may be OK, but others are junk.

The second sample, dev-actualtext.pdf, was generated with an
experimental new \XeTeXgenerateactualtext feature, which
automatically "tags" each word with an ActualText representation.

Some points to note:

- The file size is 24662 bytes, while dev-old was 22875 bytes. Not
too bad. Of course, a lot of that is the embedded font data; with
longer documents that have lots of text but only a few fonts, the
difference would presumably be somewhat greater.

- Copy/Paste and Search work pretty well in Acrobat Reader. Not in
Preview.app.

- Highlighting of selected text (in Acrobat Reader) is somewhat
broken, apparently due to the ActualText tagging (it looks better in
dev-old). This may be fixable by tweaking exactly how the tagging is
written into the PDF; I haven't investigated it further.


No guarantees at this point as to whether/when this feature will
actually be available. It was just a quick attempt to hack something
up, to see how promising the results might be...

JK




--
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex






--
Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex





--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] New feature REQUEST for xetex

2016-02-23 Thread Zdenek Wagner
How Jonathan,

how do you put the ActualText to PDF? Is it per syllable, or per word? We
have a commercial OCR software that can convert scanned PDF to pages with
selectable texts. I have not examined it thoroughly but it seems to me that
it analyzes the scanned image, splits it to subimages "per word" and
attaches ActualText to each word. In such a way it is impossible to select
just a group of characters, the smallest entity that can be copied & pasted
(or searched for) is a word. It might fix the hignlighting problem but I am
just guessing.


Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz

2016-02-23 11:06 GMT+01:00 Jonathan Kew :

> On 23/2/16 02:54, Andrew Cunningham wrote:
>
>> It would probably more than double, i was under the impression that
>> ActualText was a tag attrubute, so extensive tagging would be needed,
>> and actual text added to the tags.
>>
>
> The ActualText tagging is highly compressible, so in practice the increase
> in overall PDF size is not all that great.
>
>
>> But the question is how to practically make use of ActualText if there
>> is a visible text layer.
>>
>> PDF/UA for instance leaves the question deliberately ambigious.
>> ActualText is the way to make the content accessible, but developers
>> creating tools for PDF do not actually have to process the ActualText.
>>
>> So to index and search PDF files you need to build a discovery system
>> utilising tools that allow you to specify the use of ActualText in
>> preference to a visible text layer.
>>
>>
> Acrobat Reader uses it, if present, so that Copy/Paste from the PDF
> results in the correct Unicode text (more or less), and Find behaves as
> expected.
>
> Other PDF readers (such as Apple's Preview) may well ignore the ActualText
> tagging, in which case it doesn't help. I don't know whether tools like
> Evince or Okular handle it
>
>
> I'm attaching two sample PDFs with a simple chunk of Hindi text (from the
> Unicode web site). The first, dev-old.pdf, is what XeTeX currently
> generates (using the "Annapurna SIL" OpenType font). In general, Copy/Paste
> and text search don't work very well -- a few characters may be OK, but
> others are junk.
>
> The second sample, dev-actualtext.pdf, was generated with an experimental
> new \XeTeXgenerateactualtext feature, which automatically "tags" each word
> with an ActualText representation.
>
> Some points to note:
>
> - The file size is 24662 bytes, while dev-old was 22875 bytes. Not too
> bad. Of course, a lot of that is the embedded font data; with longer
> documents that have lots of text but only a few fonts, the difference would
> presumably be somewhat greater.
>
> - Copy/Paste and Search work pretty well in Acrobat Reader. Not in
> Preview.app.
>
> - Highlighting of selected text (in Acrobat Reader) is somewhat broken,
> apparently due to the ActualText tagging (it looks better in dev-old). This
> may be fixable by tweaking exactly how the tagging is written into the PDF;
> I haven't investigated it further.
>
>
> No guarantees at this point as to whether/when this feature will actually
> be available. It was just a quick attempt to hack something up, to see how
> promising the results might be...
>
> JK
>
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] New feature REQUEST for xetex

2016-02-23 Thread ShreeDevi Kumar
Wow! This is wonderful, Jonathan.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Feb 23, 2016 at 3:36 PM, Jonathan Kew  wrote:

> On 23/2/16 02:54, Andrew Cunningham wrote:
>
>> It would probably more than double, i was under the impression that
>> ActualText was a tag attrubute, so extensive tagging would be needed,
>> and actual text added to the tags.
>>
>
> The ActualText tagging is highly compressible, so in practice the increase
> in overall PDF size is not all that great.
>
>
>> But the question is how to practically make use of ActualText if there
>> is a visible text layer.
>>
>> PDF/UA for instance leaves the question deliberately ambigious.
>> ActualText is the way to make the content accessible, but developers
>> creating tools for PDF do not actually have to process the ActualText.
>>
>> So to index and search PDF files you need to build a discovery system
>> utilising tools that allow you to specify the use of ActualText in
>> preference to a visible text layer.
>>
>>
> Acrobat Reader uses it, if present, so that Copy/Paste from the PDF
> results in the correct Unicode text (more or less), and Find behaves as
> expected.
>
> Other PDF readers (such as Apple's Preview) may well ignore the ActualText
> tagging, in which case it doesn't help. I don't know whether tools like
> Evince or Okular handle it
>
>
> I'm attaching two sample PDFs with a simple chunk of Hindi text (from the
> Unicode web site). The first, dev-old.pdf, is what XeTeX currently
> generates (using the "Annapurna SIL" OpenType font). In general, Copy/Paste
> and text search don't work very well -- a few characters may be OK, but
> others are junk.
>
> The second sample, dev-actualtext.pdf, was generated with an experimental
> new \XeTeXgenerateactualtext feature, which automatically "tags" each word
> with an ActualText representation.
>
> Some points to note:
>
> - The file size is 24662 bytes, while dev-old was 22875 bytes. Not too
> bad. Of course, a lot of that is the embedded font data; with longer
> documents that have lots of text but only a few fonts, the difference would
> presumably be somewhat greater.
>
> - Copy/Paste and Search work pretty well in Acrobat Reader. Not in
> Preview.app.
>
> - Highlighting of selected text (in Acrobat Reader) is somewhat broken,
> apparently due to the ActualText tagging (it looks better in dev-old). This
> may be fixable by tweaking exactly how the tagging is written into the PDF;
> I haven't investigated it further.
>
>
> No guarantees at this point as to whether/when this feature will actually
> be available. It was just a quick attempt to hack something up, to see how
> promising the results might be...
>
> JK
>
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] New feature REQUEST for xetex

2016-02-23 Thread Jonathan Kew

On 23/2/16 02:54, Andrew Cunningham wrote:

It would probably more than double, i was under the impression that
ActualText was a tag attrubute, so extensive tagging would be needed,
and actual text added to the tags.


The ActualText tagging is highly compressible, so in practice the 
increase in overall PDF size is not all that great.




But the question is how to practically make use of ActualText if there
is a visible text layer.

PDF/UA for instance leaves the question deliberately ambigious.
ActualText is the way to make the content accessible, but developers
creating tools for PDF do not actually have to process the ActualText.

So to index and search PDF files you need to build a discovery system
utilising tools that allow you to specify the use of ActualText in
preference to a visible text layer.



Acrobat Reader uses it, if present, so that Copy/Paste from the PDF 
results in the correct Unicode text (more or less), and Find behaves as 
expected.


Other PDF readers (such as Apple's Preview) may well ignore the 
ActualText tagging, in which case it doesn't help. I don't know whether 
tools like Evince or Okular handle it



I'm attaching two sample PDFs with a simple chunk of Hindi text (from 
the Unicode web site). The first, dev-old.pdf, is what XeTeX currently 
generates (using the "Annapurna SIL" OpenType font). In general, 
Copy/Paste and text search don't work very well -- a few characters may 
be OK, but others are junk.


The second sample, dev-actualtext.pdf, was generated with an 
experimental new \XeTeXgenerateactualtext feature, which automatically 
"tags" each word with an ActualText representation.


Some points to note:

- The file size is 24662 bytes, while dev-old was 22875 bytes. Not too 
bad. Of course, a lot of that is the embedded font data; with longer 
documents that have lots of text but only a few fonts, the difference 
would presumably be somewhat greater.


- Copy/Paste and Search work pretty well in Acrobat Reader. Not in 
Preview.app.


- Highlighting of selected text (in Acrobat Reader) is somewhat broken, 
apparently due to the ActualText tagging (it looks better in dev-old). 
This may be fixable by tweaking exactly how the tagging is written into 
the PDF; I haven't investigated it further.



No guarantees at this point as to whether/when this feature will 
actually be available. It was just a quick attempt to hack something up, 
to see how promising the results might be...


JK



dev-old.pdf
Description: Adobe PDF document


dev-actualtext.pdf
Description: Adobe PDF document


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] New feature REQUEST for xetex

2016-02-23 Thread Mojca Miklavec
Just curious: how does Word or InDesign solve such problems (once a
PDF gets generated)?

Mojca


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] New feature REQUEST for xetex

2016-02-23 Thread Zdenek Wagner
Hi all,

several years ago I did some texts with pdflatex and the devnag package
(XeTeX did not exist at that time), it is still here:
http://icebearsoft.euweb.cz/dvngpdf/

The situation in the Indic scripts are much more complex and cannot be
solved by a ToUnicode map. Half-consonants can be mapped to a consonant
followed by a virama. Conjuncts as ksha can be mapped to ka + virama + sha.
The problem is with reordering. I will make examples in Hindi only because
I do not know other Indic languages.

Take a word kitaab (= किताब, meaning a book). The correct character order
is ka + i-matra + ta + aa-matra + ba but in the vizual representattion the
glyphs are ordered as i-matra + ka + ta + aa-matra + ba. You cannot blindly
move the i-matra behond the following consonant. Word shakti (= सहक्ति,
force) is sha + ka + virama + ta + i-matra in the character order but sha +
i-matra + {kta-conjunct | half-ka + ta} where the second form is usually
preferred in nowadays Hindi. Even more weird reorderings exist, marzii is
ma + ra + virama + za + ii-matra in character order but vizually ma + za +
ii-matra + hook-repha.

The case of two-part vowels in some scripts is difficult two. You have
generally the following scheme:

vowel-part-1 + consonant-group or conjunct + vowel-part-2

Both parts exist as a separate glyphs mapped to other characters so you
must know whether the glyph represents a character or whether two glyphs
compose a two-part vowel.

These are not things that could be solved by simple ToUnicode maps. On the
contrary, it is not necessary to put ActualText to each word but certainly
to a great many words.


Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz

2016-02-23 6:21 GMT+01:00 Andrew Cunningham :

> Simon,
>
> On 23 February 2016 at 14:12, Simon Cozens  wrote:
>
>> On 23/02/2016 13:54, Andrew Cunningham wrote:
>> > PDF/UA for instance leaves the question deliberately ambigious.
>> > ActualText is the way to make the content accessible, but developers
>> > creating tools for PDF do not actually have to process the ActualText.
>>
>> Yeah. (Sorry to keep banging the drum but) I've just done some tests
>> with SILE, which includes some support for tagged/accessible PDFs. Even
>> when the ActualText includes the correct Devanagari, I am still seeing
>> the same problems with cut-and-paste. I'm not sure what needs to be done
>> to get it right.
>>
>>
> In terms of SILE ... supporting generation of other formats like XPS as an
> alternative to PDF is probably the only way forward for complex script
> languages.
>
> If SILE is tagging the PDFs and adding ActualText attributes , then it is
> doing everything it should be doing. The problems are with the PDF
> specification itself, what it was originally designed to be (a pre-print
> format based on the Postscript language) and the limitations placed on it
> by the developers of the spec.
>
> Andrew
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] New feature REQUEST for xetex

2016-02-22 Thread Andrew Cunningham
Simon,

On 23 February 2016 at 14:12, Simon Cozens  wrote:

> On 23/02/2016 13:54, Andrew Cunningham wrote:
> > PDF/UA for instance leaves the question deliberately ambigious.
> > ActualText is the way to make the content accessible, but developers
> > creating tools for PDF do not actually have to process the ActualText.
>
> Yeah. (Sorry to keep banging the drum but) I've just done some tests
> with SILE, which includes some support for tagged/accessible PDFs. Even
> when the ActualText includes the correct Devanagari, I am still seeing
> the same problems with cut-and-paste. I'm not sure what needs to be done
> to get it right.
>
>
In terms of SILE ... supporting generation of other formats like XPS as an
alternative to PDF is probably the only way forward for complex script
languages.

If SILE is tagging the PDFs and adding ActualText attributes , then it is
doing everything it should be doing. The problems are with the PDF
specification itself, what it was originally designed to be (a pre-print
format based on the Postscript language) and the limitations placed on it
by the developers of the spec.

Andrew


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] New feature REQUEST for xetex

2016-02-22 Thread Andrew Cunningham
On 23 February 2016 at 14:58, ShreeDevi Kumar  wrote:

>> It is not only the problem of copy&paste, you will not be able to use
> the search dialog in Acrobat. For instance, you will not be able to find
> किताब.
>
> Yes, you are right. Search does not work for unicode fonts for complex
> scripts in the current pdfs.
>
> Hence the request ...
>
It isn't an XeTeX/XeLaTeX issue ... it is an inherent limitation of the PDF
file format.

The best XeTeX could do .. is type to:

1) provide ways of controlling the subsetting or not of fonts and allow
editing or customisation of ToUnicode mappings on one hand; and

2) add some tagging algorithms and add ActualText attributes

But this isn't going to get you very far without creating a whole
information ecosystem for the PDFs.

Alternatively, it could add support for an alternative format to PDFs that
doesn't suffer the same problems as PDF documents.

Andrew

-- 
Andrew Cunningham
lang.supp...@gmail.com


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] New feature REQUEST for xetex

2016-02-22 Thread Andrew Cunningham
PDF text is essentially a sequence of glyphs, and uses the ToUnicode
mappings to resolve to

For OpenType fonts, it will apply to any glyphs that are not default glyphs
assigned specific codepoints, true ligatures or variation selectors, so in
theory for complex scripts it could include many if most most glyphs in a
font, depending on how sophisticated the typography and font design is.
Reality is you get better performance in PDFs using "dumb", simple fonts.

PDF accessibility is two staged:

first stage (best supported) is the ToUnicode mapping .. essentially text
in a PDF is just a sequence of glyphs, it is the ToUnicode mapping that
resolves them to real Unicode codepoints.

But the ToUnicode mapping can only map one glyph to one codepoint or one
glyph to a sequence of codepoints (for ligatures and variation selectors).
The documentation on PDFs seem to spend a lot of time discussing the ins
and outs of this in reference to CID fonts.

For OpenType fonts, I assume that, the cmap table is the basis of the
ToUnicode mapping. In OpenType fonts not all glyphs will have mappings to
Unicode codepoints.

Likewise PDFs are the end result of the rendering process, PDF tools can
not handle reordering and certain types of substitution that result in the
final rendered string.

2) second step in accesisble PDFs is the use of ActualText ... but
customised dedicated tools are needed

As indicated cutting and pasting operations will not work, since this is
occurring on the text layer, not the ActualText, and I suspect that will be
unlikely to change. Whereas Adobe's APIs for screen readers will use the
ActualText layer.

If cutting and pasting is an important use case for you, then PDFs are the
wrong file format for you. PDFs are a pre-print format not an archival
format, despite all the rhetoric about PDF/A , PDF/UA, etc. Or more
precisely it is only ever going to be an archival format for a certain set
of languages in certain scripts with non-opentype fonts or documents that
avoid using certain opentype features.

Andrew



On 23 February 2016 at 14:58, ShreeDevi Kumar  wrote:

> >> the problem is caused just by a few characters, especially the short
> i-matra. It might be more difficult in other Indic scripts containing
> two-part vowels.
>
> It is more extensive and applies to all/most glyphs used for conjuncts in
> addition to the short i-matra. It also applies to other Indic scripts as
> well as other complex scripts.
>
> Example below shows how the conjuncts get copied and displayed as square
> boxes. It is also font dependent.
>
> नमऽे ुगेदूसाजु ंगाराः क ु ुराः वाः । अनािरराः ससाः िशवाा
> भजािधपाीक ु ृताा भवि ॥ १॥
>
> >> It might be useful to use ActualText only for selected words.
>
> That might work for a predominantly English text with some devanagari, but
> not for full devanagari texts.
>
> >> It is not only the problem of copy&paste, you will not be able to use
> the search dialog in Acrobat. For instance, you will not be able to find
> किताब.
>
> Yes, you are right. Search does not work for unicode fonts for complex
> scripts in the current pdfs.
>
> Hence the request ...
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
> Hi all,
>
> the problem is caused just by a few characters, especially the short
> i-matra. It might be more difficult in other Indic scripts containing
> two-part vowels. The reason is that visually they appear in a different
> order than they should appear in Unicode representation. It can be solved
> by using ActualText. If all words were entered this way, the size of the
> PDF will double. It might be useful to use ActualText only for selected
> words.
>
> It is not only the problem of copy&paste, you will not be able to use the
> search dialog in Acrobat. For instance, you will not be able to find किताब.
>
>
>
> Zdeněk Wagner
> http://ttsm.icpf.cas.cz/team/wagner.shtml
> http://icebearsoft.euweb.cz
>
> 2016-02-22 14:38 GMT+01:00 ShreeDevi Kumar :
>
>> Hi Jonathan,
>>
>> I am using xetex/xelatex for typesetting of devanagari texts.
>> eg. http://sanskritdocuments.org/doc_devii/gangAShTakamkAlidAsa.pdf
>> http://sanskritdocuments.org/doc_devii/gangAShTakamkAlidAsa.html?lang=sa
>> (HTML TEXT version of the same)
>>
>> However, when the devanagri text is copied from the pdf, it does not
>> display correctly - which is the case with complex scripts with most pdf
>> creators (as far as I know).
>>
>> eg.
>> ॥ गङ्गाष्टकं कालिदासकृतम् ॥
>> is displayed as
>> ॥ गाकं कािलदासकृतम ॥
>>
>> Is it possible to add a feature to xetex to support search and copy of
>> complex script text in scripts such as devanagari?
>>
>> It would really be great to have this "coming soon to a XeTeX near
>> you"... :-)
>>
>> Thanks.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>>
>> On Thu, Feb 18, 2016 at 4:28 

Re: [XeTeX] New feature REQUEST for xetex

2016-02-22 Thread ShreeDevi Kumar
>> the problem is caused just by a few characters, especially the short
i-matra. It might be more difficult in other Indic scripts containing
two-part vowels.

It is more extensive and applies to all/most glyphs used for conjuncts in
addition to the short i-matra. It also applies to other Indic scripts as
well as other complex scripts.

Example below shows how the conjuncts get copied and displayed as square
boxes. It is also font dependent.

नमऽे ुगेदूसाजु ंगाराः क ु ुराः वाः । अनािरराः ससाः िशवाा
भजािधपाीक ु ृताा भवि ॥ १॥

>> It might be useful to use ActualText only for selected words.

That might work for a predominantly English text with some devanagari, but
not for full devanagari texts.

>> It is not only the problem of copy&paste, you will not be able to use
the search dialog in Acrobat. For instance, you will not be able to find
किताब.

Yes, you are right. Search does not work for unicode fonts for complex
scripts in the current pdfs.

Hence the request ...

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Hi all,

the problem is caused just by a few characters, especially the short
i-matra. It might be more difficult in other Indic scripts containing
two-part vowels. The reason is that visually they appear in a different
order than they should appear in Unicode representation. It can be solved
by using ActualText. If all words were entered this way, the size of the
PDF will double. It might be useful to use ActualText only for selected
words.

It is not only the problem of copy&paste, you will not be able to use the
search dialog in Acrobat. For instance, you will not be able to find किताब.



Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz

2016-02-22 14:38 GMT+01:00 ShreeDevi Kumar :

> Hi Jonathan,
>
> I am using xetex/xelatex for typesetting of devanagari texts.
> eg. http://sanskritdocuments.org/doc_devii/gangAShTakamkAlidAsa.pdf
> http://sanskritdocuments.org/doc_devii/gangAShTakamkAlidAsa.html?lang=sa
> (HTML TEXT version of the same)
>
> However, when the devanagri text is copied from the pdf, it does not
> display correctly - which is the case with complex scripts with most pdf
> creators (as far as I know).
>
> eg.
> ॥ गङ्गाष्टकं कालिदासकृतम् ॥
> is displayed as
> ॥ गाकं कािलदासकृतम ॥
>
> Is it possible to add a feature to xetex to support search and copy of
> complex script text in scripts such as devanagari?
>
> It would really be great to have this "coming soon to a XeTeX near
> you"... :-)
>
> Thanks.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
>
> On Thu, Feb 18, 2016 at 4:28 PM,
> ​​
> Jonathan Kew  wrote:
>
>> This is a pretty specialized feature, likely to be interest only to a
>> small minority of users. But for those it concerns, here's something that
>> is
>> ​​
>> "coming soon to a XeTeX near you"...
>>
>>
>> I've recently implemented a new feature, controlled by the integer
>> parameter \XeTeXinterwordspaceshaping. This will be available in the TL'16
>> release, if all goes well.
>>
>> This feature is relevant only when using OpenType/Graphite/AAT fonts, not
>> legacy .tfm-based fonts.
>>
>> When \XeTeXinterwordspaceshaping is greater than 0, XeTeX will attempt to
>> support fonts where the width of inter-word spaces may vary contextually,
>> depending on the preceding and following text. This is needed by fonts such
>> as SIL's Awami Nastaliq (in development) where words are expected to kern
>> together across spaces.
>>
>> The default behavior of xetex is to measure each word in isolation, and
>> simply string together a sequence of such word and space (glue) nodes to
>> form the horizontal list that is then line-broken to form a paragraph.
>> Normally, when inter-word spaces do not depend on the adjacent words, this
>> works fine; but in Awami the width of inter-word spaces may vary
>> drastically, even becoming negative in some cases.
>>
>> Setting \XeTeXinterwordspaceshaping=1 tells xetex to measure such spaces
>> "in context" and take account of the contextually-modified widths during
>> line breaking. This greatly improves the typeset result with such a font.
>> Each word is still shaped and rendered individually, but line-breaking and
>> word spacing respects the inter-word kerning.
>>
>> A further complication occurs when not only the width of the space but
>> also the glyphs of the adjacent words themselves may be subject to
>> contextual changes. An example of this would be a font that has OpenType
>> ligature rules that apply to multiple-word sequences; e.g. a symbol font
>> that ligates the text "credit card" to render a credit-card icon. Another
>> example is the word-final swash forms in Hoefler Italic, which are intended
>> to be used at end-of-line but NOT before word spaces within the line.
>>
>> These cases are addressed with 

Re: [XeTeX] New feature REQUEST for xetex

2016-02-22 Thread Simon Cozens
On 23/02/2016 13:54, Andrew Cunningham wrote:
> PDF/UA for instance leaves the question deliberately ambigious.
> ActualText is the way to make the content accessible, but developers
> creating tools for PDF do not actually have to process the ActualText.

Yeah. (Sorry to keep banging the drum but) I've just done some tests
with SILE, which includes some support for tagged/accessible PDFs. Even
when the ActualText includes the correct Devanagari, I am still seeing
the same problems with cut-and-paste. I'm not sure what needs to be done
to get it right.


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] New feature REQUEST for xetex

2016-02-22 Thread Andrew Cunningham
It would probably more than double, i was under the impression that
ActualText was a tag attrubute, so extensive tagging would be needed, and
actual text added to the tags.

But the question is how to practically make use of ActualText if there is a
visible text layer.

PDF/UA for instance leaves the question deliberately ambigious. ActualText
is the way to make the content accessible, but developers creating tools
for PDF do not actually have to process the ActualText.

So to index and search PDF files you need to build a discovery system
utilising tools that allow you to specify the use of ActualText in
preference to a visible text layer.

Andrew
On 23 Feb 2016 12:52 am, "Zdenek Wagner"  wrote:

> Hi all,
>
> the problem is caused just by a few characters, especially the short
> i-matra. It might be more difficult in other Indic scripts containing
> two-part vowels. The reason is that visually they appear in a different
> order than they should appear in Unicode representation. It can be solved
> by using ActualText. If all words were entered this way, the size of the
> PDF will double. It might be useful to use ActualText only for selected
> words.
>
> It is not only the problem of copy&paste, you will not be able to use the
> search dialog in Acrobat. For instance, you will not be able to find किताब.
>
>
>
> Zdeněk Wagner
> http://ttsm.icpf.cas.cz/team/wagner.shtml
> http://icebearsoft.euweb.cz
>
> 2016-02-22 14:38 GMT+01:00 ShreeDevi Kumar :
>
>> Hi Jonathan,
>>
>> I am using xetex/xelatex for typesetting of devanagari texts.
>> eg. http://sanskritdocuments.org/doc_devii/gangAShTakamkAlidAsa.pdf
>> http://sanskritdocuments.org/doc_devii/gangAShTakamkAlidAsa.html?lang=sa
>> (HTML TEXT version of the same)
>>
>> However, when the devanagri text is copied from the pdf, it does not
>> display correctly - which is the case with complex scripts with most pdf
>> creators (as far as I know).
>>
>> eg.
>> ॥ गङ्गाष्टकं कालिदासकृतम् ॥
>> is displayed as
>> ॥ गाकं कािलदासकृतम ॥
>>
>> Is it possible to add a feature to xetex to support search and copy of
>> complex script text in scripts such as devanagari?
>>
>> It would really be great to have this "coming soon to a XeTeX near
>> you"... :-)
>>
>> Thanks.
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>>
>> On Thu, Feb 18, 2016 at 4:28 PM,
>> ​​
>> Jonathan Kew  wrote:
>>
>>> This is a pretty specialized feature, likely to be interest only to a
>>> small minority of users. But for those it concerns, here's something that
>>> is
>>> ​​
>>> "coming soon to a XeTeX near you"...
>>>
>>>
>>> I've recently implemented a new feature, controlled by the integer
>>> parameter \XeTeXinterwordspaceshaping. This will be available in the TL'16
>>> release, if all goes well.
>>>
>>> This feature is relevant only when using OpenType/Graphite/AAT fonts,
>>> not legacy .tfm-based fonts.
>>>
>>> When \XeTeXinterwordspaceshaping is greater than 0, XeTeX will attempt
>>> to support fonts where the width of inter-word spaces may vary
>>> contextually, depending on the preceding and following text. This is needed
>>> by fonts such as SIL's Awami Nastaliq (in development) where words are
>>> expected to kern together across spaces.
>>>
>>> The default behavior of xetex is to measure each word in isolation, and
>>> simply string together a sequence of such word and space (glue) nodes to
>>> form the horizontal list that is then line-broken to form a paragraph.
>>> Normally, when inter-word spaces do not depend on the adjacent words, this
>>> works fine; but in Awami the width of inter-word spaces may vary
>>> drastically, even becoming negative in some cases.
>>>
>>> Setting \XeTeXinterwordspaceshaping=1 tells xetex to measure such spaces
>>> "in context" and take account of the contextually-modified widths during
>>> line breaking. This greatly improves the typeset result with such a font.
>>> Each word is still shaped and rendered individually, but line-breaking and
>>> word spacing respects the inter-word kerning.
>>>
>>> A further complication occurs when not only the width of the space but
>>> also the glyphs of the adjacent words themselves may be subject to
>>> contextual changes. An example of this would be a font that has OpenType
>>> ligature rules that apply to multiple-word sequences; e.g. a symbol font
>>> that ligates the text "credit card" to render a credit-card icon. Another
>>> example is the word-final swash forms in Hoefler Italic, which are intended
>>> to be used at end-of-line but NOT before word spaces within the line.
>>>
>>> These cases are addressed with \XeTeXinterwordspaceshaping=2. With this
>>> value, not only are inter-word spaces measured in context, but also each
>>> run of text (words and intervening spaces) in a single font will be
>>> re-shaped as a unit at \shipout time. This allows full shaping (contextual
>>> swashes, ligatures, etc) to take ef

Re: [XeTeX] New feature REQUEST for xetex

2016-02-22 Thread Zdenek Wagner
Hi all,

the problem is caused just by a few characters, especially the short
i-matra. It might be more difficult in other Indic scripts containing
two-part vowels. The reason is that visually they appear in a different
order than they should appear in Unicode representation. It can be solved
by using ActualText. If all words were entered this way, the size of the
PDF will double. It might be useful to use ActualText only for selected
words.

It is not only the problem of copy&paste, you will not be able to use the
search dialog in Acrobat. For instance, you will not be able to find किताब.



Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz

2016-02-22 14:38 GMT+01:00 ShreeDevi Kumar :

> Hi Jonathan,
>
> I am using xetex/xelatex for typesetting of devanagari texts.
> eg. http://sanskritdocuments.org/doc_devii/gangAShTakamkAlidAsa.pdf
> http://sanskritdocuments.org/doc_devii/gangAShTakamkAlidAsa.html?lang=sa
> (HTML TEXT version of the same)
>
> However, when the devanagri text is copied from the pdf, it does not
> display correctly - which is the case with complex scripts with most pdf
> creators (as far as I know).
>
> eg.
> ॥ गङ्गाष्टकं कालिदासकृतम् ॥
> is displayed as
> ॥ गाकं कािलदासकृतम ॥
>
> Is it possible to add a feature to xetex to support search and copy of
> complex script text in scripts such as devanagari?
>
> It would really be great to have this "coming soon to a XeTeX near
> you"... :-)
>
> Thanks.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
>
> On Thu, Feb 18, 2016 at 4:28 PM,
> ​​
> Jonathan Kew  wrote:
>
>> This is a pretty specialized feature, likely to be interest only to a
>> small minority of users. But for those it concerns, here's something that
>> is
>> ​​
>> "coming soon to a XeTeX near you"...
>>
>>
>> I've recently implemented a new feature, controlled by the integer
>> parameter \XeTeXinterwordspaceshaping. This will be available in the TL'16
>> release, if all goes well.
>>
>> This feature is relevant only when using OpenType/Graphite/AAT fonts, not
>> legacy .tfm-based fonts.
>>
>> When \XeTeXinterwordspaceshaping is greater than 0, XeTeX will attempt to
>> support fonts where the width of inter-word spaces may vary contextually,
>> depending on the preceding and following text. This is needed by fonts such
>> as SIL's Awami Nastaliq (in development) where words are expected to kern
>> together across spaces.
>>
>> The default behavior of xetex is to measure each word in isolation, and
>> simply string together a sequence of such word and space (glue) nodes to
>> form the horizontal list that is then line-broken to form a paragraph.
>> Normally, when inter-word spaces do not depend on the adjacent words, this
>> works fine; but in Awami the width of inter-word spaces may vary
>> drastically, even becoming negative in some cases.
>>
>> Setting \XeTeXinterwordspaceshaping=1 tells xetex to measure such spaces
>> "in context" and take account of the contextually-modified widths during
>> line breaking. This greatly improves the typeset result with such a font.
>> Each word is still shaped and rendered individually, but line-breaking and
>> word spacing respects the inter-word kerning.
>>
>> A further complication occurs when not only the width of the space but
>> also the glyphs of the adjacent words themselves may be subject to
>> contextual changes. An example of this would be a font that has OpenType
>> ligature rules that apply to multiple-word sequences; e.g. a symbol font
>> that ligates the text "credit card" to render a credit-card icon. Another
>> example is the word-final swash forms in Hoefler Italic, which are intended
>> to be used at end-of-line but NOT before word spaces within the line.
>>
>> These cases are addressed with \XeTeXinterwordspaceshaping=2. With this
>> value, not only are inter-word spaces measured in context, but also each
>> run of text (words and intervening spaces) in a single font will be
>> re-shaped as a unit at \shipout time. This allows full shaping (contextual
>> swashes, ligatures, etc) to take effect across inter-word spaces.
>>
>> Currently, this feature is implemented only in the "contextual-space"
>> branch of the code at sourceforge; anyone interested in testing it will
>> need to check out and build the code from there. After some time, if no
>> major problems show up, I expect to merge it to the master branch, and then
>> to the TeXLive source tree.
>>
>> Feedback welcome..
>>
>> JK
>>
>>
>>
>> --
>> Subscriptions, Archive, and List information, etc.:
>>  http://tug.org/mailman/listinfo/xetex
>>
>
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>
>


--
Subscriptions, Archive, and List in

Re: [XeTeX] New feature REQUEST for xetex

2016-02-22 Thread ShreeDevi Kumar
Hi Jonathan,

I am using xetex/xelatex for typesetting of devanagari texts.
eg. http://sanskritdocuments.org/doc_devii/gangAShTakamkAlidAsa.pdf
http://sanskritdocuments.org/doc_devii/gangAShTakamkAlidAsa.html?lang=sa
(HTML TEXT version of the same)

However, when the devanagri text is copied from the pdf, it does not
display correctly - which is the case with complex scripts with most pdf
creators (as far as I know).

eg.
॥ गङ्गाष्टकं कालिदासकृतम् ॥
is displayed as
॥ गाकं कािलदासकृतम ॥

Is it possible to add a feature to xetex to support search and copy of
complex script text in scripts such as devanagari?

It would really be great to have this "coming soon to a XeTeX near
you"... :-)

Thanks.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Thu, Feb 18, 2016 at 4:28 PM,
​​
Jonathan Kew  wrote:

> This is a pretty specialized feature, likely to be interest only to a
> small minority of users. But for those it concerns, here's something that
> is
> ​​
> "coming soon to a XeTeX near you"...
>
>
> I've recently implemented a new feature, controlled by the integer
> parameter \XeTeXinterwordspaceshaping. This will be available in the TL'16
> release, if all goes well.
>
> This feature is relevant only when using OpenType/Graphite/AAT fonts, not
> legacy .tfm-based fonts.
>
> When \XeTeXinterwordspaceshaping is greater than 0, XeTeX will attempt to
> support fonts where the width of inter-word spaces may vary contextually,
> depending on the preceding and following text. This is needed by fonts such
> as SIL's Awami Nastaliq (in development) where words are expected to kern
> together across spaces.
>
> The default behavior of xetex is to measure each word in isolation, and
> simply string together a sequence of such word and space (glue) nodes to
> form the horizontal list that is then line-broken to form a paragraph.
> Normally, when inter-word spaces do not depend on the adjacent words, this
> works fine; but in Awami the width of inter-word spaces may vary
> drastically, even becoming negative in some cases.
>
> Setting \XeTeXinterwordspaceshaping=1 tells xetex to measure such spaces
> "in context" and take account of the contextually-modified widths during
> line breaking. This greatly improves the typeset result with such a font.
> Each word is still shaped and rendered individually, but line-breaking and
> word spacing respects the inter-word kerning.
>
> A further complication occurs when not only the width of the space but
> also the glyphs of the adjacent words themselves may be subject to
> contextual changes. An example of this would be a font that has OpenType
> ligature rules that apply to multiple-word sequences; e.g. a symbol font
> that ligates the text "credit card" to render a credit-card icon. Another
> example is the word-final swash forms in Hoefler Italic, which are intended
> to be used at end-of-line but NOT before word spaces within the line.
>
> These cases are addressed with \XeTeXinterwordspaceshaping=2. With this
> value, not only are inter-word spaces measured in context, but also each
> run of text (words and intervening spaces) in a single font will be
> re-shaped as a unit at \shipout time. This allows full shaping (contextual
> swashes, ligatures, etc) to take effect across inter-word spaces.
>
> Currently, this feature is implemented only in the "contextual-space"
> branch of the code at sourceforge; anyone interested in testing it will
> need to check out and build the code from there. After some time, if no
> major problems show up, I expect to merge it to the master branch, and then
> to the TeXLive source tree.
>
> Feedback welcome..
>
> JK
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex