[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2021-08-31 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407489#comment-17407489
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 8/31/21, 3:52 PM:
---

This would have to be implemented in the source code. See Language.java in 
fontbox, it describes what needs to be done to implement a new language 
(implement a new GsubWorker). Currently there's only a 
GsubWorkerForBengali.java . You would need to understand what Palash Ray has 
done and why. I assume you'd need to know about Bengali and Malayalam glyphs, 
i.e. how the substitutions are done. Maybe it's a similar principle, maybe it 
isn't. Nobody in the team does AFAIK. And you need to be able to build from 
source. The current implementation is incomplete, the visual is fine but the 
text extraction is wrong. You're welcome if you want to try!


was (Author: tilman):
This would have to be implemented in the source code. See Language.java in 
fontbox, it describes what needs to be done to implement a new language 
(implement a new GsubWorker). Currently there's only a 
GsubWorkerForBengali.java . You would need to understand what Palash Ray has 
done and why. I assume you'd need to know about Bengali and Malayalam glyphs, 
i.e. how the substitutions are done. Maybe it's a similar principle, maybe it 
isn't. Nobody in the team does AFAIK. And you need to be able to build from 
source. The current implementation is incomplete, the visual is fine but the 
text extraction is wrong. You're welcome if you want to try

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, bengali-example3.pdf, bengali-word-lohit-bad.pdf, 
> bengali-word-lohit-good.pdf, committed.patch, pdf-output.png, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2021-08-30 Thread Kishore Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407073#comment-17407073
 ] 

Kishore Kumar edited comment on PDFBOX-4189 at 8/31/21, 5:45 AM:
-

Team,

I am not getting PDFBox to render malayalam (one of the Indic script) text 
properly. If Ligature substitution works then this should work. I am using 
3.0.0-RC1 version.

 

String text = "വകുപ്പ്‌ 1 മനുഷ്യരെല്ലാവരും തുല്യാവകാശങ്ങളോടും";

PDDocument doc = *new* PDDocument();

PDFont font = PDType0Font.load(doc, new 
File("/Users/kishore/Downloads/ML-NILA01_NewLipi.ttf"));

 

PDPage page = *new* PDPage();

doc.addPage(page);

PDPageContentStream contentStream = *new* PDPageContentStream(doc, page);

contentStream.beginText();

contentStream.newLineAtOffset(25, 700);

contentStream.setFont(font,12 );

contentStream.showText(text);

contentStream.endText();

contentStream.close();

 

Do we have the support for GSUB tables now? Am I doing anything wrong here? 
Please suggest.

 

The output I get is  - 
 

  !pdf-output.png!

versus the input text text

വകുപ്പ്‌ 1 മനുഷ്യരെല്ലാവരും തുല്യാവകാശങ്ങളോടും

Here GSUB substitution is not happening.

 


was (Author: kishorekollam):
Team,

I am not getting PDFBox to render malayalam (one of the Indic script) text 
properly. If Ligature substitution works then this should work. I am using 
3.0.0-RC1 version.

 

String text = "വകുപ്പ്‌ 1 മനുഷ്യരെല്ലാവരും തുല്യാവകാശങ്ങളോടും";

PDDocument doc = *new* PDDocument();

PDFont font = PDType0Font.load(doc, new 
File("/Users/kishore/Downloads/ML-NILA01_NewLipi.ttf"));

 

PDPage page = *new* PDPage();

doc.addPage(page);

PDPageContentStream contentStream = *new* PDPageContentStream(doc, page);

contentStream.beginText();

contentStream.newLineAtOffset(25, 700);

contentStream.setFont(font,12 );

contentStream.showText(text);

contentStream.endText();

contentStream.close();

 

Do we have the support for GSUB tables now? Am I doing anything wrong here? 
Please suggest.

 

 

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, bengali-example3.pdf, bengali-word-lohit-bad.pdf, 
> bengali-word-lohit-good.pdf, committed.patch, pdf-output.png, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2021-08-30 Thread Kishore Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407073#comment-17407073
 ] 

Kishore Kumar edited comment on PDFBOX-4189 at 8/31/21, 5:36 AM:
-

Team,

I am not getting PDFBox to render malayalam (one of the Indic script) text 
properly. If Ligature substitution works then this should work. I am using 
3.0.0-RC1 version.

 

String text = "വകുപ്പ്‌ 1 മനുഷ്യരെല്ലാവരും തുല്യാവകാശങ്ങളോടും";

PDDocument doc = *new* PDDocument();

PDFont font = PDType0Font.load(doc, new 
File("/Users/kishore/Downloads/ML-NILA01_NewLipi.ttf"));

 

PDPage page = *new* PDPage();

doc.addPage(page);

PDPageContentStream contentStream = *new* PDPageContentStream(doc, page);

contentStream.beginText();

contentStream.newLineAtOffset(25, 700);

contentStream.setFont(font,12 );

contentStream.showText(text);

contentStream.endText();

contentStream.close();

 

Do we have the support for GSUB tables now? Am I doing anything wrong here? 
Please suggest.

 

 


was (Author: kishorekollam):
Team,

I am not getting PDFBox to render malayalam (one of the Indic script) text 
properly. If Ligature substitution works then this should work. I am using 
3.0.0-RC1 version.

 

String text = "വകുപ്പ്‌ 1 മനുഷ്യരെല്ലാവരും തുല്യാവകാശങ്ങളോടും";

PDDocument doc = *new* PDDocument();

PDFont font = PDType0Font.load(doc, new 
File("/Users/kishore/Downloads/ML-NILA01_NewLipi.ttf"));

 

PDPage page = *new* PDPage();

doc.addPage(page);

PDPageContentStream contentStream = *new* PDPageContentStream(doc, page);

contentStream.beginText();

contentStream.newLineAtOffset(25, 700);

contentStream.setFont(font,12 );

contentStream.showText(text);

contentStream.endText();

contentStream.close();

 

Am I doing anything wrong here? Please help.

 

 

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, bengali-example3.pdf, bengali-word-lohit-bad.pdf, 
> bengali-word-lohit-good.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2021-08-30 Thread Kishore Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407073#comment-17407073
 ] 

Kishore Kumar edited comment on PDFBOX-4189 at 8/31/21, 5:35 AM:
-

Team,

I am not getting PDFBox to render malayalam (one of the Indic script) text 
properly. If Ligature substitution works then this should work. I am using 
3.0.0-RC1 version.

 

String text = "വകുപ്പ്‌ 1 മനുഷ്യരെല്ലാവരും തുല്യാവകാശങ്ങളോടും";

PDDocument doc = *new* PDDocument();

PDFont font = PDType0Font.load(doc, new 
File("/Users/kishore/Downloads/ML-NILA01_NewLipi.ttf"));

 

PDPage page = *new* PDPage();

doc.addPage(page);

PDPageContentStream contentStream = *new* PDPageContentStream(doc, page);

contentStream.beginText();

contentStream.newLineAtOffset(25, 700);

contentStream.setFont(font,12 );

contentStream.showText(text);

contentStream.endText();

contentStream.close();

 

Am I doing anything wrong here? Please help.

 

 


was (Author: kishorekollam):
Team,

I am not getting PDFBox to render malayalam (one of the Indic script) text 
properly. If Ligature substitution works then this should work. I am using 
3.0.0-RC1 version.

PDDocument doc = *new* PDDocument();

PDFont font = PDType0Font.load(doc, new 
File("/Users/kishore/Downloads/ML-NILA01_NewLipi.ttf"));

 

PDPage page = *new* PDPage();

doc.addPage(page);

PDPageContentStream contentStream = *new* PDPageContentStream(doc, page);

contentStream.beginText();

contentStream.newLineAtOffset(25, 700);

contentStream.setFont(font,12 );

contentStream.showText(text);

contentStream.endText();

contentStream.close();

 

Am I doing anything wrong here? Please help.

 

 

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, bengali-example3.pdf, bengali-word-lohit-bad.pdf, 
> bengali-word-lohit-good.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2019-04-27 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827647#comment-16827647
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/27/19 4:02 PM:
--

This has been a year and I wanted to look what's going on and concentrated on 
the first word ( আমি ).

example2 has incorrect visual glyph sequence but correct text extraction.

example 3 has correct visual glyphs sequence but incorrect text extraction.

The "scythe" ি  (= "BENGALI VOWEL SIGN I") is painted to the left of the 
consonant it is "influencing", but when composed with an editor, it is to be 
after it.

WORD solves this that the "scythe" glyph maps to the consonant in the ToUnicode 
table: [^bengali-word-lohit-good.pdf] 

This somehow looked suspicious and I wondered what would happen if I'd use the 
"scythe" glyph with two different consonants: ( আিমি ). The result was 
[^bengali-word-lohit-bad.pdf] and the glyphs look good, but the text extraction 
is wrong: আিআি . So that is really funny 🤣 but the downside is that for now, we 
have no "gold standard" to look for some guidance and inspiration.


was (Author: tilman):
This has been a year and I wanted to look what's going on and concentrated on 
the first word ( আমি ).

example2 has incorrect visual glyph sequence but correct text extraction.

example 3 has correct visual glyphs sequence but incorrect text extraction.

The "scythe" ি  (= "BENGALI VOWEL SIGN I") is painted to the left of the 
consonant it is "influencing", but when composed with an editor, it is to be 
after it.

WORD solves this that the "scythe" glyph maps to the consonant in the ToUnicode 
table: [^bengali-word-lohit-good.pdf] 

This somehow looked suspicious and I wondered what would happen if I'd use the 
"scythe" glyph with two different consonants: ( আিমি ). The result was 
[^bengali-word-lohit-bad.pdf] and the glyphs look good, but the text extraction 
is wrong: আিআি . So that is really funny 🤣 but the downside is that for now, we 
have no "gold standard" to look up to.

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, bengali-example3.pdf, bengali-word-lohit-bad.pdf, 
> bengali-word-lohit-good.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2019-04-27 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827647#comment-16827647
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/27/19 4:02 PM:
--

This has been a year and I wanted to look what's going on and concentrated on 
the first word ( আমি ).

example2 has incorrect visual glyph sequence but correct text extraction.

example 3 has correct visual glyphs sequence but incorrect text extraction.

The "scythe" ি  (= "BENGALI VOWEL SIGN I") is painted to the left of the 
consonant it is "influencing", but when composed with an editor, it is to be 
after it.

WORD solves this that the "scythe" glyph maps to the consonant in the ToUnicode 
table: [^bengali-word-lohit-good.pdf] 

This somehow looked suspicious and I wondered what would happen if I'd use the 
"scythe" glyph with two different consonants: ( আিমি ). The result was 
[^bengali-word-lohit-bad.pdf] and the glyphs look good, but the text extraction 
is wrong: আিআি . So that is really funny 🤣 but the downside is that for now, we 
have no "gold standard" to look up to.


was (Author: tilman):
This has been a year and I wanted to look what's going on and concentrated on 
the first word ( আমি ).

example2 has incorrect visual glyph sequence but correct text extraction.

example 3 has correct visual glyphs sequence but incorrect text extraction.

The "scythe" ি  (= "BENGALI VOWEL SIGN I") is painted to the left of the 
consonant it is "influencing", but when composed with an editor, it is to be 
after it.

WORD solves this that the "scythe" glyph maps to the consonant in the ToUnicode 
table: [^bengali-word-lohit-good.pdf] 

This somehow looked suspicious and I wondered what would happen if I'd use the 
"scythe" glyph with two different consonants: ( আিমি ). The result was 
[^bengali-word-lohit-bad.pdf] and the glyphs look good, but the text extraction 
is wrong 🤣. So that is really funny, but the downside is that for now, we have 
no "gold standard" to look up to.

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, bengali-example3.pdf, bengali-word-lohit-bad.pdf, 
> bengali-word-lohit-good.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2019-04-27 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827647#comment-16827647
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/27/19 4:01 PM:
--

This has been a year and I wanted to look what's going on and concentrated on 
the first word ( আমি ).

example2 has incorrect visual glyph sequence but correct text extraction.

example 3 has correct visual glyphs sequence but incorrect text extraction.

The "scythe" ি  (= "BENGALI VOWEL SIGN I") is painted to the left of the 
consonant it is "influencing", but when composed with an editor, it is to be 
after it.

WORD solves this that the "scythe" glyph maps to the consonant in the ToUnicode 
table: [^bengali-word-lohit-good.pdf] 

This somehow looked suspicious and I wondered what would happen if I'd use the 
"scythe" glyph with two different consonants: ( আিমি ). The result was 
[^bengali-word-lohit-bad.pdf] and the glyphs look good, but the text extraction 
is wrong 🤣. So that is really funny, but the downside is that for now, we have 
no "gold standard" to look up to.


was (Author: tilman):
This has been a year and I wanted to look what's going on and concentrated on 
the first word ( আমি ).

example2 has incorrect visual glyph sequence but correct text extraction.

example 3 has correct visual glyphs sequence but incorrect text extraction.

The "scythe" ি  (= "BENGALI VOWEL SIGN I") is painted to the left of the 
consonant it is "influencing", but when composed with an editor, it is to be 
after it.

WORD solves this that the "scythe" glyph maps to the consonant in the ToUnicode 
table: [^bengali-word-lohit-good.pdf] 

This somehow looked suspicious and I wondered what would happen if I'd use the 
"scythe" glyph with two different consonants. The result was 
[^bengali-word-lohit-bad.pdf] and the glyphs look good, but the text extraction 
is wrong 🤣. So that is really funny, but the downside is that for now, we have 
no "gold standard" to look up to.

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, bengali-example3.pdf, bengali-word-lohit-bad.pdf, 
> bengali-word-lohit-good.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2019-04-27 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827647#comment-16827647
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/27/19 3:51 PM:
--

This has been a year and I wanted to look what's going on and concentrated on 
the first word ( আমি ).

example2 has incorrect visual glyph sequence but correct text extraction.

example 3 has correct visual glyphs sequence but incorrect text extraction.

The "scythe" ি  (= "BENGALI VOWEL SIGN I") is painted to the left of the 
consonant it is "influencing", but when composed with an editor, it is to be 
after it.

WORD solves this that the "scythe" glyph maps to the consonant in the ToUnicode 
table: [^bengali-word-lohit-good.pdf] 

This somehow looked suspicious and I wondered what would happen if I'd use the 
"scythe" glyph with two different consonants. The result was 
[^bengali-word-lohit-bad.pdf] and the glyphs look good, but the text extraction 
is wrong 🤣. So that is really funny, but the downside is that for now, we have 
no "gold standard" to look up to.


was (Author: tilman):
This has been a year and I wanted to look what's going on and concentrated on 
the first word ( আমি ).

example2 has incorrect visual glyph sequence but correct extraction.

example 3 has correct visual glyphs sequence but incorrect extraction.

The "scythe" ি  (= "BENGALI VOWEL SIGN I") is painted to the left of the 
consonant it is "influencing", but when composed with an editor, it is to be 
after it.

WORD solves this that the "scythe" glyph maps to the consonant in the ToUnicode 
table: [^bengali-word-lohit-good.pdf] 

This somehow looked suspicious and I wondered what would happen if I'd use the 
"scythe" glyph with two different consonants. The result was 
[^bengali-word-lohit-bad.pdf] and the glyphs look good, but the text extraction 
is wrong 🤣. So that is really funny, but the downside is that for now, we have 
no "gold standard" to look up to.

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, bengali-example3.pdf, bengali-word-lohit-bad.pdf, 
> bengali-word-lohit-good.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2018-05-10 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471296#comment-16471296
 ] 

John Hewson edited comment on PDFBOX-4189 at 5/10/18 11:56 PM:
---

I'm trying to get ToUnicodeMap generation working properly with GSUB but have 
hit problems introduced by PDFBOX-4106. We'll have to resolve that before I can 
proceed here.


was (Author: jahewson):
I'm trying to ToUnicodeMap generation working properly but have hit problems 
introduced by PDFBOX-4106. We'll have to resolve that before I can proceed here.

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, bengali-example3.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2018-05-08 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16467728#comment-16467728
 ] 

John Hewson edited comment on PDFBOX-4189 at 5/8/18 5:51 PM:
-

{quote}
Maruan Sahyoun added a comment - 29/Apr/18 08:36
Tilman Hausherr Palash Ray could we get a method which returns the 
glyphs/ids/code to use together with the metrics information for a string? 
{quote}

That's exactly what a GlyphVector is. Might be what we need here...


was (Author: jahewson):
{quote}
Maruan Sahyoun added a comment - 29/Apr/18 08:36
Tilman Hausherr Palash Ray could we get a method which returns the 
glyphs/ids/code to use together with the metrics information for a string? 
{quote}

That's exactly what a GlyphVector is.

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, bengali-example3.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2018-04-29 Thread Palash Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16458001#comment-16458001
 ] 

Palash Ray edited comment on PDFBOX-4189 at 4/29/18 12:29 PM:
--

I am unable to re-produce this error. It runs fine for me. I confirm that I do 
not have any local changes. Can you please tell me on what JRE you are running 
this? I tested with both 7 and 8, and its good.

 

Actually, let me do a fresh checkout and re-test.

Thanks,

Palash.


was (Author: paawak):
I am unable to re-produce this error. It runs fine for me. I confirm that I do 
not have any local changes. Can you please tell me on what JRE you are running 
this? I tested with both 7 and 8, and its good.

Thanks,

Palash.

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, bengali-example3.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2018-04-29 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16457993#comment-16457993
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/29/18 11:51 AM:
---

I'm getting "Exception in thread "main" java.lang.IllegalArgumentException: No 
glyph for U+00E0 in font Lohit-Bengali" when running the example. Could you 
check whether the committed sample text file is the one you used, i.e. that no 
byte was changed? What encoding is used for the text, could it be we have to 
pass the encoding to InputStreamReader?


was (Author: tilman):
I'm getting "Exception in thread "main" java.lang.IllegalArgumentException: No 
glyph for U+00E0 in font Lohit-Bengali" when running the example. Could you 
check whether the committed sample text file is the one you used? What encoding 
is used for the text, could it be we have to pass the encoding to 
InputStreamReader?

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, bengali-example3.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2018-04-27 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456325#comment-16456325
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/27/18 12:53 PM:
---

Thank you, I committed your change with one minor difference, I used the font 
instead of the font name. The reason is that I don't trust the names to be 
really different.


was (Author: tilman):
I committed your change with one minor difference, I used the font instead of 
the font name. The reason is that I don't trust the names to be really 
different.

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2018-04-26 Thread Palash Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454699#comment-16454699
 ] 

Palash Ray edited comment on PDFBOX-4189 at 4/26/18 7:03 PM:
-

Well, I share your concern of not impacting others with these Gsub changes. I 
have a safety feature here: unless your Font supports the specific script name 
mentioned in the Language enum, this Gsub system will not kick in. And right 
now I have only the Bengali language in the Language enum. I think due to this 
safety feature, the Gsub feature should be pretty safe to have. However, if you 
find some other vulnerability that I might have overlooked, please do let me 
know, I am more than happy to fix.

 

As for the Gsub workers, Tilman, I have taken your advice and created a Map of 
GsubWorkers. Please take a look if that agrees with you:

[https://github.com/apache/pdfbox/pull/49]

 

Thanks,

Palash.


was (Author: paawak):
Well, I share your concern of not impacting others with these Gsub changes. I 
have a safety feature here: unless your Font supports the specific script name 
mentioned in the Language enum, this Gsub system will not kick in. And right 
now I have only the Bengali language in the Language enum. I think due o this 
safety feature, this Gsub feature should be pretty safe to have. However, if 
you find some other vulnerability that I might have overlooked, please do let 
me know, I am more than happy to fix them.

 

As for the Gsub workers, Tilman, I have taken your advice and created a Map of 
GsubWorkers. Please take a look if that agrees with you:

[https://github.com/apache/pdfbox/pull/49]

 

Thanks,

Palash.

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2018-04-25 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453551#comment-16453551
 ] 

Maruan Sahyoun edited comment on PDFBOX-4189 at 4/26/18 6:10 AM:
-

I haven't had the time to look at the details but am following the discussion. 
What about

- detecting the Script -> {{Character.UnicodeScript}} in {{java.lang}}
- provide a language setting on top to override/specify further as a script 
might cover several languages which may have different needs
- putting the GSUB processing behind a flag/configuration similar to Apache Fop 
(https://xmlgraphics.apache.org/fop/trunk/complexscripts.html#Disabling-complex-scripts)
 so users can decide if they want this for performance and compatibility 
reasons. Maybe similar to what was done in {{PDFMergerUtility}}

 


was (Author: msahyoun):
I haven't had the time to look at the details but am following the discussion. 
What about

- detecting the Script -> {{Character.UnicodeScript}} in {{java.lang}}
- provide a language setting on top to override/specify further as a script 
might cover several languages which may have different needs
- putting the GSUB processing behind a flag/configuration similar to Apache Fop 
so users can decide if they want this for performance and compatibility 
reasons. Maybe similar to what was done in {{PDFMergerUtility}}

 

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2018-04-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453042#comment-16453042
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/25/18 8:33 PM:
--

With "same as 2.0.9 i.e. no rearrangement" I mean that the new feature should 
not be activated by default, so that people who use 2.0.10 (assuming I commit 
your changes there too) would have the same output as before. The reason is 
that not everybody wants this, for example people who have pixel comparisons of 
their output don't want their tests to fail after a version change because of 
the rearrangement / replacement.

About activating the script - I think it should be independent of the font. 
Some fonts may support several scripts. (I just see that your gsubdata code 
returns one single language so maybe I'm wrong there, I thought of Arial Uni 
that has a lot of different alphabets)

How about caching the workers in the content stream, with the font as the key?


was (Author: tilman):
With "same as 2.0.9 i.e. no rearrangement" I mean that the new feature should 
not be activated by default, so that people who use 2.0.10 (assuming I commit 
your changes there too) would have the same output as before. The reason is 
that not everybody wants this, for example people who have pixel comparisons of 
their output don't want their tests to fail after a version change because of 
the rearrangement / replacement.

About activating the script - I think it should be independent of the font. 
Some fonts may support several scripts. (I just see that your gsubdata code 
returns one single language so maybe I'm wrong there, I thought of Arial Uni 
that has a lot of different languages)

How about caching the workers in the content stream, with the font as the key?

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2018-04-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453042#comment-16453042
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/25/18 8:32 PM:
--

With "same as 2.0.9 i.e. no rearrangement" I mean that the new feature should 
not be activated by default, so that people who use 2.0.10 (assuming I commit 
your changes there too) would have the same output as before. The reason is 
that not everybody wants this, for example people who have pixel comparisons of 
their output don't want their tests to fail after a version change because of 
the rearrangement / replacement.

About activating the script - I think it should be independent of the font. 
Some fonts may support several scripts. (I just see that your gsubdata code 
returns one single language so maybe I'm wrong there, I thought of Arial Uni 
that has a lot of different languages)

How about caching the workers in the content stream, with the font as the key?


was (Author: tilman):
With "same as 2.0.9 i.e. no rearrangement" I mean that the new feature should 
not be activated by default, so that people who use 2.0.10 (assuming I commit 
your changes there too) would have the same output as before. The reason is 
that not everybody wants this, for example people who have pixel comparisons of 
their output don't want their tests to fail after a version change because of 
the rearrangement / replacement.

About activating the script - I think it should be independent of the font. 
Some fonts may support several scripts.

How about caching the workers in the content stream, with the font as the key?

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2018-04-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453042#comment-16453042
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/25/18 8:29 PM:
--

With "same as 2.0.9 i.e. no rearrangement" I mean that the new feature should 
not be activated by default, so that people who use 2.0.10 (assuming I commit 
your changes there too) would have the same output as before. The reason is 
that not everybody wants this, for example people who have pixel comparisons of 
their output don't want their tests to fail after a version change because of 
the rearrangement / replacement.

About activating the script - I think it should be independent of the font. 
Some fonts may support several scripts.

How about caching the workers in the content stream, with the font as the key?


was (Author: tilman):
With "same as 2.0.9 i.e. no rearrangement" I mean that the new feature should 
not be activated by default, so that people who use 2.0.10 (assuming I commit 
your changes there too) would have the same output as before. The reason is 
that not everybody wants this, for example people who have pixel comparisons of 
their output don't want their tests to fail after a version change.

About activating the script - I think it should be independent of the font. 
Some fonts may support several scripts.

How about caching the workers in the content stream, with the font as the key?

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2018-04-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16452887#comment-16452887
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/25/18 6:57 PM:
--

I wonder if I'm missing something... 
{{gsubWorkerFactory.getGsubWorker(pdType0Font.getCmapLookup(), gsubData);}} is 
still hit for every call of showTextInternal().

Another thing to do: the default behaviour should be the same as 2.0.9 i.e. no 
rearrangement. What I'm thinking is some setter in PDPageContentStream that 
activates the GSUB worker for that script, e.g. setScript("Bengali").


was (Author: tilman):
I wonder if I'm missing something... 
{{gsubWorkerFactory.getGsubWorker(pdType0Font.getCmapLookup(), gsubData);}} is 
still hit for every call of showTextInternal().

Another thing to do: the default behaviour should be the same as 2.0.9 i.e. no 
rearrangement. What I'm thinking is some language setting that activates the 
GSUB worker for that script.

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2018-04-22 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447116#comment-16447116
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/22/18 10:13 AM:
---

What is to be done if we want to activate ligatures for latin languages - write 
another ccmp and put our own FEATURES_IN_ORDER, here with ccmp, liga and clig, 
and add "latn" to GlyphSubstitutionDataExtractor.SUPPORTED_LANGUAGES?
https://docs.microsoft.com/en-us/typography/script-development/standard


was (Author: tilman):
What is to be done if we want to activate ligatures for latin languages - write 
another ccmp and put our own FEATURES_IN_ORDER, here with ccmp, liga and clig?
https://docs.microsoft.com/en-us/typography/script-development/standard

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4189) Enable PDF creation with Indian languages, by reading and utilizing the GSUB table

2018-04-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16446840#comment-16446840
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/22/18 6:35 AM:
--

Thank you [~paawak] I committed your code with slight modifications. I removed 
most formatting changes (it drives the attention away from the acual changes) 
and changed the example code as mentioned before.

todo next:
- [~amake] please use the trunk to check if your vertical texts are still ok 
(likely yes, the tests pass and the PDF generated by the sample works too)
- [~paawak] the example output now looks even more different than before - or 
is the text from the screenshot wrong? Is this related to your latest change, 
or could it be I messed up something?
- run sonar tool (done)
- implement something to activate language specific handling
- port to 2.0 after a few days



was (Author: tilman):
Thank you [~paawak] I committed your code with slight modifications. I removed 
most formatting changes (it drives the attention away from the acual changes) 
and changed the example code as mentioned before.

todo next:
- [~amake] please use the trunk to check if your vertical texts are still ok 
(likely yes, the tests pass and the PDF generated by the sample works too)
- [~paawak] the example output now looks even more different than before - or 
is the text from the screenshot wrong? Is this related to your latest change, 
or could it be I messed up something?
- run sonar tool
- port to 2.0 after a few days


> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> --
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org