[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446840#comment-16446840
 ] 

Tilman Hausherr commented on PDFBOX-4189:
-

Thank you [~paawak] I committed your code with slight modifications. I removed 
most formatting changes (it drives the attention away from the acual changes) 
and changed the example code as mentioned before.

todo next:
- [~amake] please use the trunk to check if your vertical texts are still ok 
(likely yes, the tests pass and the PDF generated by the sample works too)
- [~paawak] the example output now looks even more different than before - or 
is the text from the screenshot wrong? Is this related to your latest change, 
or could it be I messed up something?
- run sonar tool
- port to 2.0 after a few days


> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446818#comment-16446818
 ] 

ASF subversion and git services commented on PDFBOX-4189:
-

Commit 1829710 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1829710 ]

PDFBOX-4189: Enable rendering of Indian languages by reading and utilizing the 
GSUB table, by Palash Ray

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446728#comment-16446728
 ] 

Tilman Hausherr commented on PDFBOX-4189:
-

I modified your example so that it is on one page and so that the line breaks 
are the same, please take that one - but I see that there are differences: see 
the last word (ব্যাস নির্ভয় ). However these differences are not when I see the 
source code. There it looks the same.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446711#comment-16446711
 ] 

Tilman Hausherr commented on PDFBOX-4189:
-

that's just a warm-up and to get rid of binaries in the patch.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446709#comment-16446709
 ] 

ASF subversion and git services commented on PDFBOX-4189:
-

Commit 1829697 from [~tilman] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1829697 ]

PDFBOX-4189: add lohit-bengali font for upcoming tests and example

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-21 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446710#comment-16446710
 ] 

ASF subversion and git services commented on PDFBOX-4189:
-

Commit 1829698 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1829698 ]

PDFBOX-4189: add lohit-bengali font for upcoming tests and example

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-21 Thread Palash Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446685#comment-16446685
 ] 

Palash Ray commented on PDFBOX-4189:


I know. If you ask me, its a real shame. The reason we have abstractions and 
specifications, we are supposed to be able to figure out pretty much the rules, 
without having to write language specific handlers. But I think even the font 
developers are to blame. They should push these big companies who build these 
specifications to do a better job. Anyway, sorry for the rant :)

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446684#comment-16446684
 ] 

Tilman Hausherr commented on PDFBOX-4189:
-

I had a look at Apache FOP a year or two ago and I remember that they also have 
specific code for the different languages.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-21 Thread Palash Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446682#comment-16446682
 ] 

Palash Ray commented on PDFBOX-4189:


I would like to put some thoughts around the approach I took. I am reading all 
the GSUB tables. Then, based on the Language from the ScriptTable, I am first 
determining whether I will support GSUB at all. Right now, I have these 
languages (only Bengali for now) hard-coded in the 
*GlyphSubstitutionDataExtractor* class. Later, we would need to figure out a 
better way. The reason I did this is: wanted to be safe, and not break any 
existing features, for example *vert*.

Next, Microsoft has language-specific guidelines about how to handle the 
various features that are defined in the FeatureTable in GSUB. For Bengali, its 
here: 
[https://docs.microsoft.com/en-us/typography/script-development/bengali#reor]

In the *PDPageContentStream*, right now, I am just hard-coding the Bengali 
implementation of GSUB. Again, here we need to figure out a way to handle this 
gracefully.

 

The below features are still not supported:
 # Copy-pasting pdf-text works partially for the characters which have not been 
GSUB-processed. Imho, this feature is not that important.
 # Right now, certain characters are still placed wrongly. I hope to implement 
this soon. This is a very important feature.

In order to see how good I am using GSUB, in the example 
*BengaliPdfGenerationHelloWorld*, I have added some difficult text on the 1st 
page. On the 2nd page, I have embedded an image about how these should actually 
look like. As and when I add these missing features, these 2 pages would look 
similar.

 

Hope this helps in the review process.

 

Thanks,

Palash.

 

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444713#comment-16444713
 ] 

Tilman Hausherr commented on PDFBOX-4189:
-

Wow... I'll try to review all this in the weekend. I made a short test and 
subsetting works, and so does copy & paste.

What does "breaking change" mean in your commits? I looked at one of these and 
it didn't look like it broke the API.

Feel free to add your name (without mail) as "@author" in the classes that you 
introduced, and in those where you made major changes / improvements, But it's 
not required, i.e. some of us do it and some don't.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-19 Thread Palash Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444671#comment-16444671
 ] 

Palash Ray commented on PDFBOX-4189:


Hi All,

Most of my changes are done. I have taken care of subsetting as well. Its 
working fine. Apart from some minor issues and a few hard-coding for now, 
everything is almost there. Please take a look and let me know what else I 
should do.

Thanks,

Palash.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-17 Thread Palash Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441424#comment-16441424
 ] 

Palash Ray commented on PDFBOX-4189:


I have re-instated PDFont::encode() as final, and, moved the gsub logic inside
PDPageContentStream#showTextInternal, as suggested by John.

I am enabling GSUB only for Bengali as of now: 
GlyphSubstitutionDataExtractor#SUPPORTED_LANGUAGES

I am still working on subsetting.

Thanks,

Palash.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-15 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438914#comment-16438914
 ] 

John Hewson commented on PDFBOX-4189:
-

For correct text positioning using mixed language information from the 
following tables might be useful:
- GPOS: to adjust the glyph position
- BASE: baseline offsets on a script-by-script basis.
- JSTF: justification information, including whitespace and Kashida adjustments.
- BIDI Mirroring: https://www.unicode.org/Public/10.0.0/ucd/BidiMirroring.txt

bq. here

BASE, JSTF and BiDi are concerned with _paragraph-level_ layout, which happens 
at a higher level than the proposed layout() - which would be concerned with 
only a single script in a single direction (i.e. only OpenType _shaping_). BASE 
and BiDi are related to changes between different scripts, while JSTF is to aid 
in making good line break choices. So all of that functionality will happen 
somewhere else (this fits very closely with the layout code form forms, for 
example). So in layout we're really only going to be concerned with GPOS and 
GSUB features. That way the only options that one might want to pass to layout 
would be this list of which [feature 
flags|https://docs.microsoft.com/en-us/typography/opentype/spec/featurelist] to 
apply.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-15 Thread Palash Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438640#comment-16438640
 ] 

Palash Ray commented on PDFBOX-4189:


Thanks a lot guys, for the detailed comments. It seems that there is some more 
work for me to ensure that this patch fits in into the broader scheme of 
things. I am ready to work with you to make this happen. I think PdfBox is a 
great piece of software, and I am committed to make it more feature rich. This 
particular feature is imporant to support any Indian or South East Asian 
Language. So, in my perspective, I would like to make it happen.

 

John, thanks specially for taking the time out to explain the architecture. Let 
me do a bit of refactoring, and incorporate your suggestions. I will let you 
know how that goes. I plan to handle subsetting.

 

Thanks,

Palash.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-15 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438609#comment-16438609
 ] 

Maruan Sahyoun commented on PDFBOX-4189:


The patch is a great and - given several questions we had in the past - 
important addition to PDFBox.

On the longer run I'd see some additions we might conceptually already think 
about and/or start introducing in the public API. As I haven't reviewed the 
patch the below list is meant to be a hint for possible addition. They may 
already be included

For correct text positioning using mixed language information from the 
following tables might be useful:
- GPOS: to adjust the glyph position
- BASE: baseline offsets on a script-by-script basis.
- JSTF: justification information, including whitespace and Kashida adjustments.
- BIDI Mirroring: https://www.unicode.org/Public/10.0.0/ucd/BidiMirroring.txt

To allow the user to override the language system identified by the script 
being used we might want to add {{setLanguage/getLanguage}} so that can be 
called prior to {{showText}} if an override needs to be done.

Putting that into an internal {{layout}} method as John suggested would also 
allow us to put it behind a feature flag where one could enable/disable the 
processing. We might also mark that feature as **experimental** and specify 
which languages it has been tested with (to some extend).

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438540#comment-16438540
 ] 

John Hewson commented on PDFBOX-4189:
-

Hi guys, this is a really welcome contribution, thank you. With regards to 
PDFont#encode(String text) being non-final I can add some insight as I was the 
original designer of our current PDFont#encode mechanism.

Basically, the PDFont classes are designed to represent fonts identically to 
how they are represented when embedded in PDF files. So there's no support for 
OpenType, by design. A Type0 font knows nothing about OpenType (by design).

So how can we use OpenType in PDFBox? The answer is that we do it one layer of 
abstraction up, during text _layout_ instead of text _encoding_*_._* So you 
want to put your glyph substitution code inside PDPageContentStream#showText, 
actually you want 
[PDPageContentStream#showTextInternal|https://github.com/apache/pdfbox/blob/7e721643c0b1fca9fdc349f78431f36e68abc097/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDAbstractContentStream.java#L256].

That way PDFont#encode(String text) can stay non-final :)

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread Palash Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438487#comment-16438487
 ] 

Palash Ray commented on PDFBOX-4189:


I have pushed some changes which takes care of most of the issues that you have 
pointed out except:
 # subsetting
 # BengaliPdfGenerationHelloWorld should be integrated into the 
EmbeddedFonts.java example

I will take care of these as well. Meanwhile, please let me know if any other 
changes are needed.

 

Thanks,

Palash.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438411#comment-16438411
 ] 

Tilman Hausherr commented on PDFBOX-4189:
-

Re subsetting: in your call of 
{{{color:#24292e}PDType0Font{color}{color:#d73a49}.{color}{color:#24292e}load{color}}},
 set the last parameter to true or remove it, and see what happens. Subsetting 
means PDFBox creates a new font with only the glyphs that are really used, so 
generated files get smaller (for example, the Arial Uni font has a size of 
23MB!). Please have a look at {{PDAbstractContentStream.showTextInternal}}. 
This takes all codepoints and remembers which will be in the subset. I suspect 
that you'd need to know what actual codepoints are used after the substitutions.

Re {color:#33}GlyphsubstitutionTable{color}, yeah, just move it back, 
thanks.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread Palash Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438401#comment-16438401
 ] 

Palash Ray commented on PDFBOX-4189:


* +{color:#33}What is the story of having different data for jdk7 and 
jdk8{color}+

Out of 323 entries for the GSUB table for the Bengali-Lohit.ttf font, I am 
getting a single entry which differs for jdk1.7 and 1.8. Thats the reason I had 
to create the 2 files. I am still investigating this, so maybe, I will come up 
with a better solution when I get to the bottom of this
 * +I'd also need to know where this file came from, or whether you created it 
yourself from other data+

Those .txt files are simple reference data used for testing the correctness of 
the GSUB tables. I have created them by putting some logic, transforming 
unicode characters into base-10 numbers.
 * +BengaliPdfGenerationHelloWorld should be integrated into the 
EmbeddedFonts.java example+

Will do
 * +why a log4j2.xml ? We don't use log4j2 except in preflight where log4j is 
used in Tests+

Agreed, I will remove the log4j2.xml
 * +You disabled subsetting+

I don't understand that yet. Please bear with me, I will make it work even with 
that. Let me take a look.
 * +The move of GlyphsubstitutionTable+

I can move it back if it simplifies things. Should I?
 * +There is a lot of logging done+

Will do
 * +Loosening scope restrictions is a bit of a no-no+

Agreed. I did this as a part of the move of GlyphsubstitutionTable, if I undo 
the move, this will be taken care of.

 
 * +Public methods should have a javadoc+

Will do

 

Thanks,

Palash.

 

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4189) Enable rendering of Indian languages, by reading and utilizing the GSUB table

2018-04-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438383#comment-16438383
 ] 

Tilman Hausherr commented on PDFBOX-4189:
-

Your patch is very much appreciated of course, thank you. It will probably 
result in thousands new users / usages. This is a complex patch so expect this 
to take some time before it is committed. See PDFBOX-4106 for an example of a 
complex patch and the discussion.

Can you please also add an apache header to the .txt files? See the file 
"pdfbox\src\main\resources\org\apache\pdfbox\resources\glyphlist\additional.txt".
 I'd also need to know where this file came from, or whether you created it 
yourself from other data; if yes, please include a comment how, and/or the code 
that created the file.

About the commits:
 - {color:#33}What is the story of having different data for jdk7 and 
jdk8?{color}
 - BengaliPdfGenerationHelloWorld should be integrated into the 
EmbeddedFonts.java example
 - why a log4j2.xml ? We don't use log4j2 except in preflight where log4j is 
used in Tests
 - I think I understand why my example didn't work. You disabled subsetting. 
But with subsetting the subsetter should "know" which glyphs are used. But we 
do need subsetting because otherwise files might get huge
 - The generated PDF file has trouble with text extraction: "আমি কোন পথƶ §ীরƶর 
ল©ী ষĞ পুতুল Šপো গÄা ঋষি" i.e. there are some unknown glyphs.
 - The move of GlyphsubstitutionTable breaks the API. Like I said in the PR, if 
you keep the API as it is (only expand. not change existing methods) then your 
change could be used for 2.0 too. The release of 3.0 could take years. The 
release of 2.0.10 only a few months.
 - There is a lot of logging done ("WARNUNG: oldValue: [52, 114] will be 
overridden with newValue: [114, 52]"). This is scary and should be changed or 
removed, It scares users and they create issues, thinking that something got 
wrong. If you change it to debug, please include a comment what this is about. 
See also the discussion in PDFBOX-4106, about{color:#33} "Trying to 
un-substitute a never-before-seen gid"{color}.
 - Loosening scope restrictions is a bit of a no-no, as done in 
[TTFDataStream.java|https://github.com/apache/pdfbox/pull/46/files#diff-894ae790d373c62634ceed941b264dc3]
 , 
[TTFTable.java|https://github.com/apache/pdfbox/pull/46/files#diff-355fd8e3330f392bdae0778f942dc124]
 , and maybe elsewhere. As preached by "Effective Java", item 15: "make each 
class or member as inaccessible as possible".
 - Public methods should have a javadoc, same for classes. It doesn't have to 
be big, just make it good enough for other people to understand what is done. 
See also [https://pdfbox.apache.org/codingconventions.html] , I think most 
conventions are already respected.

I have no yet done a review review of the code (looking side-by-side), so more 
questions may be coming.

> Enable rendering of Indian languages, by reading and utilizing the GSUB table
> -
>
> Key: PDFBOX-4189
> URL: https://issues.apache.org/jira/browse/PDFBOX-4189
> Project: PDFBox
>  Issue Type: New Feature
>  Components: FontBox, PDModel
>Reporter: Palash Ray
>Priority: Major
> Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org