Re: [poppler] How to normalize MathematicalPi text?

2019-03-13 Thread Jason Crain
On Wed, Mar 13, 2019 at 01:54:26PM +0100, Jeroen Ooms wrote:
> I think what would be needed is to construct a table that maps the
> Mathematical-Pi characters into their proper unicode values.

The PDF creator should be providing that table, called the ToUnicode
map, in the font's data structures. Since this font doesn't provide one,
poppler has to guess what the Unicode value could be and it guesses
wrong.

If you were to provide a map that says, for this font, character code
"^A" maps to "β", that should work.
___
poppler mailing list
poppler@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] PDF 2.0 Spec (answer from FSFE)

2019-03-13 Thread Leonard Rosenthol
We have a board meeting on Monday and will be discussing this.  There seems to 
be support - we just have to work through the details...

Leonard

-Original Message-
From: poppler  On Behalf Of Leonard 
Rosenthol
Sent: Tuesday, March 12, 2019 9:26 AM
To: Germán Poo-Caamaño ; Tobias Deiminger ; 
poppler@lists.freedesktop.org
Subject: Re: [poppler] PDF 2.0 Spec (answer from FSFE)

As a member of the board of the PDF Association - let me check in and see what 
we can do...

-Original Message-
From: poppler  On Behalf Of Germán 
Poo-Caamaño
Sent: Tuesday, March 12, 2019 8:49 AM
To: Leonard Rosenthol ; Tobias Deiminger 
; poppler@lists.freedesktop.org
Subject: Re: [poppler] PDF 2.0 Spec (answer from FSFE)

On Sun, 2019-03-10 at 12:51 +, Leonard Rosenthol wrote:
> German, that is not entirely correct.   It is entirely based on the
> type/category of Liaison.  
> 
> The PDF Association holds the highest class (A) of Liaison with the 
> ISO and thus not only has full access to the drafts, but also has
> both voting and meeting attendance rights.   That's why I suggested
> it.

Maybe we are talking about two different things.

I am not arguing the PDF Association status in the ISO. I am arguing the 
Liaison membership in the PDF Association. In the link I cited, it clearly says 
that Liaison members of the PDFA do not have access to drafts of upcoming ISO 
standards.

If any of us would like access to the drafts by becoming a member of PDFA, then 
Liaison will not work for us. Unless there is missing information or they hold 
exceptions.

> -Original Message-
> From: Germán Poo-Caamaño  On Behalf Of Germán Poo- 
> Caamaño
> Sent: Sunday, March 10, 2019 7:55 AM
> To: Leonard Rosenthol ; Tobias Deiminger < 
> haxti...@posteo.de>; poppler@lists.freedesktop.org
> Subject: Re: [poppler] PDF 2.0 Spec (answer from FSFE)
> 
> On Tue, 2019-03-05 at 16:27 +, Leonard Rosenthol wrote:
> > Another organization option that is already a Class A liaison with 
> > ISO is the PDF Association 
> > (https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.pdfa.orgdata=02%7C01%7Clrosenth%40adobe.com%7C1ed8263e22e84296622908d6a6ee4d04%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636879939760157145sdata=jggjWYCmR18i2VSpiyC1yY3VdrmjTNbSpfOO2fQn%2FHM%3Dreserved=0).
> 
> Liaison is the one that fits better for non-profits organizations.
> However, that does not grant access to the ISO drafts.
> 
> Individual or Observer membership would be a straightforward mechanism 
> to obtain access to the ISO drafts. I am leaning to Observer because 
> is for organizations. I expect that would cost us around €750/year 
> because I assume that KDE ev and/or GOME Foundation are small size 
> organizations. Otherwise, we can try to get either organization to 
> fund some developers as individuals.
> 
> Here the details and costs:
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.
> pdfa.org%2Fmember-benefits%2Fdata=02%7C01%7Clrosenth%40adobe.com%
> 7Cbf307c9b6d6c496d76be08d6a6e91902%7Cfa7b1b5a7b34438794aed2c178decee1%
> 7C0%7C0%7C636879917440112400sdata=ORbBXpyHhsMG1FXFWX15e8Y6SD2P6rk
> yAevyam75ZwY%3Dreserved=0
> 
> --
> Germán Poo-Caamaño
> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcalci
> fer.org%2Fdata=02%7C01%7Clrosenth%40adobe.com%7Cbf307c9b6d6c496d7
> 6be08d6a6e91902%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636879917
> 440112400sdata=5VCSaGV6qAZOzDRXoXq4GfShJc5TgiMtsv0yQLDcqKw%3D
> ;reserved=0

--
Germán Poo-Caamaño
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcalcifer.org%2Fdata=02%7C01%7Clrosenth%40adobe.com%7C1ed8263e22e84296622908d6a6ee4d04%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636879939760157145sdata=QhqHwntT%2FLIS9YSHsKtCYs5kG7ZGW0fJ6bz2N0QuDYQ%3Dreserved=0
___
poppler mailing list
poppler@lists.freedesktop.org
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpopplerdata=02%7C01%7Clrosenth%40adobe.com%7C1ed8263e22e84296622908d6a6ee4d04%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636879939760157145sdata=18nLNRbodHqxzI6oppoR1rCH5y62DkURV1tsWccMe7k%3Dreserved=0
___
poppler mailing list
poppler@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/poppler

[poppler] How to normalize MathematicalPi text?

2019-03-13 Thread Jeroen Ooms
A researcher who is using the R bindings to analyze large numbers of
scientific papers has asked me advice on the following:

When extracting results from scientific pdf, sometimes math symbols
cannot be extracted because symbols are encoded with a custom font
called Mathematical-Pi [1]. An example of such a paper is [2]. When we
extract text via poppler::page::text() all of the = < > α β characters
are random characters from Mathematical-Pi rather than the expected
unicode symbols. Unfortunately these are critical characters to
interpret the results, so we cannot ignore this.

I was wondering if someone has experience with normalizing text with
custom fonts into proper unicode ?

I think what would be needed is to construct a table that maps the
Mathematical-Pi characters into their proper unicode values. Then we
would need some hook for poppler::page::text() to replace textboxes
that are using the Mathematical-Pi font, into the corresponding utf-8
text.


 [1] https://files.acrobat.com/a/preview/b445ea2f-fcbb-44af-a798-fc854d8dd9b5
 [2] https://github.com/ropensci/pdftools/files/2961444/Ames2004.pdf
___
poppler mailing list
poppler@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/poppler