[ https://issues.apache.org/jira/browse/NIFI-10218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew M. Lim updated NIFI-10218: --------------------------------- Attachment: example.pdf > ExtractDocumentText processor does not handle certain characters when > extracting from a PDF > ------------------------------------------------------------------------------------------- > > Key: NIFI-10218 > URL: https://issues.apache.org/jira/browse/NIFI-10218 > Project: Apache NiFi > Issue Type: Bug > Components: Extensions > Reporter: Andrew M. Lim > Priority: Minor > Attachments: 625006.pdf, example.pdf > > > When a PDF has special characters ("+", "=",">", "+-"), when the text is > extracted from the document, these characters show up with different symbols. > I've attached two PDFs that illustrate the issue differently: > * 625006.pdf has multiple pages. When the text is extracted from a table, > certain characters show up as a ? symbol. > * example.pdf is a single page with the same table. When the text is > extracted the same characters show up as " or # symbols. -- This message was sent by Atlassian Jira (v8.20.10#820010)