I attempted to re-extract text from some of our PDF files containing Arabic characters since upgrading to DSpace 5. Most of these characters were lost by the extraction process.
The text from the same documents had been extracted while running DSpace 3 or DSpace 4 and the extract was reasonably good. In an attempt to resolve the issue, I upgraded my DSpace 5 instance to use pdfbox 2.0.0 as described in https://jira.duraspace.org/browse/DS-3035, but I am still unable to produce a good text extraction. I had previously tested the following PR in DSpace 6 ( https://github.com/DSpace/DSpace/pull/1287) and I had good results. I am now unable to reproduce those results. Can you recommend any configuration settings that I should review? -- Terry Brady Applications Programmer Analyst Georgetown University Library Information Technology http://georgetown-university-libraries.github.io/ <https://www.library.georgetown.edu/lit/code> 425-298-5498 (Seattle, WA) -- You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group. To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech+unsubscr...@googlegroups.com. To post to this group, send email to dspace-tech@googlegroups.com. Visit this group at https://groups.google.com/group/dspace-tech. For more options, visit https://groups.google.com/d/optout.