[dspace-tech] Pdfbox Text Extract Issues

Terry Brady Fri, 03 Jun 2016 14:08:58 -0700

I attempted to re-extract text from some of our PDF files containing Arabic
characters since upgrading to DSpace 5.  Most of these characters were lost
by the extraction process.


The text from the same documents had been extracted while running DSpace 3
or DSpace 4 and the extract was reasonably good.

In an attempt to resolve the issue, I upgraded my DSpace 5 instance to use
pdfbox 2.0.0 as described in https://jira.duraspace.org/browse/DS-3035, but
I am still unable to produce a good text extraction.

I had previously tested the following PR in DSpace 6 (
https://github.com/DSpace/DSpace/pull/1287) and I had good results.  I am
now unable to reproduce those results.

Can you recommend any configuration settings that I should review?

-- 
Terry Brady
Applications Programmer Analyst
Georgetown University Library Information Technology
http://georgetown-university-libraries.github.io/
<https://www.library.georgetown.edu/lit/code>
425-298-5498 (Seattle, WA)

-- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To post to this group, send email to dspace-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

[dspace-tech] Pdfbox Text Extract Issues

Reply via email to