Re: [dspace-tech] Pdfbox Text Extract Issues

2016-06-06 Thread Terry Brady
Ivan,

Thanks for the note.  As I have investigated this further, I have
discovered that the issue lies in the way that I have scripted my call to
filter-media and not in the text extraction code.

Terry

On Sat, Jun 4, 2016 at 3:41 PM, helix84  wrote:

> Hi Terry,
>
> could this be the culprit or the fix?
>
> https://jira.duraspace.org/browse/DS-1187
>
>
> Regards,
> ~~helix84
>



-- 
Terry Brady
Applications Programmer Analyst
Georgetown University Library Information Technology
http://georgetown-university-libraries.github.io/

425-298-5498 (Seattle, WA)

-- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To post to this group, send email to dspace-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.


Re: [dspace-tech] Pdfbox Text Extract Issues

2016-06-04 Thread helix84
Hi Terry,

could this be the culprit or the fix?

https://jira.duraspace.org/browse/DS-1187


Regards,
~~helix84

-- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To post to this group, send email to dspace-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.


[dspace-tech] Pdfbox Text Extract Issues

2016-06-03 Thread Terry Brady
I attempted to re-extract text from some of our PDF files containing Arabic
characters since upgrading to DSpace 5.  Most of these characters were lost
by the extraction process.

The text from the same documents had been extracted while running DSpace 3
or DSpace 4 and the extract was reasonably good.

In an attempt to resolve the issue, I upgraded my DSpace 5 instance to use
pdfbox 2.0.0 as described in https://jira.duraspace.org/browse/DS-3035, but
I am still unable to produce a good text extraction.

I had previously tested the following PR in DSpace 6 (
https://github.com/DSpace/DSpace/pull/1287) and I had good results.  I am
now unable to reproduce those results.

Can you recommend any configuration settings that I should review?

-- 
Terry Brady
Applications Programmer Analyst
Georgetown University Library Information Technology
http://georgetown-university-libraries.github.io/

425-298-5498 (Seattle, WA)

-- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To post to this group, send email to dspace-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.