Fantastic. This is the first lucid comment I've heard on this subject.
It's this program that seems to the bane of my existence. I like the -
s flag idea. I will
definitely look at implementing that.
Thanks in advance,
Jeffrey Trimble
System LIbrarian
William F. Maag Library
Youngstown State University
330.941.2483 (Office)
[email protected]
http://www.maag.ysu.edu
http://digital.maag.ysu.edu
On Apr 8, 2009, at 10:36 AM, Tim Donohue wrote:
Jeffrey,
I've seen this same issue all to many times to count. From what
I've noticed it seems that the PDFBox software (which DSpace uses)
occasionally has difficulties with larger PDFs (usually 7MB or
larger) which included OCRed, scanned images. I've never
encountered this problem with PDFs created directly from digital
files (like Word, etc.)...
From what I've seen, occasionally recreating the PDF will resolve
the problem...but, more often than not even that doesn't help. The
problem seems to be more of an issue with how PDFBox loads the
content into memory.
Locally, I've only come up with two possible solutions:
(1) Increase the memory available to the 'filter-media' script (by
bumping up the -Xmx value in the '[dspace]/bin/dsrun' script). This
works for some PDFs, but others will continue to have problems (as
PDFBox seems to use up enormous amounts of memory for some PDFs).
(2) Force those problematic PDFs to be skipped over by the 'filter-
media' script (by using the -s flag):
To make this easier on myself, I've started maintaining a "filter-
skiplist" file which lists all the handles of the problematic PDFs
(so far we've encountered 35 of them), with a separate handle on
each line. Then, I pass this "filter-skiplist" file to the cronjob
which runs 'filter-media' like so:
0 2 * * * filter-media -s `less filter-skiplist | tr '\n' ','`
The above script translates all the newlines (\n) to commas (,) in
the 'filter-skiplist' file and passes the result to the 'filter-
media' -s (skip) flag. So, in the end, filter-media receives a
comma-separated list of handles of PDFs which it should no longer
process. (Obviously this means any PDFs belonging to items in your
'filter-skiplist' can not be full text searched in DSpace)
I'm hoping that in the longer term PDFBox will resolve its memory
issues as it comes out of the "incubation" stage under Apache.
If anyone else has potential solutions, I'd love to hear them, as
I'm in a similar situation as Jeffrey.
- Tim
Jeffrey Trimble wrote:
I've run into a funky situation. After using the distributed
PDFBOX....and
the associated jars (bouncy castle) the filter media works really,
really well,
until--
We have one pdf that has caused the filter-media to produce a
memory dump/
java heap dump. The errors are reports first the IBM flavor of
JVM. We removed
the offending PDF from the database, the filter-media went on it's
way merrily.
Has anyone seen anything like this? I have a copy of the heap dump
and trace. I can
reproduce it one demand by placing this PDF back into the IR.
If you have seen this, and was able to resolve it, please let me
know. The only thing
I can think of doing is to rescan the PDF file from the original
and seeing if there
is something that resovles itself with the new scan.
Thanks in advance,
Jeffrey Trimble
System LIbrarian
William F. Maag Library
Youngstown State University
330.941.2483 (Office)
[email protected] <mailto:[email protected]>
http://www.maag.ysu.edu
http://digital.maag.ysu.edu
------------------------------------------------------------------------
------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
------------------------------------------------------------------------
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
--
Tim Donohue
Research Programmer, IDEALS
http://www.ideals.uiuc.edu/
University of Illinois
[email protected] | (217) 333-4648
------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech