Re: [Dspace-tech] Problem with filter-media

Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] Thu, 09 Oct 2008 14:25:12 -0700

Hi Graham,

     Are you saying that I need the 1.5 version of PDFFilter.java (or
make the changes in the "diff" link) in order for the new parameter in
dspace.cfg to work?  Will it work if I download a 1.5 version of
PDFFilter.java?


 

     Did any of those other error messages mean anything to you?

Thanks a bunch Graham,

Sue

 

________________________________

From: Graham Triggs [mailto:[EMAIL PROTECTED] 
Sent: Thursday, October 09, 2008 5:02 PM
To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
Cc: dspace-tech@lists.sourceforge.net; Smail, James W. (LARC-B702)[NCI
INFORMATION SYSTEMS]
Subject: Re: [Dspace-tech] Problem with filter-media

 

Hi Susan,

 

These are long known issues with PDF text extraction. And they are both
due to bugs in the underlying libraries that are used, and not
necessarily an issue with the PDF content or size.

 

For the heap space issue, a new configuration option was added to DSpace
1.5 - if you add to your dspace.cfg:

 

pdffilter.skiponmemoryexception=true

 

then it will skip the PDF when an out of memory exception occurs, rather
than failing the process.

 

But there isn't anything that we can do to extract data from PDFs where
the errors are occurring.

 

Note that if you aren't running DSpace 1.5, you might want to make
changes to your local PDFFilter class, in line with the diff here:

 

http://fisheye3.atlassian.com/browse/dspace/branches/dspace-1_5_x/dspace
-api/src/main/java/org/dspace/app/mediafilter/PDFFilter.java?r1=2260&r2=
2581

 

G

 

On 9 Oct 2008, at 21:20, Thornton, Susan M. (LARC-B702)[NCI INFORMATION
SYSTEMS] wrote:





     We've been having a problem with filter-media for as long as I can
remember, with DSpace 1.3.1 and now with DSpace 1.4.2.  I've emailed the
list and discussed this problem with some of the developers before, but
we've never had a resolution.  I've been doing some more research on it
myself for the past day or so and here are some interesting things that
I've found:

 

1.      99% of our documents are .pdf files.  filter-media seems to fail
with two different types of errors:

        a.      Java heap space - memory error
        b.      Possibly unreadable character(s) error or problem with
the actual format and/or scanning of the document

 

2.      filter-media does not actually fail with error type (b.) above,
but it does fail with error type (a.).  This error has resulted in
hundreds, maybe thousands of our documents not being filtered and,
consequently, not being full-text searchable.

 

3.      I used to think that perhaps the memory error was caused by our
repository being fairly large (right now we have a total of 101,633
Items and are in the process of loading thousands more) - that perhaps
the memory problem resulted *after* filtering lots of documents - maybe
it had eaten up all the memory in the process.  Today I figured out that
is absolutely not the problem.  What I did in an attempt to get all the
unfiltered documents filtered, is I wrote a sql query that created a
filter-media execution line ("$BINDIR/dsrun
org.dspace.app.mediafilter.MediaFilterManager -n -i 2121/68481 $@") for
each individual Item in DSpace that did NOT have a $$$$$$$.pdf.txt
document in the Bitstream table, then I copied all these lines into one
script and ran it.  So basically what happens is that filter-media
executes over and over again, with the -i option (where you specify a
handle you want filtered), once for each document that hadn't been
previously filtered.  What I found is that the errors were occurring on
the filtering of a *single* document and were not caused by an "memory
accumulation" effect.

 

4.      In looking at some of the documents that were causing the
errors, it appears that perhaps it is the larger documents that are
getting the Java heap space error, although I'm not quite sure of this.
Here is one of the errors that occurred:

 

 

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

        at java.nio.CharBuffer.wrap(CharBuffer.java:350)

        at java.nio.CharBuffer.wrap(CharBuffer.java:373)

        at
java.lang.StringCoding$StringDecoder.decode(StringCoding.java:138)

        at java.lang.StringCoding.decode(StringCoding.java:173)

        at java.lang.String.<init>(String.java:444)

        at java.lang.String.<init>(String.java:516)

        at
org.fontbox.cmap.CMapParser.createStringFromBytes(CMapParser.java:418)

        at org.fontbox.cmap.CMapParser.parse(CMapParser.java:152)

        at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)

        at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)

        at
org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)

        at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)

        at
org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452
)

        at
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:21
5)

        at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)

        at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)

        at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)

        at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)

        at
org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java
:142)

        at
org.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java
:169)

        at
org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilte
rManager.java:344)

        at
org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterMana
ger.java:313)

        at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilt
erManager.java:280)

        at
org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.ja
va:219)

 

 

Seems a lot of the Googling I've been doing on this indicates either the
document is too large to be filtered, or there are some strings in the
document that are too large for the String or Substring it's trying to
do.

 

 

5.      The other errors seem to be caused by, perhaps, non-readable
characters (maybe a bad scan of the document..??) or something actually
wrong with the scanned document.  Here are some of those errors:

 

ERROR filtering, skipping bitstream #46251 java.io.IOException: Error
expected floating point number actual='110.-21'

 

ERROR filtering, skipping bitstream #46372
java.io.StreamCorruptedException: Error: data is null

 

ERROR filtering, skipping bitstream #46675 java.io.IOException: Error
expected floating point number actual='98.-46'

 

ERROR filtering, skipping bitstream #46823 java.io.IOException: Error:
Expected operator 'ID' actual='IM'

 

ERROR filtering, skipping bitstream #51652 java.io.EOFException:
Unexpected end of ZLIB input stream  (Sue:  WHAT??!!)

 

ERROR filtering, skipping bitstream #46894 java.io.IOException: Error
getting pdf version:java.lang.NumberFormatException: For input string:
"fi"  (Sue:  Wow!  This is interesting.....??)

 

ERROR filtering, skipping bitstream #46938 java.io.IOException: Error:
Expected operator 'ID' actual='IM'

 

 

I am going to have a few of these documents rescanned to see if that
will correct the problem, however I have no idea how to correct the heap
space error.  Here's what our "dsrun" looks like:

 

java -Xmx3072m -Dfile.encoding=UTF-8 -classpath $FULLPATH "$@"

 

We are running postgreSQL 8.2.5 on Sun Solaris 10 with DSpace 1.4.2 (and
gearing up for 1.5).

 

Can anyone help with this?  This is a serious problem for us, since like
I said, it is causing our full-text searchability to be
inaccurate/incomplete.

 

Thanks in advance,

Sue

 

 

 

 

 

Sue Walker-Thornton

ConITS Contract
NASA Langley Research Center
Integrated Library Systems Application & Database Administrator

130 Research Drive

Hampton, VA  23666

Office: (757) 224-4074
Fax:    (757) 224-4001
Pager: (757) 988-2547 
Email:  [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> 

 

------------------------------------------------------------------------
-
This SF.Net email is sponsored by the Moblin Your Move Developer's
challenge
Build the coolest Linux based applications with Moblin SDK & win great
prizes
Grand prize is a trip for two to an Open Source event anywhere in the
world
http://moblin-contest.org/redirect.php?banner_id=100&url=/______________
_________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

 

 

 

This e-mail is confidential and should not be used by anyone who is not
the original intended recipient. BioMed Central Limited does not accept
liability for any statements made which are clearly the sender's own and
not expressly made on behalf of BioMed Central Limited. No contracts may
be concluded on behalf of BioMed Central Limited by means of e-mail
communication. BioMed Central Limited Registered in England and Wales
with registered number 3680030 Registered Office Middlesex House, 34-42
Cleveland Street, London W1T 4LB

This email has been scanned by Postini.
For more information please visit http://www.postini.com

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/

_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Problem with filter-media

Reply via email to