[ How does the PDFBox stay ahead of the game when Adobe is upgrading and
   adding features to the PDF standard? ]

PDFBox could track and target the iso standard first and Adobe's second.

 # Document management -- Portable document format -- Part 1: PDF 1.7
 
http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=51502

-- Van Ly 



-----Original Message-----
From: Jeffrey Trimble [mailto:jtrim...@cc.ysu.edu]
Sent: Thu 4/9/2009 5:46 AM
To: Richard Rodgers
Cc: dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media
 
Got my vote for that.  Until the PDFBox is perfected to really delve  
into
this, it will be helpful.

How does the PDFBox stay ahead of the game when Adobe is upgrading and
adding features to the PDF standard?  That should be catching up with  
all of
us sooner or later.

--Jeff

Jeffrey Trimble
System LIbrarian
William F.  Maag Library
Youngstown State University
330.941.2483 (Office)
jtrim...@cc.ysu.edu
http://www.maag.ysu.edu
http://digital.maag.ysu.edu



On Apr 8, 2009, at 11:53 AM, Richard Rodgers wrote:

> At MIT we came up with a similar approach, which takes some of the
> grunt work out of managing the skips. We extended MediaFilter to  
> detect PDFBox
> (or other) exceptions, then automatically record their handles to a  
> skip list,
> which is used for any subsequent runs. We'd be glad to give you the  
> code or
> just put it into the next 1.5.X release.
>
> Thanks,
>
> Richard R
>
> Quoting Tim Donohue <tdono...@illinois.edu>:
>
>> Jeffrey,
>>
>> I've seen this same issue all to many times to count.  From what I've
>> noticed it seems that the PDFBox software (which DSpace uses)
>> occasionally has difficulties with larger PDFs (usually 7MB or  
>> larger)
>> which included OCRed, scanned images.   I've never encountered this
>> problem with PDFs created directly from digital files (like Word,  
>> etc.)...
>>
>> From what I've seen, occasionally recreating the PDF will resolve the
>> problem...but, more often than not even that doesn't help.  The  
>> problem
>> seems to be more of an issue with how PDFBox loads the content into  
>> memory.
>>
>> Locally, I've only come up with two possible solutions:
>>
>> (1) Increase the memory available to the 'filter-media' script (by
>> bumping up the -Xmx value in the '[dspace]/bin/dsrun' script).  This
>> works for some PDFs, but others will continue to have problems (as
>> PDFBox seems to use up enormous amounts of memory for some PDFs).
>>
>> (2) Force those problematic PDFs to be skipped over by the
>> 'filter-media' script (by using the -s flag):
>>
>> To make this easier on myself, I've started maintaining a
>> "filter-skiplist" file which lists all the handles of the problematic
>> PDFs (so far we've encountered 35 of them), with a separate handle on
>> each line.  Then, I pass this "filter-skiplist" file to the cronjob
>> which runs 'filter-media' like so:
>>
>> 0 2 * * * filter-media -s `less filter-skiplist | tr '\n' ','`
>>
>> The above script translates all the newlines (\n) to commas (,) in  
>> the
>> 'filter-skiplist' file and passes the result to the 'filter-media' -s
>> (skip) flag.  So, in the end, filter-media receives a comma-separated
>> list of handles of PDFs which it should no longer process.   
>> (Obviously
>> this means any PDFs belonging to items in your 'filter-skiplist'  
>> can not
>> be full text searched in DSpace)
>>
>> I'm hoping that in the longer term PDFBox will resolve its memory  
>> issues
>> as it comes out of the "incubation" stage under Apache.
>>
>> If anyone else has potential solutions, I'd love to hear them, as  
>> I'm in
>> a similar situation as Jeffrey.
>>
>> - Tim
>>
>>
>> Jeffrey Trimble wrote:
>>> I've run into a funky situation.  After using the distributed  
>>> PDFBOX....and
>>> the associated jars (bouncy castle) the filter media works really,
>>> really well,
>>> until--
>>>
>>> We have one pdf that has caused the filter-media to produce a  
>>> memory dump/
>>> java heap dump.  The errors are reports first  the IBM flavor of  
>>> JVM.
>>> We removed
>>> the offending PDF from the database, the filter-media went on it's  
>>> way
>>> merrily.
>>>
>>> Has anyone seen anything like this?  I have a copy of the heap  
>>> dump and
>>> trace.  I can
>>> reproduce it one demand by placing this PDF back into the IR.
>>>
>>> If you have seen this, and was able to resolve it, please let me  
>>> know.
>>> The only thing
>>> I can think of doing is to rescan the PDF file from the original and
>>> seeing if there
>>> is something that resovles itself with the new scan.
>>>
>>> Thanks in advance,
>>>
>>>
>>> Jeffrey Trimble
>>> System LIbrarian
>>> William F.  Maag Library
>>> Youngstown State University
>>> 330.941.2483 (Office)
>>> jtrim...@cc.ysu.edu <mailto:jtrim...@cc.ysu.edu>
>>> http://www.maag.ysu.edu
>>> http://digital.maag.ysu.edu
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> ------------------------------------------------------------------------------
>>> This SF.net email is sponsored by:
>>> High Quality Requirements in a Collaborative Environment.
>>> Download a free trial of Rational Requirements Composer Now!
>>> http://p.sf.net/sfu/www-ibm-com
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> DSpace-tech mailing list
>>> DSpace-tech@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>>
>> --
>> Tim Donohue
>> Research Programmer, IDEALS
>> http://www.ideals.uiuc.edu/
>> University of Illinois
>> tdono...@illinois.edu | (217) 333-4648
>>
>> ------------------------------------------------------------------------------
>> This SF.net email is sponsored by:
>> High Quality Requirements in a Collaborative Environment.
>> Download a free trial of Rational Requirements Composer Now!
>> http://p.sf.net/sfu/www-ibm-com
>> _______________________________________________
>> DSpace-tech mailing list
>> DSpace-tech@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>>
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech





------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
  • [Dspac... Jeffrey Trimble
    • R... Tim Donohue
      • ... Jeffrey Trimble
      • ... Richard Rodgers
        • ... Dorothea Salo
          • ... Tim Donohue
        • ... Jeffrey Trimble
          • ... Van Ly
            • ... Mark Diggory
      • ... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
    • R... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
    • R... Larry Stone
    • R... Graham Triggs
      • ... Mark Diggory
    • R... Larry Stone
      • ... Mark Diggory

Reply via email to