Hi Tim,

     So you're saying that our proposed solution would work as long as
we remove (or comment out):

 

filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF

 

from dspace.cfg and make the change to not display the .txt files on the
Item pages?

 

Then we would still need to run filter-media which would only be to
basically add our .txt files to the TEXT bundle for each Item?  

 

By the way, we have been using the 1.5 version of filter-media, with the
addition of the two new configuration parameters in dspace.cfg, for
awhile, even though we are running DSpace 1.4.2.  I did this awhile back
and yes, it has stopped the JAVA heap space errors from killing
filter-media midstream.

 

I do think this new plan is the better way to go for us.  I believe the
advantages would be:

1.  No more filter-media running for soooo long - over 24 hours most of
the time.

2.  We would identify "problematic" .pdf files (ones that possibly
wouldn't filter) prior to importing them into DSpace, instead of
after-the-fact.  When these problems are caught at the scanning point,
they could be dealt with there and then (rescanning/re-ocr'ing, etc).

3.  Our Users wouldn't have such a big job of identifying the
"unfilterable" documents, locating them for rescanning, getting them
back to us for re-import, etc etc.  

4.  Bottom line would be a more accurate full-text searchable
repository.

 

Thanks a bunch for the detailed feedback.  We are processing a 1000
document test with this new procedure and will let you know how it
goes!!

Sue

 

-----Original Message-----
From: Tim Donohue [mailto:tdono...@illinois.edu] 
Sent: Thursday, January 15, 2009 11:27 AM
To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
Cc: dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI
INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION
SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS]
Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

 

Sue,

 

There were some improvements to 'filter-media' in DSpace 1.5.x. 

Primarily, there's the addition of two new PDF-specific settings in the 

dspace.cfg:

 

pdffilter.largepdfs = true

pdffilter.skiponmemoryexception = true

 

The former ensures that all PDF text-extractions are written to 

temporary files during indexing.  This helps avoid OutOfMemoryException 

& Heap space errors that were occasionally caused by larger PDFs being 

loaded into system memory all at once.

 

The latter attempts to skip over any PDFs which still cause an 

OutOfMemoryException.  So, if that exception still occurs on a PDF, then


the PDF is skipped entirely and *not* indexed.  This helps to avoid the 

entire 'filter-media' script "crashing" when an OutOfMemoryException 

occurs (which used to happen in 1.4.2).

 

Despite these changes in 1.5.x, there is NO guarantee that *all* of your


PDFs will index properly.  As I've mentioned before, the 'filter-media' 

script uses third-party software (called PDFBox: http://www.pdfbox.org/)


for indexing of PDF files.  There are some known bugs in PDFBox that 

have yet to be fixed, so it does *not* always work for all PDFs.   In 

some cases, PDFBox will also work inconsistently (and I don't know why 

that is).  I've run into some inconsistency problems with larger-sized 

PDFs, which are originally scanned documents with embedded OCR. 

Occasionally PDFBox will index them fine, and other times it will cause 

an OutOfMemoryException (which, with DSpace 1.5 means that 

'filter-media' will just skip that pdf).

 

So, I guess the best way to sum this up is that DSpace currently cannot 

successfully index 100% of all PDFs, since PDFBox cannot do so.  DSpace 

1.5 has improvements in helping DSpace to safely handle PDFBox issues 

(like the OutOfMemoryExceptions), but it doesn't necessarily have 

drastic improvements in indexing capabilities.

 

I answered your other questions inline below...

 

 

Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:

 

> 1.                   Has the filter-media/index-all process changed 

> and/or improved significantly in DSpace 1.5?  If so, we may just
shelve 

> this issue until we've implemented 1.5.

 

See above, obviously...

 

> 2.                   In DSpace 1.4.2 (and 1.5), does it matter whether


> your .txt files are plain or accessible .txt files?  Can index-all 

> process either type?

 

For text files, it doesn't really matter...in either case the 

'filter-media' script just pulls out the plain text for indexing.  I 

don't believe there'd be any significant difference between the "type" 

of .txt file.

 

However, it's worth making this clear: for .txt files, you *still* need 

to run the 'filter-media' script for them to be indexed by 'index-all'. 

  Essentially, 'index-all' only indexes plain text files in the "TEXT" 

bundle.  The 'filter-media' script is what adds plain text to the "TEXT"


bundle.

 

>  

> 

> 3.                   If the process in 1.5 hasn't changed and/or 

> improved significantly in 1.5, we are considering having our scanning 

> folks just create the .txt files along with the .pdf files at the time


> the documents are scanned.  Then when they send them to us, we would 

> just upload them in the import process along with the .pdf files for 

> each Item.  The only thing we'd really have to change in our import 

> process is the addition of a second file name in the "contents" file
and 

> the addition of the .txt document in the Item's import directory
(right 

> along with the .pdf file).  One other issue is we might have to make a


> small modification to DSpace to **not** display the .txt file on the 

> Item page unless the User is in the Admin interface since we wouldn't 

> want our Users clicking on/opening the .txt files.  If we did this, we


> could completely eliminate the filter-media job altogether.  This
would 

> ensure that we did not load any "unfilterable" documents into DSpace.


> It would also eliminate the tedious process of identifying which 

> documents did not filter successfully, and the whole process of 

> rescanning and replacing them in DSpace.

 

This sounds like a perfectly reasonable way of doing things, assuming 

you have the staff time to pre-generate those .txt files.  You are 

correct that you'd no longer need to run 'filter-media' on those PDFs. 

But, you'd still need to run 'filter-media' to index those .txt files. 

You could do this by modifying the "Media Filter" settings in your 

dspace.cfg and *removing* the PDFFilter from the list (so 'filter-media'


would no longer filter PDFs, but it would work on the other types of 

content).

 

It would also require some custom coding to hide those .txt files from 

normal users, but that shouldn't be too horrible.

 

If you did go this route, I'd make sure that you still OCR the PDFs that


you put in, as it improves their accessibility overall.

 

Hopefully that all makes sense...definitely let us know if you have 

further questions.

 

- Tim

 

-- 

Tim Donohue

Research Programmer, IDEALS

http://www.ideals.uiuc.edu/

University of Illinois

tdono...@illinois.edu | (217) 333-4648

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
  • [Dspace-tech] DSpa... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
    • Re: [Dspace-t... Tim Donohue
      • Re: [Dspa... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
        • Re: [... Tim Donohue
          • R... Diggory Mark
            • ... Tim Donohue
              • ... Claudia Jürgen
              • ... Tim Donohue
                • ... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
                • ... Tim Donohue
                • ... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
            • ... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]

Reply via email to