Actually...

Looking at the code of DSIndexer... I'm sure, written by among  
others... myself.  We find that only Bitstreams within the "TEXT"  
bundle are actually indexed into Lucene:

>  for (int i = 0; i < myBundles.length; i++)
>             {
>                 if ((myBundles[i].getName() != null)
>                         && myBundles[i].getName().equals("TEXT"))
>                 {

I'm thinking this was a short-sightedness, but the unhappy consequence  
of which is that your text files will not get indexed if you place  
them into the "CONTENT" Bundle.  There are two solutions

A.) Put your text bitstreams into the TEXT bundle and not have to  
worry about them being exposed because the TEXT bundle will not be.

B.) Put your text Bitstreams in the Content Bundle, alter the UI to  
hide them, and alter DSIndexer to index the CONTENT bundle.

Mark

On Jan 16, 2009, at 2:40 PM, Tim Donohue wrote:

> Susan,
>
> Actually, the setting you'd want to change in your DSpace 1.4.2
> dspace.cfg is this one:
>
> plugin.sequence.org.dspace.app.mediafilter.MediaFilter = ...
>
> You'd want to remove the entry for:
> "org.dspace.app.mediafilter.PDFFilter"
>
> That'd ensure that the PDFFilter is no longer used by filter-media.   
> The
> setting that you referenced below just configures the PDF filter to
> process files which are "Adobe PDF" format.
>
> [NOTE:] If you end up upgrading to DSpace 1.5.x, the above
> "plugin.sequence.org.dspace.app.mediafilter.MediaFilter" setting no
> longer exists.  Instead, it was replaced by a more simplistic
> "filter.plugins" setting.  In that case, for DSpace 1.5.x, you'd just
> remove "PDF Text Extractor" from the list of enabled "filter.plugins".
> Again, this would ensure that 'filter-media' would no longer use the  
> PDF
> filter.
>
> Hopefully that all makes sense...Beyond that, as you mentioned, you'd
> just need to hide those '*.txt' files from being displayed.
>
> - Tim
>
>
>
> Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
>> Hi Tim,
>>
>>     So you're saying that our proposed solution would work as long as
>> we remove (or comment out):
>>
>>
>>
>> *filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe  
>> PDF*
>>
>>
>>
>> from dspace.cfg and make the change to not display the .txt files  
>> on the
>> Item pages?
>>
>>
>>
>> Then we would still need to run filter-media which would only be to
>> basically add our .txt files to the TEXT bundle for each Item?
>>
>>
>>
>> By the way, we have been using the 1.5 version of filter-media,  
>> with the
>> addition of the two new configuration parameters in dspace.cfg, for
>> awhile, even though we are running DSpace 1.4.2.  I did this awhile  
>> back
>> and yes, it has stopped the JAVA heap space errors from killing
>> filter-media midstream.
>>
>>
>>
>> I do think this new plan is the better way to go for us.  I believe  
>> the
>> advantages would be:
>>
>> 1.  No more filter-media running for soooo long – over 24 hours  
>> most of
>> the time.
>>
>> 2.  We would identify “problematic” .pdf files (ones that possibly
>> wouldn’t filter) prior to importing them into DSpace, instead of
>> after-the-fact.  When these problems are caught at the scanning  
>> point,
>> they could be dealt with there and then (rescanning/re-ocr’ing, etc).
>>
>> 3.  Our Users wouldn’t have such a big job of identifying the
>> “unfilterable” documents, locating them for rescanning, getting them
>> back to us for re-import, etc etc.
>>
>> 4.  Bottom line would be a more accurate full-text searchable  
>> repository.
>>
>>
>>
>> Thanks a bunch for the detailed feedback.  We are processing a 1000
>> document test with this new procedure and will let you know how it  
>> goes!!
>>
>> Sue
>>
>>
>>
>> -----Original Message-----
>> From: Tim Donohue [mailto:tdono...@illinois.edu]
>> Sent: Thursday, January 15, 2009 11:27 AM
>> To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
>> Cc: dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7) 
>> [NCI
>> INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION
>> SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS]
>> Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media  
>> questions
>>
>>
>>
>> Sue,
>>
>>
>>
>> There were some improvements to 'filter-media' in DSpace 1.5.x.
>>
>> Primarily, there's the addition of two new PDF-specific settings in  
>> the
>>
>> dspace.cfg:
>>
>>
>>
>> pdffilter.largepdfs = true
>>
>> pdffilter.skiponmemoryexception = true
>>
>>
>>
>> The former ensures that all PDF text-extractions are written to
>>
>> temporary files during indexing.  This helps avoid  
>> OutOfMemoryException
>>
>> & Heap space errors that were occasionally caused by larger PDFs  
>> being
>>
>> loaded into system memory all at once.
>>
>>
>>
>> The latter attempts to skip over any PDFs which still cause an
>>
>> OutOfMemoryException.  So, if that exception still occurs on a PDF,  
>> then
>>
>> the PDF is skipped entirely and *not* indexed.  This helps to avoid  
>> the
>>
>> entire 'filter-media' script "crashing" when an OutOfMemoryException
>>
>> occurs (which used to happen in 1.4.2).
>>
>>
>>
>> Despite these changes in 1.5.x, there is NO guarantee that *all* of  
>> your
>>
>> PDFs will index properly.  As I've mentioned before, the 'filter- 
>> media'
>>
>> script uses third-party software (called PDFBox: http://www.pdfbox.org/)
>>
>> for indexing of PDF files.  There are some known bugs in PDFBox that
>>
>> have yet to be fixed, so it does *not* always work for all PDFs.   In
>>
>> some cases, PDFBox will also work inconsistently (and I don't know  
>> why
>>
>> that is).  I've run into some inconsistency problems with larger- 
>> sized
>>
>> PDFs, which are originally scanned documents with embedded OCR.
>>
>> Occasionally PDFBox will index them fine, and other times it will  
>> cause
>>
>> an OutOfMemoryException (which, with DSpace 1.5 means that
>>
>> 'filter-media' will just skip that pdf).
>>
>>
>>
>> So, I guess the best way to sum this up is that DSpace currently  
>> cannot
>>
>> successfully index 100% of all PDFs, since PDFBox cannot do so.   
>> DSpace
>>
>> 1.5 has improvements in helping DSpace to safely handle PDFBox issues
>>
>> (like the OutOfMemoryExceptions), but it doesn't necessarily have
>>
>> drastic improvements in indexing capabilities.
>>
>>
>>
>> I answered your other questions inline below...
>>
>>
>>
>>
>>
>> Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
>>
>>
>>
>>> 1.                   Has the filter-media/index-all process changed
>>
>>> and/or improved significantly in DSpace 1.5?  If so, we may just  
>>> shelve
>>
>>> this issue until we’ve implemented 1.5.
>>
>>
>>
>> See above, obviously...
>>
>>
>>
>>> 2.                   In DSpace 1.4.2 (and 1.5), does it matter  
>>> whether
>>
>>> your .txt files are plain or accessible .txt files?  Can index-all
>>
>>> process either type?
>>
>>
>>
>> For text files, it doesn't really matter...in either case the
>>
>> 'filter-media' script just pulls out the plain text for indexing.  I
>>
>> don't believe there'd be any significant difference between the  
>> "type"
>>
>> of .txt file.
>>
>>
>>
>> However, it's worth making this clear: for .txt files, you *still*  
>> need
>>
>> to run the 'filter-media' script for them to be indexed by 'index- 
>> all'.
>>
>>  Essentially, 'index-all' only indexes plain text files in the "TEXT"
>>
>> bundle.  The 'filter-media' script is what adds plain text to the  
>> "TEXT"
>>
>> bundle.
>>
>>
>>
>>>
>>
>>>
>>
>>> 3.                   If the process in 1.5 hasn’t changed and/or
>>
>>> improved significantly in 1.5, we are considering having our  
>>> scanning
>>
>>> folks just create the .txt files along with the .pdf files at the  
>>> time
>>
>>> the documents are scanned.  Then when they send them to us, we would
>>
>>> just upload them in the import process along with the .pdf files for
>>
>>> each Item.  The only thing we’d really have to change in our import
>>
>>> process is the addition of a second file name in the “contents”  
>>> file and
>>
>>> the addition of the .txt document in the Item’s import directory  
>>> (right
>>
>>> along with the .pdf file).  One other issue is we might have to  
>>> make a
>>
>>> small modification to DSpace to **not** display the .txt file on the
>>
>>> Item page unless the User is in the Admin interface since we  
>>> wouldn’t
>>
>>> want our Users clicking on/opening the .txt files.  If we did  
>>> this, we
>>
>>> could completely eliminate the filter-media job altogether.  This  
>>> would
>>
>>> ensure that we did not load any “unfilterable” documents into  
>>> DSpace.
>>
>>> It would also eliminate the tedious process of identifying which
>>
>>> documents did not filter successfully, and the whole process of
>>
>>> rescanning and replacing them in DSpace.
>>
>>
>>
>> This sounds like a perfectly reasonable way of doing things, assuming
>>
>> you have the staff time to pre-generate those .txt files.  You are
>>
>> correct that you'd no longer need to run 'filter-media' on those  
>> PDFs.
>>
>> But, you'd still need to run 'filter-media' to index those .txt  
>> files.
>>
>> You could do this by modifying the "Media Filter" settings in your
>>
>> dspace.cfg and *removing* the PDFFilter from the list (so 'filter- 
>> media'
>>
>> would no longer filter PDFs, but it would work on the other types of
>>
>> content).
>>
>>
>>
>> It would also require some custom coding to hide those .txt files  
>> from
>>
>> normal users, but that shouldn't be too horrible.
>>
>>
>>
>> If you did go this route, I'd make sure that you still OCR the PDFs  
>> that
>>
>> you put in, as it improves their accessibility overall.
>>
>>
>>
>> Hopefully that all makes sense...definitely let us know if you have
>>
>> further questions.
>>
>>
>>
>> - Tim
>>
>>
>>
>> -- 
>>
>> Tim Donohue
>>
>> Research Programmer, IDEALS
>>
>> http://www.ideals.uiuc.edu/
>>
>> University of Illinois
>>
>> tdono...@illinois.edu | (217) 333-4648
>>
>
> -- 
> Tim Donohue
> Research Programmer, IDEALS
> http://www.ideals.uiuc.edu/
> University of Illinois
> tdono...@illinois.edu | (217) 333-4648
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> SourcForge Community
> SourceForge wants to tell your story.
> http://p.sf.net/sfu/sf-spreadtheword
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

~~~~~~~~~~~~~
Mark R. Diggory
http://purl.org/net/mdiggory/homepage




------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
  • [Dspace-tech] DSpa... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
    • Re: [Dspace-t... Tim Donohue
      • Re: [Dspa... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
        • Re: [... Tim Donohue
          • R... Diggory Mark
            • ... Tim Donohue
              • ... Claudia Jürgen
              • ... Tim Donohue
                • ... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
                • ... Tim Donohue
                • ... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
            • ... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]

Reply via email to