Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] Wed, 21 Jan 2009 15:09:42 -0800

Thanks for all your help Tim!  I think this will help us out a lot!
Best,
Sue


-----Original Message-----
From: Tim Donohue [mailto:tdono...@illinois.edu] 
Sent: Wednesday, January 21, 2009 4:54 PM
To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
Cc: Diggory Mark; dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

Sue,

Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
> Hi Tim,
>      Thanks to all for the suggestions.  Basically I am trying to
> prevent filter-media from attempting to filter our .pdf files and I
want
> index-all to index only our .txt files.
> 
>      So if I remove the pdffilter parameters from dspace.cfg and I
have
> all our .txt files in the TEXT bundle (using one of the 3 options you
> outlined), this should work and we shouldn't have to run filter-media
at
> all, right?

That's almost correct, except for the last part of your statement. 
You'll notice that in the options I laid out below, Options #1 and #3 
specifically state you STILL need to run 'filter-media'.  This is 
because in both those options you are starting with the *.txt files in 
the ORIGINAL bundle, and they need to be copied to the TEXT bundle 
before they can be indexed.

Is this starting to make some sense?  Filter-media is what does 
extraction of full text (from PDF, HTML, Word or Plain text formats) and

generates a corresponding *.txt file in the TEXT bundle containing the 
extracted full text.  Since the 'index-all' script will ONLY index *.txt

from the TEXT bundle, you will always need to run 'filter-media' first 
unless you've manually added *.txt to the TEXT bundle.

- Tim


> 
> Thanks again,
> Sue
> 
> -----Original Message-----
> From: Tim Donohue [mailto:tdono...@illinois.edu] 
> Sent: Wednesday, January 21, 2009 10:54 AM
> To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
> Cc: Diggory Mark; dspace-tech@lists.sourceforge.net; Kimbrough, Glenn
W.
> (LARC-B7)[NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis
(LARC-B7)[NCI
> INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION
> SYSTEMS]
> Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
> 
> Sue,
> 
> Sorry, we've all been talking across each other a bit.  As you can 
> probably tell, there's really no "correct" answer on how to do this, 
> rather there's a variety of options to choose from
> 
> Essentially, you have 3 options that have been laid out by Mark,
Claudia
> 
> and myself.  I'm not certain which will be *easiest* off the top of my
> head:
> 
> [Option 1]  Add the *.txt files to the "ORIGINAL" bundle (which is
where
> 
> they are added by default).  If they are in the "ORIGINAL" bundle you 
> will have to run 'filter-media' to "filter" them into the "TEXT"
bundle.
> 
>    Then, you will run 'index-all' to index them for searching (as
noted 
> 'index-all' only indexes documents in the "TEXT" bundle).  You will
also
> 
> need to modify the UI if you don't want these *.txt files to be
visible 
> to normal users.
> 
> [Option 2]  Add the *.txt files to the "TEXT" bundle directly.  There
is
> 
> no way to do this via normal DSpace user interfaces.  You can however
do
> 
> this during the normal command-line bulk item import process by 
> specifying a "bundle" name in the 'contents' file.  See the DSpace
Docs 
> for more information on this:
>
http://dspace.svn.sourceforge.net/viewvc/dspace/trunk/dspace/docs/applic
> ation.html#itemimporter
> 
> [Option 3]  Claudia's suggestion is very similar to Option #1.
However,
> 
> as she notes and easy way to "hide" the *.txt files from the UI is to
go
> 
> into the DSpace Administration UI (specifically the "Bitstream Format 
> Registry" and mark the *.txt format as "internal").  This tells DSpace

> that ALL *.txt files should be considered internal files, and should 
> NEVER be displayed in the UI.  So, you'd only want to do this if you 
> never want any *.txt files to be displayed from the UI.
> 
> 
> In my opinion (others may have differing opinions), it'd be safer & 
> potentially easier to go with either option #1 or #3.  The danger of 
> option #2 is that the "TEXT" bundle tends to be managed by the 
> "filter-media" script in DSpace.  As long as you are always aware that

> you manually added files to this bundle, you should be fine.  But, if 
> you ever ran 'filter-media' in "force" mode (with the -f option), 
> there'd be a possibility the 'filter-media' script would overwrite all

> your manually added *.txt files in that bundle.
> 
> Hopefully that gives you a decent lay of the land.  There may be yet 
> other options out there, but at least this gives you a few to work off
> of.
> 
> - Tim
> 
> 
> 
> Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
>> I did the following query against the bundle table and it seems we
> only 
>> have 3 bundle "name"s in the table:  LICENSE, ORIGINAL, & TEXT:
>>
>>  */select count(*)/*
>>
>> */      , name /*
>>
>> */ from bundle /*
>>
>> */  group by 2 /*
>>
>> */  order by 2/*
>>
>> */ /*
>>
>> All the .txt files we created in our 1000 document test are in the 
>> ORIGINAL bundle, according to NAME in the bundle table.  So if I run 
>> this query and then run index-all, these .txt files should be 
>> searchable, correct?
>>
>>   */UPDATE bundle/*
>>
>> */  SET name = 'TEXT'/*
>>
>> */  WHERE bundle_id = /*
>>
>> */     (SELECT bu.bundle_id /*
>>
>> */         FROM bitstream bi/*
>>
>> */            , bundle2bitstream b2b/*
>>
>> */            , bundle    bu/*
>>
>> */         WHERE bi.bitstream_id = b2b.bitstream_id/*
>>
>> */           AND b2b.bundle_id   = bu.bundle_id/*
>>
>> */           AND bundle.bundle_id = bu.bundle_id/*
>>
>> */           AND bu.name = 'ORIGINAL'/*
>>
>> */           AND bi.name LIKE '%.txt')   /*
>>
>>  
>>
>> Let me know what you think.
>>
>> Thanks again,
>>
>> Sue
>>
>>  
>>
>>  
>>
>> -----Original Message-----
>> From: Tim Donohue [mailto:tdono...@illinois.edu]
>> Sent: Tuesday, January 20, 2009 2:12 PM
>> To: Diggory Mark
>> Cc: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]; 
>> dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI 
>> INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION

>> SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS]
>> Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media
questions
>>
>>  
>>
>> Mark,
>>
>>  
>>
>> That's correct, that the indexer only indexes files in the TEXT
> bundle.
>>   But, that's why I had recommended to Susan to first run
> 'filter-media'
>> script.   The 'filter-media' script will take text files in the
> CONTENT
>> bundle and essentially copy them over to the TEXT bundle for
indexing.
>>
>>  
>>
>> So, you are correct that the *.txt files could be immediately put in
> the
>> TEXT bundle (which would also avoid them being exposed publicly).
> But,
>> the alternative would be to put the *.txt files in the CONTENT bundle
>>
>> and run 'filter-media' to "filter" it into the TEXT bundle. (However,
> as
>> you noted, this latter option would require UI alteration to hide the
>>
>> *.txt files, if they shouldn't be accessible).
>>
>>  
>>
>> - Tim
>>
>>  
>>
>> Diggory Mark wrote:
>>
>>>  Actually...
>>>  Looking at the code of DSIndexer... I'm sure, written by among
> others...
>>>  myself.  We find that only Bitstreams within the "TEXT" bundle are
>>>  actually indexed into Lucene:
>>>>  for (int i = 0; i < myBundles.length; i++)
>>>>             {
>>>>                 if ((myBundles[i].getName() != null)
>>>>                         && myBundles[i].getName().equals("TEXT"))
>>>>                 {
>>>  I'm thinking this was a short-sightedness, but the unhappy
> consequence
>>>  of which is that your text files will not get indexed if you place
> them
>>>  into the "CONTENT" Bundle.  There are two solutions
>>>  A.) Put your text bitstreams into the TEXT bundle and not have to
> worry
>>>  about them being exposed because the TEXT bundle will not be.
>>>  B.) Put your text Bitstreams in the Content Bundle, alter the UI to
> hide
>>>  them, and alter DSIndexer to index the CONTENT bundle.
>>>  Mark
>>>  On Jan 16, 2009, at 2:40 PM, Tim Donohue wrote:
>>>> Susan,
>>>> Actually, the setting you'd want to change in your DSpace 1.4.2
>>>> dspace.cfg is this one:
>>>> plugin.sequence.org.dspace.app.mediafilter.MediaFilter = ...
>>>> You'd want to remove the entry for:
>>>> "org.dspace.app.mediafilter.PDFFilter"
>>>> That'd ensure that the PDFFilter is no longer used by filter-media.
> The
>>>> setting that you referenced below just configures the PDF filter to
>>>> process files which are "Adobe PDF" format.
>>>> [NOTE:] If you end up upgrading to DSpace 1.5.x, the above
>>>> "plugin.sequence.org.dspace.app.mediafilter.MediaFilter" setting no
>>>> longer exists.  Instead, it was replaced by a more simplistic
>>>> "filter.plugins" setting.  In that case, for DSpace 1.5.x, you'd
> just
>>>> remove "PDF Text Extractor" from the list of enabled
> "filter.plugins".
>>>> Again, this would ensure that 'filter-media' would no longer use
> the PDF
>>>> filter.
>>>> Hopefully that all makes sense...Beyond that, as you mentioned,
> you'd
>>>> just need to hide those '*.txt' files from being displayed.
>>>> - Tim
>>>> Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
>>>>> Hi Tim,
>>>>>     So you're saying that our proposed solution would work as long
> as
>>>>> we remove (or comment out):
>>>>> *filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe
> PDF*
>>>>> from dspace.cfg and make the change to not display the .txt files
> on the
>>>>> Item pages?
>>>>> Then we would still need to run filter-media which would only be
> to
>>>>> basically add our .txt files to the TEXT bundle for each Item?
>>>>> By the way, we have been using the 1.5 version of filter-media,
> with the
>>>>> addition of the two new configuration parameters in dspace.cfg,
> for
>>>>> awhile, even though we are running DSpace 1.4.2.  I did this
> awhile back
>>>>> and yes, it has stopped the JAVA heap space errors from killing
>>>>> filter-media midstream.
>>>>> I do think this new plan is the better way to go for us.  I
> believe the
>>>>> advantages would be:
>>>>> 1.  No more filter-media running for soooo long - over 24 hours
> most of
>>>>> the time.
>>>>> 2.  We would identify "problematic" .pdf files (ones that possibly
>>>>> wouldn't filter) prior to importing them into DSpace, instead of
>>>>> after-the-fact.  When these problems are caught at the scanning
> point,
>>>>> they could be dealt with there and then (rescanning/re-ocr'ing,
> etc).
>>>>> 3.  Our Users wouldn't have such a big job of identifying the
>>>>> "unfilterable" documents, locating them for rescanning, getting
> them
>>>>> back to us for re-import, etc etc.
>>>>> 4.  Bottom line would be a more accurate full-text searchable
>>>>> repository.
>>>>> Thanks a bunch for the detailed feedback.  We are processing a
> 1000
>>>>> document test with this new procedure and will let you know how it
>>>>> goes!!
>>>>> Sue
>>>>> -----Original Message-----
>>>>> From: Tim Donohue [mailto:tdono...@illinois.edu]
>>>>> Sent: Thursday, January 15, 2009 11:27 AM
>>>>> To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
>>>>> Cc: dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W.
> (LARC-B7)[NCI
>>>>> INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI
> INFORMATION
>>>>> SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS]
>>>>> Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media
> questions
>>>>> Sue,
>>>>> There were some improvements to 'filter-media' in DSpace 1.5.x.
>>>>> Primarily, there's the addition of two new PDF-specific settings
> in the
>>>>> dspace.cfg:
>>>>> pdffilter.largepdfs = true
>>>>> pdffilter.skiponmemoryexception = true
>>>>> The former ensures that all PDF text-extractions are written to
>>>>> temporary files during indexing.  This helps avoid
> OutOfMemoryException
>>>>> & Heap space errors that were occasionally caused by larger PDFs
> being
>>>>> loaded into system memory all at once.
>>>>> The latter attempts to skip over any PDFs which still cause an
>>>>> OutOfMemoryException.  So, if that exception still occurs on a
> PDF, then
>>>>> the PDF is skipped entirely and *not* indexed.  This helps to
> avoid the
>>>>> entire 'filter-media' script "crashing" when an
> OutOfMemoryException
>>>>> occurs (which used to happen in 1.4.2).
>>>>> Despite these changes in 1.5.x, there is NO guarantee that *all*
> of your
>>>>> PDFs will index properly.  As I've mentioned before, the
> 'filter-media'
>>>>> script uses third-party software (called PDFBox:
> http://www.pdfbox.org/)
>>>>> for indexing of PDF files.  There are some known bugs in PDFBox
> that
>>>>> have yet to be fixed, so it does *not* always work for all PDFs.
> In
>>>>> some cases, PDFBox will also work inconsistently (and I don't know
> why
>>>>> that is).  I've run into some inconsistency problems with
> larger-sized
>>>>> PDFs, which are originally scanned documents with embedded OCR.
>>>>> Occasionally PDFBox will index them fine, and other times it will
> cause
>>>>> an OutOfMemoryException (which, with DSpace 1.5 means that
>>>>> 'filter-media' will just skip that pdf).
>>>>> So, I guess the best way to sum this up is that DSpace currently
> cannot
>>>>> successfully index 100% of all PDFs, since PDFBox cannot do so.
> DSpace
>>>>> 1.5 has improvements in helping DSpace to safely handle PDFBox
> issues
>>>>> (like the OutOfMemoryExceptions), but it doesn't necessarily have
>>>>> drastic improvements in indexing capabilities.
>>>>> I answered your other questions inline below...
>>>>> Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
>>>>>> 1.                   Has the filter-media/index-all process
> changed
>>>>>> and/or improved significantly in DSpace 1.5?  If so, we may just
> shelve
>>>>>> this issue until we've implemented 1.5.
>>>>> See above, obviously...
>>>>>> 2.                   In DSpace 1.4.2 (and 1.5), does it matter
> whether
>>>>>> your .txt files are plain or accessible .txt files?  Can
> index-all
>>>>>> process either type?
>>>>> For text files, it doesn't really matter...in either case the
>>>>> 'filter-media' script just pulls out the plain text for indexing.
> I
>>>>> don't believe there'd be any significant difference between the
> "type"
>>>>> of .txt file.
>>>>> However, it's worth making this clear: for .txt files, you *still*
> need
>>>>> to run the 'filter-media' script for them to be indexed by
> 'index-all'.
>>>>>  Essentially, 'index-all' only indexes plain text files in the
> "TEXT"
>>>>> bundle.  The 'filter-media' script is what adds plain text to the
> "TEXT"
>>>>> bundle.
>>>>>> 3.                   If the process in 1.5 hasn't changed and/or
>>>>>> improved significantly in 1.5, we are considering having our
> scanning
>>>>>> folks just create the .txt files along with the .pdf files at the
> time
>>>>>> the documents are scanned.  Then when they send them to us, we
> would
>>>>>> just upload them in the import process along with the .pdf files
> for
>>>>>> each Item.  The only thing we'd really have to change in our
> import
>>>>>> process is the addition of a second file name in the "contents"
> file
>>>>>> and
>>>>>> the addition of the .txt document in the Item's import directory
> (right
>>>>>> along with the .pdf file).  One other issue is we might have to
> make a
>>>>>> small modification to DSpace to **not** display the .txt file on
> the
>>>>>> Item page unless the User is in the Admin interface since we
> wouldn't
>>>>>> want our Users clicking on/opening the .txt files.  If we did
> this, we
>>>>>> could completely eliminate the filter-media job altogether.  This
> would
>>>>>> ensure that we did not load any "unfilterable" documents into
> DSpace.
>>>>>> It would also eliminate the tedious process of identifying which
>>>>>> documents did not filter successfully, and the whole process of
>>>>>> rescanning and replacing them in DSpace.
>>>>> This sounds like a perfectly reasonable way of doing things,
> assuming
>>>>> you have the staff time to pre-generate those .txt files.  You are
>>>>> correct that you'd no longer need to run 'filter-media' on those
> PDFs.
>>>>> But, you'd still need to run 'filter-media' to index those .txt
> files.
>>>>> You could do this by modifying the "Media Filter" settings in your
>>>>> dspace.cfg and *removing* the PDFFilter from the list (so
> 'filter-media'
>>>>> would no longer filter PDFs, but it would work on the other types
> of
>>>>> content).
>>>>> It would also require some custom coding to hide those .txt files
> from
>>>>> normal users, but that shouldn't be too horrible.
>>>>> If you did go this route, I'd make sure that you still OCR the
> PDFs that
>>>>> you put in, as it improves their accessibility overall.
>>>>> Hopefully that all makes sense...definitely let us know if you
> have
>>>>> further questions.
>>>>> - Tim
>>>>> --
>>>>> Tim Donohue
>>>>> Research Programmer, IDEALS
>>>>> http://www.ideals.uiuc.edu/
>>>>> University of Illinois
>>>>> tdono...@illinois.edu | (217) 333-4648
>>>> --
>>>> Tim Donohue
>>>> Research Programmer, IDEALS
>>>> http://www.ideals.uiuc.edu/
>>>> University of Illinois
>>>> tdono...@illinois.edu | (217) 333-4648
>
------------------------------------------------------------------------
> ------ 
>>
>>>> This SF.net email is sponsored by:
>>>> SourcForge Community
>>>> SourceForge wants to tell your story.
>>>> http://p.sf.net/sfu/sf-spreadtheword
>>>> _______________________________________________
>>>> DSpace-tech mailing list
>>>> DSpace-tech@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>>>  ~~~~~~~~~~~~~
>>>  Mark R. Diggory
>>>  http://purl.org/net/mdiggory/homepage
>>  
>>
>> -- 
>>
>> Tim Donohue
>>
>> Research Programmer, IDEALS
>>
>> http://www.ideals.uiuc.edu/
>>
>> University of Illinois
>>
>> tdono...@illinois.edu | (217) 333-4648
>>
> 

-- 
Tim Donohue
Research Programmer, IDEALS
http://www.ideals.uiuc.edu/
University of Illinois
tdono...@illinois.edu | (217) 333-4648

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

Reply via email to