
Sorry, we've all been talking across each other a bit.  As you can 
probably tell, there's really no "correct" answer on how to do this, 
rather there's a variety of options to choose from

Essentially, you have 3 options that have been laid out by Mark, Claudia 
and myself.  I'm not certain which will be *easiest* off the top of my head:

[Option 1]  Add the *.txt files to the "ORIGINAL" bundle (which is where 
they are added by default).  If they are in the "ORIGINAL" bundle you 
will have to run 'filter-media' to "filter" them into the "TEXT" bundle. 
   Then, you will run 'index-all' to index them for searching (as noted 
'index-all' only indexes documents in the "TEXT" bundle).  You will also 
need to modify the UI if you don't want these *.txt files to be visible 
to normal users.

[Option 2]  Add the *.txt files to the "TEXT" bundle directly.  There is 
no way to do this via normal DSpace user interfaces.  You can however do 
this during the normal command-line bulk item import process by 
specifying a "bundle" name in the 'contents' file.  See the DSpace Docs 
for more information on this:

[Option 3]  Claudia's suggestion is very similar to Option #1.  However, 
as she notes and easy way to "hide" the *.txt files from the UI is to go 
into the DSpace Administration UI (specifically the "Bitstream Format 
Registry" and mark the *.txt format as "internal").  This tells DSpace 
that ALL *.txt files should be considered internal files, and should 
NEVER be displayed in the UI.  So, you'd only want to do this if you 
never want any *.txt files to be displayed from the UI.

In my opinion (others may have differing opinions), it'd be safer & 
potentially easier to go with either option #1 or #3.  The danger of 
option #2 is that the "TEXT" bundle tends to be managed by the 
"filter-media" script in DSpace.  As long as you are always aware that 
you manually added files to this bundle, you should be fine.  But, if 
you ever ran 'filter-media' in "force" mode (with the -f option), 
there'd be a possibility the 'filter-media' script would overwrite all 
your manually added *.txt files in that bundle.

Hopefully that gives you a decent lay of the land.  There may be yet 
other options out there, but at least this gives you a few to work off of.

- Tim

Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
> I did the following query against the bundle table and it seems we only 
> have 3 bundle "name"s in the table:  LICENSE, ORIGINAL, & TEXT:
>  */select count(*)/*
> */      , name /*
> */ from bundle /*
> */  group by 2 /*
> */  order by 2/*
> */ /*
> All the .txt files we created in our 1000 document test are in the 
> ORIGINAL bundle, according to NAME in the bundle table.  So if I run 
> this query and then run index-all, these .txt files should be 
> searchable, correct?
>   */UPDATE bundle/*
> */  SET name = 'TEXT'/*
> */  WHERE bundle_id = /*
> */     (SELECT bu.bundle_id /*
> */         FROM bitstream bi/*
> */            , bundle2bitstream b2b/*
> */            , bundle    bu/*
> */         WHERE bi.bitstream_id = b2b.bitstream_id/*
> */           AND b2b.bundle_id   = bu.bundle_id/*
> */           AND bundle.bundle_id = bu.bundle_id/*
> */           AND = 'ORIGINAL'/*
> */           AND LIKE '%.txt')   /*
> Let me know what you think.
> Thanks again,
> Sue
> -----Original Message-----
> From: Tim Donohue []
> Sent: Tuesday, January 20, 2009 2:12 PM
> To: Diggory Mark
> Cc: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]; 
>; Kimbrough, Glenn W. (LARC-B7)[NCI 
> Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
> Mark,
> That's correct, that the indexer only indexes files in the TEXT bundle.
>   But, that's why I had recommended to Susan to first run 'filter-media'
> script.   The 'filter-media' script will take text files in the CONTENT
> bundle and essentially copy them over to the TEXT bundle for indexing.
> So, you are correct that the *.txt files could be immediately put in the
> TEXT bundle (which would also avoid them being exposed publicly).  But,
> the alternative would be to put the *.txt files in the CONTENT bundle
> and run 'filter-media' to "filter" it into the TEXT bundle. (However, as
> you noted, this latter option would require UI alteration to hide the
> *.txt files, if they shouldn't be accessible).
> - Tim
> Diggory Mark wrote:
>>  Actually...
>>  Looking at the code of DSIndexer... I'm sure, written by among others...
>>  myself.  We find that only Bitstreams within the "TEXT" bundle are
>>  actually indexed into Lucene:
>> >  for (int i = 0; i < myBundles.length; i++)
>> >             {
>> >                 if ((myBundles[i].getName() != null)
>> >                         && myBundles[i].getName().equals("TEXT"))
>> >                 {
>>  I'm thinking this was a short-sightedness, but the unhappy consequence
>>  of which is that your text files will not get indexed if you place them
>>  into the "CONTENT" Bundle.  There are two solutions
>>  A.) Put your text bitstreams into the TEXT bundle and not have to worry
>>  about them being exposed because the TEXT bundle will not be.
>>  B.) Put your text Bitstreams in the Content Bundle, alter the UI to hide
>>  them, and alter DSIndexer to index the CONTENT bundle.
>>  Mark
>>  On Jan 16, 2009, at 2:40 PM, Tim Donohue wrote:
>> > Susan,
>> > 
>> > Actually, the setting you'd want to change in your DSpace 1.4.2
>> > dspace.cfg is this one:
>> > 
>> > = ...
>> > 
>> > You'd want to remove the entry for:
>> > ""
>> > 
>> > That'd ensure that the PDFFilter is no longer used by filter-media.  The
>> > setting that you referenced below just configures the PDF filter to
>> > process files which are "Adobe PDF" format.
>> > 
>> > [NOTE:] If you end up upgrading to DSpace 1.5.x, the above
>> > "" setting no
>> > longer exists.  Instead, it was replaced by a more simplistic
>> > "filter.plugins" setting.  In that case, for DSpace 1.5.x, you'd just
>> > remove "PDF Text Extractor" from the list of enabled "filter.plugins".
>> > Again, this would ensure that 'filter-media' would no longer use the PDF
>> > filter.
>> > 
>> > Hopefully that all makes sense...Beyond that, as you mentioned, you'd
>> > just need to hide those '*.txt' files from being displayed.
>> > 
>> > - Tim
>> > 
>> > 
>> > 
>> > Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
>> >> Hi Tim,
>> >> 
>> >>     So you're saying that our proposed solution would work as long as
>> >> we remove (or comment out):
>> >> 
>> >> 
>> >> 
>> >> * = Adobe PDF*
>> >> 
>> >> 
>> >> 
>> >> from dspace.cfg and make the change to not display the .txt files on the
>> >> Item pages?
>> >> 
>> >> 
>> >> 
>> >> Then we would still need to run filter-media which would only be to
>> >> basically add our .txt files to the TEXT bundle for each Item?
>> >> 
>> >> 
>> >> 
>> >> By the way, we have been using the 1.5 version of filter-media, with the
>> >> addition of the two new configuration parameters in dspace.cfg, for
>> >> awhile, even though we are running DSpace 1.4.2.  I did this awhile back
>> >> and yes, it has stopped the JAVA heap space errors from killing
>> >> filter-media midstream.
>> >> 
>> >> 
>> >> 
>> >> I do think this new plan is the better way to go for us.  I believe the
>> >> advantages would be:
>> >> 
>> >> 1.  No more filter-media running for soooo long – over 24 hours most of
>> >> the time.
>> >> 
>> >> 2.  We would identify “problematic” .pdf files (ones that possibly
>> >> wouldn’t filter) prior to importing them into DSpace, instead of
>> >> after-the-fact.  When these problems are caught at the scanning point,
>> >> they could be dealt with there and then (rescanning/re-ocr’ing, etc).
>> >> 
>> >> 3.  Our Users wouldn’t have such a big job of identifying the
>> >> “unfilterable” documents, locating them for rescanning, getting them
>> >> back to us for re-import, etc etc.
>> >> 
>> >> 4.  Bottom line would be a more accurate full-text searchable
>> >> repository.
>> >> 
>> >> 
>> >> 
>> >> Thanks a bunch for the detailed feedback.  We are processing a 1000
>> >> document test with this new procedure and will let you know how it
>> >> goes!!
>> >> 
>> >> Sue
>> >> 
>> >> 
>> >> 
>> >> -----Original Message-----
>> >> From: Tim Donohue []
>> >> Sent: Thursday, January 15, 2009 11:27 AM
>> >> To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
>> >> Cc:; Kimbrough, Glenn W. (LARC-B7)[NCI
>> >> Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
>> >> 
>> >> 
>> >> 
>> >> Sue,
>> >> 
>> >> 
>> >> 
>> >> There were some improvements to 'filter-media' in DSpace 1.5.x.
>> >> 
>> >> Primarily, there's the addition of two new PDF-specific settings in the
>> >> 
>> >> dspace.cfg:
>> >> 
>> >> 
>> >> 
>> >> pdffilter.largepdfs = true
>> >> 
>> >> pdffilter.skiponmemoryexception = true
>> >> 
>> >> 
>> >> 
>> >> The former ensures that all PDF text-extractions are written to
>> >> 
>> >> temporary files during indexing.  This helps avoid OutOfMemoryException
>> >> 
>> >> & Heap space errors that were occasionally caused by larger PDFs being
>> >> 
>> >> loaded into system memory all at once.
>> >> 
>> >> 
>> >> 
>> >> The latter attempts to skip over any PDFs which still cause an
>> >> 
>> >> OutOfMemoryException.  So, if that exception still occurs on a PDF, then
>> >> 
>> >> the PDF is skipped entirely and *not* indexed.  This helps to avoid the
>> >> 
>> >> entire 'filter-media' script "crashing" when an OutOfMemoryException
>> >> 
>> >> occurs (which used to happen in 1.4.2).
>> >> 
>> >> 
>> >> 
>> >> Despite these changes in 1.5.x, there is NO guarantee that *all* of your
>> >> 
>> >> PDFs will index properly.  As I've mentioned before, the 'filter-media'
>> >> 
>> >> script uses third-party software (called PDFBox:
>> >> 
>> >> for indexing of PDF files.  There are some known bugs in PDFBox that
>> >> 
>> >> have yet to be fixed, so it does *not* always work for all PDFs.   In
>> >> 
>> >> some cases, PDFBox will also work inconsistently (and I don't know why
>> >> 
>> >> that is).  I've run into some inconsistency problems with larger-sized
>> >> 
>> >> PDFs, which are originally scanned documents with embedded OCR.
>> >> 
>> >> Occasionally PDFBox will index them fine, and other times it will cause
>> >> 
>> >> an OutOfMemoryException (which, with DSpace 1.5 means that
>> >> 
>> >> 'filter-media' will just skip that pdf).
>> >> 
>> >> 
>> >> 
>> >> So, I guess the best way to sum this up is that DSpace currently cannot
>> >> 
>> >> successfully index 100% of all PDFs, since PDFBox cannot do so.  DSpace
>> >> 
>> >> 1.5 has improvements in helping DSpace to safely handle PDFBox issues
>> >> 
>> >> (like the OutOfMemoryExceptions), but it doesn't necessarily have
>> >> 
>> >> drastic improvements in indexing capabilities.
>> >> 
>> >> 
>> >> 
>> >> I answered your other questions inline below...
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
>> >> 
>> >> 
>> >> 
>> >>> 1.                   Has the filter-media/index-all process changed
>> >> 
>> >>> and/or improved significantly in DSpace 1.5?  If so, we may just shelve
>> >> 
>> >>> this issue until we’ve implemented 1.5.
>> >> 
>> >> 
>> >> 
>> >> See above, obviously...
>> >> 
>> >> 
>> >> 
>> >>> 2.                   In DSpace 1.4.2 (and 1.5), does it matter whether
>> >> 
>> >>> your .txt files are plain or accessible .txt files?  Can index-all
>> >> 
>> >>> process either type?
>> >> 
>> >> 
>> >> 
>> >> For text files, it doesn't really either case the
>> >> 
>> >> 'filter-media' script just pulls out the plain text for indexing.  I
>> >> 
>> >> don't believe there'd be any significant difference between the "type"
>> >> 
>> >> of .txt file.
>> >> 
>> >> 
>> >> 
>> >> However, it's worth making this clear: for .txt files, you *still* need
>> >> 
>> >> to run the 'filter-media' script for them to be indexed by 'index-all'.
>> >> 
>> >>  Essentially, 'index-all' only indexes plain text files in the "TEXT"
>> >> 
>> >> bundle.  The 'filter-media' script is what adds plain text to the "TEXT"
>> >> 
>> >> bundle.
>> >> 
>> >> 
>> >> 
>> >>> 
>> >> 
>> >>> 
>> >> 
>> >>> 3.                   If the process in 1.5 hasn’t changed and/or
>> >> 
>> >>> improved significantly in 1.5, we are considering having our scanning
>> >> 
>> >>> folks just create the .txt files along with the .pdf files at the time
>> >> 
>> >>> the documents are scanned.  Then when they send them to us, we would
>> >> 
>> >>> just upload them in the import process along with the .pdf files for
>> >> 
>> >>> each Item.  The only thing we’d really have to change in our import
>> >> 
>> >>> process is the addition of a second file name in the “contents” file
>> >>> and
>> >> 
>> >>> the addition of the .txt document in the Item’s import directory (right
>> >> 
>> >>> along with the .pdf file).  One other issue is we might have to make a
>> >> 
>> >>> small modification to DSpace to **not** display the .txt file on the
>> >> 
>> >>> Item page unless the User is in the Admin interface since we wouldn’t
>> >> 
>> >>> want our Users clicking on/opening the .txt files.  If we did this, we
>> >> 
>> >>> could completely eliminate the filter-media job altogether.  This would
>> >> 
>> >>> ensure that we did not load any “unfilterable” documents into DSpace.
>> >> 
>> >>> It would also eliminate the tedious process of identifying which
>> >> 
>> >>> documents did not filter successfully, and the whole process of
>> >> 
>> >>> rescanning and replacing them in DSpace.
>> >> 
>> >> 
>> >> 
>> >> This sounds like a perfectly reasonable way of doing things, assuming
>> >> 
>> >> you have the staff time to pre-generate those .txt files.  You are
>> >> 
>> >> correct that you'd no longer need to run 'filter-media' on those PDFs.
>> >> 
>> >> But, you'd still need to run 'filter-media' to index those .txt files.
>> >> 
>> >> You could do this by modifying the "Media Filter" settings in your
>> >> 
>> >> dspace.cfg and *removing* the PDFFilter from the list (so 'filter-media'
>> >> 
>> >> would no longer filter PDFs, but it would work on the other types of
>> >> 
>> >> content).
>> >> 
>> >> 
>> >> 
>> >> It would also require some custom coding to hide those .txt files from
>> >> 
>> >> normal users, but that shouldn't be too horrible.
>> >> 
>> >> 
>> >> 
>> >> If you did go this route, I'd make sure that you still OCR the PDFs that
>> >> 
>> >> you put in, as it improves their accessibility overall.
>> >> 
>> >> 
>> >> 
>> >> Hopefully that all makes sense...definitely let us know if you have
>> >> 
>> >> further questions.
>> >> 
>> >> 
>> >> 
>> >> - Tim
>> >> 
>> >> 
>> >> 
>> >> --
>> >> 
>> >> Tim Donohue
>> >> 
>> >> Research Programmer, IDEALS
>> >> 
>> >>
>> >> 
>> >> University of Illinois
>> >> 
>> >> | (217) 333-4648
>> >> 
>> > 
>> > --
>> > Tim Donohue
>> > Research Programmer, IDEALS
>> >
>> > University of Illinois
>> > | (217) 333-4648
>> > 
>> > 
> ------------------------------------------------------------------------------
>> > 
>> > This email is sponsored by:
>> > SourcForge Community
>> > SourceForge wants to tell your story.
>> >
>> > _______________________________________________
>> > DSpace-tech mailing list
>> >
>> >
>>  ~~~~~~~~~~~~~
>>  Mark R. Diggory
> -- 
> Tim Donohue
> Research Programmer, IDEALS
> University of Illinois
> | (217) 333-4648

Tim Donohue
Research Programmer, IDEALS
University of Illinois | (217) 333-4648

This email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
DSpace-tech mailing list
  • [Dspace-tech] DSpa... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
    • Re: [Dspace-t... Tim Donohue
      • Re: [Dspa... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
        • Re: [... Tim Donohue
          • R... Diggory Mark
            • ... Tim Donohue
              • ... Claudia Jürgen
              • ... Tim Donohue
                • ... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
                • ... Tim Donohue
                • ... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
            • ... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]

Reply via email to