Sue, Sorry, we've all been talking across each other a bit. As you can probably tell, there's really no "correct" answer on how to do this, rather there's a variety of options to choose from
Essentially, you have 3 options that have been laid out by Mark, Claudia and myself. I'm not certain which will be *easiest* off the top of my head: [Option 1] Add the *.txt files to the "ORIGINAL" bundle (which is where they are added by default). If they are in the "ORIGINAL" bundle you will have to run 'filter-media' to "filter" them into the "TEXT" bundle. Then, you will run 'index-all' to index them for searching (as noted 'index-all' only indexes documents in the "TEXT" bundle). You will also need to modify the UI if you don't want these *.txt files to be visible to normal users. [Option 2] Add the *.txt files to the "TEXT" bundle directly. There is no way to do this via normal DSpace user interfaces. You can however do this during the normal command-line bulk item import process by specifying a "bundle" name in the 'contents' file. See the DSpace Docs for more information on this: http://dspace.svn.sourceforge.net/viewvc/dspace/trunk/dspace/docs/application.html#itemimporter [Option 3] Claudia's suggestion is very similar to Option #1. However, as she notes and easy way to "hide" the *.txt files from the UI is to go into the DSpace Administration UI (specifically the "Bitstream Format Registry" and mark the *.txt format as "internal"). This tells DSpace that ALL *.txt files should be considered internal files, and should NEVER be displayed in the UI. So, you'd only want to do this if you never want any *.txt files to be displayed from the UI. In my opinion (others may have differing opinions), it'd be safer & potentially easier to go with either option #1 or #3. The danger of option #2 is that the "TEXT" bundle tends to be managed by the "filter-media" script in DSpace. As long as you are always aware that you manually added files to this bundle, you should be fine. But, if you ever ran 'filter-media' in "force" mode (with the -f option), there'd be a possibility the 'filter-media' script would overwrite all your manually added *.txt files in that bundle. Hopefully that gives you a decent lay of the land. There may be yet other options out there, but at least this gives you a few to work off of. - Tim Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: > I did the following query against the bundle table and it seems we only > have 3 bundle "name"s in the table: LICENSE, ORIGINAL, & TEXT: > > */select count(*)/* > > */ , name /* > > */ from bundle /* > > */ group by 2 /* > > */ order by 2/* > > */ /* > > All the .txt files we created in our 1000 document test are in the > ORIGINAL bundle, according to NAME in the bundle table. So if I run > this query and then run index-all, these .txt files should be > searchable, correct? > > */UPDATE bundle/* > > */ SET name = 'TEXT'/* > > */ WHERE bundle_id = /* > > */ (SELECT bu.bundle_id /* > > */ FROM bitstream bi/* > > */ , bundle2bitstream b2b/* > > */ , bundle bu/* > > */ WHERE bi.bitstream_id = b2b.bitstream_id/* > > */ AND b2b.bundle_id = bu.bundle_id/* > > */ AND bundle.bundle_id = bu.bundle_id/* > > */ AND bu.name = 'ORIGINAL'/* > > */ AND bi.name LIKE '%.txt') /* > > > > Let me know what you think. > > Thanks again, > > Sue > > > > > > -----Original Message----- > From: Tim Donohue [mailto:tdono...@illinois.edu] > Sent: Tuesday, January 20, 2009 2:12 PM > To: Diggory Mark > Cc: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]; > dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI > INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION > SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] > Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions > > > > Mark, > > > > That's correct, that the indexer only indexes files in the TEXT bundle. > > But, that's why I had recommended to Susan to first run 'filter-media' > > script. The 'filter-media' script will take text files in the CONTENT > > bundle and essentially copy them over to the TEXT bundle for indexing. > > > > So, you are correct that the *.txt files could be immediately put in the > > TEXT bundle (which would also avoid them being exposed publicly). But, > > the alternative would be to put the *.txt files in the CONTENT bundle > > and run 'filter-media' to "filter" it into the TEXT bundle. (However, as > > you noted, this latter option would require UI alteration to hide the > > *.txt files, if they shouldn't be accessible). > > > > - Tim > > > > Diggory Mark wrote: > >> Actually... > >> > >> Looking at the code of DSIndexer... I'm sure, written by among others... > >> myself. We find that only Bitstreams within the "TEXT" bundle are > >> actually indexed into Lucene: > >> > >> > for (int i = 0; i < myBundles.length; i++) > >> > { > >> > if ((myBundles[i].getName() != null) > >> > && myBundles[i].getName().equals("TEXT")) > >> > { > >> > >> I'm thinking this was a short-sightedness, but the unhappy consequence > >> of which is that your text files will not get indexed if you place them > >> into the "CONTENT" Bundle. There are two solutions > >> > >> A.) Put your text bitstreams into the TEXT bundle and not have to worry > >> about them being exposed because the TEXT bundle will not be. > >> > >> B.) Put your text Bitstreams in the Content Bundle, alter the UI to hide > >> them, and alter DSIndexer to index the CONTENT bundle. > >> > >> Mark > >> > >> On Jan 16, 2009, at 2:40 PM, Tim Donohue wrote: > >> > >> > Susan, > >> > > >> > Actually, the setting you'd want to change in your DSpace 1.4.2 > >> > dspace.cfg is this one: > >> > > >> > plugin.sequence.org.dspace.app.mediafilter.MediaFilter = ... > >> > > >> > You'd want to remove the entry for: > >> > "org.dspace.app.mediafilter.PDFFilter" > >> > > >> > That'd ensure that the PDFFilter is no longer used by filter-media. The > >> > setting that you referenced below just configures the PDF filter to > >> > process files which are "Adobe PDF" format. > >> > > >> > [NOTE:] If you end up upgrading to DSpace 1.5.x, the above > >> > "plugin.sequence.org.dspace.app.mediafilter.MediaFilter" setting no > >> > longer exists. Instead, it was replaced by a more simplistic > >> > "filter.plugins" setting. In that case, for DSpace 1.5.x, you'd just > >> > remove "PDF Text Extractor" from the list of enabled "filter.plugins". > >> > Again, this would ensure that 'filter-media' would no longer use the PDF > >> > filter. > >> > > >> > Hopefully that all makes sense...Beyond that, as you mentioned, you'd > >> > just need to hide those '*.txt' files from being displayed. > >> > > >> > - Tim > >> > > >> > > >> > > >> > Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: > >> >> Hi Tim, > >> >> > >> >> So you're saying that our proposed solution would work as long as > >> >> we remove (or comment out): > >> >> > >> >> > >> >> > >> >> *filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF* > >> >> > >> >> > >> >> > >> >> from dspace.cfg and make the change to not display the .txt files on the > >> >> Item pages? > >> >> > >> >> > >> >> > >> >> Then we would still need to run filter-media which would only be to > >> >> basically add our .txt files to the TEXT bundle for each Item? > >> >> > >> >> > >> >> > >> >> By the way, we have been using the 1.5 version of filter-media, with the > >> >> addition of the two new configuration parameters in dspace.cfg, for > >> >> awhile, even though we are running DSpace 1.4.2. I did this awhile back > >> >> and yes, it has stopped the JAVA heap space errors from killing > >> >> filter-media midstream. > >> >> > >> >> > >> >> > >> >> I do think this new plan is the better way to go for us. I believe the > >> >> advantages would be: > >> >> > >> >> 1. No more filter-media running for soooo long – over 24 hours most of > >> >> the time. > >> >> > >> >> 2. We would identify “problematic” .pdf files (ones that possibly > >> >> wouldn’t filter) prior to importing them into DSpace, instead of > >> >> after-the-fact. When these problems are caught at the scanning point, > >> >> they could be dealt with there and then (rescanning/re-ocr’ing, etc). > >> >> > >> >> 3. Our Users wouldn’t have such a big job of identifying the > >> >> “unfilterable” documents, locating them for rescanning, getting them > >> >> back to us for re-import, etc etc. > >> >> > >> >> 4. Bottom line would be a more accurate full-text searchable > >> >> repository. > >> >> > >> >> > >> >> > >> >> Thanks a bunch for the detailed feedback. We are processing a 1000 > >> >> document test with this new procedure and will let you know how it > >> >> goes!! > >> >> > >> >> Sue > >> >> > >> >> > >> >> > >> >> -----Original Message----- > >> >> From: Tim Donohue [mailto:tdono...@illinois.edu] > >> >> Sent: Thursday, January 15, 2009 11:27 AM > >> >> To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] > >> >> Cc: dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI > >> >> INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION > >> >> SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] > >> >> Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions > >> >> > >> >> > >> >> > >> >> Sue, > >> >> > >> >> > >> >> > >> >> There were some improvements to 'filter-media' in DSpace 1.5.x. > >> >> > >> >> Primarily, there's the addition of two new PDF-specific settings in the > >> >> > >> >> dspace.cfg: > >> >> > >> >> > >> >> > >> >> pdffilter.largepdfs = true > >> >> > >> >> pdffilter.skiponmemoryexception = true > >> >> > >> >> > >> >> > >> >> The former ensures that all PDF text-extractions are written to > >> >> > >> >> temporary files during indexing. This helps avoid OutOfMemoryException > >> >> > >> >> & Heap space errors that were occasionally caused by larger PDFs being > >> >> > >> >> loaded into system memory all at once. > >> >> > >> >> > >> >> > >> >> The latter attempts to skip over any PDFs which still cause an > >> >> > >> >> OutOfMemoryException. So, if that exception still occurs on a PDF, then > >> >> > >> >> the PDF is skipped entirely and *not* indexed. This helps to avoid the > >> >> > >> >> entire 'filter-media' script "crashing" when an OutOfMemoryException > >> >> > >> >> occurs (which used to happen in 1.4.2). > >> >> > >> >> > >> >> > >> >> Despite these changes in 1.5.x, there is NO guarantee that *all* of your > >> >> > >> >> PDFs will index properly. As I've mentioned before, the 'filter-media' > >> >> > >> >> script uses third-party software (called PDFBox: http://www.pdfbox.org/) > >> >> > >> >> for indexing of PDF files. There are some known bugs in PDFBox that > >> >> > >> >> have yet to be fixed, so it does *not* always work for all PDFs. In > >> >> > >> >> some cases, PDFBox will also work inconsistently (and I don't know why > >> >> > >> >> that is). I've run into some inconsistency problems with larger-sized > >> >> > >> >> PDFs, which are originally scanned documents with embedded OCR. > >> >> > >> >> Occasionally PDFBox will index them fine, and other times it will cause > >> >> > >> >> an OutOfMemoryException (which, with DSpace 1.5 means that > >> >> > >> >> 'filter-media' will just skip that pdf). > >> >> > >> >> > >> >> > >> >> So, I guess the best way to sum this up is that DSpace currently cannot > >> >> > >> >> successfully index 100% of all PDFs, since PDFBox cannot do so. DSpace > >> >> > >> >> 1.5 has improvements in helping DSpace to safely handle PDFBox issues > >> >> > >> >> (like the OutOfMemoryExceptions), but it doesn't necessarily have > >> >> > >> >> drastic improvements in indexing capabilities. > >> >> > >> >> > >> >> > >> >> I answered your other questions inline below... > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: > >> >> > >> >> > >> >> > >> >>> 1. Has the filter-media/index-all process changed > >> >> > >> >>> and/or improved significantly in DSpace 1.5? If so, we may just shelve > >> >> > >> >>> this issue until we’ve implemented 1.5. > >> >> > >> >> > >> >> > >> >> See above, obviously... > >> >> > >> >> > >> >> > >> >>> 2. In DSpace 1.4.2 (and 1.5), does it matter whether > >> >> > >> >>> your .txt files are plain or accessible .txt files? Can index-all > >> >> > >> >>> process either type? > >> >> > >> >> > >> >> > >> >> For text files, it doesn't really matter...in either case the > >> >> > >> >> 'filter-media' script just pulls out the plain text for indexing. I > >> >> > >> >> don't believe there'd be any significant difference between the "type" > >> >> > >> >> of .txt file. > >> >> > >> >> > >> >> > >> >> However, it's worth making this clear: for .txt files, you *still* need > >> >> > >> >> to run the 'filter-media' script for them to be indexed by 'index-all'. > >> >> > >> >> Essentially, 'index-all' only indexes plain text files in the "TEXT" > >> >> > >> >> bundle. The 'filter-media' script is what adds plain text to the "TEXT" > >> >> > >> >> bundle. > >> >> > >> >> > >> >> > >> >>> > >> >> > >> >>> > >> >> > >> >>> 3. If the process in 1.5 hasn’t changed and/or > >> >> > >> >>> improved significantly in 1.5, we are considering having our scanning > >> >> > >> >>> folks just create the .txt files along with the .pdf files at the time > >> >> > >> >>> the documents are scanned. Then when they send them to us, we would > >> >> > >> >>> just upload them in the import process along with the .pdf files for > >> >> > >> >>> each Item. The only thing we’d really have to change in our import > >> >> > >> >>> process is the addition of a second file name in the “contents” file > >> >>> and > >> >> > >> >>> the addition of the .txt document in the Item’s import directory (right > >> >> > >> >>> along with the .pdf file). One other issue is we might have to make a > >> >> > >> >>> small modification to DSpace to **not** display the .txt file on the > >> >> > >> >>> Item page unless the User is in the Admin interface since we wouldn’t > >> >> > >> >>> want our Users clicking on/opening the .txt files. If we did this, we > >> >> > >> >>> could completely eliminate the filter-media job altogether. This would > >> >> > >> >>> ensure that we did not load any “unfilterable” documents into DSpace. > >> >> > >> >>> It would also eliminate the tedious process of identifying which > >> >> > >> >>> documents did not filter successfully, and the whole process of > >> >> > >> >>> rescanning and replacing them in DSpace. > >> >> > >> >> > >> >> > >> >> This sounds like a perfectly reasonable way of doing things, assuming > >> >> > >> >> you have the staff time to pre-generate those .txt files. You are > >> >> > >> >> correct that you'd no longer need to run 'filter-media' on those PDFs. > >> >> > >> >> But, you'd still need to run 'filter-media' to index those .txt files. > >> >> > >> >> You could do this by modifying the "Media Filter" settings in your > >> >> > >> >> dspace.cfg and *removing* the PDFFilter from the list (so 'filter-media' > >> >> > >> >> would no longer filter PDFs, but it would work on the other types of > >> >> > >> >> content). > >> >> > >> >> > >> >> > >> >> It would also require some custom coding to hide those .txt files from > >> >> > >> >> normal users, but that shouldn't be too horrible. > >> >> > >> >> > >> >> > >> >> If you did go this route, I'd make sure that you still OCR the PDFs that > >> >> > >> >> you put in, as it improves their accessibility overall. > >> >> > >> >> > >> >> > >> >> Hopefully that all makes sense...definitely let us know if you have > >> >> > >> >> further questions. > >> >> > >> >> > >> >> > >> >> - Tim > >> >> > >> >> > >> >> > >> >> -- > >> >> > >> >> Tim Donohue > >> >> > >> >> Research Programmer, IDEALS > >> >> > >> >> http://www.ideals.uiuc.edu/ > >> >> > >> >> University of Illinois > >> >> > >> >> tdono...@illinois.edu | (217) 333-4648 > >> >> > >> > > >> > -- > >> > Tim Donohue > >> > Research Programmer, IDEALS > >> > http://www.ideals.uiuc.edu/ > >> > University of Illinois > >> > tdono...@illinois.edu | (217) 333-4648 > >> > > >> > > ------------------------------------------------------------------------------ > > > >> > > >> > This SF.net email is sponsored by: > >> > SourcForge Community > >> > SourceForge wants to tell your story. > >> > http://p.sf.net/sfu/sf-spreadtheword > >> > _______________________________________________ > >> > DSpace-tech mailing list > >> > DSpace-tech@lists.sourceforge.net > >> > https://lists.sourceforge.net/lists/listinfo/dspace-tech > >> > >> ~~~~~~~~~~~~~ > >> Mark R. Diggory > >> http://purl.org/net/mdiggory/homepage > >> > >> > >> > >> > > > > -- > > Tim Donohue > > Research Programmer, IDEALS > > http://www.ideals.uiuc.edu/ > > University of Illinois > > tdono...@illinois.edu | (217) 333-4648 > -- Tim Donohue Research Programmer, IDEALS http://www.ideals.uiuc.edu/ University of Illinois tdono...@illinois.edu | (217) 333-4648 ------------------------------------------------------------------------------ This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech