Thanks for all your help Tim! I think this will help us out a lot! Best, Sue
-----Original Message----- From: Tim Donohue [mailto:tdono...@illinois.edu] Sent: Wednesday, January 21, 2009 4:54 PM To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] Cc: Diggory Mark; dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions Sue, Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: > Hi Tim, > Thanks to all for the suggestions. Basically I am trying to > prevent filter-media from attempting to filter our .pdf files and I want > index-all to index only our .txt files. > > So if I remove the pdffilter parameters from dspace.cfg and I have > all our .txt files in the TEXT bundle (using one of the 3 options you > outlined), this should work and we shouldn't have to run filter-media at > all, right? That's almost correct, except for the last part of your statement. You'll notice that in the options I laid out below, Options #1 and #3 specifically state you STILL need to run 'filter-media'. This is because in both those options you are starting with the *.txt files in the ORIGINAL bundle, and they need to be copied to the TEXT bundle before they can be indexed. Is this starting to make some sense? Filter-media is what does extraction of full text (from PDF, HTML, Word or Plain text formats) and generates a corresponding *.txt file in the TEXT bundle containing the extracted full text. Since the 'index-all' script will ONLY index *.txt from the TEXT bundle, you will always need to run 'filter-media' first unless you've manually added *.txt to the TEXT bundle. - Tim > > Thanks again, > Sue > > -----Original Message----- > From: Tim Donohue [mailto:tdono...@illinois.edu] > Sent: Wednesday, January 21, 2009 10:54 AM > To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] > Cc: Diggory Mark; dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. > (LARC-B7)[NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI > INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION > SYSTEMS] > Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions > > Sue, > > Sorry, we've all been talking across each other a bit. As you can > probably tell, there's really no "correct" answer on how to do this, > rather there's a variety of options to choose from > > Essentially, you have 3 options that have been laid out by Mark, Claudia > > and myself. I'm not certain which will be *easiest* off the top of my > head: > > [Option 1] Add the *.txt files to the "ORIGINAL" bundle (which is where > > they are added by default). If they are in the "ORIGINAL" bundle you > will have to run 'filter-media' to "filter" them into the "TEXT" bundle. > > Then, you will run 'index-all' to index them for searching (as noted > 'index-all' only indexes documents in the "TEXT" bundle). You will also > > need to modify the UI if you don't want these *.txt files to be visible > to normal users. > > [Option 2] Add the *.txt files to the "TEXT" bundle directly. There is > > no way to do this via normal DSpace user interfaces. You can however do > > this during the normal command-line bulk item import process by > specifying a "bundle" name in the 'contents' file. See the DSpace Docs > for more information on this: > http://dspace.svn.sourceforge.net/viewvc/dspace/trunk/dspace/docs/applic > ation.html#itemimporter > > [Option 3] Claudia's suggestion is very similar to Option #1. However, > > as she notes and easy way to "hide" the *.txt files from the UI is to go > > into the DSpace Administration UI (specifically the "Bitstream Format > Registry" and mark the *.txt format as "internal"). This tells DSpace > that ALL *.txt files should be considered internal files, and should > NEVER be displayed in the UI. So, you'd only want to do this if you > never want any *.txt files to be displayed from the UI. > > > In my opinion (others may have differing opinions), it'd be safer & > potentially easier to go with either option #1 or #3. The danger of > option #2 is that the "TEXT" bundle tends to be managed by the > "filter-media" script in DSpace. As long as you are always aware that > you manually added files to this bundle, you should be fine. But, if > you ever ran 'filter-media' in "force" mode (with the -f option), > there'd be a possibility the 'filter-media' script would overwrite all > your manually added *.txt files in that bundle. > > Hopefully that gives you a decent lay of the land. There may be yet > other options out there, but at least this gives you a few to work off > of. > > - Tim > > > > Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: >> I did the following query against the bundle table and it seems we > only >> have 3 bundle "name"s in the table: LICENSE, ORIGINAL, & TEXT: >> >> */select count(*)/* >> >> */ , name /* >> >> */ from bundle /* >> >> */ group by 2 /* >> >> */ order by 2/* >> >> */ /* >> >> All the .txt files we created in our 1000 document test are in the >> ORIGINAL bundle, according to NAME in the bundle table. So if I run >> this query and then run index-all, these .txt files should be >> searchable, correct? >> >> */UPDATE bundle/* >> >> */ SET name = 'TEXT'/* >> >> */ WHERE bundle_id = /* >> >> */ (SELECT bu.bundle_id /* >> >> */ FROM bitstream bi/* >> >> */ , bundle2bitstream b2b/* >> >> */ , bundle bu/* >> >> */ WHERE bi.bitstream_id = b2b.bitstream_id/* >> >> */ AND b2b.bundle_id = bu.bundle_id/* >> >> */ AND bundle.bundle_id = bu.bundle_id/* >> >> */ AND bu.name = 'ORIGINAL'/* >> >> */ AND bi.name LIKE '%.txt') /* >> >> >> >> Let me know what you think. >> >> Thanks again, >> >> Sue >> >> >> >> >> >> -----Original Message----- >> From: Tim Donohue [mailto:tdono...@illinois.edu] >> Sent: Tuesday, January 20, 2009 2:12 PM >> To: Diggory Mark >> Cc: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]; >> dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI >> INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION >> SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] >> Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions >> >> >> >> Mark, >> >> >> >> That's correct, that the indexer only indexes files in the TEXT > bundle. >> But, that's why I had recommended to Susan to first run > 'filter-media' >> script. The 'filter-media' script will take text files in the > CONTENT >> bundle and essentially copy them over to the TEXT bundle for indexing. >> >> >> >> So, you are correct that the *.txt files could be immediately put in > the >> TEXT bundle (which would also avoid them being exposed publicly). > But, >> the alternative would be to put the *.txt files in the CONTENT bundle >> >> and run 'filter-media' to "filter" it into the TEXT bundle. (However, > as >> you noted, this latter option would require UI alteration to hide the >> >> *.txt files, if they shouldn't be accessible). >> >> >> >> - Tim >> >> >> >> Diggory Mark wrote: >> >>> Actually... >>> Looking at the code of DSIndexer... I'm sure, written by among > others... >>> myself. We find that only Bitstreams within the "TEXT" bundle are >>> actually indexed into Lucene: >>>> for (int i = 0; i < myBundles.length; i++) >>>> { >>>> if ((myBundles[i].getName() != null) >>>> && myBundles[i].getName().equals("TEXT")) >>>> { >>> I'm thinking this was a short-sightedness, but the unhappy > consequence >>> of which is that your text files will not get indexed if you place > them >>> into the "CONTENT" Bundle. There are two solutions >>> A.) Put your text bitstreams into the TEXT bundle and not have to > worry >>> about them being exposed because the TEXT bundle will not be. >>> B.) Put your text Bitstreams in the Content Bundle, alter the UI to > hide >>> them, and alter DSIndexer to index the CONTENT bundle. >>> Mark >>> On Jan 16, 2009, at 2:40 PM, Tim Donohue wrote: >>>> Susan, >>>> Actually, the setting you'd want to change in your DSpace 1.4.2 >>>> dspace.cfg is this one: >>>> plugin.sequence.org.dspace.app.mediafilter.MediaFilter = ... >>>> You'd want to remove the entry for: >>>> "org.dspace.app.mediafilter.PDFFilter" >>>> That'd ensure that the PDFFilter is no longer used by filter-media. > The >>>> setting that you referenced below just configures the PDF filter to >>>> process files which are "Adobe PDF" format. >>>> [NOTE:] If you end up upgrading to DSpace 1.5.x, the above >>>> "plugin.sequence.org.dspace.app.mediafilter.MediaFilter" setting no >>>> longer exists. Instead, it was replaced by a more simplistic >>>> "filter.plugins" setting. In that case, for DSpace 1.5.x, you'd > just >>>> remove "PDF Text Extractor" from the list of enabled > "filter.plugins". >>>> Again, this would ensure that 'filter-media' would no longer use > the PDF >>>> filter. >>>> Hopefully that all makes sense...Beyond that, as you mentioned, > you'd >>>> just need to hide those '*.txt' files from being displayed. >>>> - Tim >>>> Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: >>>>> Hi Tim, >>>>> So you're saying that our proposed solution would work as long > as >>>>> we remove (or comment out): >>>>> *filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe > PDF* >>>>> from dspace.cfg and make the change to not display the .txt files > on the >>>>> Item pages? >>>>> Then we would still need to run filter-media which would only be > to >>>>> basically add our .txt files to the TEXT bundle for each Item? >>>>> By the way, we have been using the 1.5 version of filter-media, > with the >>>>> addition of the two new configuration parameters in dspace.cfg, > for >>>>> awhile, even though we are running DSpace 1.4.2. I did this > awhile back >>>>> and yes, it has stopped the JAVA heap space errors from killing >>>>> filter-media midstream. >>>>> I do think this new plan is the better way to go for us. I > believe the >>>>> advantages would be: >>>>> 1. No more filter-media running for soooo long - over 24 hours > most of >>>>> the time. >>>>> 2. We would identify "problematic" .pdf files (ones that possibly >>>>> wouldn't filter) prior to importing them into DSpace, instead of >>>>> after-the-fact. When these problems are caught at the scanning > point, >>>>> they could be dealt with there and then (rescanning/re-ocr'ing, > etc). >>>>> 3. Our Users wouldn't have such a big job of identifying the >>>>> "unfilterable" documents, locating them for rescanning, getting > them >>>>> back to us for re-import, etc etc. >>>>> 4. Bottom line would be a more accurate full-text searchable >>>>> repository. >>>>> Thanks a bunch for the detailed feedback. We are processing a > 1000 >>>>> document test with this new procedure and will let you know how it >>>>> goes!! >>>>> Sue >>>>> -----Original Message----- >>>>> From: Tim Donohue [mailto:tdono...@illinois.edu] >>>>> Sent: Thursday, January 15, 2009 11:27 AM >>>>> To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] >>>>> Cc: dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. > (LARC-B7)[NCI >>>>> INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI > INFORMATION >>>>> SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] >>>>> Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media > questions >>>>> Sue, >>>>> There were some improvements to 'filter-media' in DSpace 1.5.x. >>>>> Primarily, there's the addition of two new PDF-specific settings > in the >>>>> dspace.cfg: >>>>> pdffilter.largepdfs = true >>>>> pdffilter.skiponmemoryexception = true >>>>> The former ensures that all PDF text-extractions are written to >>>>> temporary files during indexing. This helps avoid > OutOfMemoryException >>>>> & Heap space errors that were occasionally caused by larger PDFs > being >>>>> loaded into system memory all at once. >>>>> The latter attempts to skip over any PDFs which still cause an >>>>> OutOfMemoryException. So, if that exception still occurs on a > PDF, then >>>>> the PDF is skipped entirely and *not* indexed. This helps to > avoid the >>>>> entire 'filter-media' script "crashing" when an > OutOfMemoryException >>>>> occurs (which used to happen in 1.4.2). >>>>> Despite these changes in 1.5.x, there is NO guarantee that *all* > of your >>>>> PDFs will index properly. As I've mentioned before, the > 'filter-media' >>>>> script uses third-party software (called PDFBox: > http://www.pdfbox.org/) >>>>> for indexing of PDF files. There are some known bugs in PDFBox > that >>>>> have yet to be fixed, so it does *not* always work for all PDFs. > In >>>>> some cases, PDFBox will also work inconsistently (and I don't know > why >>>>> that is). I've run into some inconsistency problems with > larger-sized >>>>> PDFs, which are originally scanned documents with embedded OCR. >>>>> Occasionally PDFBox will index them fine, and other times it will > cause >>>>> an OutOfMemoryException (which, with DSpace 1.5 means that >>>>> 'filter-media' will just skip that pdf). >>>>> So, I guess the best way to sum this up is that DSpace currently > cannot >>>>> successfully index 100% of all PDFs, since PDFBox cannot do so. > DSpace >>>>> 1.5 has improvements in helping DSpace to safely handle PDFBox > issues >>>>> (like the OutOfMemoryExceptions), but it doesn't necessarily have >>>>> drastic improvements in indexing capabilities. >>>>> I answered your other questions inline below... >>>>> Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: >>>>>> 1. Has the filter-media/index-all process > changed >>>>>> and/or improved significantly in DSpace 1.5? If so, we may just > shelve >>>>>> this issue until we've implemented 1.5. >>>>> See above, obviously... >>>>>> 2. In DSpace 1.4.2 (and 1.5), does it matter > whether >>>>>> your .txt files are plain or accessible .txt files? Can > index-all >>>>>> process either type? >>>>> For text files, it doesn't really matter...in either case the >>>>> 'filter-media' script just pulls out the plain text for indexing. > I >>>>> don't believe there'd be any significant difference between the > "type" >>>>> of .txt file. >>>>> However, it's worth making this clear: for .txt files, you *still* > need >>>>> to run the 'filter-media' script for them to be indexed by > 'index-all'. >>>>> Essentially, 'index-all' only indexes plain text files in the > "TEXT" >>>>> bundle. The 'filter-media' script is what adds plain text to the > "TEXT" >>>>> bundle. >>>>>> 3. If the process in 1.5 hasn't changed and/or >>>>>> improved significantly in 1.5, we are considering having our > scanning >>>>>> folks just create the .txt files along with the .pdf files at the > time >>>>>> the documents are scanned. Then when they send them to us, we > would >>>>>> just upload them in the import process along with the .pdf files > for >>>>>> each Item. The only thing we'd really have to change in our > import >>>>>> process is the addition of a second file name in the "contents" > file >>>>>> and >>>>>> the addition of the .txt document in the Item's import directory > (right >>>>>> along with the .pdf file). One other issue is we might have to > make a >>>>>> small modification to DSpace to **not** display the .txt file on > the >>>>>> Item page unless the User is in the Admin interface since we > wouldn't >>>>>> want our Users clicking on/opening the .txt files. If we did > this, we >>>>>> could completely eliminate the filter-media job altogether. This > would >>>>>> ensure that we did not load any "unfilterable" documents into > DSpace. >>>>>> It would also eliminate the tedious process of identifying which >>>>>> documents did not filter successfully, and the whole process of >>>>>> rescanning and replacing them in DSpace. >>>>> This sounds like a perfectly reasonable way of doing things, > assuming >>>>> you have the staff time to pre-generate those .txt files. You are >>>>> correct that you'd no longer need to run 'filter-media' on those > PDFs. >>>>> But, you'd still need to run 'filter-media' to index those .txt > files. >>>>> You could do this by modifying the "Media Filter" settings in your >>>>> dspace.cfg and *removing* the PDFFilter from the list (so > 'filter-media' >>>>> would no longer filter PDFs, but it would work on the other types > of >>>>> content). >>>>> It would also require some custom coding to hide those .txt files > from >>>>> normal users, but that shouldn't be too horrible. >>>>> If you did go this route, I'd make sure that you still OCR the > PDFs that >>>>> you put in, as it improves their accessibility overall. >>>>> Hopefully that all makes sense...definitely let us know if you > have >>>>> further questions. >>>>> - Tim >>>>> -- >>>>> Tim Donohue >>>>> Research Programmer, IDEALS >>>>> http://www.ideals.uiuc.edu/ >>>>> University of Illinois >>>>> tdono...@illinois.edu | (217) 333-4648 >>>> -- >>>> Tim Donohue >>>> Research Programmer, IDEALS >>>> http://www.ideals.uiuc.edu/ >>>> University of Illinois >>>> tdono...@illinois.edu | (217) 333-4648 > ------------------------------------------------------------------------ > ------ >> >>>> This SF.net email is sponsored by: >>>> SourcForge Community >>>> SourceForge wants to tell your story. >>>> http://p.sf.net/sfu/sf-spreadtheword >>>> _______________________________________________ >>>> DSpace-tech mailing list >>>> DSpace-tech@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/dspace-tech >>> ~~~~~~~~~~~~~ >>> Mark R. Diggory >>> http://purl.org/net/mdiggory/homepage >> >> >> -- >> >> Tim Donohue >> >> Research Programmer, IDEALS >> >> http://www.ideals.uiuc.edu/ >> >> University of Illinois >> >> tdono...@illinois.edu | (217) 333-4648 >> > -- Tim Donohue Research Programmer, IDEALS http://www.ideals.uiuc.edu/ University of Illinois tdono...@illinois.edu | (217) 333-4648 ------------------------------------------------------------------------------ This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech