Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
Sue, Sorry, we've all been talking across each other a bit. As you can probably tell, there's really no correct answer on how to do this, rather there's a variety of options to choose from Essentially, you have 3 options that have been laid out by Mark, Claudia and myself. I'm not certain which will be *easiest* off the top of my head: [Option 1] Add the *.txt files to the ORIGINAL bundle (which is where they are added by default). If they are in the ORIGINAL bundle you will have to run 'filter-media' to filter them into the TEXT bundle. Then, you will run 'index-all' to index them for searching (as noted 'index-all' only indexes documents in the TEXT bundle). You will also need to modify the UI if you don't want these *.txt files to be visible to normal users. [Option 2] Add the *.txt files to the TEXT bundle directly. There is no way to do this via normal DSpace user interfaces. You can however do this during the normal command-line bulk item import process by specifying a bundle name in the 'contents' file. See the DSpace Docs for more information on this: http://dspace.svn.sourceforge.net/viewvc/dspace/trunk/dspace/docs/application.html#itemimporter [Option 3] Claudia's suggestion is very similar to Option #1. However, as she notes and easy way to hide the *.txt files from the UI is to go into the DSpace Administration UI (specifically the Bitstream Format Registry and mark the *.txt format as internal). This tells DSpace that ALL *.txt files should be considered internal files, and should NEVER be displayed in the UI. So, you'd only want to do this if you never want any *.txt files to be displayed from the UI. In my opinion (others may have differing opinions), it'd be safer potentially easier to go with either option #1 or #3. The danger of option #2 is that the TEXT bundle tends to be managed by the filter-media script in DSpace. As long as you are always aware that you manually added files to this bundle, you should be fine. But, if you ever ran 'filter-media' in force mode (with the -f option), there'd be a possibility the 'filter-media' script would overwrite all your manually added *.txt files in that bundle. Hopefully that gives you a decent lay of the land. There may be yet other options out there, but at least this gives you a few to work off of. - Tim Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: I did the following query against the bundle table and it seems we only have 3 bundle names in the table: LICENSE, ORIGINAL, TEXT: */select count(*)/* */ , name /* */ from bundle /* */ group by 2 /* */ order by 2/* */ /* All the .txt files we created in our 1000 document test are in the ORIGINAL bundle, according to NAME in the bundle table. So if I run this query and then run index-all, these .txt files should be searchable, correct? */UPDATE bundle/* */ SET name = 'TEXT'/* */ WHERE bundle_id = /* */ (SELECT bu.bundle_id /* */ FROM bitstream bi/* */, bundle2bitstream b2b/* */, bundlebu/* */ WHERE bi.bitstream_id = b2b.bitstream_id/* */ AND b2b.bundle_id = bu.bundle_id/* */ AND bundle.bundle_id = bu.bundle_id/* */ AND bu.name = 'ORIGINAL'/* */ AND bi.name LIKE '%.txt') /* Let me know what you think. Thanks again, Sue -Original Message- From: Tim Donohue [mailto:tdono...@illinois.edu] Sent: Tuesday, January 20, 2009 2:12 PM To: Diggory Mark Cc: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]; dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions Mark, That's correct, that the indexer only indexes files in the TEXT bundle. But, that's why I had recommended to Susan to first run 'filter-media' script. The 'filter-media' script will take text files in the CONTENT bundle and essentially copy them over to the TEXT bundle for indexing. So, you are correct that the *.txt files could be immediately put in the TEXT bundle (which would also avoid them being exposed publicly). But, the alternative would be to put the *.txt files in the CONTENT bundle and run 'filter-media' to filter it into the TEXT bundle. (However, as you noted, this latter option would require UI alteration to hide the *.txt files, if they shouldn't be accessible). - Tim Diggory Mark wrote: Actually... Looking at the code of DSIndexer... I'm sure, written by among others... myself. We find that only Bitstreams within the TEXT bundle are actually indexed into Lucene: for (int i = 0; i myBundles.length; i
Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
Hi Tim, Thanks to all for the suggestions. Basically I am trying to prevent filter-media from attempting to filter our .pdf files and I want index-all to index only our .txt files. So if I remove the pdffilter parameters from dspace.cfg and I have all our .txt files in the TEXT bundle (using one of the 3 options you outlined), this should work and we shouldn't have to run filter-media at all, right? Thanks again, Sue -Original Message- From: Tim Donohue [mailto:tdono...@illinois.edu] Sent: Wednesday, January 21, 2009 10:54 AM To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] Cc: Diggory Mark; dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions Sue, Sorry, we've all been talking across each other a bit. As you can probably tell, there's really no correct answer on how to do this, rather there's a variety of options to choose from Essentially, you have 3 options that have been laid out by Mark, Claudia and myself. I'm not certain which will be *easiest* off the top of my head: [Option 1] Add the *.txt files to the ORIGINAL bundle (which is where they are added by default). If they are in the ORIGINAL bundle you will have to run 'filter-media' to filter them into the TEXT bundle. Then, you will run 'index-all' to index them for searching (as noted 'index-all' only indexes documents in the TEXT bundle). You will also need to modify the UI if you don't want these *.txt files to be visible to normal users. [Option 2] Add the *.txt files to the TEXT bundle directly. There is no way to do this via normal DSpace user interfaces. You can however do this during the normal command-line bulk item import process by specifying a bundle name in the 'contents' file. See the DSpace Docs for more information on this: http://dspace.svn.sourceforge.net/viewvc/dspace/trunk/dspace/docs/applic ation.html#itemimporter [Option 3] Claudia's suggestion is very similar to Option #1. However, as she notes and easy way to hide the *.txt files from the UI is to go into the DSpace Administration UI (specifically the Bitstream Format Registry and mark the *.txt format as internal). This tells DSpace that ALL *.txt files should be considered internal files, and should NEVER be displayed in the UI. So, you'd only want to do this if you never want any *.txt files to be displayed from the UI. In my opinion (others may have differing opinions), it'd be safer potentially easier to go with either option #1 or #3. The danger of option #2 is that the TEXT bundle tends to be managed by the filter-media script in DSpace. As long as you are always aware that you manually added files to this bundle, you should be fine. But, if you ever ran 'filter-media' in force mode (with the -f option), there'd be a possibility the 'filter-media' script would overwrite all your manually added *.txt files in that bundle. Hopefully that gives you a decent lay of the land. There may be yet other options out there, but at least this gives you a few to work off of. - Tim Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: I did the following query against the bundle table and it seems we only have 3 bundle names in the table: LICENSE, ORIGINAL, TEXT: */select count(*)/* */ , name /* */ from bundle /* */ group by 2 /* */ order by 2/* */ /* All the .txt files we created in our 1000 document test are in the ORIGINAL bundle, according to NAME in the bundle table. So if I run this query and then run index-all, these .txt files should be searchable, correct? */UPDATE bundle/* */ SET name = 'TEXT'/* */ WHERE bundle_id = /* */ (SELECT bu.bundle_id /* */ FROM bitstream bi/* */, bundle2bitstream b2b/* */, bundlebu/* */ WHERE bi.bitstream_id = b2b.bitstream_id/* */ AND b2b.bundle_id = bu.bundle_id/* */ AND bundle.bundle_id = bu.bundle_id/* */ AND bu.name = 'ORIGINAL'/* */ AND bi.name LIKE '%.txt') /* Let me know what you think. Thanks again, Sue -Original Message- From: Tim Donohue [mailto:tdono...@illinois.edu] Sent: Tuesday, January 20, 2009 2:12 PM To: Diggory Mark Cc: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]; dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions Mark, That's correct, that the indexer only indexes files in the TEXT bundle. But, that's why I had recommended to Susan to first
Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
Sue, Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: Hi Tim, Thanks to all for the suggestions. Basically I am trying to prevent filter-media from attempting to filter our .pdf files and I want index-all to index only our .txt files. So if I remove the pdffilter parameters from dspace.cfg and I have all our .txt files in the TEXT bundle (using one of the 3 options you outlined), this should work and we shouldn't have to run filter-media at all, right? That's almost correct, except for the last part of your statement. You'll notice that in the options I laid out below, Options #1 and #3 specifically state you STILL need to run 'filter-media'. This is because in both those options you are starting with the *.txt files in the ORIGINAL bundle, and they need to be copied to the TEXT bundle before they can be indexed. Is this starting to make some sense? Filter-media is what does extraction of full text (from PDF, HTML, Word or Plain text formats) and generates a corresponding *.txt file in the TEXT bundle containing the extracted full text. Since the 'index-all' script will ONLY index *.txt from the TEXT bundle, you will always need to run 'filter-media' first unless you've manually added *.txt to the TEXT bundle. - Tim Thanks again, Sue -Original Message- From: Tim Donohue [mailto:tdono...@illinois.edu] Sent: Wednesday, January 21, 2009 10:54 AM To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] Cc: Diggory Mark; dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions Sue, Sorry, we've all been talking across each other a bit. As you can probably tell, there's really no correct answer on how to do this, rather there's a variety of options to choose from Essentially, you have 3 options that have been laid out by Mark, Claudia and myself. I'm not certain which will be *easiest* off the top of my head: [Option 1] Add the *.txt files to the ORIGINAL bundle (which is where they are added by default). If they are in the ORIGINAL bundle you will have to run 'filter-media' to filter them into the TEXT bundle. Then, you will run 'index-all' to index them for searching (as noted 'index-all' only indexes documents in the TEXT bundle). You will also need to modify the UI if you don't want these *.txt files to be visible to normal users. [Option 2] Add the *.txt files to the TEXT bundle directly. There is no way to do this via normal DSpace user interfaces. You can however do this during the normal command-line bulk item import process by specifying a bundle name in the 'contents' file. See the DSpace Docs for more information on this: http://dspace.svn.sourceforge.net/viewvc/dspace/trunk/dspace/docs/applic ation.html#itemimporter [Option 3] Claudia's suggestion is very similar to Option #1. However, as she notes and easy way to hide the *.txt files from the UI is to go into the DSpace Administration UI (specifically the Bitstream Format Registry and mark the *.txt format as internal). This tells DSpace that ALL *.txt files should be considered internal files, and should NEVER be displayed in the UI. So, you'd only want to do this if you never want any *.txt files to be displayed from the UI. In my opinion (others may have differing opinions), it'd be safer potentially easier to go with either option #1 or #3. The danger of option #2 is that the TEXT bundle tends to be managed by the filter-media script in DSpace. As long as you are always aware that you manually added files to this bundle, you should be fine. But, if you ever ran 'filter-media' in force mode (with the -f option), there'd be a possibility the 'filter-media' script would overwrite all your manually added *.txt files in that bundle. Hopefully that gives you a decent lay of the land. There may be yet other options out there, but at least this gives you a few to work off of. - Tim Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: I did the following query against the bundle table and it seems we only have 3 bundle names in the table: LICENSE, ORIGINAL, TEXT: */select count(*)/* */ , name /* */ from bundle /* */ group by 2 /* */ order by 2/* */ /* All the .txt files we created in our 1000 document test are in the ORIGINAL bundle, according to NAME in the bundle table. So if I run this query and then run index-all, these .txt files should be searchable, correct? */UPDATE bundle/* */ SET name = 'TEXT'/* */ WHERE bundle_id = /* */ (SELECT bu.bundle_id /* */ FROM bitstream bi/* */, bundle2bitstream b2b/* */, bundlebu/* */ WHERE
Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
Thanks for all your help Tim! I think this will help us out a lot! Best, Sue -Original Message- From: Tim Donohue [mailto:tdono...@illinois.edu] Sent: Wednesday, January 21, 2009 4:54 PM To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] Cc: Diggory Mark; dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions Sue, Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: Hi Tim, Thanks to all for the suggestions. Basically I am trying to prevent filter-media from attempting to filter our .pdf files and I want index-all to index only our .txt files. So if I remove the pdffilter parameters from dspace.cfg and I have all our .txt files in the TEXT bundle (using one of the 3 options you outlined), this should work and we shouldn't have to run filter-media at all, right? That's almost correct, except for the last part of your statement. You'll notice that in the options I laid out below, Options #1 and #3 specifically state you STILL need to run 'filter-media'. This is because in both those options you are starting with the *.txt files in the ORIGINAL bundle, and they need to be copied to the TEXT bundle before they can be indexed. Is this starting to make some sense? Filter-media is what does extraction of full text (from PDF, HTML, Word or Plain text formats) and generates a corresponding *.txt file in the TEXT bundle containing the extracted full text. Since the 'index-all' script will ONLY index *.txt from the TEXT bundle, you will always need to run 'filter-media' first unless you've manually added *.txt to the TEXT bundle. - Tim Thanks again, Sue -Original Message- From: Tim Donohue [mailto:tdono...@illinois.edu] Sent: Wednesday, January 21, 2009 10:54 AM To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] Cc: Diggory Mark; dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions Sue, Sorry, we've all been talking across each other a bit. As you can probably tell, there's really no correct answer on how to do this, rather there's a variety of options to choose from Essentially, you have 3 options that have been laid out by Mark, Claudia and myself. I'm not certain which will be *easiest* off the top of my head: [Option 1] Add the *.txt files to the ORIGINAL bundle (which is where they are added by default). If they are in the ORIGINAL bundle you will have to run 'filter-media' to filter them into the TEXT bundle. Then, you will run 'index-all' to index them for searching (as noted 'index-all' only indexes documents in the TEXT bundle). You will also need to modify the UI if you don't want these *.txt files to be visible to normal users. [Option 2] Add the *.txt files to the TEXT bundle directly. There is no way to do this via normal DSpace user interfaces. You can however do this during the normal command-line bulk item import process by specifying a bundle name in the 'contents' file. See the DSpace Docs for more information on this: http://dspace.svn.sourceforge.net/viewvc/dspace/trunk/dspace/docs/applic ation.html#itemimporter [Option 3] Claudia's suggestion is very similar to Option #1. However, as she notes and easy way to hide the *.txt files from the UI is to go into the DSpace Administration UI (specifically the Bitstream Format Registry and mark the *.txt format as internal). This tells DSpace that ALL *.txt files should be considered internal files, and should NEVER be displayed in the UI. So, you'd only want to do this if you never want any *.txt files to be displayed from the UI. In my opinion (others may have differing opinions), it'd be safer potentially easier to go with either option #1 or #3. The danger of option #2 is that the TEXT bundle tends to be managed by the filter-media script in DSpace. As long as you are always aware that you manually added files to this bundle, you should be fine. But, if you ever ran 'filter-media' in force mode (with the -f option), there'd be a possibility the 'filter-media' script would overwrite all your manually added *.txt files in that bundle. Hopefully that gives you a decent lay of the land. There may be yet other options out there, but at least this gives you a few to work off of. - Tim Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: I did the following query against the bundle table and it seems we only have 3 bundle names in the table: LICENSE, ORIGINAL, TEXT: */select count(*)/* */ , name /* */ from bundle /* */ group by 2 /* */ order by 2/* */ /* All the .txt files we created in our 1000 document test are in the ORIGINAL
Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
Mark, That's correct, that the indexer only indexes files in the TEXT bundle. But, that's why I had recommended to Susan to first run 'filter-media' script. The 'filter-media' script will take text files in the CONTENT bundle and essentially copy them over to the TEXT bundle for indexing. So, you are correct that the *.txt files could be immediately put in the TEXT bundle (which would also avoid them being exposed publicly). But, the alternative would be to put the *.txt files in the CONTENT bundle and run 'filter-media' to filter it into the TEXT bundle. (However, as you noted, this latter option would require UI alteration to hide the *.txt files, if they shouldn't be accessible). - Tim Diggory Mark wrote: Actually... Looking at the code of DSIndexer... I'm sure, written by among others... myself. We find that only Bitstreams within the TEXT bundle are actually indexed into Lucene: for (int i = 0; i myBundles.length; i++) { if ((myBundles[i].getName() != null) myBundles[i].getName().equals(TEXT)) { I'm thinking this was a short-sightedness, but the unhappy consequence of which is that your text files will not get indexed if you place them into the CONTENT Bundle. There are two solutions A.) Put your text bitstreams into the TEXT bundle and not have to worry about them being exposed because the TEXT bundle will not be. B.) Put your text Bitstreams in the Content Bundle, alter the UI to hide them, and alter DSIndexer to index the CONTENT bundle. Mark On Jan 16, 2009, at 2:40 PM, Tim Donohue wrote: Susan, Actually, the setting you'd want to change in your DSpace 1.4.2 dspace.cfg is this one: plugin.sequence.org.dspace.app.mediafilter.MediaFilter = ... You'd want to remove the entry for: org.dspace.app.mediafilter.PDFFilter That'd ensure that the PDFFilter is no longer used by filter-media. The setting that you referenced below just configures the PDF filter to process files which are Adobe PDF format. [NOTE:] If you end up upgrading to DSpace 1.5.x, the above plugin.sequence.org.dspace.app.mediafilter.MediaFilter setting no longer exists. Instead, it was replaced by a more simplistic filter.plugins setting. In that case, for DSpace 1.5.x, you'd just remove PDF Text Extractor from the list of enabled filter.plugins. Again, this would ensure that 'filter-media' would no longer use the PDF filter. Hopefully that all makes sense...Beyond that, as you mentioned, you'd just need to hide those '*.txt' files from being displayed. - Tim Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: Hi Tim, So you're saying that our proposed solution would work as long as we remove (or comment out): *filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF* from dspace.cfg and make the change to not display the .txt files on the Item pages? Then we would still need to run filter-media which would only be to basically add our .txt files to the TEXT bundle for each Item? By the way, we have been using the 1.5 version of filter-media, with the addition of the two new configuration parameters in dspace.cfg, for awhile, even though we are running DSpace 1.4.2. I did this awhile back and yes, it has stopped the JAVA heap space errors from killing filter-media midstream. I do think this new plan is the better way to go for us. I believe the advantages would be: 1. No more filter-media running for s long – over 24 hours most of the time. 2. We would identify “problematic” .pdf files (ones that possibly wouldn’t filter) prior to importing them into DSpace, instead of after-the-fact. When these problems are caught at the scanning point, they could be dealt with there and then (rescanning/re-ocr’ing, etc). 3. Our Users wouldn’t have such a big job of identifying the “unfilterable” documents, locating them for rescanning, getting them back to us for re-import, etc etc. 4. Bottom line would be a more accurate full-text searchable repository. Thanks a bunch for the detailed feedback. We are processing a 1000 document test with this new procedure and will let you know how it goes!! Sue -Original Message- From: Tim Donohue [mailto:tdono...@illinois.edu] Sent: Thursday, January 15, 2009 11:27 AM To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] Cc: dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions Sue, There were some improvements to 'filter-media' in DSpace 1.5.x. Primarily, there's the addition of two new PDF-specific settings in the dspace.cfg: pdffilter.largepdfs = true pdffilter.skiponmemoryexception = true
Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
Would taking either of these two suggestions allow our new procedures to work? Basically we are trying to get around the problem of having unfilterable (by PDFBOX) documents in DSpace and to creating a repository that is going to return the most accurate search results as humanly possible. Thanks Mark, Sue -Original Message- From: Diggory Mark [mailto:mdigg...@gmail.com] Sent: Friday, January 16, 2009 6:10 PM To: Tim Donohue Cc: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]; dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions Actually... Looking at the code of DSIndexer... I'm sure, written by among others... myself. We find that only Bitstreams within the TEXT bundle are actually indexed into Lucene: for (int i = 0; i myBundles.length; i++) { if ((myBundles[i].getName() != null) myBundles[i].getName().equals(TEXT)) { I'm thinking this was a short-sightedness, but the unhappy consequence of which is that your text files will not get indexed if you place them into the CONTENT Bundle. There are two solutions A.) Put your text bitstreams into the TEXT bundle and not have to worry about them being exposed because the TEXT bundle will not be. B.) Put your text Bitstreams in the Content Bundle, alter the UI to hide them, and alter DSIndexer to index the CONTENT bundle. Mark On Jan 16, 2009, at 2:40 PM, Tim Donohue wrote: Susan, Actually, the setting you'd want to change in your DSpace 1.4.2 dspace.cfg is this one: plugin.sequence.org.dspace.app.mediafilter.MediaFilter = ... You'd want to remove the entry for: org.dspace.app.mediafilter.PDFFilter That'd ensure that the PDFFilter is no longer used by filter-media. The setting that you referenced below just configures the PDF filter to process files which are Adobe PDF format. [NOTE:] If you end up upgrading to DSpace 1.5.x, the above plugin.sequence.org.dspace.app.mediafilter.MediaFilter setting no longer exists. Instead, it was replaced by a more simplistic filter.plugins setting. In that case, for DSpace 1.5.x, you'd just remove PDF Text Extractor from the list of enabled filter.plugins. Again, this would ensure that 'filter-media' would no longer use the PDF filter. Hopefully that all makes sense...Beyond that, as you mentioned, you'd just need to hide those '*.txt' files from being displayed. - Tim Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: Hi Tim, So you're saying that our proposed solution would work as long as we remove (or comment out): *filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF* from dspace.cfg and make the change to not display the .txt files on the Item pages? Then we would still need to run filter-media which would only be to basically add our .txt files to the TEXT bundle for each Item? By the way, we have been using the 1.5 version of filter-media, with the addition of the two new configuration parameters in dspace.cfg, for awhile, even though we are running DSpace 1.4.2. I did this awhile back and yes, it has stopped the JAVA heap space errors from killing filter-media midstream. I do think this new plan is the better way to go for us. I believe the advantages would be: 1. No more filter-media running for s long - over 24 hours most of the time. 2. We would identify problematic .pdf files (ones that possibly wouldn't filter) prior to importing them into DSpace, instead of after-the-fact. When these problems are caught at the scanning point, they could be dealt with there and then (rescanning/re-ocr'ing, etc). 3. Our Users wouldn't have such a big job of identifying the unfilterable documents, locating them for rescanning, getting them back to us for re-import, etc etc. 4. Bottom line would be a more accurate full-text searchable repository. Thanks a bunch for the detailed feedback. We are processing a 1000 document test with this new procedure and will let you know how it goes!! Sue -Original Message- From: Tim Donohue [mailto:tdono...@illinois.edu] Sent: Thursday, January 15, 2009 11:27 AM To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] Cc: dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7) [NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions Sue, There were some improvements to 'filter-media' in DSpace 1.5.x. Primarily, there's the addition of two new PDF-specific settings in the dspace.cfg: pdffilter.largepdfs
Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
, Douglas Lewis (LARC-B7)[NCI INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions Sue, There were some improvements to 'filter-media' in DSpace 1.5.x. Primarily, there's the addition of two new PDF-specific settings in the dspace.cfg: pdffilter.largepdfs = true pdffilter.skiponmemoryexception = true The former ensures that all PDF text-extractions are written to temporary files during indexing. This helps avoid OutOfMemoryException Heap space errors that were occasionally caused by larger PDFs being loaded into system memory all at once. The latter attempts to skip over any PDFs which still cause an OutOfMemoryException. So, if that exception still occurs on a PDF, then the PDF is skipped entirely and *not* indexed. This helps to avoid the entire 'filter-media' script crashing when an OutOfMemoryException occurs (which used to happen in 1.4.2). Despite these changes in 1.5.x, there is NO guarantee that *all* of your PDFs will index properly. As I've mentioned before, the 'filter-media' script uses third-party software (called PDFBox: http://www.pdfbox.org/) for indexing of PDF files. There are some known bugs in PDFBox that have yet to be fixed, so it does *not* always work for all PDFs. In some cases, PDFBox will also work inconsistently (and I don't know why that is). I've run into some inconsistency problems with larger-sized PDFs, which are originally scanned documents with embedded OCR. Occasionally PDFBox will index them fine, and other times it will cause an OutOfMemoryException (which, with DSpace 1.5 means that 'filter-media' will just skip that pdf). So, I guess the best way to sum this up is that DSpace currently cannot successfully index 100% of all PDFs, since PDFBox cannot do so. DSpace 1.5 has improvements in helping DSpace to safely handle PDFBox issues (like the OutOfMemoryExceptions), but it doesn't necessarily have drastic improvements in indexing capabilities. I answered your other questions inline below... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: 1. Has the filter-media/index-all process changed and/or improved significantly in DSpace 1.5? If so, we may just shelve this issue until we’ve implemented 1.5. See above, obviously... 2. In DSpace 1.4.2 (and 1.5), does it matter whether your .txt files are plain or accessible .txt files? Can index-all process either type? For text files, it doesn't really matter...in either case the 'filter-media' script just pulls out the plain text for indexing. I don't believe there'd be any significant difference between the type of .txt file. However, it's worth making this clear: for .txt files, you *still* need to run the 'filter-media' script for them to be indexed by 'index-all'. Essentially, 'index-all' only indexes plain text files in the TEXT bundle. The 'filter-media' script is what adds plain text to the TEXT bundle. 3. If the process in 1.5 hasn’t changed and/or improved significantly in 1.5, we are considering having our scanning folks just create the .txt files along with the .pdf files at the time the documents are scanned. Then when they send them to us, we would just upload them in the import process along with the .pdf files for each Item. The only thing we’d really have to change in our import process is the addition of a second file name in the “contents” file and the addition of the .txt document in the Item’s import directory (right along with the .pdf file). One other issue is we might have to make a small modification to DSpace to **not** display the .txt file on the Item page unless the User is in the Admin interface since we wouldn’t want our Users clicking on/opening the .txt files. If we did this, we could completely eliminate the filter-media job altogether. This would ensure that we did not load any “unfilterable” documents into DSpace. It would also eliminate the tedious process of identifying which documents did not filter successfully, and the whole process of rescanning and replacing them in DSpace. This sounds like a perfectly reasonable way of doing things, assuming you have the staff time to pre-generate those .txt files. You are correct that you'd no longer need to run 'filter-media' on those PDFs. But, you'd still need to run 'filter-media' to index those .txt files. You could do this by modifying the Media Filter settings in your dspace.cfg and *removing* the PDFFilter from the list (so 'filter-media' would no longer filter PDFs, but it would work on the other types of content). It would also require some custom coding to hide those .txt files from normal users, but that shouldn't be too horrible. If you did go this route, I'd make sure that you still OCR the PDFs
Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
Susan, Actually, the setting you'd want to change in your DSpace 1.4.2 dspace.cfg is this one: plugin.sequence.org.dspace.app.mediafilter.MediaFilter = ... You'd want to remove the entry for: org.dspace.app.mediafilter.PDFFilter That'd ensure that the PDFFilter is no longer used by filter-media. The setting that you referenced below just configures the PDF filter to process files which are Adobe PDF format. [NOTE:] If you end up upgrading to DSpace 1.5.x, the above plugin.sequence.org.dspace.app.mediafilter.MediaFilter setting no longer exists. Instead, it was replaced by a more simplistic filter.plugins setting. In that case, for DSpace 1.5.x, you'd just remove PDF Text Extractor from the list of enabled filter.plugins. Again, this would ensure that 'filter-media' would no longer use the PDF filter. Hopefully that all makes sense...Beyond that, as you mentioned, you'd just need to hide those '*.txt' files from being displayed. - Tim Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: Hi Tim, So you're saying that our proposed solution would work as long as we remove (or comment out): *filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF* from dspace.cfg and make the change to not display the .txt files on the Item pages? Then we would still need to run filter-media which would only be to basically add our .txt files to the TEXT bundle for each Item? By the way, we have been using the 1.5 version of filter-media, with the addition of the two new configuration parameters in dspace.cfg, for awhile, even though we are running DSpace 1.4.2. I did this awhile back and yes, it has stopped the JAVA heap space errors from killing filter-media midstream. I do think this new plan is the better way to go for us. I believe the advantages would be: 1. No more filter-media running for s long – over 24 hours most of the time. 2. We would identify “problematic” .pdf files (ones that possibly wouldn’t filter) prior to importing them into DSpace, instead of after-the-fact. When these problems are caught at the scanning point, they could be dealt with there and then (rescanning/re-ocr’ing, etc). 3. Our Users wouldn’t have such a big job of identifying the “unfilterable” documents, locating them for rescanning, getting them back to us for re-import, etc etc. 4. Bottom line would be a more accurate full-text searchable repository. Thanks a bunch for the detailed feedback. We are processing a 1000 document test with this new procedure and will let you know how it goes!! Sue -Original Message- From: Tim Donohue [mailto:tdono...@illinois.edu] Sent: Thursday, January 15, 2009 11:27 AM To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] Cc: dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions Sue, There were some improvements to 'filter-media' in DSpace 1.5.x. Primarily, there's the addition of two new PDF-specific settings in the dspace.cfg: pdffilter.largepdfs = true pdffilter.skiponmemoryexception = true The former ensures that all PDF text-extractions are written to temporary files during indexing. This helps avoid OutOfMemoryException Heap space errors that were occasionally caused by larger PDFs being loaded into system memory all at once. The latter attempts to skip over any PDFs which still cause an OutOfMemoryException. So, if that exception still occurs on a PDF, then the PDF is skipped entirely and *not* indexed. This helps to avoid the entire 'filter-media' script crashing when an OutOfMemoryException occurs (which used to happen in 1.4.2). Despite these changes in 1.5.x, there is NO guarantee that *all* of your PDFs will index properly. As I've mentioned before, the 'filter-media' script uses third-party software (called PDFBox: http://www.pdfbox.org/) for indexing of PDF files. There are some known bugs in PDFBox that have yet to be fixed, so it does *not* always work for all PDFs. In some cases, PDFBox will also work inconsistently (and I don't know why that is). I've run into some inconsistency problems with larger-sized PDFs, which are originally scanned documents with embedded OCR. Occasionally PDFBox will index them fine, and other times it will cause an OutOfMemoryException (which, with DSpace 1.5 means that 'filter-media' will just skip that pdf). So, I guess the best way to sum this up is that DSpace currently cannot successfully index 100% of all PDFs, since PDFBox cannot do so. DSpace 1.5 has improvements in helping DSpace to safely handle PDFBox issues (like
Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
Actually... Looking at the code of DSIndexer... I'm sure, written by among others... myself. We find that only Bitstreams within the TEXT bundle are actually indexed into Lucene: for (int i = 0; i myBundles.length; i++) { if ((myBundles[i].getName() != null) myBundles[i].getName().equals(TEXT)) { I'm thinking this was a short-sightedness, but the unhappy consequence of which is that your text files will not get indexed if you place them into the CONTENT Bundle. There are two solutions A.) Put your text bitstreams into the TEXT bundle and not have to worry about them being exposed because the TEXT bundle will not be. B.) Put your text Bitstreams in the Content Bundle, alter the UI to hide them, and alter DSIndexer to index the CONTENT bundle. Mark On Jan 16, 2009, at 2:40 PM, Tim Donohue wrote: Susan, Actually, the setting you'd want to change in your DSpace 1.4.2 dspace.cfg is this one: plugin.sequence.org.dspace.app.mediafilter.MediaFilter = ... You'd want to remove the entry for: org.dspace.app.mediafilter.PDFFilter That'd ensure that the PDFFilter is no longer used by filter-media. The setting that you referenced below just configures the PDF filter to process files which are Adobe PDF format. [NOTE:] If you end up upgrading to DSpace 1.5.x, the above plugin.sequence.org.dspace.app.mediafilter.MediaFilter setting no longer exists. Instead, it was replaced by a more simplistic filter.plugins setting. In that case, for DSpace 1.5.x, you'd just remove PDF Text Extractor from the list of enabled filter.plugins. Again, this would ensure that 'filter-media' would no longer use the PDF filter. Hopefully that all makes sense...Beyond that, as you mentioned, you'd just need to hide those '*.txt' files from being displayed. - Tim Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: Hi Tim, So you're saying that our proposed solution would work as long as we remove (or comment out): *filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF* from dspace.cfg and make the change to not display the .txt files on the Item pages? Then we would still need to run filter-media which would only be to basically add our .txt files to the TEXT bundle for each Item? By the way, we have been using the 1.5 version of filter-media, with the addition of the two new configuration parameters in dspace.cfg, for awhile, even though we are running DSpace 1.4.2. I did this awhile back and yes, it has stopped the JAVA heap space errors from killing filter-media midstream. I do think this new plan is the better way to go for us. I believe the advantages would be: 1. No more filter-media running for s long – over 24 hours most of the time. 2. We would identify “problematic” .pdf files (ones that possibly wouldn’t filter) prior to importing them into DSpace, instead of after-the-fact. When these problems are caught at the scanning point, they could be dealt with there and then (rescanning/re-ocr’ing, etc). 3. Our Users wouldn’t have such a big job of identifying the “unfilterable” documents, locating them for rescanning, getting them back to us for re-import, etc etc. 4. Bottom line would be a more accurate full-text searchable repository. Thanks a bunch for the detailed feedback. We are processing a 1000 document test with this new procedure and will let you know how it goes!! Sue -Original Message- From: Tim Donohue [mailto:tdono...@illinois.edu] Sent: Thursday, January 15, 2009 11:27 AM To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] Cc: dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7) [NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS] Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions Sue, There were some improvements to 'filter-media' in DSpace 1.5.x. Primarily, there's the addition of two new PDF-specific settings in the dspace.cfg: pdffilter.largepdfs = true pdffilter.skiponmemoryexception = true The former ensures that all PDF text-extractions are written to temporary files during indexing. This helps avoid OutOfMemoryException Heap space errors that were occasionally caused by larger PDFs being loaded into system memory all at once. The latter attempts to skip over any PDFs which still cause an OutOfMemoryException. So, if that exception still occurs on a PDF, then the PDF is skipped entirely and *not* indexed. This helps to avoid the entire 'filter-media' script crashing when an OutOfMemoryException occurs (which used to happen in 1.4.2). Despite these changes in 1.5.x, there is NO guarantee that *all* of your PDFs will index properly. As I've mentioned