Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

2009-01-21 Thread Tim Donohue
Sue,

Sorry, we've all been talking across each other a bit.  As you can 
probably tell, there's really no correct answer on how to do this, 
rather there's a variety of options to choose from

Essentially, you have 3 options that have been laid out by Mark, Claudia 
and myself.  I'm not certain which will be *easiest* off the top of my head:

[Option 1]  Add the *.txt files to the ORIGINAL bundle (which is where 
they are added by default).  If they are in the ORIGINAL bundle you 
will have to run 'filter-media' to filter them into the TEXT bundle. 
   Then, you will run 'index-all' to index them for searching (as noted 
'index-all' only indexes documents in the TEXT bundle).  You will also 
need to modify the UI if you don't want these *.txt files to be visible 
to normal users.

[Option 2]  Add the *.txt files to the TEXT bundle directly.  There is 
no way to do this via normal DSpace user interfaces.  You can however do 
this during the normal command-line bulk item import process by 
specifying a bundle name in the 'contents' file.  See the DSpace Docs 
for more information on this:
http://dspace.svn.sourceforge.net/viewvc/dspace/trunk/dspace/docs/application.html#itemimporter

[Option 3]  Claudia's suggestion is very similar to Option #1.  However, 
as she notes and easy way to hide the *.txt files from the UI is to go 
into the DSpace Administration UI (specifically the Bitstream Format 
Registry and mark the *.txt format as internal).  This tells DSpace 
that ALL *.txt files should be considered internal files, and should 
NEVER be displayed in the UI.  So, you'd only want to do this if you 
never want any *.txt files to be displayed from the UI.


In my opinion (others may have differing opinions), it'd be safer  
potentially easier to go with either option #1 or #3.  The danger of 
option #2 is that the TEXT bundle tends to be managed by the 
filter-media script in DSpace.  As long as you are always aware that 
you manually added files to this bundle, you should be fine.  But, if 
you ever ran 'filter-media' in force mode (with the -f option), 
there'd be a possibility the 'filter-media' script would overwrite all 
your manually added *.txt files in that bundle.

Hopefully that gives you a decent lay of the land.  There may be yet 
other options out there, but at least this gives you a few to work off of.

- Tim



Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
 I did the following query against the bundle table and it seems we only 
 have 3 bundle names in the table:  LICENSE, ORIGINAL,  TEXT:
 
  */select count(*)/*
 
 */  , name /*
 
 */ from bundle /*
 
 */  group by 2 /*
 
 */  order by 2/*
 
 */ /*
 
 All the .txt files we created in our 1000 document test are in the 
 ORIGINAL bundle, according to NAME in the bundle table.  So if I run 
 this query and then run index-all, these .txt files should be 
 searchable, correct?
 
   */UPDATE bundle/*
 
 */  SET name = 'TEXT'/*
 
 */  WHERE bundle_id = /*
 
 */ (SELECT bu.bundle_id /*
 
 */ FROM bitstream bi/*
 
 */, bundle2bitstream b2b/*
 
 */, bundlebu/*
 
 */ WHERE bi.bitstream_id = b2b.bitstream_id/*
 
 */   AND b2b.bundle_id   = bu.bundle_id/*
 
 */   AND bundle.bundle_id = bu.bundle_id/*
 
 */   AND bu.name = 'ORIGINAL'/*
 
 */   AND bi.name LIKE '%.txt')   /*
 
  
 
 Let me know what you think.
 
 Thanks again,
 
 Sue
 
  
 
  
 
 -Original Message-
 From: Tim Donohue [mailto:tdono...@illinois.edu]
 Sent: Tuesday, January 20, 2009 2:12 PM
 To: Diggory Mark
 Cc: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]; 
 dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI 
 INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION 
 SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS]
 Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
 
  
 
 Mark,
 
  
 
 That's correct, that the indexer only indexes files in the TEXT bundle.
 
   But, that's why I had recommended to Susan to first run 'filter-media'
 
 script.   The 'filter-media' script will take text files in the CONTENT
 
 bundle and essentially copy them over to the TEXT bundle for indexing.
 
  
 
 So, you are correct that the *.txt files could be immediately put in the
 
 TEXT bundle (which would also avoid them being exposed publicly).  But,
 
 the alternative would be to put the *.txt files in the CONTENT bundle
 
 and run 'filter-media' to filter it into the TEXT bundle. (However, as
 
 you noted, this latter option would require UI alteration to hide the
 
 *.txt files, if they shouldn't be accessible).
 
  
 
 - Tim
 
  
 
 Diggory Mark wrote:
 
  Actually...
 

 
  Looking at the code of DSIndexer... I'm sure, written by among others...
 
  myself.  We find that only Bitstreams within the TEXT bundle are
 
  actually indexed into Lucene:
 

 
   for (int i = 0; i  myBundles.length; i

Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

2009-01-21 Thread Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
Hi Tim,
 Thanks to all for the suggestions.  Basically I am trying to
prevent filter-media from attempting to filter our .pdf files and I want
index-all to index only our .txt files.

 So if I remove the pdffilter parameters from dspace.cfg and I have
all our .txt files in the TEXT bundle (using one of the 3 options you
outlined), this should work and we shouldn't have to run filter-media at
all, right?

Thanks again,
Sue

-Original Message-
From: Tim Donohue [mailto:tdono...@illinois.edu] 
Sent: Wednesday, January 21, 2009 10:54 AM
To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
Cc: Diggory Mark; dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W.
(LARC-B7)[NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI
INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION
SYSTEMS]
Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

Sue,

Sorry, we've all been talking across each other a bit.  As you can 
probably tell, there's really no correct answer on how to do this, 
rather there's a variety of options to choose from

Essentially, you have 3 options that have been laid out by Mark, Claudia

and myself.  I'm not certain which will be *easiest* off the top of my
head:

[Option 1]  Add the *.txt files to the ORIGINAL bundle (which is where

they are added by default).  If they are in the ORIGINAL bundle you 
will have to run 'filter-media' to filter them into the TEXT bundle.

   Then, you will run 'index-all' to index them for searching (as noted 
'index-all' only indexes documents in the TEXT bundle).  You will also

need to modify the UI if you don't want these *.txt files to be visible 
to normal users.

[Option 2]  Add the *.txt files to the TEXT bundle directly.  There is

no way to do this via normal DSpace user interfaces.  You can however do

this during the normal command-line bulk item import process by 
specifying a bundle name in the 'contents' file.  See the DSpace Docs 
for more information on this:
http://dspace.svn.sourceforge.net/viewvc/dspace/trunk/dspace/docs/applic
ation.html#itemimporter

[Option 3]  Claudia's suggestion is very similar to Option #1.  However,

as she notes and easy way to hide the *.txt files from the UI is to go

into the DSpace Administration UI (specifically the Bitstream Format 
Registry and mark the *.txt format as internal).  This tells DSpace 
that ALL *.txt files should be considered internal files, and should 
NEVER be displayed in the UI.  So, you'd only want to do this if you 
never want any *.txt files to be displayed from the UI.


In my opinion (others may have differing opinions), it'd be safer  
potentially easier to go with either option #1 or #3.  The danger of 
option #2 is that the TEXT bundle tends to be managed by the 
filter-media script in DSpace.  As long as you are always aware that 
you manually added files to this bundle, you should be fine.  But, if 
you ever ran 'filter-media' in force mode (with the -f option), 
there'd be a possibility the 'filter-media' script would overwrite all 
your manually added *.txt files in that bundle.

Hopefully that gives you a decent lay of the land.  There may be yet 
other options out there, but at least this gives you a few to work off
of.

- Tim



Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
 I did the following query against the bundle table and it seems we
only 
 have 3 bundle names in the table:  LICENSE, ORIGINAL,  TEXT:
 
  */select count(*)/*
 
 */  , name /*
 
 */ from bundle /*
 
 */  group by 2 /*
 
 */  order by 2/*
 
 */ /*
 
 All the .txt files we created in our 1000 document test are in the 
 ORIGINAL bundle, according to NAME in the bundle table.  So if I run 
 this query and then run index-all, these .txt files should be 
 searchable, correct?
 
   */UPDATE bundle/*
 
 */  SET name = 'TEXT'/*
 
 */  WHERE bundle_id = /*
 
 */ (SELECT bu.bundle_id /*
 
 */ FROM bitstream bi/*
 
 */, bundle2bitstream b2b/*
 
 */, bundlebu/*
 
 */ WHERE bi.bitstream_id = b2b.bitstream_id/*
 
 */   AND b2b.bundle_id   = bu.bundle_id/*
 
 */   AND bundle.bundle_id = bu.bundle_id/*
 
 */   AND bu.name = 'ORIGINAL'/*
 
 */   AND bi.name LIKE '%.txt')   /*
 
  
 
 Let me know what you think.
 
 Thanks again,
 
 Sue
 
  
 
  
 
 -Original Message-
 From: Tim Donohue [mailto:tdono...@illinois.edu]
 Sent: Tuesday, January 20, 2009 2:12 PM
 To: Diggory Mark
 Cc: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]; 
 dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI 
 INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION 
 SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS]
 Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
 
  
 
 Mark,
 
  
 
 That's correct, that the indexer only indexes files in the TEXT
bundle.
 
   But, that's why I had recommended to Susan to first

Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

2009-01-21 Thread Tim Donohue
Sue,

Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
 Hi Tim,
  Thanks to all for the suggestions.  Basically I am trying to
 prevent filter-media from attempting to filter our .pdf files and I want
 index-all to index only our .txt files.
 
  So if I remove the pdffilter parameters from dspace.cfg and I have
 all our .txt files in the TEXT bundle (using one of the 3 options you
 outlined), this should work and we shouldn't have to run filter-media at
 all, right?

That's almost correct, except for the last part of your statement. 
You'll notice that in the options I laid out below, Options #1 and #3 
specifically state you STILL need to run 'filter-media'.  This is 
because in both those options you are starting with the *.txt files in 
the ORIGINAL bundle, and they need to be copied to the TEXT bundle 
before they can be indexed.

Is this starting to make some sense?  Filter-media is what does 
extraction of full text (from PDF, HTML, Word or Plain text formats) and 
generates a corresponding *.txt file in the TEXT bundle containing the 
extracted full text.  Since the 'index-all' script will ONLY index *.txt 
from the TEXT bundle, you will always need to run 'filter-media' first 
unless you've manually added *.txt to the TEXT bundle.

- Tim


 
 Thanks again,
 Sue
 
 -Original Message-
 From: Tim Donohue [mailto:tdono...@illinois.edu] 
 Sent: Wednesday, January 21, 2009 10:54 AM
 To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
 Cc: Diggory Mark; dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W.
 (LARC-B7)[NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI
 INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION
 SYSTEMS]
 Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
 
 Sue,
 
 Sorry, we've all been talking across each other a bit.  As you can 
 probably tell, there's really no correct answer on how to do this, 
 rather there's a variety of options to choose from
 
 Essentially, you have 3 options that have been laid out by Mark, Claudia
 
 and myself.  I'm not certain which will be *easiest* off the top of my
 head:
 
 [Option 1]  Add the *.txt files to the ORIGINAL bundle (which is where
 
 they are added by default).  If they are in the ORIGINAL bundle you 
 will have to run 'filter-media' to filter them into the TEXT bundle.
 
Then, you will run 'index-all' to index them for searching (as noted 
 'index-all' only indexes documents in the TEXT bundle).  You will also
 
 need to modify the UI if you don't want these *.txt files to be visible 
 to normal users.
 
 [Option 2]  Add the *.txt files to the TEXT bundle directly.  There is
 
 no way to do this via normal DSpace user interfaces.  You can however do
 
 this during the normal command-line bulk item import process by 
 specifying a bundle name in the 'contents' file.  See the DSpace Docs 
 for more information on this:
 http://dspace.svn.sourceforge.net/viewvc/dspace/trunk/dspace/docs/applic
 ation.html#itemimporter
 
 [Option 3]  Claudia's suggestion is very similar to Option #1.  However,
 
 as she notes and easy way to hide the *.txt files from the UI is to go
 
 into the DSpace Administration UI (specifically the Bitstream Format 
 Registry and mark the *.txt format as internal).  This tells DSpace 
 that ALL *.txt files should be considered internal files, and should 
 NEVER be displayed in the UI.  So, you'd only want to do this if you 
 never want any *.txt files to be displayed from the UI.
 
 
 In my opinion (others may have differing opinions), it'd be safer  
 potentially easier to go with either option #1 or #3.  The danger of 
 option #2 is that the TEXT bundle tends to be managed by the 
 filter-media script in DSpace.  As long as you are always aware that 
 you manually added files to this bundle, you should be fine.  But, if 
 you ever ran 'filter-media' in force mode (with the -f option), 
 there'd be a possibility the 'filter-media' script would overwrite all 
 your manually added *.txt files in that bundle.
 
 Hopefully that gives you a decent lay of the land.  There may be yet 
 other options out there, but at least this gives you a few to work off
 of.
 
 - Tim
 
 
 
 Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
 I did the following query against the bundle table and it seems we
 only 
 have 3 bundle names in the table:  LICENSE, ORIGINAL,  TEXT:

  */select count(*)/*

 */  , name /*

 */ from bundle /*

 */  group by 2 /*

 */  order by 2/*

 */ /*

 All the .txt files we created in our 1000 document test are in the 
 ORIGINAL bundle, according to NAME in the bundle table.  So if I run 
 this query and then run index-all, these .txt files should be 
 searchable, correct?

   */UPDATE bundle/*

 */  SET name = 'TEXT'/*

 */  WHERE bundle_id = /*

 */ (SELECT bu.bundle_id /*

 */ FROM bitstream bi/*

 */, bundle2bitstream b2b/*

 */, bundlebu/*

 */ WHERE

Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

2009-01-21 Thread Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
Thanks for all your help Tim!  I think this will help us out a lot!
Best,
Sue

-Original Message-
From: Tim Donohue [mailto:tdono...@illinois.edu] 
Sent: Wednesday, January 21, 2009 4:54 PM
To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
Cc: Diggory Mark; dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

Sue,

Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
 Hi Tim,
  Thanks to all for the suggestions.  Basically I am trying to
 prevent filter-media from attempting to filter our .pdf files and I
want
 index-all to index only our .txt files.
 
  So if I remove the pdffilter parameters from dspace.cfg and I
have
 all our .txt files in the TEXT bundle (using one of the 3 options you
 outlined), this should work and we shouldn't have to run filter-media
at
 all, right?

That's almost correct, except for the last part of your statement. 
You'll notice that in the options I laid out below, Options #1 and #3 
specifically state you STILL need to run 'filter-media'.  This is 
because in both those options you are starting with the *.txt files in 
the ORIGINAL bundle, and they need to be copied to the TEXT bundle 
before they can be indexed.

Is this starting to make some sense?  Filter-media is what does 
extraction of full text (from PDF, HTML, Word or Plain text formats) and

generates a corresponding *.txt file in the TEXT bundle containing the 
extracted full text.  Since the 'index-all' script will ONLY index *.txt

from the TEXT bundle, you will always need to run 'filter-media' first 
unless you've manually added *.txt to the TEXT bundle.

- Tim


 
 Thanks again,
 Sue
 
 -Original Message-
 From: Tim Donohue [mailto:tdono...@illinois.edu] 
 Sent: Wednesday, January 21, 2009 10:54 AM
 To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
 Cc: Diggory Mark; dspace-tech@lists.sourceforge.net; Kimbrough, Glenn
W.
 (LARC-B7)[NCI INFORMATION SYSTEMS]; Warren, Douglas Lewis
(LARC-B7)[NCI
 INFORMATION SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION
 SYSTEMS]
 Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
 
 Sue,
 
 Sorry, we've all been talking across each other a bit.  As you can 
 probably tell, there's really no correct answer on how to do this, 
 rather there's a variety of options to choose from
 
 Essentially, you have 3 options that have been laid out by Mark,
Claudia
 
 and myself.  I'm not certain which will be *easiest* off the top of my
 head:
 
 [Option 1]  Add the *.txt files to the ORIGINAL bundle (which is
where
 
 they are added by default).  If they are in the ORIGINAL bundle you 
 will have to run 'filter-media' to filter them into the TEXT
bundle.
 
Then, you will run 'index-all' to index them for searching (as
noted 
 'index-all' only indexes documents in the TEXT bundle).  You will
also
 
 need to modify the UI if you don't want these *.txt files to be
visible 
 to normal users.
 
 [Option 2]  Add the *.txt files to the TEXT bundle directly.  There
is
 
 no way to do this via normal DSpace user interfaces.  You can however
do
 
 this during the normal command-line bulk item import process by 
 specifying a bundle name in the 'contents' file.  See the DSpace
Docs 
 for more information on this:

http://dspace.svn.sourceforge.net/viewvc/dspace/trunk/dspace/docs/applic
 ation.html#itemimporter
 
 [Option 3]  Claudia's suggestion is very similar to Option #1.
However,
 
 as she notes and easy way to hide the *.txt files from the UI is to
go
 
 into the DSpace Administration UI (specifically the Bitstream Format 
 Registry and mark the *.txt format as internal).  This tells DSpace

 that ALL *.txt files should be considered internal files, and should 
 NEVER be displayed in the UI.  So, you'd only want to do this if you 
 never want any *.txt files to be displayed from the UI.
 
 
 In my opinion (others may have differing opinions), it'd be safer  
 potentially easier to go with either option #1 or #3.  The danger of 
 option #2 is that the TEXT bundle tends to be managed by the 
 filter-media script in DSpace.  As long as you are always aware that

 you manually added files to this bundle, you should be fine.  But, if 
 you ever ran 'filter-media' in force mode (with the -f option), 
 there'd be a possibility the 'filter-media' script would overwrite all

 your manually added *.txt files in that bundle.
 
 Hopefully that gives you a decent lay of the land.  There may be yet 
 other options out there, but at least this gives you a few to work off
 of.
 
 - Tim
 
 
 
 Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
 I did the following query against the bundle table and it seems we
 only 
 have 3 bundle names in the table:  LICENSE, ORIGINAL,  TEXT:

  */select count(*)/*

 */  , name /*

 */ from bundle /*

 */  group by 2 /*

 */  order by 2/*

 */ /*

 All the .txt files we created in our 1000 document test are in the 
 ORIGINAL

Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

2009-01-20 Thread Tim Donohue
Mark,

That's correct, that the indexer only indexes files in the TEXT bundle. 
  But, that's why I had recommended to Susan to first run 'filter-media' 
script.   The 'filter-media' script will take text files in the CONTENT 
bundle and essentially copy them over to the TEXT bundle for indexing.

So, you are correct that the *.txt files could be immediately put in the 
TEXT bundle (which would also avoid them being exposed publicly).  But, 
the alternative would be to put the *.txt files in the CONTENT bundle 
and run 'filter-media' to filter it into the TEXT bundle. (However, as 
you noted, this latter option would require UI alteration to hide the 
*.txt files, if they shouldn't be accessible).

- Tim

Diggory Mark wrote:
 Actually...
 
 Looking at the code of DSIndexer... I'm sure, written by among others... 
 myself.  We find that only Bitstreams within the TEXT bundle are 
 actually indexed into Lucene:
 
  for (int i = 0; i  myBundles.length; i++)
 {
 if ((myBundles[i].getName() != null)
  myBundles[i].getName().equals(TEXT))
 {
 
 I'm thinking this was a short-sightedness, but the unhappy consequence 
 of which is that your text files will not get indexed if you place them 
 into the CONTENT Bundle.  There are two solutions
 
 A.) Put your text bitstreams into the TEXT bundle and not have to worry 
 about them being exposed because the TEXT bundle will not be.
 
 B.) Put your text Bitstreams in the Content Bundle, alter the UI to hide 
 them, and alter DSIndexer to index the CONTENT bundle.
 
 Mark
 
 On Jan 16, 2009, at 2:40 PM, Tim Donohue wrote:
 
 Susan,

 Actually, the setting you'd want to change in your DSpace 1.4.2
 dspace.cfg is this one:

 plugin.sequence.org.dspace.app.mediafilter.MediaFilter = ...

 You'd want to remove the entry for:
 org.dspace.app.mediafilter.PDFFilter

 That'd ensure that the PDFFilter is no longer used by filter-media.  The
 setting that you referenced below just configures the PDF filter to
 process files which are Adobe PDF format.

 [NOTE:] If you end up upgrading to DSpace 1.5.x, the above
 plugin.sequence.org.dspace.app.mediafilter.MediaFilter setting no
 longer exists.  Instead, it was replaced by a more simplistic
 filter.plugins setting.  In that case, for DSpace 1.5.x, you'd just
 remove PDF Text Extractor from the list of enabled filter.plugins.
 Again, this would ensure that 'filter-media' would no longer use the PDF
 filter.

 Hopefully that all makes sense...Beyond that, as you mentioned, you'd
 just need to hide those '*.txt' files from being displayed.

 - Tim



 Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
 Hi Tim,

 So you're saying that our proposed solution would work as long as
 we remove (or comment out):



 *filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF*



 from dspace.cfg and make the change to not display the .txt files on the
 Item pages?



 Then we would still need to run filter-media which would only be to
 basically add our .txt files to the TEXT bundle for each Item?



 By the way, we have been using the 1.5 version of filter-media, with the
 addition of the two new configuration parameters in dspace.cfg, for
 awhile, even though we are running DSpace 1.4.2.  I did this awhile back
 and yes, it has stopped the JAVA heap space errors from killing
 filter-media midstream.



 I do think this new plan is the better way to go for us.  I believe the
 advantages would be:

 1.  No more filter-media running for s long – over 24 hours most of
 the time.

 2.  We would identify “problematic” .pdf files (ones that possibly
 wouldn’t filter) prior to importing them into DSpace, instead of
 after-the-fact.  When these problems are caught at the scanning point,
 they could be dealt with there and then (rescanning/re-ocr’ing, etc).

 3.  Our Users wouldn’t have such a big job of identifying the
 “unfilterable” documents, locating them for rescanning, getting them
 back to us for re-import, etc etc.

 4.  Bottom line would be a more accurate full-text searchable 
 repository.



 Thanks a bunch for the detailed feedback.  We are processing a 1000
 document test with this new procedure and will let you know how it 
 goes!!

 Sue



 -Original Message-
 From: Tim Donohue [mailto:tdono...@illinois.edu]
 Sent: Thursday, January 15, 2009 11:27 AM
 To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
 Cc: dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI
 INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION
 SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS]
 Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions



 Sue,



 There were some improvements to 'filter-media' in DSpace 1.5.x.

 Primarily, there's the addition of two new PDF-specific settings in the

 dspace.cfg:



 pdffilter.largepdfs = true

 pdffilter.skiponmemoryexception = true

Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

2009-01-20 Thread Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
Would taking either of these two suggestions allow our new procedures
to work?  Basically we are trying to get around the problem of having
unfilterable (by PDFBOX) documents in DSpace and to creating a
repository that is going to return the most accurate search results as
humanly possible.
Thanks Mark,
Sue

-Original Message-
From: Diggory Mark [mailto:mdigg...@gmail.com] 
Sent: Friday, January 16, 2009 6:10 PM
To: Tim Donohue
Cc: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS];
dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI
INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION
SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS]
Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

Actually...

Looking at the code of DSIndexer... I'm sure, written by among  
others... myself.  We find that only Bitstreams within the TEXT  
bundle are actually indexed into Lucene:

  for (int i = 0; i  myBundles.length; i++)
 {
 if ((myBundles[i].getName() != null)
  myBundles[i].getName().equals(TEXT))
 {

I'm thinking this was a short-sightedness, but the unhappy consequence  
of which is that your text files will not get indexed if you place  
them into the CONTENT Bundle.  There are two solutions

A.) Put your text bitstreams into the TEXT bundle and not have to  
worry about them being exposed because the TEXT bundle will not be.

B.) Put your text Bitstreams in the Content Bundle, alter the UI to  
hide them, and alter DSIndexer to index the CONTENT bundle.

Mark

On Jan 16, 2009, at 2:40 PM, Tim Donohue wrote:

 Susan,

 Actually, the setting you'd want to change in your DSpace 1.4.2
 dspace.cfg is this one:

 plugin.sequence.org.dspace.app.mediafilter.MediaFilter = ...

 You'd want to remove the entry for:
 org.dspace.app.mediafilter.PDFFilter

 That'd ensure that the PDFFilter is no longer used by filter-media.   
 The
 setting that you referenced below just configures the PDF filter to
 process files which are Adobe PDF format.

 [NOTE:] If you end up upgrading to DSpace 1.5.x, the above
 plugin.sequence.org.dspace.app.mediafilter.MediaFilter setting no
 longer exists.  Instead, it was replaced by a more simplistic
 filter.plugins setting.  In that case, for DSpace 1.5.x, you'd just
 remove PDF Text Extractor from the list of enabled filter.plugins.
 Again, this would ensure that 'filter-media' would no longer use the  
 PDF
 filter.

 Hopefully that all makes sense...Beyond that, as you mentioned, you'd
 just need to hide those '*.txt' files from being displayed.

 - Tim



 Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
 Hi Tim,

 So you're saying that our proposed solution would work as long as
 we remove (or comment out):



 *filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe  
 PDF*



 from dspace.cfg and make the change to not display the .txt files  
 on the
 Item pages?



 Then we would still need to run filter-media which would only be to
 basically add our .txt files to the TEXT bundle for each Item?



 By the way, we have been using the 1.5 version of filter-media,  
 with the
 addition of the two new configuration parameters in dspace.cfg, for
 awhile, even though we are running DSpace 1.4.2.  I did this awhile  
 back
 and yes, it has stopped the JAVA heap space errors from killing
 filter-media midstream.



 I do think this new plan is the better way to go for us.  I believe  
 the
 advantages would be:

 1.  No more filter-media running for s long - over 24 hours  
 most of
 the time.

 2.  We would identify problematic .pdf files (ones that possibly
 wouldn't filter) prior to importing them into DSpace, instead of
 after-the-fact.  When these problems are caught at the scanning  
 point,
 they could be dealt with there and then (rescanning/re-ocr'ing, etc).

 3.  Our Users wouldn't have such a big job of identifying the
 unfilterable documents, locating them for rescanning, getting them
 back to us for re-import, etc etc.

 4.  Bottom line would be a more accurate full-text searchable  
 repository.



 Thanks a bunch for the detailed feedback.  We are processing a 1000
 document test with this new procedure and will let you know how it  
 goes!!

 Sue



 -Original Message-
 From: Tim Donohue [mailto:tdono...@illinois.edu]
 Sent: Thursday, January 15, 2009 11:27 AM
 To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
 Cc: dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7) 
 [NCI
 INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION
 SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS]
 Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media  
 questions



 Sue,



 There were some improvements to 'filter-media' in DSpace 1.5.x.

 Primarily, there's the addition of two new PDF-specific settings in  
 the

 dspace.cfg:



 pdffilter.largepdfs

Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

2009-01-20 Thread Claudia Jürgen
, Douglas Lewis (LARC-B7)[NCI INFORMATION
 SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS]
 Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions



 Sue,



 There were some improvements to 'filter-media' in DSpace 1.5.x.

 Primarily, there's the addition of two new PDF-specific settings in the

 dspace.cfg:



 pdffilter.largepdfs = true

 pdffilter.skiponmemoryexception = true



 The former ensures that all PDF text-extractions are written to

 temporary files during indexing.  This helps avoid OutOfMemoryException

  Heap space errors that were occasionally caused by larger PDFs being

 loaded into system memory all at once.



 The latter attempts to skip over any PDFs which still cause an

 OutOfMemoryException.  So, if that exception still occurs on a PDF, then

 the PDF is skipped entirely and *not* indexed.  This helps to avoid the

 entire 'filter-media' script crashing when an OutOfMemoryException

 occurs (which used to happen in 1.4.2).



 Despite these changes in 1.5.x, there is NO guarantee that *all* of your

 PDFs will index properly.  As I've mentioned before, the 'filter-media'

 script uses third-party software (called PDFBox: http://www.pdfbox.org/)

 for indexing of PDF files.  There are some known bugs in PDFBox that

 have yet to be fixed, so it does *not* always work for all PDFs.   In

 some cases, PDFBox will also work inconsistently (and I don't know why

 that is).  I've run into some inconsistency problems with larger-sized

 PDFs, which are originally scanned documents with embedded OCR.

 Occasionally PDFBox will index them fine, and other times it will cause

 an OutOfMemoryException (which, with DSpace 1.5 means that

 'filter-media' will just skip that pdf).



 So, I guess the best way to sum this up is that DSpace currently cannot

 successfully index 100% of all PDFs, since PDFBox cannot do so.  DSpace

 1.5 has improvements in helping DSpace to safely handle PDFBox issues

 (like the OutOfMemoryExceptions), but it doesn't necessarily have

 drastic improvements in indexing capabilities.



 I answered your other questions inline below...





 Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:



 1.   Has the filter-media/index-all process changed
 and/or improved significantly in DSpace 1.5?  If so, we may just shelve
 this issue until we’ve implemented 1.5.


 See above, obviously...



 2.   In DSpace 1.4.2 (and 1.5), does it matter whether
 your .txt files are plain or accessible .txt files?  Can index-all
 process either type?


 For text files, it doesn't really matter...in either case the

 'filter-media' script just pulls out the plain text for indexing.  I

 don't believe there'd be any significant difference between the type

 of .txt file.



 However, it's worth making this clear: for .txt files, you *still* need

 to run the 'filter-media' script for them to be indexed by 'index-all'.

  Essentially, 'index-all' only indexes plain text files in the TEXT

 bundle.  The 'filter-media' script is what adds plain text to the TEXT

 bundle.



 3.   If the process in 1.5 hasn’t changed and/or
 improved significantly in 1.5, we are considering having our scanning
 folks just create the .txt files along with the .pdf files at the time
 the documents are scanned.  Then when they send them to us, we would
 just upload them in the import process along with the .pdf files for
 each Item.  The only thing we’d really have to change in our import
 process is the addition of a second file name in the “contents” file 
 and
 the addition of the .txt document in the Item’s import directory (right
 along with the .pdf file).  One other issue is we might have to make a
 small modification to DSpace to **not** display the .txt file on the
 Item page unless the User is in the Admin interface since we wouldn’t
 want our Users clicking on/opening the .txt files.  If we did this, we
 could completely eliminate the filter-media job altogether.  This would
 ensure that we did not load any “unfilterable” documents into DSpace.
 It would also eliminate the tedious process of identifying which
 documents did not filter successfully, and the whole process of
 rescanning and replacing them in DSpace.


 This sounds like a perfectly reasonable way of doing things, assuming

 you have the staff time to pre-generate those .txt files.  You are

 correct that you'd no longer need to run 'filter-media' on those PDFs.

 But, you'd still need to run 'filter-media' to index those .txt files.

 You could do this by modifying the Media Filter settings in your

 dspace.cfg and *removing* the PDFFilter from the list (so 'filter-media'

 would no longer filter PDFs, but it would work on the other types of

 content).



 It would also require some custom coding to hide those .txt files from

 normal users, but that shouldn't be too horrible.



 If you did go this route, I'd make sure that you still OCR the PDFs

Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

2009-01-16 Thread Tim Donohue
Susan,

Actually, the setting you'd want to change in your DSpace 1.4.2 
dspace.cfg is this one:

plugin.sequence.org.dspace.app.mediafilter.MediaFilter = ...

You'd want to remove the entry for:
org.dspace.app.mediafilter.PDFFilter

That'd ensure that the PDFFilter is no longer used by filter-media.  The 
setting that you referenced below just configures the PDF filter to 
process files which are Adobe PDF format.

[NOTE:] If you end up upgrading to DSpace 1.5.x, the above 
plugin.sequence.org.dspace.app.mediafilter.MediaFilter setting no 
longer exists.  Instead, it was replaced by a more simplistic 
filter.plugins setting.  In that case, for DSpace 1.5.x, you'd just 
remove PDF Text Extractor from the list of enabled filter.plugins. 
Again, this would ensure that 'filter-media' would no longer use the PDF 
filter.

Hopefully that all makes sense...Beyond that, as you mentioned, you'd 
just need to hide those '*.txt' files from being displayed.

- Tim



Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
 Hi Tim,
 
  So you're saying that our proposed solution would work as long as 
 we remove (or comment out):
 
  
 
 *filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF*
 
  
 
 from dspace.cfg and make the change to not display the .txt files on the 
 Item pages?
 
  
 
 Then we would still need to run filter-media which would only be to 
 basically add our .txt files to the TEXT bundle for each Item? 
 
  
 
 By the way, we have been using the 1.5 version of filter-media, with the 
 addition of the two new configuration parameters in dspace.cfg, for 
 awhile, even though we are running DSpace 1.4.2.  I did this awhile back 
 and yes, it has stopped the JAVA heap space errors from killing 
 filter-media midstream.
 
  
 
 I do think this new plan is the better way to go for us.  I believe the 
 advantages would be:
 
 1.  No more filter-media running for s long – over 24 hours most of 
 the time.
 
 2.  We would identify “problematic” .pdf files (ones that possibly 
 wouldn’t filter) prior to importing them into DSpace, instead of 
 after-the-fact.  When these problems are caught at the scanning point, 
 they could be dealt with there and then (rescanning/re-ocr’ing, etc).
 
 3.  Our Users wouldn’t have such a big job of identifying the 
 “unfilterable” documents, locating them for rescanning, getting them 
 back to us for re-import, etc etc. 
 
 4.  Bottom line would be a more accurate full-text searchable repository.
 
  
 
 Thanks a bunch for the detailed feedback.  We are processing a 1000 
 document test with this new procedure and will let you know how it goes!!
 
 Sue
 
  
 
 -Original Message-
 From: Tim Donohue [mailto:tdono...@illinois.edu]
 Sent: Thursday, January 15, 2009 11:27 AM
 To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
 Cc: dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7)[NCI 
 INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION 
 SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS]
 Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions
 
  
 
 Sue,
 
  
 
 There were some improvements to 'filter-media' in DSpace 1.5.x.
 
 Primarily, there's the addition of two new PDF-specific settings in the
 
 dspace.cfg:
 
  
 
 pdffilter.largepdfs = true
 
 pdffilter.skiponmemoryexception = true
 
  
 
 The former ensures that all PDF text-extractions are written to
 
 temporary files during indexing.  This helps avoid OutOfMemoryException
 
  Heap space errors that were occasionally caused by larger PDFs being
 
 loaded into system memory all at once.
 
  
 
 The latter attempts to skip over any PDFs which still cause an
 
 OutOfMemoryException.  So, if that exception still occurs on a PDF, then
 
 the PDF is skipped entirely and *not* indexed.  This helps to avoid the
 
 entire 'filter-media' script crashing when an OutOfMemoryException
 
 occurs (which used to happen in 1.4.2).
 
  
 
 Despite these changes in 1.5.x, there is NO guarantee that *all* of your
 
 PDFs will index properly.  As I've mentioned before, the 'filter-media'
 
 script uses third-party software (called PDFBox: http://www.pdfbox.org/)
 
 for indexing of PDF files.  There are some known bugs in PDFBox that
 
 have yet to be fixed, so it does *not* always work for all PDFs.   In
 
 some cases, PDFBox will also work inconsistently (and I don't know why
 
 that is).  I've run into some inconsistency problems with larger-sized
 
 PDFs, which are originally scanned documents with embedded OCR.
 
 Occasionally PDFBox will index them fine, and other times it will cause
 
 an OutOfMemoryException (which, with DSpace 1.5 means that
 
 'filter-media' will just skip that pdf).
 
  
 
 So, I guess the best way to sum this up is that DSpace currently cannot
 
 successfully index 100% of all PDFs, since PDFBox cannot do so.  DSpace
 
 1.5 has improvements in helping DSpace to safely handle PDFBox issues
 
 (like

Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media questions

2009-01-16 Thread Diggory Mark
Actually...

Looking at the code of DSIndexer... I'm sure, written by among  
others... myself.  We find that only Bitstreams within the TEXT  
bundle are actually indexed into Lucene:

  for (int i = 0; i  myBundles.length; i++)
 {
 if ((myBundles[i].getName() != null)
  myBundles[i].getName().equals(TEXT))
 {

I'm thinking this was a short-sightedness, but the unhappy consequence  
of which is that your text files will not get indexed if you place  
them into the CONTENT Bundle.  There are two solutions

A.) Put your text bitstreams into the TEXT bundle and not have to  
worry about them being exposed because the TEXT bundle will not be.

B.) Put your text Bitstreams in the Content Bundle, alter the UI to  
hide them, and alter DSIndexer to index the CONTENT bundle.

Mark

On Jan 16, 2009, at 2:40 PM, Tim Donohue wrote:

 Susan,

 Actually, the setting you'd want to change in your DSpace 1.4.2
 dspace.cfg is this one:

 plugin.sequence.org.dspace.app.mediafilter.MediaFilter = ...

 You'd want to remove the entry for:
 org.dspace.app.mediafilter.PDFFilter

 That'd ensure that the PDFFilter is no longer used by filter-media.   
 The
 setting that you referenced below just configures the PDF filter to
 process files which are Adobe PDF format.

 [NOTE:] If you end up upgrading to DSpace 1.5.x, the above
 plugin.sequence.org.dspace.app.mediafilter.MediaFilter setting no
 longer exists.  Instead, it was replaced by a more simplistic
 filter.plugins setting.  In that case, for DSpace 1.5.x, you'd just
 remove PDF Text Extractor from the list of enabled filter.plugins.
 Again, this would ensure that 'filter-media' would no longer use the  
 PDF
 filter.

 Hopefully that all makes sense...Beyond that, as you mentioned, you'd
 just need to hide those '*.txt' files from being displayed.

 - Tim



 Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
 Hi Tim,

 So you're saying that our proposed solution would work as long as
 we remove (or comment out):



 *filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe  
 PDF*



 from dspace.cfg and make the change to not display the .txt files  
 on the
 Item pages?



 Then we would still need to run filter-media which would only be to
 basically add our .txt files to the TEXT bundle for each Item?



 By the way, we have been using the 1.5 version of filter-media,  
 with the
 addition of the two new configuration parameters in dspace.cfg, for
 awhile, even though we are running DSpace 1.4.2.  I did this awhile  
 back
 and yes, it has stopped the JAVA heap space errors from killing
 filter-media midstream.



 I do think this new plan is the better way to go for us.  I believe  
 the
 advantages would be:

 1.  No more filter-media running for s long – over 24 hours  
 most of
 the time.

 2.  We would identify “problematic” .pdf files (ones that possibly
 wouldn’t filter) prior to importing them into DSpace, instead of
 after-the-fact.  When these problems are caught at the scanning  
 point,
 they could be dealt with there and then (rescanning/re-ocr’ing, etc).

 3.  Our Users wouldn’t have such a big job of identifying the
 “unfilterable” documents, locating them for rescanning, getting them
 back to us for re-import, etc etc.

 4.  Bottom line would be a more accurate full-text searchable  
 repository.



 Thanks a bunch for the detailed feedback.  We are processing a 1000
 document test with this new procedure and will let you know how it  
 goes!!

 Sue



 -Original Message-
 From: Tim Donohue [mailto:tdono...@illinois.edu]
 Sent: Thursday, January 15, 2009 11:27 AM
 To: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
 Cc: dspace-tech@lists.sourceforge.net; Kimbrough, Glenn W. (LARC-B7) 
 [NCI
 INFORMATION SYSTEMS]; Warren, Douglas Lewis (LARC-B7)[NCI INFORMATION
 SYSTEMS]; Smail, James W. (LARC-B702)[NCI INFORMATION SYSTEMS]
 Subject: Re: [Dspace-tech] DSpace 1.4.2 / 1.5.x filter-media  
 questions



 Sue,



 There were some improvements to 'filter-media' in DSpace 1.5.x.

 Primarily, there's the addition of two new PDF-specific settings in  
 the

 dspace.cfg:



 pdffilter.largepdfs = true

 pdffilter.skiponmemoryexception = true



 The former ensures that all PDF text-extractions are written to

 temporary files during indexing.  This helps avoid  
 OutOfMemoryException

  Heap space errors that were occasionally caused by larger PDFs  
 being

 loaded into system memory all at once.



 The latter attempts to skip over any PDFs which still cause an

 OutOfMemoryException.  So, if that exception still occurs on a PDF,  
 then

 the PDF is skipped entirely and *not* indexed.  This helps to avoid  
 the

 entire 'filter-media' script crashing when an OutOfMemoryException

 occurs (which used to happen in 1.4.2).



 Despite these changes in 1.5.x, there is NO guarantee that *all* of  
 your

 PDFs will index properly.  As I've mentioned