Hi,

     We are currently running DSpace 1.4.2 and are in the process of
upgrading to 1.5.  We have had numerous issues and problems with
filter-media, including documents being "unfilterable" due to a variety
of errors, including Java heap space errors, and filter-media taking
forever to run and impacting Production performance.  We are looking for
a better, less painful way to filter our documents (99% are .pdf
documents) and create the most accurate full-text searchable repository
possible in DSpace.  Following are my questions:

 

1.                   Has the filter-media/index-all process changed
and/or improved significantly in DSpace 1.5?  If so, we may just shelve
this issue until we've implemented 1.5.

 

2.                   In DSpace 1.4.2 (and 1.5), does it matter whether
your .txt files are plain or accessible .txt files?  Can index-all
process either type?

 

3.                   If the process in 1.5 hasn't changed and/or
improved significantly in 1.5, we are considering having our scanning
folks just create the .txt files along with the .pdf files at the time
the documents are scanned.  Then when they send them to us, we would
just upload them in the import process along with the .pdf files for
each Item.  The only thing we'd really have to change in our import
process is the addition of a second file name in the "contents" file and
the addition of the .txt document in the Item's import directory (right
along with the .pdf file).  One other issue is we might have to make a
small modification to DSpace to *not* display the .txt file on the Item
page unless the User is in the Admin interface since we wouldn't want
our Users clicking on/opening the .txt files.  If we did this, we could
completely eliminate the filter-media job altogether.  This would ensure
that we did not load any "unfilterable" documents into DSpace.  It would
also eliminate the tedious process of identifying which documents did
not filter successfully, and the whole process of rescanning and
replacing them in DSpace.

 

What we need to know is if this plan sounds feasible or could it be
problematic in a way we haven't considered?

 

Thanks in advance,

Sue

 

 

Sue Walker-Thornton

ConITS Contract
NASA Langley Research Center
</></>Integrated Library Systems Application & Database Administrator

130 Research Drive

Hampton, VA  23666

Office: (757) 224-4074
Fax:    (757) 224-4001
Pager: (757) 988-2547 
Email:  susan.m.thorn...@nasa.gov <mailto:susan.m.thorn...@nasa.gov> 

 

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
  • [Dspace-tech] DSpa... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
    • Re: [Dspace-t... Tim Donohue
      • Re: [Dspa... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
        • Re: [... Tim Donohue
          • R... Diggory Mark
            • ... Tim Donohue
              • ... Claudia Jürgen
              • ... Tim Donohue
                • ... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
                • ... Tim Donohue
                • ... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]

Reply via email to