Hi,
I have spent a lot of time recently working on the filter-media
cron, on the issue of the errors that occur when it encounters a
document that is not filterable for some reason. It seems that there
are several different reasons why filter-media fails:
1. The document is very large and a "Java Heap Space" error
occurs. In the original version of filter-media in DSpace 1.4.2 (I'm
not sure about 1.5), this error causes the process to fail. I know I
received an email recently from someone who kindly provided me with some
code changes for MediaFilterManager.java that would allow the process to
continue when this error occurs instead of failing.
2. The document is very large, but the process does not fail.
Instead, a blank .txt file is created.
3. The document contains unreadable characters and cannot be
filtered. These documents are "skipped".
I have come up with a temporary solution to the problem of the
continual failure of filter-media, besides the one offered in the email
I received from the list. Whereas this fix allows the filtering process
to continue, my solution does not even attempt to filter a document that
filter-media has previously attempted to filter and found to be
"non-filterable". I've accomplished this by creating a new table in
postgreSQL to hold the bitstream_id(s) of the documents that
filter-media attempts to filter and cannot, for whatever reason. On an
initial run of filter-media with this modification in place,
MediaFilterManager.java will write a row to this table for each document
that it attempts to filter and cannot, so that the next time through,
before filtering a document a check is done to see if this bitstream_id
exists in the new table. If so, a filtering attempt is not made. If
not, the program continues as normal. After every run of filter-media,
a query is run that lists the handles of the documents that are
currently not filterable. The scanning department will then rescan and
replace those documents in LDR in the hope that the newly scanned
document can now be successfully filtered. Once the document is
replaced in LDR, it will have a new bitstream_id and filter-media will
attempt to filter it when it runs again.
My question is this: Does anyone know if there a size limit/cutoff
point where a document is TOO large to be filtered?? If there is, then
I have no idea what to do about these documents. The largest document
in our repository is 1,862,628,176 bytes or 1.9 GB. I don't suppose a
.zip file can be filtered??? :-)
Thanks in advance,
Best,
Sue
Sue Walker-Thornton
ConITS Contract
NASA Langley Research Center
</></>Integrated Library Systems Application & Database Administrator
130 Research Drive
Hampton, VA 23666
Office: (757) 224-4074
Fax: (757) 224-4001
Pager: (757) 988-2547
Email: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech