Dear all
We're using DSpace version 6.0 on a Windows Server 2012 and we currently
store over 300'000 items in our repository (each item has two bitstreams, a
PDF document and a digital timestamp).
DSpace version: 6.0
SCM revision: 0fea17436854acf9048b0e11fbf988333ea02956
SCM branch: UNKNOWN
OS: Windows Server 2012 R2(amd64) version 6.3
Discovery: enabled.
JRE: Oracle Corporation version 1.8.0_131
Ant version: Apache Ant(TM) version 1.10.0 compiled on December 27
2016
Maven version: 3.3.9
DSpace home: D:\dspace
We also use the full text search engine of Apache Solr and therefore we
start a batch process with a filter-media command every night. The log file
of this process reports creating the respective text files and thumbnail
images from the PDF documents (as seen below).
FILTERED: bitstream 13e851e4-59e1-46e1-98a6-27afad4c7382 (item:
archivsuisse/92351) and created '4403_Y_20170721T122639.PDF.txt'
File: 4403_Y_20170721T122639.PDF.jpg
FILTERED: bitstream 13e851e4-59e1-46e1-98a6-27afad4c7382 (item:
archivsuisse/92351) and created '4403_Y_20170721T122639.PDF.jpg'
File: 4406_B_20170721T121321.PDF.txt
FILTERED: bitstream 202d2e4e-33e6-47af-baba-41c1b58a938c (item:
archivsuisse/92359) and created '4406_B_20170721T121321.PDF.txt'
File: 4406_B_20170721T121321.PDF.jpg
FILTERED: bitstream 202d2e4e-33e6-47af-baba-41c1b58a938c (item:
archivsuisse/92359) and created '4406_B_20170721T121321.PDF.jpg'
File: 4407_Y_20170721T121334.PDF.txt
FILTERED: bitstream 17fdd605-3a14-49de-882c-e38c9d6b57b0 (item:
archivsuisse/92361) and created '4407_Y_20170721T121334.PDF.txt'
File: 4407_Y_20170721T121334.PDF.jpg
But when I check the processed items in DSpace, no text file or thumbnail
image has been created and the full text of the PDF documents never gets
indexed by the full text search engine Apache Solr. Only a fraction of the
items have a generated text document and for a few months no new text
documents have been added to the DSpace items and the numbers in the
Discovery widget in DSpace JSPUI haven't changed in a while.
Discover
Has File(s)
333641 false
8970 true
When I try to filter a single item with the following command
dspace filter-media -m 1 -v
the process is first skipping over the first items which have already been
filtered and at the first item to be filtered it stops and never finishes.
D:\dspace\bin>dspace filter-media -m 1 -v
Using DSpace installation in: D:\dspace
Invalid maximum value '1' - ignoring
The following MediaFilters are enabled:
Full Filter Name: org.dspace.app.mediafilter.WordFilter
org.dspace.app.mediafilter.WordFilter
Full Filter Name: org.dspace.app.mediafilter.JPEGFilter
org.dspace.app.mediafilter.JPEGFilter
Full Filter Name: org.dspace.app.mediafilter.PowerPointFilter
org.dspace.app.mediafilter.PowerPointFilter
Full Filter Name: org.dspace.app.mediafilter.HTMLFilter
org.dspace.app.mediafilter.HTMLFilter
Full Filter Name: org.dspace.app.mediafilter.ExcelFilter
org.dspace.app.mediafilter.ExcelFilter
Full Filter Name: org.dspace.app.mediafilter.PDFFilter
org.dspace.app.mediafilter.PDFFilter
Full Filter Name: org.dspace.app.mediafilter.PDFBoxThumbnail
org.dspace.app.mediafilter.PDFBoxThumbnail
SKIPPED: bitstream 170bc089-2aba-4f38-8f5f-44cfb91092c6 (item:
123456789/72) because 'py-tutorial-de.pdf.txt' already exists
SKIPPED: bitstream 170bc089-2aba-4f38-8f5f-44cfb91092c6 (item:
123456789/72) because 'py-tutorial-de.pdf.jpg' already exists
SKIPPED: bitstream b56999cd-ef41-4bbf-a18a-6abf0d2be4a7 (item:
123456789/95) because '200001_gelb.pdf.txt' already exists
SKIPPED: bitstream b56999cd-ef41-4bbf-a18a-6abf0d2be4a7 (item:
123456789/95) because '200001_gelb.pdf.jpg' already exists
SKIPPED: bitstream cc04eb6e-5860-4359-8374-dc751ba07ed2 (item:
123456789/96) because '200001_grün.pdf.txt' already exists
SKIPPED: bitstream cc04eb6e-5860-4359-8374-dc751ba07ed2 (item:
123456789/96) because '200001_grün.pdf.jpg' already exists
SKIPPED: bitstream 833ee548-76e9-4ef3-9f52-dbffc3a28df3 (item:
123456789/97) because '200001_rosa.pdf.txt' already exists
SKIPPED: bitstream 833ee548-76e9-4ef3-9f52-dbffc3a28df3 (item:
123456789/97) because '200001_rosa.pdf.jpg' already exists
SKIPPED: bitstream d6fc9e85-a85d-427e-9907-ba3aecff34fc (item:
123456789/98) because '200001_rot.pdf.txt' already exists
SKIPPED: bitstream d6fc9e85-a85d-427e-9907-ba3aecff34fc (item:
123456789/98) because '200001_rot.pdf.jpg' already exists
PROCESSING: bitstream e7d9a3dd-494d-461c-9320-513f8a68f773 (item:
archivsuisse/51880)
File: 96020_P_20170707T142205.PDF.txt
FILTERED: bitstream e7d9a3dd-494d-461c-9320-513f8a68f773 (item:
archivsuisse/51880) and created '96020_P_20170707T142205.PDF.txt'
I also checked the DSpace log files but can find no warnings or errors. I
don't know what could have caused DSpace to stop filtering our PDF
documents and what can be done to fix this problem. Are there any other log
files I can check?
Any help is greatly appreciated.
Heinz Gnehm
archivsuisse AG
Bernstrasse 23
3122 Kehrsatz
Switzerland
--
You received this message because you are subscribed to the Google Groups
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.