Dear all

We're using DSpace version 6.0 on a Windows Server 2012 and we currently 
store over 300'000 items in our repository (each item has two bitstreams, a 
PDF document and a digital timestamp).

   DSpace version:  6.0
     SCM revision:  0fea17436854acf9048b0e11fbf988333ea02956
       SCM branch:  UNKNOWN
               OS:  Windows Server 2012 R2(amd64) version 6.3
        Discovery:  enabled.
              JRE:  Oracle Corporation version 1.8.0_131
      Ant version:  Apache Ant(TM) version 1.10.0 compiled on December 27 
2016
    Maven version:  3.3.9
      DSpace home:  D:\dspace

We also use the full text search engine of Apache Solr and therefore we 
start a batch process with a filter-media command every night. The log file 
of this process reports creating the respective text files and thumbnail 
images from the PDF documents (as seen below).

   FILTERED: bitstream 13e851e4-59e1-46e1-98a6-27afad4c7382 (item: 
archivsuisse/92351) and created '4403_Y_20170721T122639.PDF.txt'
   File: 4403_Y_20170721T122639.PDF.jpg
   FILTERED: bitstream 13e851e4-59e1-46e1-98a6-27afad4c7382 (item: 
archivsuisse/92351) and created '4403_Y_20170721T122639.PDF.jpg'
   File: 4406_B_20170721T121321.PDF.txt
   FILTERED: bitstream 202d2e4e-33e6-47af-baba-41c1b58a938c (item: 
archivsuisse/92359) and created '4406_B_20170721T121321.PDF.txt'
   File: 4406_B_20170721T121321.PDF.jpg
   FILTERED: bitstream 202d2e4e-33e6-47af-baba-41c1b58a938c (item: 
archivsuisse/92359) and created '4406_B_20170721T121321.PDF.jpg'
   File: 4407_Y_20170721T121334.PDF.txt
   FILTERED: bitstream 17fdd605-3a14-49de-882c-e38c9d6b57b0 (item: 
archivsuisse/92361) and created '4407_Y_20170721T121334.PDF.txt'
   File: 4407_Y_20170721T121334.PDF.jpg

But when I check the processed items in DSpace, no text file or thumbnail 
image has been created and the full text of the PDF documents never gets 
indexed by the full text search engine Apache Solr. Only a fraction of the 
items have a generated text document and for a few months no new text 
documents have been added to the DSpace items and the numbers in the 
Discovery widget in DSpace JSPUI haven't changed in a while.

   Discover
   Has File(s)
   333641 false
   8970 true

When I try to filter a single item with the following command

   dspace filter-media -m 1 -v
   
the process is first skipping over the first items which have already been 
filtered and at the first item to be filtered it stops and never finishes.

   D:\dspace\bin>dspace filter-media -m 1 -v
   Using DSpace installation in: D:\dspace
   Invalid maximum value '1' - ignoring
   The following MediaFilters are enabled:
   Full Filter Name: org.dspace.app.mediafilter.WordFilter
   org.dspace.app.mediafilter.WordFilter
   Full Filter Name: org.dspace.app.mediafilter.JPEGFilter
   org.dspace.app.mediafilter.JPEGFilter
   Full Filter Name: org.dspace.app.mediafilter.PowerPointFilter
   org.dspace.app.mediafilter.PowerPointFilter
   Full Filter Name: org.dspace.app.mediafilter.HTMLFilter
   org.dspace.app.mediafilter.HTMLFilter
   Full Filter Name: org.dspace.app.mediafilter.ExcelFilter
   org.dspace.app.mediafilter.ExcelFilter
   Full Filter Name: org.dspace.app.mediafilter.PDFFilter
   org.dspace.app.mediafilter.PDFFilter
   Full Filter Name: org.dspace.app.mediafilter.PDFBoxThumbnail
   org.dspace.app.mediafilter.PDFBoxThumbnail
   SKIPPED: bitstream 170bc089-2aba-4f38-8f5f-44cfb91092c6 (item: 
123456789/72) because 'py-tutorial-de.pdf.txt' already exists
   SKIPPED: bitstream 170bc089-2aba-4f38-8f5f-44cfb91092c6 (item: 
123456789/72) because 'py-tutorial-de.pdf.jpg' already exists
   SKIPPED: bitstream b56999cd-ef41-4bbf-a18a-6abf0d2be4a7 (item: 
123456789/95) because '200001_gelb.pdf.txt' already exists
   SKIPPED: bitstream b56999cd-ef41-4bbf-a18a-6abf0d2be4a7 (item: 
123456789/95) because '200001_gelb.pdf.jpg' already exists
   SKIPPED: bitstream cc04eb6e-5860-4359-8374-dc751ba07ed2 (item: 
123456789/96) because '200001_grün.pdf.txt' already exists
   SKIPPED: bitstream cc04eb6e-5860-4359-8374-dc751ba07ed2 (item: 
123456789/96) because '200001_grün.pdf.jpg' already exists
   SKIPPED: bitstream 833ee548-76e9-4ef3-9f52-dbffc3a28df3 (item: 
123456789/97) because '200001_rosa.pdf.txt' already exists
   SKIPPED: bitstream 833ee548-76e9-4ef3-9f52-dbffc3a28df3 (item: 
123456789/97) because '200001_rosa.pdf.jpg' already exists
   SKIPPED: bitstream d6fc9e85-a85d-427e-9907-ba3aecff34fc (item: 
123456789/98) because '200001_rot.pdf.txt' already exists
   SKIPPED: bitstream d6fc9e85-a85d-427e-9907-ba3aecff34fc (item: 
123456789/98) because '200001_rot.pdf.jpg' already exists
   PROCESSING: bitstream e7d9a3dd-494d-461c-9320-513f8a68f773 (item: 
archivsuisse/51880)
   File: 96020_P_20170707T142205.PDF.txt
   FILTERED: bitstream e7d9a3dd-494d-461c-9320-513f8a68f773 (item: 
archivsuisse/51880) and created '96020_P_20170707T142205.PDF.txt'

I also checked the DSpace log files but can find no warnings or errors. I 
don't know what could have caused DSpace to stop filtering our PDF 
documents and what can be done to fix this problem. Are there any other log 
files I can check?

Any help is greatly appreciated.

Heinz Gnehm
archivsuisse AG
Bernstrasse 23
3122 Kehrsatz
Switzerland

-- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Reply via email to