Re: [dspace-tech] No extracted text from PDFs

admin Sat, 13 Apr 2019 01:03:04 -0700

Hi Tim, I use DSpace 6.3.

Sorry for late response but I was testing my website: when I removed 
ImageMagick Image Thumbnail and ImageMagick PDF Thumbnail filters and got 
back to default DSpace ones, I restored text extraction functionality.
Maybe something was set up incorrectly (apart from those two ImageMagick 
filters, I had additional settings enabled as 
org.dspace.app.mediafilter.ImageMagickThumbnailFilter.flatten, 
org.dspace.app.mediafilter.ImageMagickThumbnailFilter.cmyk_profile, and 
org.dspace.app.mediafilter.ImageMagickThumbnailFilter.srgb_profile, with 
according values).


Anyways, the default filters fit the bills.


Peter

W dniu piątek, 29 marca 2019 19:14:45 UTC użytkownik Tim Donohue napisał:
>
> Hi Peter,
>
> You didn't mention what version of DSpace you are using.  But, our 
> documentation for the Media Filters (which index all files) is available 
> at: 
> https://wiki.duraspace.org/display/DSDOC6x/Mediafilters+for+Transforming+DSpace+Content
>
> The PDF text extractor does not require an extra plugin, it just needs to 
> be enabled.  However, the PDF text extractor is a third party library, and 
> it won't work 100% of the time. I've found it can have problems (at times) 
> with large or complex PDFs and some PDFs with OCR.  
>
> You should also look at the output of filter-media command to see if it is 
> processing the PDFs, and whether it is throwing any errors.  If the 
> "filter-media" command is having difficulty with a single file, it might 
> exit unexpectedly (which could mean it won't even process other files) 
> throwing an error.
>
> You also may want to check the dspace.log files immediately after running 
> filter-media to see if any errors are being thrown there.
>
> In any case, my suspicion is there may be an error occurring here that is 
> causing the Media Filters to not work properly.  If you find more 
> information, feel free to send it to this list and we can try to help.
>
> Tim
>
> On Wed, Mar 27, 2019 at 5:05 AM admin <ad...@ispan.waw.pl <javascript:>> 
> wrote:
>
>> Hi,
>>
>> I noticed that from some time my items doesn't have extracted text 
>> bitstream (the earlier items have this file, but since then I might 
>> reconfigured something).
>> They are indexed, because I can find them via searching by the title (so 
>> a part of metadata), but the full text is not available for search.
>>
>> These are editable PDFs and I have filter.plugins = PDF Text Extractor 
>> enabled. I also run filter-media every day. I tried filter-media -f but it 
>> doesn't help.
>>
>> Probably some plugin is missing anyway I suppose or some settings for 
>> enabled plugin is invalid?
>>
>>
>> Best, Peter
>>
>> -- 
>> All messages to this mailing list should adhere to the DuraSpace Code of 
>> Conduct: https://duraspace.org/about/policies/code-of-conduct/
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "DSpace Technical Support" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to dspac...@googlegroups.com <javascript:>.
>> To post to this group, send email to dspac...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/dspace-tech.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> Tim Donohue
> Technical Lead for DSpace & DSpaceDirect
> DuraSpace.org | DSpace.org | DSpaceDirect.org
>
>

-- 
All messages to this mailing list should adhere to the DuraSpace Code of 
Conduct: https://duraspace.org/about/policies/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To post to this group, send email to dspace-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Re: [dspace-tech] No extracted text from PDFs

Reply via email to