[magnolia-user] PDF full-text indexing/searching

Pietro Pagani (via Magnolia Forums) Wed, 11 Jun 2014 01:09:13 -0700

Hi, I'm having some problems with full-text search with Magnolia 5.2.5 EE. 
This is properly working for Word files (.doc, .docx) but I can't search over 
PDF files.


Here my test case.
I have uploaded a DOC and a PDF file both containing the key-word "worldcup".
Using the following FTL expression:
[#assign results_dam = cmsfn.simpleSearch("dam", "worldcup", "mgnl:asset", "/") 
/]
only the .doc file is returned.
I have also tried to perform directly the following query: 
select * from [nt:base] as t where ISDESCENDANTNODE([/]) AND contains(t.*, 
'worldcup')
but still the PDF file is not returned.

What can be the reason? Is there any configuration to do not already included 
in a standard installation?

I have tried to modify jackrabbit configuration file (in my local dev 
environment is jackrabbit-bundle-derby-search.xml)adding the following 
configuration to <SearchIndex>:

[code]
...
<param name="textFilterClasses" 
value="org.apache.jackrabbit.extractor.PlainTextExtractor,
              org.apache.jackrabbit.extractor.MsWordTextExtractor,
              org.apache.jackrabbit.extractor.MsExcelTextExtractor,
              org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
              org.apache.jackrabbit.extractor.PdfTextExtractor,
              org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
              org.apache.jackrabbit.extractor.RTFTextExtractor,
              org.apache.jackrabbit.extractor.HTMLTextExtractor,
              org.apache.jackrabbit.extractor.XMLTextExtractor"/>
...
[/code]
                  
but I have found the following WARNING in log file during Magnolia startup:

[i]WARN  rg.apache.jackrabbit.core.query.lucene.SearchIndex: 
The textFilterClasses configuration parameter has been deprecated, and the 
configured value will be ignored: 
org.apache.jackrabbit.extractor.PlainTextExtractor, 
org.apache.jackrabbit.extractor.MsWordTextExtractor,org.apache.jackrabbit.extractor.MsExcelTextExtractor,org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,org.apache.jackrabbit.extractor.RTFTextExtractor,org.apache.jackrabbit.extractor.HTMLTextExtractor,org.apache.jackrabbit.extractor.XMLTextExtractor[/i]


Thanks,
Pietro

-- 
Context is everything: 
http://forum.magnolia-cms.com/forum/thread.html?threadId=6411247e-6641-49d6-8fac-d84b171a91af


----------------------------------------------------------------
For list details, see http://www.magnolia-cms.com/community/mailing-lists.html
Alternatively, use our forums: http://forum.magnolia-cms.com/
To unsubscribe, E-mail to: <[email protected]>
----------------------------------------------------------------

[magnolia-user] PDF full-text indexing/searching

Reply via email to