Hi all, I'm new to Solr, i've recently downloaded solr 8.0.0 and have been following the tutorials. Using the 2 example instances created, i'm trying to create my own collection. I've done a copy of the _default configset and used it to create my collection.
For my case, the files i want to index are pdf files composed of images. I have tesseract installed and i can parse correctly the pdf files using an tika server instance i downloaded, i.e i can get the extracted text from the images. I'm following the instructions on from page "Uploading Data with Solr Cell Using Apache Tika" to propertly configure the PDF image extraction but i'm not being able to correctly get this. My aim is that the content of the PDF file goes into a field named content that i've created in my schema. From my attempts this field is non existent or when it exists it doesnt contain the expected text from the parsed images. In the configuration of ExtractingRequestHandler, the lib clauses are present in my solrconfig.xml, that section is as below: <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="fmap.content">content</str> </lst> <str name="parseContext.config">parseContext.xml</str> </requestHandler> And my parseContext.xml file is: <?xml version="1.0" encoding="UTF-8" ?> <entries> <entry class="org.apache.tika.parser.pdf.PDFParserConfig" impl="org.apache.tika.parser.pdf.PDFParserConfig"> <property name="extractInlineImages" value="true" /> </entry> </entries> Any help on how to correctly extract the text from the PDF images would be great. Thanks Miguel