Hi,

we were in Solr 5.2.1 and TikaEntityProcessor to index pdf documents through 
DIH and was working fine. The jars were tika-core-1.4.jar and 
tika-parsers-1.4.jar.

Below is my schema.xml: (p,s. All filed types have been defined)

<fields>
   <field name="title" type="string" indexed="true" stored="true"/>
   <field name="author" type="string" indexed="true" stored="true" />
   <field name="text" type="exact_text" indexed="true" stored="true" />
   <field name="path" type="string" indexed="true" stored="true" />
   <field name="size" type="string" indexed="true" stored="true" />
   <field name="lastmodified" type="string" indexed="true" stored="true" />
   <field name="fileName" type="string" indexed="true" stored="true" />

And my tika-data-config.xml:

<?xml version="1.0" encoding="UTF-8" standalone="no"?><dataConfig>
    <dataSource type="BinFileDataSource"/>
    <document>
                <entity baseDir="C:\help" dataSource="null" 
fileName=".*\.(PDF)|(pdf)|(doc)|(docx)|(DOC)|(DOCX)|(txt)|(ppt)|(xls)|(csv)" 
name="f" onError="skip" processor="FileListEntityProcessor" recursive="true" 
rootEntity="false">
             <field column="fileAbsolutePath" name="path"/>
             <field column="fileSize" name="size"/>
             <field column="fileLastModified" name="lastmodified"/>
             <field column="file" name="fileName"/>
        <entity format="text" name="tika-test" processor="TikaEntityProcessor" 
url="${f.fileAbsolutePath}">
                <field column="Author" meta="true" name="author"/>
                <field column="title" meta="true" name="title"/>
                <field column="text" name="text"/>
        </entity>
                </entity>
    </document>
</dataConfig>

Now we have upgraded to solr-8.4.1 and when I try to put the above jars and 
index, I see only below are getting indexed:

{
        "fileName":"01 - System-Wide Functions.pdf",
        "size":"2524884",
        "lastmodified":"Mon Jul 15 06:26:52 UTC 2019",
        "path":"D:\\tssindex\\server\\solr\\help\\help\\01 - System-Wide 
Functions.pdf",
        "text":"",
        "_version_":1664474933885927424},
{

As you can see, the text field is empty & author, title fields are not getting 
indexed and any search on that text field is not returning the documents.

Please help me in this regard.


Thanks,
Srinivas


________________________________
DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by 
replying to the e-mail, and then delete it without making copies or using it in 
any way.
No representation is made that this email or any attachments are free of 
viruses. Virus scanning is recommended and is the responsibility of the 
recipient.

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.

Reply via email to