Hi folks,

Please suggest the solution for importing and indexing PDF files
*incrementally*. My requirements is to pull the PDF files remotely from the
network folder path. This network folder will be having new sets of PDF
files after certain intervals (for say 20 secs). The folder will be forced
to get empty, every time the new sets of PDF files are copied into it. I do
not want to loose the earlier saved index of the old files, while doing the
next incremental import.

Currently, i am using Solr 6.6 version for the research.

The dataimport handler config is currently like this :-

<!--Remote Access--><dataConfig>
  <dataSource type="BinFileDataSource"/>
  <document>
    <entity name="K2FileEntity" processor="FileListEntityProcessor"
dataSource="null"
                        recursive = "true"                                      
        
                        baseDir="\\CLDSINGH02\*RemoteFileDepot*"
                        fileName=".*pdf" rootEntity="false">
                        
                        <field column="file" name="id"/>                        
                        <field column="fileSize" name="size" />-->
                        <field column="fileLastModified" name="lastmodified" />

                          <entity name="pdf" processor="TikaEntityProcessor" 
onError="skip"
                                          
url="${K2FileEntity.fileAbsolutePath}" format="text">                         

                                <field column="title" name="title" meta="true"/>
                                <field column="dc:format" name="format" 
meta="true"/>
                                <field column="text" name="text"/>
                          </entity>
    </entity>
  </document></dataConfig>


Kind regards,
Karan Singh

Reply via email to