Can't index all docs in a local folder with DIH in Solr 5.0.0

Gary Taylor Wed, 25 Feb 2015 08:16:53 -0800

I can't get the FileListEntityProcessor and TikeEntityProcessor tocorrectly add a Solr document for each epub file in my local directory.

I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran "solr start"and then "solr create -c hn2" to create a new core.

I want to index a load of epub files that I've got in a directory. So Icreated a data-import.xml (in solr\hn2\conf):


<dataConfig>
    <dataSource type="BinFileDataSource" name="bin" />
    <document>
        <entity name="files" dataSource="null" rootEntity="false"
            processor="FileListEntityProcessor"
            baseDir="c:/Users/gt/Documents/epub" fileName=".*epub"
            onError="skip"
            recursive="true">
            <field column="fileAbsolutePath" name="id" />
            <field column="fileSize" name="size" />
            <field column="fileLastModified" name="lastModified" />

            <entity name="documentImport" processor="TikaEntityProcessor"

url="${files.fileAbsolutePath}" format="text"dataSource="bin" onError="skip">

                <field column="file" name="fileName"/>
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="content"/>
            </entity>
        </entity>
    </document>
</dataConfig>

In my solrconfig.xml, I added a requestHandler entry to reference mydata-import.xml:

<requestHandler name="/dataimport"class="org.apache.solr.handler.dataimport.DataImportHandler">

      <lst name="defaults">
          <str name="config">data-import.xml</str>
      </lst>
  </requestHandler>

I renamed managed-schema to schema.xml, and ensured the following docfields were setup:

<field name="id" type="string" indexed="true" stored="true"required="true" multiValued="false" />

      <field name="fileName" type="string" indexed="true" stored="true" />
      <field name="author" type="string" indexed="true" stored="true" />
      <field name="title" type="string" indexed="true" stored="true" />

      <field name="size" type="long" indexed="true" stored="true" />

<field name="lastModified" type="date" indexed="true"stored="true" />

<field name="content" type="text_en" indexed="false"stored="true" multiValued="false"/><field name="text" type="text_en" indexed="true" stored="false"multiValued="true"/>


    <copyField source="content" dest="text"/>

I copied all the jars from dist and contrib\* into server\solr\lib.

Stopping and restarting solr then creates a new managed-schema file andrenames schema.xml to schema.xml.back


All good so far.

Now I go to the web admin for dataimport(http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try andexecute a full import.

But, the results show "Requests: 0, Fetched: 58, Skipped: 0,Processed:1" - ie. it only adds one document (the very first one) eventhough it's iterated over 58!


No errors are reported in the logs.

I can search on the contents of that first epub document, so it'sextracting OK in Tika, but there's a problem somewhere in my configthat's causing only 1 document to be indexed in Solr.


Thanks for any assistance / pointers.

Regards,
Gary

--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.

Can't index all docs in a local folder with DIH in Solr 5.0.0

Reply via email to