[ 
https://issues.apache.org/jira/browse/SOLR-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-7174:
-----------------------------
    Summary: DIH should reset TikaEntityProcessor so that it is capable of 
re-use.  (was: DIH should reset TikaEntityProcessor so that it is not capable 
of re-use.)

> DIH should reset TikaEntityProcessor so that it is capable of re-use.
> ---------------------------------------------------------------------
>
>                 Key: SOLR-7174
>                 URL: https://issues.apache.org/jira/browse/SOLR-7174
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>    Affects Versions: 5.0
>         Environment: Windows 7.  Ubuntu 14.04.
>            Reporter: Gary Taylor
>            Assignee: Noble Paul
>              Labels: dataimportHandler, tika,text-extraction
>         Attachments: SOLR-7174.patch
>
>
> Downloaded Solr 5.0.0, on a Windows 7 PC.   I ran "solr start" and then "solr 
> create -c hn2" to create a new core.
> I want to index a load of epub files that I've got in a directory. So I 
> created a data-import.xml (in solr\hn2\conf):
> <dataConfig>
>     <dataSource type="BinFileDataSource" name="bin" />
>     <document>
>         <entity name="files" dataSource="null" rootEntity="false"
>             processor="FileListEntityProcessor"
>             baseDir="c:/Users/gt/Documents/epub" fileName=".*epub"
>             onError="skip"
>             recursive="true">
>             <field column="fileAbsolutePath" name="id" />
>             <field column="fileSize" name="size" />
>             <field column="fileLastModified" name="lastModified" />
>             <entity name="documentImport" processor="TikaEntityProcessor"
>                 url="${files.fileAbsolutePath}" format="text" 
> dataSource="bin" onError="skip">
>                 <field column="file" name="fileName"/>
>                 <field column="Author" name="author" meta="true"/>
>                 <field column="title" name="title" meta="true"/>
>                 <field column="text" name="content"/>
>             </entity>
>         </entity>
>     </document>
> </dataConfig>
> In my solrconfig.xml, I added a requestHandler entry to reference my 
> data-import.xml:
>   <requestHandler name="/dataimport" 
> class="org.apache.solr.handler.dataimport.DataImportHandler">
>       <lst name="defaults">
>           <str name="config">data-import.xml</str>
>       </lst>
>   </requestHandler>
> I renamed managed-schema to schema.xml, and ensured the following doc fields 
> were setup:
>       <field name="id" type="string" indexed="true" stored="true" 
> required="true" multiValued="false" />
>       <field name="fileName" type="string" indexed="true" stored="true" />
>       <field name="author" type="string" indexed="true" stored="true" />
>       <field name="title" type="string" indexed="true" stored="true" />
>       <field name="size" type="long" indexed="true" stored="true" />
>       <field name="lastModified" type="date" indexed="true" stored="true" />
>       <field name="content" type="text_en" indexed="false" stored="true" 
> multiValued="false"/>
>       <field name="text" type="text_en" indexed="true" stored="false" 
> multiValued="true"/>
>     <copyField source="content" dest="text"/>
> I copied all the jars from dist and contrib\* into server\solr\lib.
> Stopping and restarting solr then creates a new managed-schema file and 
> renames schema.xml to schema.xml.back
> All good so far.
> Now I go to the web admin for dataimport 
> (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute 
> a full import.
> But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" - 
> ie. it only adds one document (the very first one) even though it's iterated 
> over 58!
> No errors are reported in the logs. 
> I can repeat this on Ubuntu 14.04 using the same steps, so it's not Windows 
> specific.
> -----------------
> If I change the data-import.xml to use FileDataSource and 
> PlainTextEntityProcessor and parse txt files, eg: 
> <dataConfig>  
>       <dataSource type="FileDataSource" name="bin" />
>       <document>
>               <entity name="files" dataSource="null" rootEntity="false"
>                       processor="FileListEntityProcessor"
>                       baseDir="c:/Users/gt/Documents/epub" fileName=".*txt">
>                       <field column="fileAbsolutePath" name="id" />
>                       <field column="fileSize" name="size" />
>                       <field column="fileLastModified" name="lastModified" />
>                       <entity name="documentImport" 
> processor="PlainTextEntityProcessor"
>                               url="${files.fileAbsolutePath}" format="text" 
> dataSource="bin">
>                               <field column="plainText" name="content"/>
>                       </entity>
>               </entity>
>       </document> 
> </dataConfig> 
> This works.  So it's a combo of BinFileDataSource and TikaEntityProcessor 
> that is failing.
> On Windows, I ran Process Monitor, and spotted that only the very first epub 
> file is actually being read (repeatedly).
> With verbose and debug on when running the DIH, I get the following response:
> ....
>   "verbose-output": [
>     "entity:files",
>     [
>       null,
>       "----------- row #1-------------",
>       "fileSize",
>       2609004,
>       "fileLastModified",
>       "2015-02-25T11:37:25.217Z",
>       "fileAbsolutePath",
>       "c:\\Users\\gt\\Documents\\epub\\issue018.epub",
>       "fileDir",
>       "c:\\Users\\gt\\Documents\\epub",
>       "file",
>       "issue018.epub",
>       null,
>       "---------------------------------------------",
>       "entity:documentImport",
>       [
>         "document#1",
>         [
>           "query",
>           "c:\\Users\\gt\\Documents\\epub\\issue018.epub",
>           "time-taken",
>           "0:0:0.0",
>           null,
>           "----------- row #1-------------",
>           "text",
>           "< ... parsed epub text - snip ... >"
>           "title",
>           "Issue 18 title",
>           "Author",
>           "Author text",
>           null,
>           "---------------------------------------------"
>         ],
>         "document#2",
>         []
>       ],
>       null,
>       "----------- row #2-------------",
>       "fileSize",
>       4428804,
>       "fileLastModified",
>       "2015-02-25T11:37:36.399Z",
>       "fileAbsolutePath",
>       "c:\\Users\\gt\\Documents\\epub\\issue019.epub",
>       "fileDir",
>       "c:\\Users\\gt\\Documents\\epub",
>       "file",
>       "issue019.epub",
>       null,
>       "---------------------------------------------",
>       "entity:documentImport",
>       [
>         "document#2",
>         []
>       ],
>       null,
>       "----------- row #3-------------",
>       "fileSize",
>       2580266,
>       "fileLastModified",
>       "2015-02-25T11:37:41.188Z",
>       "fileAbsolutePath",
>       "c:\\Users\\gt\\Documents\\epub\\issue020.epub",
>       "fileDir",
>       "c:\\Users\\gt\\Documents\\epub",
>       "file",
>       "issue020.epub",
>       null,
>       "---------------------------------------------",
>       "entity:documentImport",
>       [
>         "document#2",
>         []
>       ],
> ....
> ....



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to