[ 
https://issues.apache.org/jira/browse/SOLR-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357242#comment-14357242
 ] 

Tim Allison commented on SOLR-7174:
-----------------------------------

Could be Tika, but it isn't RecursiveParserWrapper.  That has to be called in 
the invoking code (e.g. it isn't under the hood of AutoDetectParser), and it 
would wrap AutoDetectParser or the user configured parser.  


> DIH should reset TikaEntityProcessor so that it is capable of re-use.
> ---------------------------------------------------------------------
>
>                 Key: SOLR-7174
>                 URL: https://issues.apache.org/jira/browse/SOLR-7174
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 5.0
>         Environment: Windows 7.  Ubuntu 14.04.
>            Reporter: Gary Taylor
>            Assignee: Noble Paul
>              Labels: dataimportHandler, tika,text-extraction
>         Attachments: SOLR-7174.patch
>
>
> Downloaded Solr 5.0.0, on a Windows 7 PC.   I ran "solr start" and then "solr 
> create -c hn2" to create a new core.
> I want to index a load of epub files that I've got in a directory. So I 
> created a data-import.xml (in solr\hn2\conf):
> <dataConfig>
>     <dataSource type="BinFileDataSource" name="bin" />
>     <document>
>         <entity name="files" dataSource="null" rootEntity="false"
>             processor="FileListEntityProcessor"
>             baseDir="c:/Users/gt/Documents/epub" fileName=".*epub"
>             onError="skip"
>             recursive="true">
>             <field column="fileAbsolutePath" name="id" />
>             <field column="fileSize" name="size" />
>             <field column="fileLastModified" name="lastModified" />
>             <entity name="documentImport" processor="TikaEntityProcessor"
>                 url="${files.fileAbsolutePath}" format="text" 
> dataSource="bin" onError="skip">
>                 <field column="file" name="fileName"/>
>                 <field column="Author" name="author" meta="true"/>
>                 <field column="title" name="title" meta="true"/>
>                 <field column="text" name="content"/>
>             </entity>
>         </entity>
>     </document>
> </dataConfig>
> In my solrconfig.xml, I added a requestHandler entry to reference my 
> data-import.xml:
>   <requestHandler name="/dataimport" 
> class="org.apache.solr.handler.dataimport.DataImportHandler">
>       <lst name="defaults">
>           <str name="config">data-import.xml</str>
>       </lst>
>   </requestHandler>
> I renamed managed-schema to schema.xml, and ensured the following doc fields 
> were setup:
>       <field name="id" type="string" indexed="true" stored="true" 
> required="true" multiValued="false" />
>       <field name="fileName" type="string" indexed="true" stored="true" />
>       <field name="author" type="string" indexed="true" stored="true" />
>       <field name="title" type="string" indexed="true" stored="true" />
>       <field name="size" type="long" indexed="true" stored="true" />
>       <field name="lastModified" type="date" indexed="true" stored="true" />
>       <field name="content" type="text_en" indexed="false" stored="true" 
> multiValued="false"/>
>       <field name="text" type="text_en" indexed="true" stored="false" 
> multiValued="true"/>
>     <copyField source="content" dest="text"/>
> I copied all the jars from dist and contrib\* into server\solr\lib.
> Stopping and restarting solr then creates a new managed-schema file and 
> renames schema.xml to schema.xml.back
> All good so far.
> Now I go to the web admin for dataimport 
> (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute 
> a full import.
> But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" - 
> ie. it only adds one document (the very first one) even though it's iterated 
> over 58!
> No errors are reported in the logs. 
> I can repeat this on Ubuntu 14.04 using the same steps, so it's not Windows 
> specific.
> -----------------
> If I change the data-import.xml to use FileDataSource and 
> PlainTextEntityProcessor and parse txt files, eg: 
> <dataConfig>  
>       <dataSource type="FileDataSource" name="bin" />
>       <document>
>               <entity name="files" dataSource="null" rootEntity="false"
>                       processor="FileListEntityProcessor"
>                       baseDir="c:/Users/gt/Documents/epub" fileName=".*txt">
>                       <field column="fileAbsolutePath" name="id" />
>                       <field column="fileSize" name="size" />
>                       <field column="fileLastModified" name="lastModified" />
>                       <entity name="documentImport" 
> processor="PlainTextEntityProcessor"
>                               url="${files.fileAbsolutePath}" format="text" 
> dataSource="bin">
>                               <field column="plainText" name="content"/>
>                       </entity>
>               </entity>
>       </document> 
> </dataConfig> 
> This works.  So it's a combo of BinFileDataSource and TikaEntityProcessor 
> that is failing.
> On Windows, I ran Process Monitor, and spotted that only the very first epub 
> file is actually being read (repeatedly).
> With verbose and debug on when running the DIH, I get the following response:
> ....
>   "verbose-output": [
>     "entity:files",
>     [
>       null,
>       "----------- row #1-------------",
>       "fileSize",
>       2609004,
>       "fileLastModified",
>       "2015-02-25T11:37:25.217Z",
>       "fileAbsolutePath",
>       "c:\\Users\\gt\\Documents\\epub\\issue018.epub",
>       "fileDir",
>       "c:\\Users\\gt\\Documents\\epub",
>       "file",
>       "issue018.epub",
>       null,
>       "---------------------------------------------",
>       "entity:documentImport",
>       [
>         "document#1",
>         [
>           "query",
>           "c:\\Users\\gt\\Documents\\epub\\issue018.epub",
>           "time-taken",
>           "0:0:0.0",
>           null,
>           "----------- row #1-------------",
>           "text",
>           "< ... parsed epub text - snip ... >"
>           "title",
>           "Issue 18 title",
>           "Author",
>           "Author text",
>           null,
>           "---------------------------------------------"
>         ],
>         "document#2",
>         []
>       ],
>       null,
>       "----------- row #2-------------",
>       "fileSize",
>       4428804,
>       "fileLastModified",
>       "2015-02-25T11:37:36.399Z",
>       "fileAbsolutePath",
>       "c:\\Users\\gt\\Documents\\epub\\issue019.epub",
>       "fileDir",
>       "c:\\Users\\gt\\Documents\\epub",
>       "file",
>       "issue019.epub",
>       null,
>       "---------------------------------------------",
>       "entity:documentImport",
>       [
>         "document#2",
>         []
>       ],
>       null,
>       "----------- row #3-------------",
>       "fileSize",
>       2580266,
>       "fileLastModified",
>       "2015-02-25T11:37:41.188Z",
>       "fileAbsolutePath",
>       "c:\\Users\\gt\\Documents\\epub\\issue020.epub",
>       "fileDir",
>       "c:\\Users\\gt\\Documents\\epub",
>       "file",
>       "issue020.epub",
>       null,
>       "---------------------------------------------",
>       "entity:documentImport",
>       [
>         "document#2",
>         []
>       ],
> ....
> ....



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to