[ 
https://issues.apache.org/jira/browse/SOLR-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356789#comment-14356789
 ] 

Alexandre Rafalovitch commented on SOLR-7174:
---------------------------------------------

This may actually be a regression, see SOLR-7222 . Which means we need to 
change the CHANGES.txt, but also that something else maybe affected. 

So, it is either Tika upgrade that did it or something in DIH. Possibly related 
to RecursiveParserWrapper mentioned in SOLR-7189.

> DIH should reset TikaEntityProcessor so that it is capable of re-use.
> ---------------------------------------------------------------------
>
>                 Key: SOLR-7174
>                 URL: https://issues.apache.org/jira/browse/SOLR-7174
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 5.0
>         Environment: Windows 7.  Ubuntu 14.04.
>            Reporter: Gary Taylor
>            Assignee: Noble Paul
>              Labels: dataimportHandler, tika,text-extraction
>         Attachments: SOLR-7174.patch
>
>
> Downloaded Solr 5.0.0, on a Windows 7 PC.   I ran "solr start" and then "solr 
> create -c hn2" to create a new core.
> I want to index a load of epub files that I've got in a directory. So I 
> created a data-import.xml (in solr\hn2\conf):
> <dataConfig>
>     <dataSource type="BinFileDataSource" name="bin" />
>     <document>
>         <entity name="files" dataSource="null" rootEntity="false"
>             processor="FileListEntityProcessor"
>             baseDir="c:/Users/gt/Documents/epub" fileName=".*epub"
>             onError="skip"
>             recursive="true">
>             <field column="fileAbsolutePath" name="id" />
>             <field column="fileSize" name="size" />
>             <field column="fileLastModified" name="lastModified" />
>             <entity name="documentImport" processor="TikaEntityProcessor"
>                 url="${files.fileAbsolutePath}" format="text" 
> dataSource="bin" onError="skip">
>                 <field column="file" name="fileName"/>
>                 <field column="Author" name="author" meta="true"/>
>                 <field column="title" name="title" meta="true"/>
>                 <field column="text" name="content"/>
>             </entity>
>         </entity>
>     </document>
> </dataConfig>
> In my solrconfig.xml, I added a requestHandler entry to reference my 
> data-import.xml:
>   <requestHandler name="/dataimport" 
> class="org.apache.solr.handler.dataimport.DataImportHandler">
>       <lst name="defaults">
>           <str name="config">data-import.xml</str>
>       </lst>
>   </requestHandler>
> I renamed managed-schema to schema.xml, and ensured the following doc fields 
> were setup:
>       <field name="id" type="string" indexed="true" stored="true" 
> required="true" multiValued="false" />
>       <field name="fileName" type="string" indexed="true" stored="true" />
>       <field name="author" type="string" indexed="true" stored="true" />
>       <field name="title" type="string" indexed="true" stored="true" />
>       <field name="size" type="long" indexed="true" stored="true" />
>       <field name="lastModified" type="date" indexed="true" stored="true" />
>       <field name="content" type="text_en" indexed="false" stored="true" 
> multiValued="false"/>
>       <field name="text" type="text_en" indexed="true" stored="false" 
> multiValued="true"/>
>     <copyField source="content" dest="text"/>
> I copied all the jars from dist and contrib\* into server\solr\lib.
> Stopping and restarting solr then creates a new managed-schema file and 
> renames schema.xml to schema.xml.back
> All good so far.
> Now I go to the web admin for dataimport 
> (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute 
> a full import.
> But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" - 
> ie. it only adds one document (the very first one) even though it's iterated 
> over 58!
> No errors are reported in the logs. 
> I can repeat this on Ubuntu 14.04 using the same steps, so it's not Windows 
> specific.
> -----------------
> If I change the data-import.xml to use FileDataSource and 
> PlainTextEntityProcessor and parse txt files, eg: 
> <dataConfig>  
>       <dataSource type="FileDataSource" name="bin" />
>       <document>
>               <entity name="files" dataSource="null" rootEntity="false"
>                       processor="FileListEntityProcessor"
>                       baseDir="c:/Users/gt/Documents/epub" fileName=".*txt">
>                       <field column="fileAbsolutePath" name="id" />
>                       <field column="fileSize" name="size" />
>                       <field column="fileLastModified" name="lastModified" />
>                       <entity name="documentImport" 
> processor="PlainTextEntityProcessor"
>                               url="${files.fileAbsolutePath}" format="text" 
> dataSource="bin">
>                               <field column="plainText" name="content"/>
>                       </entity>
>               </entity>
>       </document> 
> </dataConfig> 
> This works.  So it's a combo of BinFileDataSource and TikaEntityProcessor 
> that is failing.
> On Windows, I ran Process Monitor, and spotted that only the very first epub 
> file is actually being read (repeatedly).
> With verbose and debug on when running the DIH, I get the following response:
> ....
>   "verbose-output": [
>     "entity:files",
>     [
>       null,
>       "----------- row #1-------------",
>       "fileSize",
>       2609004,
>       "fileLastModified",
>       "2015-02-25T11:37:25.217Z",
>       "fileAbsolutePath",
>       "c:\\Users\\gt\\Documents\\epub\\issue018.epub",
>       "fileDir",
>       "c:\\Users\\gt\\Documents\\epub",
>       "file",
>       "issue018.epub",
>       null,
>       "---------------------------------------------",
>       "entity:documentImport",
>       [
>         "document#1",
>         [
>           "query",
>           "c:\\Users\\gt\\Documents\\epub\\issue018.epub",
>           "time-taken",
>           "0:0:0.0",
>           null,
>           "----------- row #1-------------",
>           "text",
>           "< ... parsed epub text - snip ... >"
>           "title",
>           "Issue 18 title",
>           "Author",
>           "Author text",
>           null,
>           "---------------------------------------------"
>         ],
>         "document#2",
>         []
>       ],
>       null,
>       "----------- row #2-------------",
>       "fileSize",
>       4428804,
>       "fileLastModified",
>       "2015-02-25T11:37:36.399Z",
>       "fileAbsolutePath",
>       "c:\\Users\\gt\\Documents\\epub\\issue019.epub",
>       "fileDir",
>       "c:\\Users\\gt\\Documents\\epub",
>       "file",
>       "issue019.epub",
>       null,
>       "---------------------------------------------",
>       "entity:documentImport",
>       [
>         "document#2",
>         []
>       ],
>       null,
>       "----------- row #3-------------",
>       "fileSize",
>       2580266,
>       "fileLastModified",
>       "2015-02-25T11:37:41.188Z",
>       "fileAbsolutePath",
>       "c:\\Users\\gt\\Documents\\epub\\issue020.epub",
>       "fileDir",
>       "c:\\Users\\gt\\Documents\\epub",
>       "file",
>       "issue020.epub",
>       null,
>       "---------------------------------------------",
>       "entity:documentImport",
>       [
>         "document#2",
>         []
>       ],
> ....
> ....



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to