[ https://issues.apache.org/jira/browse/SOLR-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346412#comment-14346412 ]
ASF subversion and git services commented on SOLR-7174: ------------------------------------------------------- Commit 1663858 from [~noble.paul] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1663858 ] SOLR-7174: DIH should reset TikaEntityProcessor so that it is capable of re-use > DIH should reset TikaEntityProcessor so that it is capable of re-use. > --------------------------------------------------------------------- > > Key: SOLR-7174 > URL: https://issues.apache.org/jira/browse/SOLR-7174 > Project: Solr > Issue Type: Bug > Components: contrib - DataImportHandler > Affects Versions: 5.0 > Environment: Windows 7. Ubuntu 14.04. > Reporter: Gary Taylor > Assignee: Noble Paul > Labels: dataimportHandler, tika,text-extraction > Attachments: SOLR-7174.patch > > > Downloaded Solr 5.0.0, on a Windows 7 PC. I ran "solr start" and then "solr > create -c hn2" to create a new core. > I want to index a load of epub files that I've got in a directory. So I > created a data-import.xml (in solr\hn2\conf): > <dataConfig> > <dataSource type="BinFileDataSource" name="bin" /> > <document> > <entity name="files" dataSource="null" rootEntity="false" > processor="FileListEntityProcessor" > baseDir="c:/Users/gt/Documents/epub" fileName=".*epub" > onError="skip" > recursive="true"> > <field column="fileAbsolutePath" name="id" /> > <field column="fileSize" name="size" /> > <field column="fileLastModified" name="lastModified" /> > <entity name="documentImport" processor="TikaEntityProcessor" > url="${files.fileAbsolutePath}" format="text" > dataSource="bin" onError="skip"> > <field column="file" name="fileName"/> > <field column="Author" name="author" meta="true"/> > <field column="title" name="title" meta="true"/> > <field column="text" name="content"/> > </entity> > </entity> > </document> > </dataConfig> > In my solrconfig.xml, I added a requestHandler entry to reference my > data-import.xml: > <requestHandler name="/dataimport" > class="org.apache.solr.handler.dataimport.DataImportHandler"> > <lst name="defaults"> > <str name="config">data-import.xml</str> > </lst> > </requestHandler> > I renamed managed-schema to schema.xml, and ensured the following doc fields > were setup: > <field name="id" type="string" indexed="true" stored="true" > required="true" multiValued="false" /> > <field name="fileName" type="string" indexed="true" stored="true" /> > <field name="author" type="string" indexed="true" stored="true" /> > <field name="title" type="string" indexed="true" stored="true" /> > <field name="size" type="long" indexed="true" stored="true" /> > <field name="lastModified" type="date" indexed="true" stored="true" /> > <field name="content" type="text_en" indexed="false" stored="true" > multiValued="false"/> > <field name="text" type="text_en" indexed="true" stored="false" > multiValued="true"/> > <copyField source="content" dest="text"/> > I copied all the jars from dist and contrib\* into server\solr\lib. > Stopping and restarting solr then creates a new managed-schema file and > renames schema.xml to schema.xml.back > All good so far. > Now I go to the web admin for dataimport > (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute > a full import. > But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" - > ie. it only adds one document (the very first one) even though it's iterated > over 58! > No errors are reported in the logs. > I can repeat this on Ubuntu 14.04 using the same steps, so it's not Windows > specific. > ----------------- > If I change the data-import.xml to use FileDataSource and > PlainTextEntityProcessor and parse txt files, eg: > <dataConfig> > <dataSource type="FileDataSource" name="bin" /> > <document> > <entity name="files" dataSource="null" rootEntity="false" > processor="FileListEntityProcessor" > baseDir="c:/Users/gt/Documents/epub" fileName=".*txt"> > <field column="fileAbsolutePath" name="id" /> > <field column="fileSize" name="size" /> > <field column="fileLastModified" name="lastModified" /> > <entity name="documentImport" > processor="PlainTextEntityProcessor" > url="${files.fileAbsolutePath}" format="text" > dataSource="bin"> > <field column="plainText" name="content"/> > </entity> > </entity> > </document> > </dataConfig> > This works. So it's a combo of BinFileDataSource and TikaEntityProcessor > that is failing. > On Windows, I ran Process Monitor, and spotted that only the very first epub > file is actually being read (repeatedly). > With verbose and debug on when running the DIH, I get the following response: > .... > "verbose-output": [ > "entity:files", > [ > null, > "----------- row #1-------------", > "fileSize", > 2609004, > "fileLastModified", > "2015-02-25T11:37:25.217Z", > "fileAbsolutePath", > "c:\\Users\\gt\\Documents\\epub\\issue018.epub", > "fileDir", > "c:\\Users\\gt\\Documents\\epub", > "file", > "issue018.epub", > null, > "---------------------------------------------", > "entity:documentImport", > [ > "document#1", > [ > "query", > "c:\\Users\\gt\\Documents\\epub\\issue018.epub", > "time-taken", > "0:0:0.0", > null, > "----------- row #1-------------", > "text", > "< ... parsed epub text - snip ... >" > "title", > "Issue 18 title", > "Author", > "Author text", > null, > "---------------------------------------------" > ], > "document#2", > [] > ], > null, > "----------- row #2-------------", > "fileSize", > 4428804, > "fileLastModified", > "2015-02-25T11:37:36.399Z", > "fileAbsolutePath", > "c:\\Users\\gt\\Documents\\epub\\issue019.epub", > "fileDir", > "c:\\Users\\gt\\Documents\\epub", > "file", > "issue019.epub", > null, > "---------------------------------------------", > "entity:documentImport", > [ > "document#2", > [] > ], > null, > "----------- row #3-------------", > "fileSize", > 2580266, > "fileLastModified", > "2015-02-25T11:37:41.188Z", > "fileAbsolutePath", > "c:\\Users\\gt\\Documents\\epub\\issue020.epub", > "fileDir", > "c:\\Users\\gt\\Documents\\epub", > "file", > "issue020.epub", > null, > "---------------------------------------------", > "entity:documentImport", > [ > "document#2", > [] > ], > .... > .... -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org