[ https://issues.apache.org/jira/browse/SOLR-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750855#action_12750855 ]
Noble Paul edited comment on SOLR-1358 at 12/8/09 3:29 PM: ----------------------------------------------------------- Let us provide a new TikaEntityProcessor {code:xml} <dataConfig> <!-- use any of type DataSource<InputStream> --> <dataSource type="BinURLDataSource"/> <document> <!-- The value of format can be text|xml|html . The implicit field 'text' will have that format. default value is 'text' (if not specified) --> <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}" format="text"> <!--Do appropriate mapping here meta="true" means it is a metadata field --> <field column="Author" meta="true" name="author"/> <field column="title" meta="true" name="docTitle"/> <!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately--> <field column="text"/> </entity> <document> </dataConfig> {code} With format=xml|html XPathEntityProcessor can be nested. This may help users extract more nested data from a file. It is even possible to create multiple documents from a single file was (Author: noble.paul): Let us provide a new TikaEntityProcessor {code:xml} <dataConfig> <!-- use any of type DataSource<InputStream> --> <dataSource type="BinURLDataSource"/> <document> <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}"> <!--Do appropriate mapping here meta="true" means it is a metadata field --> <field column="Author" meta="true" name="author"/> <field column="title" meta="true" name="docTitle"/> <!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately--> <field column="text"/> </entity> <document> </dataConfig> {code} This most likely would need a BinUrlDataSource/BinContentStreamDataSource because Tika uses binary inputs. My suggestion is that TikaEntityProcessor live in the extraction contrib so that managing dependencies is easier. But we will have to make extraction have a compile-time dependency on DIH. Grant , what do you think? > Integration of Tika and DataImportHandler > ----------------------------------------- > > Key: SOLR-1358 > URL: https://issues.apache.org/jira/browse/SOLR-1358 > Project: Solr > Issue Type: New Feature > Components: contrib - DataImportHandler > Reporter: Sascha Szott > Assignee: Noble Paul > Attachments: SOLR-1358.patch, SOLR-1358.patch > > > At the moment, it's impossible to configure Solr such that it build up > documents by using data that comes from both pdf documents and database table > columns. Currently, to accomplish this task, it's up to the user to add some > preprocessing that converts pdf files into plain text files. Therefore, I > would like to see an integration of Solr Cell into DIH that makes those > preprocessing obsolete. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.