I don't know much about Tika but in the example data-config.xml that
you posted, the "xpath" attribute on the field "text" won't work
because the xpath attribute is used only by a XPathEntityProcessor.

On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen <a...@conx.ch> wrote:
> I want tika to only index the content in <div id="content">...</div> for the 
> field "text". unfortunately it's indexing the hole page. Can't xpath do this?
>
> data-config.xml:
>
> <dataConfig>
>         <dataSource type="BinFileDataSource" name="data"/>
>         <dataSource type="BinURLDataSource" name="dataUrl"/>
>         <dataSource type="URLDataSource" name="main"/>
> <document>
>         <entity name="rec" processor="XPathEntityProcessor" 
> url="http://127.0.0.1/tkb/internet/docImportUrl.xml"; forEach="/docs/doc" 
> dataSource="main"> <!--transformer="script:GenerateId"-->
>                 <field column="title" xpath="//title" />
>                 <field column="id" xpath="//id" />
>                 <field column="file" xpath="//file" />
>                 <field column="path" xpath="//path" />
>                 <field column="url" xpath="//url" />
>                 <field column="Author" xpath="//author" />
>
>                 <entity name="tika" processor="TikaEntityProcessor" 
> url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" 
> htmlMapper="identity" format="html" >
>                         <field column="text" xpath="//div[@id='content']" />
>
>                 </entity>
>         </entity>
> </document>
> </dataConfig>



-- 
Regards,
Shalin Shekhar Mangar.

Reply via email to