I don't know much about Tika but in the example data-config.xml that you posted, the "xpath" attribute on the field "text" won't work because the xpath attribute is used only by a XPathEntityProcessor.
On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen <a...@conx.ch> wrote: > I want tika to only index the content in <div id="content">...</div> for the > field "text". unfortunately it's indexing the hole page. Can't xpath do this? > > data-config.xml: > > <dataConfig> > <dataSource type="BinFileDataSource" name="data"/> > <dataSource type="BinURLDataSource" name="dataUrl"/> > <dataSource type="URLDataSource" name="main"/> > <document> > <entity name="rec" processor="XPathEntityProcessor" > url="http://127.0.0.1/tkb/internet/docImportUrl.xml" forEach="/docs/doc" > dataSource="main"> <!--transformer="script:GenerateId"--> > <field column="title" xpath="//title" /> > <field column="id" xpath="//id" /> > <field column="file" xpath="//file" /> > <field column="path" xpath="//path" /> > <field column="url" xpath="//url" /> > <field column="Author" xpath="//author" /> > > <entity name="tika" processor="TikaEntityProcessor" > url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" > htmlMapper="identity" format="html" > > <field column="text" xpath="//div[@id='content']" /> > > </entity> > </entity> > </document> > </dataConfig> -- Regards, Shalin Shekhar Mangar.