or could i use a filter in schema.xml where i define a fieldtype and use some filter that understands xpath?
On 4. Sep 2013, at 11:52 AM, Shalin Shekhar Mangar wrote: > No that wouldn't work. It seems that you probably need a custom > Transformer to extract the right div content. I do not know if > TikaEntityProcessor supports such a thing. > > On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen <a...@conx.ch> wrote: >> so could i just nest it in a XPathEntityProcessor to filter the html or is >> there something like xpath for tika? >> >> <entity name="htm" processor="XPathEntityProcessor" url="${rec.file}" >> forEach="/div[@id='content']" dataSource="main"> >> <entity name="tika" processor="TikaEntityProcessor" >> url="${htm}" dataSource="dataUrl" onError="skip" htmlMapper="identity" >> format="html" > >> <field column="text" /> >> </entity> >> </entity> >> >> but now i dont know how to pass the text to tika, what do i put in url and >> datasource? >> >> >> On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote: >> >>> I don't know much about Tika but in the example data-config.xml that >>> you posted, the "xpath" attribute on the field "text" won't work >>> because the xpath attribute is used only by a XPathEntityProcessor. >>> >>> On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen <a...@conx.ch> wrote: >>>> I want tika to only index the content in <div id="content">...</div> for >>>> the field "text". unfortunately it's indexing the hole page. Can't xpath >>>> do this? >>>> >>>> data-config.xml: >>>> >>>> <dataConfig> >>>> <dataSource type="BinFileDataSource" name="data"/> >>>> <dataSource type="BinURLDataSource" name="dataUrl"/> >>>> <dataSource type="URLDataSource" name="main"/> >>>> <document> >>>> <entity name="rec" processor="XPathEntityProcessor" >>>> url="http://127.0.0.1/tkb/internet/docImportUrl.xml" forEach="/docs/doc" >>>> dataSource="main"> <!--transformer="script:GenerateId"--> >>>> <field column="title" xpath="//title" /> >>>> <field column="id" xpath="//id" /> >>>> <field column="file" xpath="//file" /> >>>> <field column="path" xpath="//path" /> >>>> <field column="url" xpath="//url" /> >>>> <field column="Author" xpath="//author" /> >>>> >>>> <entity name="tika" processor="TikaEntityProcessor" >>>> url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" >>>> htmlMapper="identity" format="html" > >>>> <field column="text" xpath="//div[@id='content']" /> >>>> >>>> </entity> >>>> </entity> >>>> </document> >>>> </dataConfig> >>> >>> >>> >>> -- >>> Regards, >>> Shalin Shekhar Mangar. >> > > > > -- > Regards, > Shalin Shekhar Mangar.