Re: dataimporter tika doesn't extract certain div

2013-09-04 Thread Andreas Owen
so could i just nest it in a XPathEntityProcessor to filter the html or is there something like xpath for tika? entity name=htm processor=XPathEntityProcessor url=${rec.file} forEach=/div[@id='content'] dataSource=main entity name=tika processor=TikaEntityProcessor

Re: dataimporter tika doesn't extract certain div

2013-09-04 Thread Shalin Shekhar Mangar
No that wouldn't work. It seems that you probably need a custom Transformer to extract the right div content. I do not know if TikaEntityProcessor supports such a thing. On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen a...@conx.ch wrote: so could i just nest it in a XPathEntityProcessor to filter

Re: dataimporter tika doesn't extract certain div

2013-09-04 Thread Andreas Owen
or could i use a filter in schema.xml where i define a fieldtype and use some filter that understands xpath? On 4. Sep 2013, at 11:52 AM, Shalin Shekhar Mangar wrote: No that wouldn't work. It seems that you probably need a custom Transformer to extract the right div content. I do not know if

Re: dataimporter tika doesn't extract certain div

2013-09-03 Thread Shalin Shekhar Mangar
I don't know much about Tika but in the example data-config.xml that you posted, the xpath attribute on the field text won't work because the xpath attribute is used only by a XPathEntityProcessor. On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen a...@conx.ch wrote: I want tika to only index the

dataimporter tika doesn't extract certain div

2013-08-29 Thread Andreas Owen
I want tika to only index the content in div id=content.../div for the field text. unfortunately it's indexing the hole page. Can't xpath do this? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/