so could i just nest it in a XPathEntityProcessor to filter the html or is
there something like xpath for tika?
entity name=htm processor=XPathEntityProcessor url=${rec.file}
forEach=/div[@id='content'] dataSource=main
entity name=tika processor=TikaEntityProcessor
No that wouldn't work. It seems that you probably need a custom
Transformer to extract the right div content. I do not know if
TikaEntityProcessor supports such a thing.
On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen a...@conx.ch wrote:
so could i just nest it in a XPathEntityProcessor to filter
or could i use a filter in schema.xml where i define a fieldtype and use some
filter that understands xpath?
On 4. Sep 2013, at 11:52 AM, Shalin Shekhar Mangar wrote:
No that wouldn't work. It seems that you probably need a custom
Transformer to extract the right div content. I do not know if
I don't know much about Tika but in the example data-config.xml that
you posted, the xpath attribute on the field text won't work
because the xpath attribute is used only by a XPathEntityProcessor.
On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen a...@conx.ch wrote:
I want tika to only index the
I want tika to only index the content in div id=content.../div for the
field text. unfortunately it's indexing the hole page. Can't xpath do this?
data-config.xml:
dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/