Re: dataimporter tika doesn't extract certain div
so could i just nest it in a XPathEntityProcessor to filter the html or is there something like xpath for tika? entity name=htm processor=XPathEntityProcessor url=${rec.file} forEach=/div[@id='content'] dataSource=main entity name=tika processor=TikaEntityProcessor url=${htm} dataSource=dataUrl onError=skip htmlMapper=identity format=html field column=text / /entity /entity but now i dont know how to pass the text to tika, what do i put in url and datasource? On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote: I don't know much about Tika but in the example data-config.xml that you posted, the xpath attribute on the field text won't work because the xpath attribute is used only by a XPathEntityProcessor. On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen a...@conx.ch wrote: I want tika to only index the content in div id=content.../div for the field text. unfortunately it's indexing the hole page. Can't xpath do this? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource name=main/ document entity name=rec processor=XPathEntityProcessor url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc dataSource=main !--transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl onError=skip htmlMapper=identity format=html field column=text xpath=//div[@id='content'] / /entity /entity /document /dataConfig -- Regards, Shalin Shekhar Mangar.
Re: dataimporter tika doesn't extract certain div
No that wouldn't work. It seems that you probably need a custom Transformer to extract the right div content. I do not know if TikaEntityProcessor supports such a thing. On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen a...@conx.ch wrote: so could i just nest it in a XPathEntityProcessor to filter the html or is there something like xpath for tika? entity name=htm processor=XPathEntityProcessor url=${rec.file} forEach=/div[@id='content'] dataSource=main entity name=tika processor=TikaEntityProcessor url=${htm} dataSource=dataUrl onError=skip htmlMapper=identity format=html field column=text / /entity /entity but now i dont know how to pass the text to tika, what do i put in url and datasource? On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote: I don't know much about Tika but in the example data-config.xml that you posted, the xpath attribute on the field text won't work because the xpath attribute is used only by a XPathEntityProcessor. On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen a...@conx.ch wrote: I want tika to only index the content in div id=content.../div for the field text. unfortunately it's indexing the hole page. Can't xpath do this? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource name=main/ document entity name=rec processor=XPathEntityProcessor url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc dataSource=main !--transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl onError=skip htmlMapper=identity format=html field column=text xpath=//div[@id='content'] / /entity /entity /document /dataConfig -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: dataimporter tika doesn't extract certain div
or could i use a filter in schema.xml where i define a fieldtype and use some filter that understands xpath? On 4. Sep 2013, at 11:52 AM, Shalin Shekhar Mangar wrote: No that wouldn't work. It seems that you probably need a custom Transformer to extract the right div content. I do not know if TikaEntityProcessor supports such a thing. On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen a...@conx.ch wrote: so could i just nest it in a XPathEntityProcessor to filter the html or is there something like xpath for tika? entity name=htm processor=XPathEntityProcessor url=${rec.file} forEach=/div[@id='content'] dataSource=main entity name=tika processor=TikaEntityProcessor url=${htm} dataSource=dataUrl onError=skip htmlMapper=identity format=html field column=text / /entity /entity but now i dont know how to pass the text to tika, what do i put in url and datasource? On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote: I don't know much about Tika but in the example data-config.xml that you posted, the xpath attribute on the field text won't work because the xpath attribute is used only by a XPathEntityProcessor. On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen a...@conx.ch wrote: I want tika to only index the content in div id=content.../div for the field text. unfortunately it's indexing the hole page. Can't xpath do this? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource name=main/ document entity name=rec processor=XPathEntityProcessor url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc dataSource=main !--transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl onError=skip htmlMapper=identity format=html field column=text xpath=//div[@id='content'] / /entity /entity /document /dataConfig -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: dataimporter tika doesn't extract certain div
I don't know much about Tika but in the example data-config.xml that you posted, the xpath attribute on the field text won't work because the xpath attribute is used only by a XPathEntityProcessor. On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen a...@conx.ch wrote: I want tika to only index the content in div id=content.../div for the field text. unfortunately it's indexing the hole page. Can't xpath do this? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource name=main/ document entity name=rec processor=XPathEntityProcessor url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc dataSource=main !--transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl onError=skip htmlMapper=identity format=html field column=text xpath=//div[@id='content'] / /entity /entity /document /dataConfig -- Regards, Shalin Shekhar Mangar.
dataimporter tika doesn't extract certain div
I want tika to only index the content in div id=content.../div for the field text. unfortunately it's indexing the hole page. Can't xpath do this? data-config.xml: dataConfig dataSource type=BinFileDataSource name=data/ dataSource type=BinURLDataSource name=dataUrl/ dataSource type=URLDataSource name=main/ document entity name=rec processor=XPathEntityProcessor url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc dataSource=main !--transformer=script:GenerateId-- field column=title xpath=//title / field column=id xpath=//id / field column=file xpath=//file / field column=path xpath=//path / field column=url xpath=//url / field column=Author xpath=//author / entity name=tika processor=TikaEntityProcessor url=${rec.path}${rec.file} dataSource=dataUrl onError=skip htmlMapper=identity format=html field column=text xpath=//div[@id='content'] / /entity /entity /document /dataConfig