Re: dataimporter tika doesn't extract certain div

2013-09-04 Thread Andreas Owen
so could i just nest it in a XPathEntityProcessor to filter the html or is 
there something like xpath for tika?

entity name=htm processor=XPathEntityProcessor url=${rec.file} 
forEach=/div[@id='content'] dataSource=main
entity name=tika processor=TikaEntityProcessor 
url=${htm} dataSource=dataUrl onError=skip htmlMapper=identity 
format=html 
field column=text /
/entity
/entity

but now i dont know how to pass the text to tika, what do i put in url and 
datasource?


On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote:

 I don't know much about Tika but in the example data-config.xml that
 you posted, the xpath attribute on the field text won't work
 because the xpath attribute is used only by a XPathEntityProcessor.
 
 On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen a...@conx.ch wrote:
 I want tika to only index the content in div id=content.../div for the 
 field text. unfortunately it's indexing the hole page. Can't xpath do this?
 
 data-config.xml:
 
 dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource name=main/
 document
entity name=rec processor=XPathEntityProcessor 
 url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc 
 dataSource=main !--transformer=script:GenerateId--
field column=title xpath=//title /
field column=id xpath=//id /
field column=file xpath=//file /
field column=path xpath=//path /
field column=url xpath=//url /
field column=Author xpath=//author /
 
entity name=tika processor=TikaEntityProcessor 
 url=${rec.path}${rec.file} dataSource=dataUrl onError=skip 
 htmlMapper=identity format=html 
field column=text xpath=//div[@id='content'] /
 
/entity
/entity
 /document
 /dataConfig
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.



Re: dataimporter tika doesn't extract certain div

2013-09-04 Thread Shalin Shekhar Mangar
No that wouldn't work. It seems that you probably need a custom
Transformer to extract the right div content. I do not know if
TikaEntityProcessor supports such a thing.

On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen a...@conx.ch wrote:
 so could i just nest it in a XPathEntityProcessor to filter the html or is 
 there something like xpath for tika?

 entity name=htm processor=XPathEntityProcessor url=${rec.file} 
 forEach=/div[@id='content'] dataSource=main
 entity name=tika processor=TikaEntityProcessor 
 url=${htm} dataSource=dataUrl onError=skip htmlMapper=identity 
 format=html 
 field column=text /
 /entity
 /entity

 but now i dont know how to pass the text to tika, what do i put in url and 
 datasource?


 On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote:

 I don't know much about Tika but in the example data-config.xml that
 you posted, the xpath attribute on the field text won't work
 because the xpath attribute is used only by a XPathEntityProcessor.

 On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen a...@conx.ch wrote:
 I want tika to only index the content in div id=content.../div for 
 the field text. unfortunately it's indexing the hole page. Can't xpath do 
 this?

 data-config.xml:

 dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource name=main/
 document
entity name=rec processor=XPathEntityProcessor 
 url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc 
 dataSource=main !--transformer=script:GenerateId--
field column=title xpath=//title /
field column=id xpath=//id /
field column=file xpath=//file /
field column=path xpath=//path /
field column=url xpath=//url /
field column=Author xpath=//author /

entity name=tika processor=TikaEntityProcessor 
 url=${rec.path}${rec.file} dataSource=dataUrl onError=skip 
 htmlMapper=identity format=html 
field column=text xpath=//div[@id='content'] /

/entity
/entity
 /document
 /dataConfig



 --
 Regards,
 Shalin Shekhar Mangar.




-- 
Regards,
Shalin Shekhar Mangar.


Re: dataimporter tika doesn't extract certain div

2013-09-04 Thread Andreas Owen
or could i use a filter in schema.xml where i define a fieldtype and use some 
filter that understands xpath?

On 4. Sep 2013, at 11:52 AM, Shalin Shekhar Mangar wrote:

 No that wouldn't work. It seems that you probably need a custom
 Transformer to extract the right div content. I do not know if
 TikaEntityProcessor supports such a thing.
 
 On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen a...@conx.ch wrote:
 so could i just nest it in a XPathEntityProcessor to filter the html or is 
 there something like xpath for tika?
 
 entity name=htm processor=XPathEntityProcessor url=${rec.file} 
 forEach=/div[@id='content'] dataSource=main
entity name=tika processor=TikaEntityProcessor 
 url=${htm} dataSource=dataUrl onError=skip htmlMapper=identity 
 format=html 
field column=text /
/entity
/entity
 
 but now i dont know how to pass the text to tika, what do i put in url and 
 datasource?
 
 
 On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote:
 
 I don't know much about Tika but in the example data-config.xml that
 you posted, the xpath attribute on the field text won't work
 because the xpath attribute is used only by a XPathEntityProcessor.
 
 On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen a...@conx.ch wrote:
 I want tika to only index the content in div id=content.../div for 
 the field text. unfortunately it's indexing the hole page. Can't xpath 
 do this?
 
 data-config.xml:
 
 dataConfig
   dataSource type=BinFileDataSource name=data/
   dataSource type=BinURLDataSource name=dataUrl/
   dataSource type=URLDataSource name=main/
 document
   entity name=rec processor=XPathEntityProcessor 
 url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc 
 dataSource=main !--transformer=script:GenerateId--
   field column=title xpath=//title /
   field column=id xpath=//id /
   field column=file xpath=//file /
   field column=path xpath=//path /
   field column=url xpath=//url /
   field column=Author xpath=//author /
 
   entity name=tika processor=TikaEntityProcessor 
 url=${rec.path}${rec.file} dataSource=dataUrl onError=skip 
 htmlMapper=identity format=html 
   field column=text xpath=//div[@id='content'] /
 
   /entity
   /entity
 /document
 /dataConfig
 
 
 
 --
 Regards,
 Shalin Shekhar Mangar.
 
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.



Re: dataimporter tika doesn't extract certain div

2013-09-03 Thread Shalin Shekhar Mangar
I don't know much about Tika but in the example data-config.xml that
you posted, the xpath attribute on the field text won't work
because the xpath attribute is used only by a XPathEntityProcessor.

On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen a...@conx.ch wrote:
 I want tika to only index the content in div id=content.../div for the 
 field text. unfortunately it's indexing the hole page. Can't xpath do this?

 data-config.xml:

 dataConfig
 dataSource type=BinFileDataSource name=data/
 dataSource type=BinURLDataSource name=dataUrl/
 dataSource type=URLDataSource name=main/
 document
 entity name=rec processor=XPathEntityProcessor 
 url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc 
 dataSource=main !--transformer=script:GenerateId--
 field column=title xpath=//title /
 field column=id xpath=//id /
 field column=file xpath=//file /
 field column=path xpath=//path /
 field column=url xpath=//url /
 field column=Author xpath=//author /

 entity name=tika processor=TikaEntityProcessor 
 url=${rec.path}${rec.file} dataSource=dataUrl onError=skip 
 htmlMapper=identity format=html 
 field column=text xpath=//div[@id='content'] /

 /entity
 /entity
 /document
 /dataConfig



-- 
Regards,
Shalin Shekhar Mangar.


dataimporter tika doesn't extract certain div

2013-08-29 Thread Andreas Owen
I want tika to only index the content in div id=content.../div for the 
field text. unfortunately it's indexing the hole page. Can't xpath do this?

data-config.xml:

dataConfig
dataSource type=BinFileDataSource name=data/
dataSource type=BinURLDataSource name=dataUrl/
dataSource type=URLDataSource name=main/
document
entity name=rec processor=XPathEntityProcessor 
url=http://127.0.0.1/tkb/internet/docImportUrl.xml; forEach=/docs/doc 
dataSource=main !--transformer=script:GenerateId--
field column=title xpath=//title /
field column=id xpath=//id /
field column=file xpath=//file /
field column=path xpath=//path /
field column=url xpath=//url /
field column=Author xpath=//author /

entity name=tika processor=TikaEntityProcessor 
url=${rec.path}${rec.file} dataSource=dataUrl onError=skip 
htmlMapper=identity format=html 
field column=text xpath=//div[@id='content'] /

/entity
/entity
/document
/dataConfig