or could i use a filter in schema.xml where i define a fieldtype and use some 
filter that understands xpath?

On 4. Sep 2013, at 11:52 AM, Shalin Shekhar Mangar wrote:

> No that wouldn't work. It seems that you probably need a custom
> Transformer to extract the right div content. I do not know if
> TikaEntityProcessor supports such a thing.
> 
> On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen <a...@conx.ch> wrote:
>> so could i just nest it in a XPathEntityProcessor to filter the html or is 
>> there something like xpath for tika?
>> 
>> <entity name="htm" processor="XPathEntityProcessor" url="${rec.file}" 
>> forEach="/div[@id='content']" dataSource="main">
>>                        <entity name="tika" processor="TikaEntityProcessor" 
>> url="${htm}" dataSource="dataUrl" onError="skip" htmlMapper="identity" 
>> format="html" >
>>                                <field column="text" />
>>                        </entity>
>>                </entity>
>> 
>> but now i dont know how to pass the text to tika, what do i put in url and 
>> datasource?
>> 
>> 
>> On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote:
>> 
>>> I don't know much about Tika but in the example data-config.xml that
>>> you posted, the "xpath" attribute on the field "text" won't work
>>> because the xpath attribute is used only by a XPathEntityProcessor.
>>> 
>>> On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen <a...@conx.ch> wrote:
>>>> I want tika to only index the content in <div id="content">...</div> for 
>>>> the field "text". unfortunately it's indexing the hole page. Can't xpath 
>>>> do this?
>>>> 
>>>> data-config.xml:
>>>> 
>>>> <dataConfig>
>>>>       <dataSource type="BinFileDataSource" name="data"/>
>>>>       <dataSource type="BinURLDataSource" name="dataUrl"/>
>>>>       <dataSource type="URLDataSource" name="main"/>
>>>> <document>
>>>>       <entity name="rec" processor="XPathEntityProcessor" 
>>>> url="http://127.0.0.1/tkb/internet/docImportUrl.xml"; forEach="/docs/doc" 
>>>> dataSource="main"> <!--transformer="script:GenerateId"-->
>>>>               <field column="title" xpath="//title" />
>>>>               <field column="id" xpath="//id" />
>>>>               <field column="file" xpath="//file" />
>>>>               <field column="path" xpath="//path" />
>>>>               <field column="url" xpath="//url" />
>>>>               <field column="Author" xpath="//author" />
>>>> 
>>>>               <entity name="tika" processor="TikaEntityProcessor" 
>>>> url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" 
>>>> htmlMapper="identity" format="html" >
>>>>                       <field column="text" xpath="//div[@id='content']" />
>>>> 
>>>>               </entity>
>>>>       </entity>
>>>> </document>
>>>> </dataConfig>
>>> 
>>> 
>>> 
>>> --
>>> Regards,
>>> Shalin Shekhar Mangar.
>> 
> 
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.

Reply via email to