thanks but the first suggestion is already implemented and the 2. didn't work. 
i have also tried htmlMapper="identity" but nothing worked.

i also tried this but the html was stripped in both fields

<entity name="tika" processor="TikaEntityProcessor" url="${rec.urlParse}" 
dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" 
transformer="HTMLStripTransformer">
                        <field column="text" name="text" stripHTML="false" />
                        <field column="text" name="text_nohtml" 
stripHTML="true" />

but in the end i think it's best to cut tika out because i'm not getting any 
benefits from it. i would just need to get this to work:

        <field xpath="//h:h1" column="h_1" />
        <field column="text" xpath="/xhtml:html/xhtml:body" />

the fields are empty and i'm not getting any errors in the logs.


On 28. Sep 2013, at 2:43 AM, Alexandre Rafalovitch wrote:

> This is a rather complicated example to chew through, but try the following
> two things:
> *) dataField="${tika.text}"  => dataField="text" (or less likely htmlMapper
> tika.text)
> You might be trying to read content of the field rather than passing
> reference to the field that seems to be expected. This might explain the
> exception.
> 
> *) It may help to be aware of
> https://issues.apache.org/jira/browse/SOLR-4530 . There is a new
> htmlMapper="identity" flag on Tika entries to ensure more of HTML structure
> passing through. By default, Tika strips out most of the HTML tags.
> 
> Regards,
>   Alex.
> 
> On Thu, Sep 26, 2013 at 5:17 PM, Andreas Owen <a...@conx.ch> wrote:
> 
>>                <entity name="tika" processor="TikaEntityProcessor"
>> url="${rec.urlParse}" dataSource="dataUrl" onError="skip" format="html">
>>                        <field column="text"/>
>> 
>>                        <entity name="detail" type="XPathEntityProcessor"
>> forEach="/html" dataSource="fld" dataField="${tika.text}" rootEntity="true"
>> onError="skip">
>>                                <field xpath="//h1" column="h_1" />
>>                        </entity>
>>                </entity>
>> 
> 
> 
> 
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)

Reply via email to