thanks but the first suggestion is already implemented and the 2. didn't work. i have also tried htmlMapper="identity" but nothing worked.
i also tried this but the html was stripped in both fields <entity name="tika" processor="TikaEntityProcessor" url="${rec.urlParse}" dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" transformer="HTMLStripTransformer"> <field column="text" name="text" stripHTML="false" /> <field column="text" name="text_nohtml" stripHTML="true" /> but in the end i think it's best to cut tika out because i'm not getting any benefits from it. i would just need to get this to work: <field xpath="//h:h1" column="h_1" /> <field column="text" xpath="/xhtml:html/xhtml:body" /> the fields are empty and i'm not getting any errors in the logs. On 28. Sep 2013, at 2:43 AM, Alexandre Rafalovitch wrote: > This is a rather complicated example to chew through, but try the following > two things: > *) dataField="${tika.text}" => dataField="text" (or less likely htmlMapper > tika.text) > You might be trying to read content of the field rather than passing > reference to the field that seems to be expected. This might explain the > exception. > > *) It may help to be aware of > https://issues.apache.org/jira/browse/SOLR-4530 . There is a new > htmlMapper="identity" flag on Tika entries to ensure more of HTML structure > passing through. By default, Tika strips out most of the HTML tags. > > Regards, > Alex. > > On Thu, Sep 26, 2013 at 5:17 PM, Andreas Owen <a...@conx.ch> wrote: > >> <entity name="tika" processor="TikaEntityProcessor" >> url="${rec.urlParse}" dataSource="dataUrl" onError="skip" format="html"> >> <field column="text"/> >> >> <entity name="detail" type="XPathEntityProcessor" >> forEach="/html" dataSource="fld" dataField="${tika.text}" rootEntity="true" >> onError="skip"> >> <field xpath="//h1" column="h_1" /> >> </entity> >> </entity> >> > > > > Personal website: http://www.outerthoughts.com/ > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > - Time is the quality of nature that keeps events from happening all at > once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)