Re: HTMLStripTransformer not working with data importer

James Ostheimer Thu, 11 Mar 2010 09:07:14 -0800

Hi-

I can't seem to make any of the transfomers work, I am using the
DataImporter to pull in data from a wordpress instance (see below).  Neither
REGEX or HTMLStrip seems to do anything to my content.


Do I have to include a separate jar with the transformers?  Are the
transformers in 1.4 (particularly the HTMLStrip)?

James

On Wed, Mar 10, 2010 at 10:47 PM, James Ostheimer <[email protected]
> wrote:

> HI-
>
> I am working a contract to index some wordpress data.  For the posts I of
> course have html in the content of the column, I'd like to strip it out.
>  Here is my data importer config
>
> <dataConfig>
>     <dataSource driver="com.mysql.jdbc.Driver"
> url="jdbc:mysql://localhost:3306/econetsm" user="*******" password="*******"
> />
>     <document>
>             <entity name="post" transformer="HTMLStripTransformer"
> query="SELECT id, post_content, post_title FROM elinstmkting_posts e"
> onError="abort"
>                 deltaQuery="SELECT * FROM elinstmkting_posts e where
> post_modified_gmt > '${dataimporter.last_index_time}'">
>            <field column="POST_TITLE" name="post_title"
> stripHTML="false"/>
>             <field column="POST_CONTENT" name="post_content"
> stripHTML="true"  />
>         </entity>
>     </document>
> </dataConfig>
>
> Looks perfect according to the wiki docs, but the html is found when I
> search for "strong" (<strong> tag) and html is returned in the field.
>
> I assume I am doing something stupid wrong, I am using the latest stable
> solr (1.4.0).
>
> Does it matter that the post data is not a complete html document (it
> doesn't have a <html> start tag or a <body> tag)?
>
> James
>

Re: HTMLStripTransformer not working with data importer

Reply via email to