[DIH] How to use combine Regex and HTML transformers

Pulkit Singhal Thu, 15 Sep 2011 13:44:31 -0700

Hello,

I need to pull out the price and imageURL for products in an Amazon RSS feed.


PROBLEM STATEMENT:
The following:
            <field column="description"
                   xpath="/rss/channel/item/description"
                   />
            <field column="price"
                   regex=".*?\$(\d*.\d*)"
                   sourceColName="description"
                   />
            <field column="imageUrl"
                   regex=".*?img src=&quot;(.*?)&quot;.*"
                   sourceColName="description"
                   />
works but I am left with html junk inside the description!

USELESS WORKAROUND:
If I try to strip the html from the data being fed into description
while letting the price and imageURL know of the direct path of the
RSS feed field like so:
            <field column="description"
                   xpath="/rss/channel/item/description"
                   stripHTML="true"
                   />
            <field column="price"
                   regex=".*?\$(\d*.\d*)"
                   xpath="/rss/channel/item/description"
                   />
            <field column="imageUrl"
                   regex=".*?img src=&quot;(.*?)&quot;.*"
                   xpath="/rss/channel/item/description"
                   />
then this fails and only the last configured field in this list
(imageURL) ends up having any data imported.
Is this a bug?

CRUX OF THE PROBLEM:
Also I tried to then create a field just to store the raw html data
like so but this configuration yields no results for the description
field so I'm back to where I started:
            <field column="rawDescription"
                   xpath="/rss/channel/item/description"
                   />
            <field column="description"
                   regex=".*"
                   sourceColName="rawDescription"
                   stripHTML="true"
                   />
            <field column="price"
                   regex=".*?\$(\d*.\d*)"
                   sourceColName="rawDescription"
                   />
            <field column="imageUrl"
                   regex=".*?img src=&quot;(.*?)&quot;.*"
                   sourceColName="rawDescription"
                   />
I was suspicious of trying to combine sourceColName with stripHTML to
begin with ... I suppose that I was hoping that the regex transformer
will run first and copy all the html data as-is which will then be
stripped out later by the HTML transformer but this didn't work. Why?
what else can I do?

Thanks!
- Pulkit

[DIH] How to use combine Regex and HTML transformers

Reply via email to