Hello, I need to pull out the price and imageURL for products in an Amazon RSS feed.
PROBLEM STATEMENT: The following: <field column="description" xpath="/rss/channel/item/description" /> <field column="price" regex=".*?\$(\d*.\d*)" sourceColName="description" /> <field column="imageUrl" regex=".*?img src="(.*?)".*" sourceColName="description" /> works but I am left with html junk inside the description! USELESS WORKAROUND: If I try to strip the html from the data being fed into description while letting the price and imageURL know of the direct path of the RSS feed field like so: <field column="description" xpath="/rss/channel/item/description" stripHTML="true" /> <field column="price" regex=".*?\$(\d*.\d*)" xpath="/rss/channel/item/description" /> <field column="imageUrl" regex=".*?img src="(.*?)".*" xpath="/rss/channel/item/description" /> then this fails and only the last configured field in this list (imageURL) ends up having any data imported. Is this a bug? CRUX OF THE PROBLEM: Also I tried to then create a field just to store the raw html data like so but this configuration yields no results for the description field so I'm back to where I started: <field column="rawDescription" xpath="/rss/channel/item/description" /> <field column="description" regex=".*" sourceColName="rawDescription" stripHTML="true" /> <field column="price" regex=".*?\$(\d*.\d*)" sourceColName="rawDescription" /> <field column="imageUrl" regex=".*?img src="(.*?)".*" sourceColName="rawDescription" /> I was suspicious of trying to combine sourceColName with stripHTML to begin with ... I suppose that I was hoping that the regex transformer will run first and copy all the html data as-is which will then be stripped out later by the HTML transformer but this didn't work. Why? what else can I do? Thanks! - Pulkit