There is another way to ingest data using DIH. Check out the HTMLStripTransformer
<entity name="CDC" pk="link" datasource="filedatasource" url="http://www2c.cdc.gov/podcasts/createrss.asp?t=r&c=19" processor="XPathEntityProcessor" forEach="/rss/channel | /rss/channel/item" transformer="DateFormatTransformer,HTMLStripTransformer"> <field column="source" xpath="/rss/channel/title" commonField="true" /> <field column="source-link" xpath="/rss/channel/link" commonField="true" /> <field column="subject" xpath="/rss/channel/description" commonField="true" /> <field column="title" xpath="/rss/channel/item/title" /> <field column="link" xpath="/rss/channel/item/link" /> <field column="description" xpath="/rss/channel/item/description" stripHTML="true" /> <field column="creator" xpath="/rss/channel/item/creator" /> <field column="item-subject" xpath="/rss/channel/item/subject" /> <field column="author" xpath="/rss/channel/item/author" /> <field column="comments" xpath="/rss/channel/item/comments" /> <field column="pubdate" xpath="/rss/channel/item/pubDate" dateTimeFormat="EEE, dd MMM yyyy HH:mm:sss z" /> <field column="dcdate" xpath="/rss/channel/item/date" dateTimeFormat="yyyy-MM-dd'T'HH:mm:sss'Z'" /> <field column="lat" xpath="/rss/channel/item/lat" /> <field column="lng" xpath="/rss/channel/item/long" /> </entity> On Fri, Jan 14, 2011 at 11:10 AM, arnaud gaudinat <arnaud.gaudi...@gmail.com > wrote: > I just saw TagSoup and it seems to clean bad HTML tags to create a good > HTML file. > what's BoilerPipe does, it try to eliminate html content which is not part > of the useful content for a human reader (ie. navigation contents, ads, > comments...) > take a look here: http://boilerpipe-web.appspot.com/ and try with one of > your URL > > And other type of this application, is 'Readability' which is more for a > end-user (http://lab.arc90.com/experiments/readability/) > > > Le 14.01.2011 16:51, Adam Estrada a écrit : > > Is there a drastic difference between this and TagSoup which is already >> included in Solr? >> >> On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat >> <arnaud.gaudi...@gmail.com>wrote: >> >> Hello, >>> >>> I would like to use BoilerPipe (a very good program which cleans the html >>> content from surplus "clutter"). >>> I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from >>> solr, am I right? >>> >>> How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml >>> ( >>> with org.apache.solr.handler.extraction.ExtractingRequestHandler)? >>> >>> Or do I need to modify some code inside Solr? >>> >>> I so something like TikaCLI -F in the tika forum ( >>> >>> http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration >>> ) >>> is it the right way? >>> >>> Thanks in advance, >>> >>> Arno. >>> >>> >>> >