Re: HTML encode extracted docs - Problems with solr.HTMLStripCharFilter

2010-06-01 Thread Damian Bursztyn
> > > > > -Original Message- > > From: Lance Norskog [mailto:goks...@gmail.com] > > Sent: 09 March 2010 04:36 > > To: solr-user@lucene.apache.org > > Subject: Re: HTML encode extracted docs > > > > A Tika integration with the DataIm

Re: HTML encode extracted docs - Problems with solr.HTMLStripCharFilter

2010-03-13 Thread Lance Norskog
; > > -Original Message- > From: Lance Norskog [mailto:goks...@gmail.com] > Sent: 09 March 2010 04:36 > To: solr-user@lucene.apache.org > Subject: Re: HTML encode extracted docs > > A Tika integration with the DataImportHandler is in the Solr trunk. > With this

RE: HTML encode extracted docs - Problems with solr.HTMLStripCharFilter

2010-03-09 Thread Mark Roberts
-user@lucene.apache.org Subject: Re: HTML encode extracted docs A Tika integration with the DataImportHandler is in the Solr trunk. With this, you can copy the raw HTML into different fields and process one copy with Tika. If it's just straight HTML, would the HTMLStripCharFilter be good

Re: HTML encode extracted docs

2010-03-08 Thread Lance Norskog
A Tika integration with the DataImportHandler is in the Solr trunk. With this, you can copy the raw HTML into different fields and process one copy with Tika. If it's just straight HTML, would the HTMLStripCharFilter be good enough? http://www.lucidimagination.com/search/document/CDRG_ch05_5.7.2

HTML encode extracted docs

2010-03-08 Thread Mark Roberts
I'm uploading .htm files to be extracted - some of these files are "include" files that have snippets of HTML rather than fully formed html documents. solr-cell stores the raw HTML for these items, rather than extracting the text. Is there any way I can get solr to encode this content prior to s