>
> >
> > -Original Message-
> > From: Lance Norskog [mailto:goks...@gmail.com]
> > Sent: 09 March 2010 04:36
> > To: solr-user@lucene.apache.org
> > Subject: Re: HTML encode extracted docs
> >
> > A Tika integration with the DataIm
;
>
> -Original Message-
> From: Lance Norskog [mailto:goks...@gmail.com]
> Sent: 09 March 2010 04:36
> To: solr-user@lucene.apache.org
> Subject: Re: HTML encode extracted docs
>
> A Tika integration with the DataImportHandler is in the Solr trunk.
> With this
-user@lucene.apache.org
Subject: Re: HTML encode extracted docs
A Tika integration with the DataImportHandler is in the Solr trunk.
With this, you can copy the raw HTML into different fields and process
one copy with Tika.
If it's just straight HTML, would the HTMLStripCharFilter be good
A Tika integration with the DataImportHandler is in the Solr trunk.
With this, you can copy the raw HTML into different fields and process
one copy with Tika.
If it's just straight HTML, would the HTMLStripCharFilter be good enough?
http://www.lucidimagination.com/search/document/CDRG_ch05_5.7.2
I'm uploading .htm files to be extracted - some of these files are "include"
files that have snippets of HTML rather than fully formed html documents.
solr-cell stores the raw HTML for these items, rather than extracting the text.
Is there any way I can get solr to encode this content prior to s