Re: Indexing HTML

Thierry Collogne Mon, 27 Aug 2007 06:35:28 -0700

I think you can use the HTMLStripWhitespaceTokenizerFactory.

Look here :


http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e

I hope this helps


On 27/08/07, Michael Kimsal <[EMAIL PROTECTED]> wrote:
>
> Hello
>
> I'm trying to index individual lines of an HTML file, and I'm hitting this
> error:
>
> TEXT must be immediately followed by END_TAG and not START_TAG
>
> I've got something that looks like
>
> <add>
> <doc>
> <field name="id">4</field>
> <field name="line"><a href="foobar"><b><i>linktext</i></b></a></field>
> </doc>
> </add>
>
> Actually, that sample code above, as its own data file POSTed to SOLR,
> throws
>
> parser must be on START_TAG or TEXT to read text (position: START_TAG seen
> ...&lt;field name="line"&gt;&lt;a href="foobar"&gt;... @4:37
>
> as an error.
>
> Any clues as to how I can do this?  I'd like to keep the original copy of
> each line intact in the index.
>
> Thanks!
>
> --
> Michael Kimsal
> http://webdevradio.com
>

Re: Indexing HTML

Reply via email to