It was the intention to remove script.
I developed HTMLStripReader by just looking at a bunch of real-world HTML.
I hadn't run across script in uppercase, so I didn't do a case
insensitive check.
The code is currently:
if (name.equals("script") || name.equals("style")) {
Should be easy enough to change unless there is a good reason not to.
-Yonik
On Thu, Apr 10, 2008 at 5:05 AM, Walter Ferrara <[EMAIL PROTECTED]> wrote:
> I've noticed that passing html to a field using
> HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts too.
> For example, using a analyzer like:
> <fieldType name="HTMLStripper2" class="solr.TextField" >
> <analyzer>
> <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
> </analyzer>
> </fieldType>
>
> with a text such as:
> <html>
> <head><title>title</title></head>
> <body>
> pre
> <SCRIPT LANGUAGE="JavaScript">
> var time = new Date();
> ordval= (time.getTime());
> </SCRIPT>
> post <!-- comment -->
> </body>
> </html>
>
> Analysis.jsp turns out those tokens:
> title
> pre
> var
> time
> =
> new
> Date();
> ordval=
> (time.getTime());
> post
>
> While if the script in the page is commented, everything works fine.
> Is this due to design choice? Shouldn't scripts be removed in both cases?
> (Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson -
> 2008-03-24 09:59:40)
>
> Walter
>
>