Hi, Tim.

I think, whitelisting on content-type from meta tag can be a solution. We
can whitelist "text/html" + options (like "text/html; charset=...") and
"application/xhtml+xml" + options. So, users, who had valid (text/html or
application/xhtml+xml in <meta http-equiv="Content-Type" ...>) will have
same behavior as it was in 1.7 and wouldn't have some weird content-type in
case of invalid html meta tag.

On the other hand, we can provide some configuration switch to choose
between Content-Type source preference for html/xhtml parsers. But
content-type whitelisting is still a good idea, IMHO.

-- 
Best regards,
Konstantin Gribov

чт, 9 апр. 2015 г. в 19:03, Allison, Timothy B. <talli...@mitre.org>:

> I just finished the against govdocs1 with 1.7 vs. 1.8-rc1, and all looks
> good with one major change... on first glance.
>
> Because of my "fix" on TIKA-1519 and the law of unintended consequences,
> files that start like so:
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "
> http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> <html xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
>
> Have different Content-Type(s) between the
>
> In Tika 1.7, they used to have a Content-Type of: text/html;
> charset=iso-8859-1
>
> In Tika 1.8-rc1, they now have a Content-Type of: application/xhtml+xml
>
> This is a major change.
>
> Do we want this?
>
>  Or do we want to revert to the old behavior but add some kind of filter
> to prevent crazy Content-Type information like the following from
> overwriting what the detector detected:
> <meta http-equiv="Content-Type" content="application/pdf" />
> or
> <meta http-equiv="Content-Type" content="anythingIFeelLikeInserting" />
>
> -----Original Message-----
> From: David Meikle [mailto:loo...@gmail.com]
> Sent: Wednesday, April 08, 2015 8:06 PM
> To: dev@tika.apache.org
> Subject: Re: [VOTE] Release Apache Tika 1.8 Candidate #1
>
> Hey Tyler,
>
> > On 7 Apr 2015, at 19:54, Tyler Palsulich <tpalsul...@apache.org> wrote:
> >
> > [ ] +1 Release this package as Apache Tika 1.8
> > [ ] -1 Do not release this package because...
>
> Whilst my testing with the release is good so far on Mac and Linux with
> Windows to go, and I am inclined to +1, it would be good if you were able
> to get your code signing key signed by someone nearby to avoid the warning
> below?
>
> amadeaus-air:release david$ gpg --verify tika-1.8-src.zip.asc
> gpg: Signature made Tue  7 Apr 19:45:15 2015 EDT using RSA key ID D4F10117
> gpg: Good signature from "Tyler Palsulich <tpalsul...@apache.org>"
> gpg: WARNING: This key is not certified with a trusted signature!
> gpg:          There is no indication that the signature belongs to the
> owner.
> Primary key fingerprint: 1D32 9CC2 D69C 821B FBE4  183E 8810 BB19 D4F1 0117
>
> Not sure if Chris, Lewis et al are near you and do this quickly?
>
> Cheers,
> Dave
>

Reply via email to