Hi, Tim. I think, whitelisting on content-type from meta tag can be a solution. We can whitelist "text/html" + options (like "text/html; charset=...") and "application/xhtml+xml" + options. So, users, who had valid (text/html or application/xhtml+xml in <meta http-equiv="Content-Type" ...>) will have same behavior as it was in 1.7 and wouldn't have some weird content-type in case of invalid html meta tag.
On the other hand, we can provide some configuration switch to choose between Content-Type source preference for html/xhtml parsers. But content-type whitelisting is still a good idea, IMHO. -- Best regards, Konstantin Gribov чт, 9 апр. 2015 г. в 19:03, Allison, Timothy B. <talli...@mitre.org>: > I just finished the against govdocs1 with 1.7 vs. 1.8-rc1, and all looks > good with one major change... on first glance. > > Because of my "fix" on TIKA-1519 and the law of unintended consequences, > files that start like so: > > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" " > http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> > > Have different Content-Type(s) between the > > In Tika 1.7, they used to have a Content-Type of: text/html; > charset=iso-8859-1 > > In Tika 1.8-rc1, they now have a Content-Type of: application/xhtml+xml > > This is a major change. > > Do we want this? > > Or do we want to revert to the old behavior but add some kind of filter > to prevent crazy Content-Type information like the following from > overwriting what the detector detected: > <meta http-equiv="Content-Type" content="application/pdf" /> > or > <meta http-equiv="Content-Type" content="anythingIFeelLikeInserting" /> > > -----Original Message----- > From: David Meikle [mailto:loo...@gmail.com] > Sent: Wednesday, April 08, 2015 8:06 PM > To: dev@tika.apache.org > Subject: Re: [VOTE] Release Apache Tika 1.8 Candidate #1 > > Hey Tyler, > > > On 7 Apr 2015, at 19:54, Tyler Palsulich <tpalsul...@apache.org> wrote: > > > > [ ] +1 Release this package as Apache Tika 1.8 > > [ ] -1 Do not release this package because... > > Whilst my testing with the release is good so far on Mac and Linux with > Windows to go, and I am inclined to +1, it would be good if you were able > to get your code signing key signed by someone nearby to avoid the warning > below? > > amadeaus-air:release david$ gpg --verify tika-1.8-src.zip.asc > gpg: Signature made Tue 7 Apr 19:45:15 2015 EDT using RSA key ID D4F10117 > gpg: Good signature from "Tyler Palsulich <tpalsul...@apache.org>" > gpg: WARNING: This key is not certified with a trusted signature! > gpg: There is no indication that the signature belongs to the > owner. > Primary key fingerprint: 1D32 9CC2 D69C 821B FBE4 183E 8810 BB19 D4F1 0117 > > Not sure if Chris, Lewis et al are near you and do this quickly? > > Cheers, > Dave >