Re: [ANNOUNCE] Welcome Giuseppe Totaro As Tika Committer + PMC Member

2015-04-09 Thread Mattmann, Chris A (3980)
Yah Giuseppe! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.go

[ANNOUNCE] Welcome Giuseppe Totaro As Tika Committer + PMC Member

2015-04-09 Thread David Meikle
Hello All, Please welcome Giuseppe Totaro as he joins us as the latest Tika committer and PMC Member. He's recently been VOTEd in and now has his account all set up so is ready to roll! Giuseppe, please feel free to say a bit about yourself as an introduction to the group. Welcome aboard, Da

[jira] [Commented] (TIKA-1519) Don't allow whatever is in http-equiv Content-Type to overwrite actual Content-Type in HtmlParser

2015-04-09 Thread Konstantin Gribov (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487708#comment-14487708 ] Konstantin Gribov commented on TIKA-1519: - It can be either refinement or not. E.g.

[jira] [Commented] (TIKA-1519) Don't allow whatever is in http-equiv Content-Type to overwrite actual Content-Type in HtmlParser

2015-04-09 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487696#comment-14487696 ] Ken Krugler commented on TIKA-1519: --- After thinking about this more, I don't think it's a

[jira] [Commented] (TIKA-1519) Don't allow whatever is in http-equiv Content-Type to overwrite actual Content-Type in HtmlParser

2015-04-09 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487673#comment-14487673 ] Tim Allison commented on TIKA-1519: --- In the above example, would we want the Content-Type

[jira] [Reopened] (TIKA-1519) Don't allow whatever is in http-equiv Content-Type to overwrite actual Content-Type in HtmlParser

2015-04-09 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-1519: --- The fix for this led to ignoring valid encoding detection when the overall id was identified as {{applicat

RE: [VOTE] Release Apache Tika 1.8 Candidate #1

2015-04-09 Thread Ken Krugler
> From: Allison, Timothy B. > Sent: April 9, 2015 9:02:44am PDT > To: dev@tika.apache.org > Subject: RE: [VOTE] Release Apache Tika 1.8 Candidate #1 > > I just finished the against govdocs1 with 1.7 vs. 1.8-rc1, and all looks good > with one major change... on first glance. > > Because of my "f

Re: [VOTE] Release Apache Tika 1.8 Candidate #1

2015-04-09 Thread Konstantin Gribov
Hi, Tim. I think, whitelisting on content-type from meta tag can be a solution. We can whitelist "text/html" + options (like "text/html; charset=...") and "application/xhtml+xml" + options. So, users, who had valid (text/html or application/xhtml+xml in ) will have same behavior as it was in 1.7 a

RE: [VOTE] Release Apache Tika 1.8 Candidate #1

2015-04-09 Thread Allison, Timothy B.
For those who want to take a look at the reports (much more work is needed on processing stack traces for SORT_STACK_TRACE): https://github.com/tballison/share/blob/master/tika_comparisons/tika_1_7_v_1_8-rc1.zip

RE: [VOTE] Release Apache Tika 1.8 Candidate #1

2015-04-09 Thread Allison, Timothy B.
I just finished the against govdocs1 with 1.7 vs. 1.8-rc1, and all looks good with one major change... on first glance. Because of my "fix" on TIKA-1519 and the law of unintended consequences, files that start like so: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";> http://www.w3.org

[jira] [Commented] (TIKA-1599) Switch from TagSoup to JSoup

2015-04-09 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487012#comment-14487012 ] Julien Nioche commented on TIKA-1599: - FWIW we've just added a JSoup based parser to s