Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Luís Filipe Nassif
>From a forensic use case it is better just saying we are trying another parser and not resetting the content handler, because the first parser can extract relevant content before the exception. To not spool everything to temp files to re-read the stream, I think we can create an optional

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Chris Mattmann
I think we should just say, OK now we're trying a different parser On 2/5/18, 9:51 AM, "Allison, Timothy B." wrote: To my mind, the real challenge is what to do with content that should be ignored... If the strategy is back-off-on-exception (try the

RE: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Allison, Timothy B.
On the metadata stuff, I'm coming around to Ray Gauss's proposal. I wanted too much back then, and his solution is super elegant, IIRC. -Original Message- From: Nick Burch [mailto:apa...@gagravarr.org] Sent: Monday, February 5, 2018 11:37 AM To: dev@tika.apache.org Subject: Re:

RE: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Allison, Timothy B.
To my mind, the real challenge is what to do with content that should be ignored... If the strategy is back-off-on-exception (try the DOCX parser, but if there's an exception, use the Zip parser), what do we do with the sax elements that have already been written? Do we need a new handler

RE: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Allison, Timothy B.
Spool to temp file? -Original Message- From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Monday, February 5, 2018 12:29 PM To: dev@tika.apache.org Subject: Re: Not-yet-broken breaking changes for Tika 2? Our solution is just to run the parser 2xyes I get it

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Mattmann, Chris A (1761)
Our solution is just to run the parser 2xyes I get it will induce overhead, but as a start, why not? In short just run through the stream 2x ++ Chris Mattmann, Ph.D. Associate Chief Technology and Innovation Officer,

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Nick Burch
On Mon, 5 Feb 2018, Chris Mattmann wrote: Let's have a go at implementing it! You know my thoughts (make it like OODT ;) )\ I'm still keen to hear how we can do the text content like OODT! I have tried to copy the OODT model for the proposed metadata case though :) Nick On 2/5/18, 8:37

[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-05 Thread NW Brad (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352666#comment-16352666 ] NW Brad commented on TIKA-2562: --- Thanks.  I'll take a look at it.  It definitely looks the the same issue,

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Chris Mattmann
Let's have a go at implementing it! You know my thoughts (make it like OODT ;) )\ On 2/5/18, 8:37 AM, "Nick Burch" wrote: Ping - anyone got any thoughts on the proposed metadata parser stuff, and any ideas on the content part? On Tue, 2 Jan 2018, Nick

Re: relying on a non-Maven central repo?

2018-02-05 Thread Chris Mattmann
I think we can't merge this b/c it references an external repository: https://blog.sonatype.com/2010/03/why-external-repos-are-being-phased-out-of-central/ https://blog.sonatype.com/2009/02/why-putting-repositories-in-your-poms-is-a-bad-idea/ Before it can be merged it needs to be uploaded to

Re: relying on a non-Maven central repo?

2018-02-05 Thread Chris Mattmann
Hmmm...the problem here is that Sonatype won't let us publish to Central with the below. It's not even an ASF policy thing - it's a Sonatype thing On 2/5/18, 5:55 AM, "Allison, Timothy B." wrote: Sorry for the duplication, but I wanted to check on this and didn't want

[jira] [Updated] (TIKA-2563) Extract embedded objects in HTML and javascript

2018-02-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2563: -- Summary: Extract embedded objects in HTML and javascript (was: Extract embedded files in HTML) >

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Nick Burch
Ping - anyone got any thoughts on the proposed metadata parser stuff, and any ideas on the content part? On Tue, 2 Jan 2018, Nick Burch wrote: On Thu, 26 Oct 2017, Chris Mattmann wrote: On collision, the precedence order defines what key takes precedence and _overwrites_ the other. Overwrite

[jira] [Created] (TIKA-2566) Move logging in tika-core to log4j via slf4j as we do in the rest of Tika

2018-02-05 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2566: - Summary: Move logging in tika-core to log4j via slf4j as we do in the rest of Tika Key: TIKA-2566 URL: https://issues.apache.org/jira/browse/TIKA-2566 Project: Tika

[jira] [Closed] (TIKA-2083) Tika 2.0 - Audit master branch against 2.x branch

2018-02-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-2083. - Resolution: Fixed Current plan is to use 2.x branch as a model, to redo [~bobpaulin]'s awesome work on

[jira] [Updated] (TIKA-2085) Tika 2.0.0 -- Overarching task list for what we need to do before 2.0.0

2018-02-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2085: -- Summary: Tika 2.0.0 -- Overarching task list for what we need to do before 2.0.0 (was: Tika 2.0 --

[jira] [Updated] (TIKA-1983) Tika 2.0 - remove tika-app's legacy server

2018-02-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1983: -- Issue Type: Sub-task (was: Task) Parent: TIKA-2085 > Tika 2.0 - remove tika-app's legacy server

[jira] [Commented] (TIKA-2564) Tika client cannot extract files from embedded archive formats

2018-02-05 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352596#comment-16352596 ] Hudson commented on TIKA-2564: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1431 (See

RE: relying on a non-Maven central repo?

2018-02-05 Thread Allison, Timothy B.
Thank you, Nick! That was my memory, but it was hazy. I can't quickly figure out where that is documented...any pointers? Or should we look to document it via a LEGAL issue or somewhere else? -Original Message- From: Nick Burch [mailto:apa...@gagravarr.org] Sent: Monday, February 5,

[jira] [Commented] (TIKA-1983) Tika 2.0 - remove tika-app's legacy server

2018-02-05 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352513#comment-16352513 ] Hudson commented on TIKA-1983: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1430 (See

[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352515#comment-16352515 ] Tim Allison commented on TIKA-2562: --- Thank you for looking into this.  IIUC, [~rgauss] offers a way to

[jira] [Resolved] (TIKA-2564) Tika client cannot extract files from embedded archive formats

2018-02-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2564. --- Resolution: Fixed Fix Version/s: 2.0.0 1.18 Thank you for opening this! >

[jira] [Assigned] (TIKA-2564) Tika client cannot extract files from embedded archive formats

2018-02-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reassigned TIKA-2564: - Assignee: Tim Allison > Tika client cannot extract files from embedded archive formats >

Re: relying on a non-Maven central repo?

2018-02-05 Thread Nick Burch
On Mon, 5 Feb 2018, Allison, Timothy B. wrote: Sorry for the duplication, but I wanted to check on this and didn't want it to get lost in a github comment. Fellow devs on Apache Tika, are we ok with relying on a non-Maven central repo? Nope. ASF policy is that we can only rely on maven

[jira] [Resolved] (TIKA-1983) Tika 2.0 - remove tika-app's legacy server

2018-02-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1983. --- Resolution: Fixed Assignee: Tim Allison Fix Version/s: 2.0.0 Fixed on {{master}}. >

[jira] [Reopened] (TIKA-1983) Tika 2.0 - remove tika-app's legacy server

2018-02-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison reopened TIKA-1983: --- This was done on the initial 2.x branch.  We need to redo it on master. > Tika 2.0 - remove tika-app's

relying on a non-Maven central repo?

2018-02-05 Thread Allison, Timothy B.
Sorry for the duplication, but I wanted to check on this and didn't want it to get lost in a github comment. >Fellow devs on Apache Tika, are we ok with relying on a non-Maven central repo? -Original Message- From: ASF GitHub Bot (JIRA) [mailto:j...@apache.org] Sent: Monday, February

[jira] [Commented] (TIKA-2490) Turn off stderr warnings in Tika-app

2018-02-05 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352398#comment-16352398 ] Tim Allison commented on TIKA-2490: --- +1 We are currently using the jul logger in tika-core for this very

[jira] [Commented] (TIKA-2565) Upgrade edu.ucar dependencies to 4.6.11

2018-02-05 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352395#comment-16352395 ] ASF GitHub Bot commented on TIKA-2565: -- tballison commented on a change in pull request #218: