>From a forensic use case it is better just saying we are trying another
parser and not resetting the content handler, because the first parser can
extract relevant content before the exception.
To not spool everything to temp files to re-read the stream, I think we can
create an optional
I think we should just say, OK now we're trying a different parser
On 2/5/18, 9:51 AM, "Allison, Timothy B." wrote:
To my mind, the real challenge is what to do with content that should be
ignored...
If the strategy is back-off-on-exception (try the
On the metadata stuff, I'm coming around to Ray Gauss's proposal. I wanted too
much back then, and his solution is super elegant, IIRC.
-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Monday, February 5, 2018 11:37 AM
To: dev@tika.apache.org
Subject: Re:
To my mind, the real challenge is what to do with content that should be
ignored...
If the strategy is back-off-on-exception (try the DOCX parser, but if there's
an exception, use the Zip parser), what do we do with the sax elements that
have already been written? Do we need a new handler
Spool to temp file?
-Original Message-
From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Monday, February 5, 2018 12:29 PM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika 2?
Our solution is just to run the parser 2xyes I get it
Our solution is just to run the parser 2xyes I get it will induce overhead,
but as a start, why not?
In short just run through the stream 2x
++
Chris Mattmann, Ph.D.
Associate Chief Technology and Innovation Officer,
On Mon, 5 Feb 2018, Chris Mattmann wrote:
Let's have a go at implementing it! You know my thoughts (make it like
OODT ;) )\
I'm still keen to hear how we can do the text content like OODT!
I have tried to copy the OODT model for the proposed metadata case though
:)
Nick
On 2/5/18, 8:37
[
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352666#comment-16352666
]
NW Brad commented on TIKA-2562:
---
Thanks. I'll take a look at it. It definitely looks the the same issue,
Let's have a go at implementing it! You know my thoughts (make it like OODT ;)
)\
On 2/5/18, 8:37 AM, "Nick Burch" wrote:
Ping - anyone got any thoughts on the proposed metadata parser stuff, and
any ideas on the content part?
On Tue, 2 Jan 2018, Nick
I think we can't merge this b/c it references an external repository:
https://blog.sonatype.com/2010/03/why-external-repos-are-being-phased-out-of-central/
https://blog.sonatype.com/2009/02/why-putting-repositories-in-your-poms-is-a-bad-idea/
Before it can be merged it needs to be uploaded to
Hmmm...the problem here is that Sonatype won't let us publish to Central with
the below. It's not even an ASF policy thing - it's a Sonatype thing
On 2/5/18, 5:55 AM, "Allison, Timothy B." wrote:
Sorry for the duplication, but I wanted to check on this and didn't want
[
https://issues.apache.org/jira/browse/TIKA-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-2563:
--
Summary: Extract embedded objects in HTML and javascript (was: Extract
embedded files in HTML)
>
Ping - anyone got any thoughts on the proposed metadata parser stuff, and
any ideas on the content part?
On Tue, 2 Jan 2018, Nick Burch wrote:
On Thu, 26 Oct 2017, Chris Mattmann wrote:
On collision, the precedence order defines what key takes precedence and
_overwrites_ the other. Overwrite
Tim Allison created TIKA-2566:
-
Summary: Move logging in tika-core to log4j via slf4j as we do in
the rest of Tika
Key: TIKA-2566
URL: https://issues.apache.org/jira/browse/TIKA-2566
Project: Tika
[
https://issues.apache.org/jira/browse/TIKA-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison closed TIKA-2083.
-
Resolution: Fixed
Current plan is to use 2.x branch as a model, to redo [~bobpaulin]'s awesome
work on
[
https://issues.apache.org/jira/browse/TIKA-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-2085:
--
Summary: Tika 2.0.0 -- Overarching task list for what we need to do before
2.0.0 (was: Tika 2.0 --
[
https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1983:
--
Issue Type: Sub-task (was: Task)
Parent: TIKA-2085
> Tika 2.0 - remove tika-app's legacy server
[
https://issues.apache.org/jira/browse/TIKA-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352596#comment-16352596
]
Hudson commented on TIKA-2564:
--
SUCCESS: Integrated in Jenkins build Tika-trunk #1431 (See
Thank you, Nick! That was my memory, but it was hazy. I can't quickly figure
out where that is documented...any pointers? Or should we look to document it
via a LEGAL issue or somewhere else?
-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Monday, February 5,
[
https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352513#comment-16352513
]
Hudson commented on TIKA-1983:
--
SUCCESS: Integrated in Jenkins build Tika-trunk #1430 (See
[
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352515#comment-16352515
]
Tim Allison commented on TIKA-2562:
---
Thank you for looking into this. IIUC, [~rgauss] offers a way to
[
https://issues.apache.org/jira/browse/TIKA-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-2564.
---
Resolution: Fixed
Fix Version/s: 2.0.0
1.18
Thank you for opening this!
>
[
https://issues.apache.org/jira/browse/TIKA-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison reassigned TIKA-2564:
-
Assignee: Tim Allison
> Tika client cannot extract files from embedded archive formats
>
On Mon, 5 Feb 2018, Allison, Timothy B. wrote:
Sorry for the duplication, but I wanted to check on this and didn't want
it to get lost in a github comment.
Fellow devs on Apache Tika, are we ok with relying on a non-Maven central repo?
Nope. ASF policy is that we can only rely on maven
[
https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-1983.
---
Resolution: Fixed
Assignee: Tim Allison
Fix Version/s: 2.0.0
Fixed on {{master}}.
>
[
https://issues.apache.org/jira/browse/TIKA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison reopened TIKA-1983:
---
This was done on the initial 2.x branch. We need to redo it on master.
> Tika 2.0 - remove tika-app's
Sorry for the duplication, but I wanted to check on this and didn't want it to
get lost in a github comment.
>Fellow devs on Apache Tika, are we ok with relying on a non-Maven central repo?
-Original Message-
From: ASF GitHub Bot (JIRA) [mailto:j...@apache.org]
Sent: Monday, February
[
https://issues.apache.org/jira/browse/TIKA-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352398#comment-16352398
]
Tim Allison commented on TIKA-2490:
---
+1 We are currently using the jul logger in tika-core for this very
[
https://issues.apache.org/jira/browse/TIKA-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352395#comment-16352395
]
ASF GitHub Bot commented on TIKA-2565:
--
tballison commented on a change in pull request #218:
29 matches
Mail list logo