Re: Call for Microsoft OneNote experts for help on OneNote parsing in Tika

2019-11-27 Thread Nick Burch
On Sun, 24 Nov 2019, Nicholas DiPiazza wrote: Basically I just need some help understanding some of the finer details of the OneNote format and how to extract info from it. https://stackoverflow.com/questions/59008205/onenote-parsing-how-to-get-to-the-text-blobs-in-the-document https://stackover

Re: Feature to extract duration of an AMR file

2020-01-27 Thread Nick Burch
On Mon, 27 Jan 2020, Saurabh Bhardwaj wrote: Currently, Tika is able to figure out whether given file is AMR file or not but doesn't return one of the most useful information for an AMR file i.e. its duration. Generally that means we have mime-magic for detection, but don't have a parser for

Re: Tika master branch not building

2020-04-06 Thread Nick Burch
On Mon, 6 Apr 2020, Eric Pugh wrote: Maybe this needs better documentation, however this is a “works as designed” feature! To avoid the build failing, run mvn package -Dossindex.fail=false Should we maybe have this set to false by default, and only enabled on release builds? (We shouldn't

Re: Issue with > 200% CPU after bulk usage

2020-04-16 Thread Nick Burch
On Wed, 15 Apr 2020, hans.mei...@avident-it.se wrote: I have encountered an issue with Tika running locally on a box that the Java runtime goes up to over 200% CPU, after running a bulk load of documents over a couple of days, it is more than 3 million documents. Can you do a thread dump to sh

Re: Fwd: New mailing list queued for creation: corpora-...@tika.apache.org

2020-06-04 Thread Nick Burch
On Thu, 4 Jun 2020, Tim Allison wrote: Following guidance from https://issues.apache.org/jira/browse/INFRA-20376, I've requested a corpora-...@tika.apache.org mail list. If we need separate user/private, we can request those. Let me know. I don't think we need user or private at this stage -

Re: Problem in resolving tika parser in Gradle projects

2020-06-05 Thread Nick Burch
On Thu, 4 Jun 2020, Dupinder Singh wrote: My project is gralde based, so I was trying to resolve the build as you described in your documentations, but this is not resolving the dependency. dependencies { runtime 'org.apache.tika:tika-parsers:1.24.1' } That looks like it ought to be fine, e

Mime type magic and repeated similar blocks - thoughts?

2020-06-09 Thread Nick Burch
Hi All At the moment, to detect RFC822 emails, we try and check for a bunch of common header lines right at the start. If not, we check for a few "could be an unusual header, could be some text", followed by checking for common headers in a larger area of text below. For example, starts with

Re: Request for access to edit the ASF Tika wiki

2020-06-22 Thread Nick Burch
On Mon, 22 Jun 2020, Vegard Stikbakke wrote: I would like to update outdated installation instructions here: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR Specifically, installation on Mac. So I'm kindly requesting access to edit! Can you please create yourself an account on our con

Re: Should we add Apache Commons Lang to tika-core as a dependency?

2020-08-03 Thread Nick Burch
On Mon, 3 Aug 2020, Peter Lee wrote: I'm working with TIKA-3141 recently and pushed a PR in github. As Keith suggested in the PR, maybe we should add Commons Lang to tika-core, as it seems Commons Lang are being used elsewhere in tika but not tika-core. Historically, we have tried to keep the

Expected private/secret keys in the source (TIKA-3205)

2020-09-29 Thread Nick Burch
Hey All Just a quick heads-up that for TIKA-3205 I generated a few new small private keys (RSA, DSA, EC) and added them to the parser test documents folder, for unit testing the new mime magics for keys and certificates. They're not protecting or using anything. One automated security scanni

Re: Fwd: XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

2020-10-10 Thread Nick Burch
On Fri, 9 Oct 2020, Tim Allison wrote: Do you think we should follow up on the Tika side? Do we know if we can handle this? I thought we did, but checking POIFSContainerDetector I can't actually see that case covered I think we (Tika) can handle it in a similar way to CompObj Over on

Re: Fwd: XLSX wrapped in an OLE2 CompObj/Package - should WorkbookFactory handle it?

2020-10-13 Thread Nick Burch
On Tue, 13 Oct 2020, Tim Allison wrote: Ha, y, this file exercises those bits of code: https://github.com/apache/tika/blob/main/tika-parser-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testPPT_oleWorkbook.ppt Nick, does this match the features of the SO question? Yup,

Re: Tika 1.25 release date?

2020-10-21 Thread Nick Burch
On Wed, 21 Oct 2020, Alexander Klimetschek wrote: Regarding xmpcore: I would love to help but it‘s a different department :-) If you can use internal contacts to find the people we need to prod / lobby / smile at, that'd be a big help! And/or if you can try to bribe that team with sending cho

xmpDM:duration - units?

2020-11-18 Thread Nick Burch
Hi All This question promoted by https://stackoverflow.com/q/64888488/685641 Is there / should there be fixed units on the xmpDM:duration metadata property? And if so, what? Currently, MP3 seems to use milliseconds via http://tika.apache.org/1.24.1/api/org/apache/tika/parser/mp3/AudioFrame.h

Re: xmpDM:duration - units?

2020-11-19 Thread Nick Burch
tter to me, even though it's a breaking change. Thoughts? Nick On Wed, Nov 18, 2020 at 3:26 PM Nick Burch wrote: Hi All This question promoted by https://stackoverflow.com/q/64888488/685641 Is there / should there be fixed units on the xmpDM:duration metadata property? And if so, w

Re: Tika 2.0.0-ALPHA?

2020-11-30 Thread Nick Burch
On Mon, 30 Nov 2020, Tim Allison wrote: Now that 1.25 is released, I'm going to work on refactoring tika-eval and tika-server shortly. Then add back in the osgi bundle. After that, shall we go with 2.0.0-ALPHA? Seems ok to me, assuming you're happy to do the work! :) Thanks Nick

Re: xmpDM:duration - units?

2020-11-30 Thread Nick Burch
On Thu, 19 Nov 2020, Nick Burch wrote: On Thu, 19 Nov 2020, Tim Allison wrote: Looks like 'scale' needs to be taken into consideration? See 1.2.6.9 https://www.adobe.com/content/dam/acom/en/devnet/xmp/pdfs/XMPSDKReleasecc-2020/XMPSpecificationPart2.pdf Ah, yes, check the spec

Re: site?

2021-01-18 Thread Nick Burch
On Mon, 18 Jan 2021, Tim Allison wrote: I did only minimal updates to our site so that there's still mostly info about 1.25, javadocs, etc. are still 1.25. I want to make it clear that that is the "production" release. If desired, I can do the full suite of updates for 2.0.0-ALPHA. Let me kn

Re: load error handler in TikaConfig for 2.x?

2021-02-09 Thread Nick Burch
On Mon, 8 Feb 2021, Tim Allison wrote: Do we still need the LoadErrorHandler for TikaConfig 2.x? IIRC, we added that so that folks who didn't want a dependency could prevent the loading of the dependency and then silence complaints -- if set to ignore. Would we just swap to throwing an Excepti

Re: load error handler in TikaConfig for 2.x?

2021-02-09 Thread Nick Burch
On Tue, 9 Feb 2021, Tim Allison wrote: Would we just swap to throwing an Exception if a parser can't be found / loaded? Y, that'd be my inclination. Seems ok to me what do we do if someone gives us a Tika Config that references a Parser that doesn't exist? My preference would be to throw

FW: OSS-Fuzz integration

2021-03-05 Thread Nick Burch
Hi All For those who don't follow dev@commons, there's yet another fulling tool on the block! Details below. Looks pretty neat, and is now being used on a few Apache Commons projects, including Commons Compress which we use What do people think about more fuzzing? Worth doing? Or just too muc

Re: high level parser module names in 2.x

2021-03-10 Thread Nick Burch
On Tue, 9 Mar 2021, Tim Allison wrote: Would this be better? tika-parsers-basic tika-parsers-complex tika-parsers-¯\_(ツ)_/¯ GStreamer has 4 levels of plugins, Base, Good, Ugly and Bad. Descriptions of what qualifies for what at https://gstreamer.freedesktop.org/modules/ . I can see developer

Re: [DISCUSS] Support Elasticsearch in the tika-pipes module?

2021-07-27 Thread Nick Burch
On Mon, 26 Jul 2021, Tim Allison wrote: Currently the OpenSearch emitter works with the 7.x version of Elasticsearch. Going forward, when the projects diverge: a) do we want to support Elasticsearch and I think we should try, but I'm not sure if it should be "we = Apache Tika" or "we = Tika

Kaitai - might be worth trying for new formats

2021-08-09 Thread Nick Burch
Hi All I came across Kaitai - http://kaitai.io/ - yesterday. Based on the experiences documented in this twitter thread on understanding + parsing an embedded filesystem: https://twitter.com/wrongbaud/status/1424380510671880198 Looks like it might be worth a look for if we need to write our

Re: versions?

2021-08-11 Thread Nick Burch
On Wed, 11 Aug 2021, Tim Allison wrote: A) I think we should maintain the 1.x branch and continue to put out bug fixes for a bit. Any objections to nominally calling the next release 1.27.1 on JIRA at least? I agree we should probably try to keep 1.x going for at least a few months, to allow

Re: [DISCUSS] upgrading log4j to to log4j2 in Tika's 1.x branch

2021-12-15 Thread Nick Burch
On Wed, 15 Dec 2021, Tim Allison wrote: I think we should keep the 1.x branch open for security upgrades for a bit...middle of next year? I have _not_ been adding new features or even some bug fixes to 1.x, and I encourage people to migrate to 2.x. We've seen quite a few queries from people s

Re: [DISCUSS] upgrading log4j to to log4j2 in Tika's 1.x branch

2021-12-15 Thread Nick Burch
On Wed, 15 Dec 2021, Tim Allison wrote: Sounds good, Nick. Unless there are objections, I'll add an EOL September 30, 2022 for the 1.x branch on our github README and maybe our site somewhere? Maybe just mention it in the news section at the end any 1.x fix releases? Nick

Re: wiki editor access request

2022-01-07 Thread Nick Burch
On Fri, 7 Jan 2022, Josh Burchard wrote: I wrote to Tim about making a small update to https://cwiki.apache.org/confluence/display/TIKA/TikaServerEndpointsCompared and he suggested that I email this dev list to see if someone could grant me editor access. Is that a possibility? Can you sign up

Re-implementing tika-dgn-detector in Tika itself - any objections?

2022-04-22 Thread Nick Burch
Hi Steven Over on https://issues.apache.org/jira/browse/TIKA-3721, one of our users altered us to your tika-dgn-detector github project. If possible, we'd like to fold the detector logic and mime type definitions into Tika itself. (Converting it to Java in the process and putting the detecto

Re: Re-implementing tika-dgn-detector in Tika itself - any objections?

2022-04-22 Thread Nick Burch
Maven, so if it's a problem, feel free to change or ignore. Cheers On Fri, 22 Apr 2022 at 11:57, Nick Burch wrote: Hi Steven Over on https://issues.apache.org/jira/browse/TIKA-3721, one of our users altered us to your tika-dgn-detector github project. If possible, we'd like to fold the

Re: Issue related to file mime type detection

2022-09-09 Thread Nick Burch
On Fri, 9 Sep 2022, Sindhu Mahadevappa wrote: We are using tika-parsers 1.23 Tika 1.23 was released in December 2019! You should really use something much more recent for comparing uploaded file mime type from file name as well as from file content for security purpose. Apache Tika's dete

RE: Issue related to file mime type detection

2022-09-15 Thread Nick Burch
On Thu, 15 Sep 2022, Sindhu Mahadevappa wrote: We have been looking for the latest Tika 2.4.1 jar file, looks like it is not available anywhere. You can get the Tika App and Tika Server jars for 2.4.1 from https://tika.apache.org/download.html For the core and parser jars, manually downloading

Re: GUI mods?

2022-09-24 Thread Nick Burch
On Sat, 24 Sep 2022, Tim Allison wrote: Given that this is greenfields, should I start w javafx or stick w swing or is there another framework I should try? Give the Tika Server an optional snazzy web UI, then wrap it as an electron app for people who want a native program to start? (plus avoi

Re: GUI mods?

2022-09-25 Thread Nick Burch
On Sat, 24 Sep 2022, Tim Allison wrote: Electron and which framework? I'd say there's two choice mechanisms. One is to pick whatever most excites you / is likely to look best on your next funding application, and say that since you're doing most of the initial work you can choose! The othe

Possibly speeding up tests with Gradle - anyone interested?

2022-10-05 Thread Nick Burch
Hi All At ApacheCon this week, a Bob and myself ended up chatting with the folks from Gradle, who are keen to help ASF projects, and are discussing with the Infra team. The easier bit - they think they might be able to help speed up our maven build, especially the running of tests. Anyone ha

Re: Possibly speeding up tests with Gradle - anyone interested?

2022-10-06 Thread Nick Burch
On Wed, 5 Oct 2022, Oleg Tikhonov wrote: Honestly I am trying to port our project to gradle. But it goes not well. It is good idea. Is some folk can help, we can do it together. Apparently Gradle Enterprise works with both Gradle and Maven! So we don't even have to change our build - https://

Re: Possibly speeding up tests with Gradle - anyone interested?

2022-10-06 Thread Nick Burch
On Wed, 5 Oct 2022, Nicholas DiPiazza wrote: Are they offering the Gradle Build Cache stuff free for apache projects? There's an announcement at ApacheCon in about an hour... I think the Infra team are still working out the details on how it'll all work. However, there's an additional offer

Re: Possibly speeding up tests with Gradle - anyone interested?

2022-10-06 Thread Nick Burch
On Thu, 6 Oct 2022, Tim Allison wrote: Happy to chat. Please put them in touch. Excellent, thanks Tim! Other than your past talks, have we got any info (eg on the wiki?) about how to run the regression corpus? I've been really impressed with what the POI team has done migrating from ant to

Re: idea about creation of accounts

2023-03-13 Thread Nick Burch
On Mon, 13 Mar 2023, Nicholas DiPiazza wrote: can we require that the request form for creating a jira account contains the first issue they would like to create? You'd need to ask on users@infra about that, it's an ASF wide thing (to avoid a huge spam problem) and not something our project cu

Re: Document chunking

2024-04-08 Thread Nick Burch
On Mon, 8 Apr 2024, Tim Allison wrote: Not sure we should jump on the bandwagon, but anything we can do to support smart chunking would benefit us. Could just be more integrations with parsers that turn out to be useful. I haven’t had much joy with some. Here’s one that I haven’t evaluated yet:

Re: junk cves -- rant

2024-04-12 Thread Nick Burch
On Thu, 11 Apr 2024, Tim Allison wrote: I just excluded joda-time because of this: CVE-2024-23080 https://nvd.nist.gov/vuln/detail/CVE-2024-23080 This is an NPE in joda-time version 2.12.5. That's two versions before the current...is it actually still in there. And more importantly, an NPE is no

Re: Copilot license for open source?

2024-04-21 Thread Nick Burch
On Fri, 19 Apr 2024, Nicholas DiPiazza wrote: Can I get an open source license for GitHub copilot? I've not heard of anyone offering that. Some of the open and open-ish models are quite good on coding tasks, though you'd need to hop to a different interface to ask for help (unlike the in-line

Re: Copilot license for open source?

2024-04-22 Thread Nick Burch
On Sun, 21 Apr 2024, Michael Wechner wrote: Thanks for the pointer to the Generative Tooling rules, which I was not aware of so far. At the bottom it says, that the ASF does not tell developers what tools to use, but I think it would be useful to useful to have some concrete examples, which w

[jira] [Commented] (TIKA-1521) Handle password protected 7zip files

2015-01-23 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289605#comment-14289605 ] Nick Burch commented on TIKA-1521: -- All unit tests (including that one) pass just fin

[jira] [Commented] (TIKA-1530) MP4Parser parses duration but does not set it

2015-01-26 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292029#comment-14292029 ] Nick Burch commented on TIKA-1530: -- Any chance you could look at something like the

[jira] [Resolved] (TIKA-1530) MP4Parser parses duration but does not set it

2015-01-27 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1530. -- Resolution: Fixed Fix Version/s: 1.8 The MP3 and Vorbis parsers already had duration support, so

[jira] [Commented] (TIKA-1521) Handle password protected 7zip files

2015-01-27 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293927#comment-14293927 ] Nick Burch commented on TIKA-1521: -- The test should be working fine as-is, without

[jira] [Commented] (TIKA-1532) DIF Parser

2015-01-28 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295498#comment-14295498 ] Nick Burch commented on TIKA-1532: -- For the mimetype part, do you have a small sa

[jira] [Commented] (TIKA-1329) Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser

2015-01-28 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295800#comment-14295800 ] Nick Burch commented on TIKA-1329: -- Website still needs updating - just use the sni

[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-30 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298416#comment-14298416 ] Nick Burch commented on TIKA-1511: -- Few minor things on Tim's github branch

[jira] [Commented] (TIKA-1532) DIF Parser

2015-01-30 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298418#comment-14298418 ] Nick Burch commented on TIKA-1532: -- {{text/plain}} is no good, we need to have a un

[jira] [Commented] (TIKA-1536) Upgrade compiler definition in pom's to Java 7

2015-02-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14300768#comment-14300768 ] Nick Burch commented on TIKA-1536: -- I don't think it needs to wait for 2.0 -

[jira] [Commented] (TIKA-1538) Wrong mimetype detection

2015-02-03 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303089#comment-14303089 ] Nick Burch commented on TIKA-1538: -- I've just tried with {{java -jar tika-ap

[jira] [Commented] (TIKA-1539) GRB file magic bytes and extension matching

2015-02-03 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303102#comment-14303102 ] Nick Burch commented on TIKA-1539: -- The GRB parser will need one (or possibly a few)

[jira] [Commented] (TIKA-1538) Wrong mimetype detection

2015-02-03 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303128#comment-14303128 ] Nick Burch commented on TIKA-1538: -- It's possible that it could still be a bug

[jira] [Commented] (TIKA-1539) GRB file magic bytes and extension matching

2015-02-07 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310709#comment-14310709 ] Nick Burch commented on TIKA-1539: -- Thanks for this. I've made a minor tweak in

[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-07 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310711#comment-14310711 ] Nick Burch commented on TIKA-1541: -- I'm not sure if we want to be activatin

[jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika

2015-02-07 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310879#comment-14310879 ] Nick Burch commented on TIKA-1541: -- Looks like we still need some unit tests for

[jira] [Commented] (TIKA-1509) Create configurable strategies for composite parsers

2015-02-09 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312369#comment-14312369 ] Nick Burch commented on TIKA-1509: -- Two things now spring to mind as possible prob

[jira] [Commented] (TIKA-1509) Create configurable strategies for composite parsers

2015-02-09 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312414#comment-14312414 ] Nick Burch commented on TIKA-1509: -- I've had a go at the parser for the "

[jira] [Commented] (TIKA-1545) Create tika-server Frontend

2015-02-10 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14314643#comment-14314643 ] Nick Burch commented on TIKA-1545: -- Is this not duplicated by TIKA-1269 ? If we can

[jira] [Commented] (TIKA-1545) Create tika-server Frontend

2015-02-10 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14314680#comment-14314680 ] Nick Burch commented on TIKA-1545: -- We have only a stopgap thing for now! The orig

[jira] [Commented] (TIKA-1554) Improve EMF file detection

2015-02-20 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329138#comment-14329138 ] Nick Burch commented on TIKA-1554: -- Do you have any small files which incorrectly tri

[jira] [Commented] (TIKA-1557) Create TesseractOCR Option to Never Run

2015-02-21 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14330168#comment-14330168 ] Nick Burch commented on TIKA-1557: -- Being able to blacklist a parser is part of the

[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-02-21 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14330171#comment-14330171 ] Nick Burch commented on TIKA-1558: -- Is it not better to do this via a custom tika co

[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-02-24 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334779#comment-14334779 ] Nick Burch commented on TIKA-1558: -- I've updated the Tika Config example

[jira] [Commented] (TIKA-1561) GCMD Directory Interchange Format (.dif) identification

2015-02-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337878#comment-14337878 ] Nick Burch commented on TIKA-1561: -- Are any of those DIF files you mention und

[jira] [Commented] (TIKA-1560) OutOfMemoryError analyzinig specific file

2015-02-25 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337885#comment-14337885 ] Nick Burch commented on TIKA-1560: -- Looks like we might need to move the "is thi

[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-02-28 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341527#comment-14341527 ] Nick Burch commented on TIKA-1558: -- As of r1662940, it is now possible to blacklist

[jira] [Commented] (TIKA-89) Rename MimeType and MimeTypes

2015-02-28 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341984#comment-14341984 ] Nick Burch commented on TIKA-89: I think this might have already been done?

[jira] [Commented] (TIKA-289) Add magic byte patterns from file(1)

2015-02-28 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341993#comment-14341993 ] Nick Burch commented on TIKA-289: - There are a few issues with integrating it: * Very

[jira] [Reopened] (TIKA-289) Add magic byte patterns from file(1)

2015-02-28 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch reopened TIKA-289: - I've just checked, and there are actually a handful of mime types defined in the file magic which we

[jira] [Updated] (TIKA-289) Add magic byte patterns from file(1)

2015-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-289: Attachment: file-mimes-missing.txt Attached is the list of mime types extracted from the file(1) magic

[jira] [Updated] (TIKA-289) Add magic byte patterns from file(1)

2015-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-289: Attachment: file-has-magic-tika-missing.txt {{file-has-magic-tika-missing.txt}} is the list of mime types

[jira] [Commented] (TIKA-289) Add magic byte patterns from file(1)

2015-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342347#comment-14342347 ] Nick Burch commented on TIKA-289: - As of r1663136, you can now run the Tika CLI with

[jira] [Resolved] (TIKA-1563) Use .gz as the default extension for application/gzip

2015-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1563. -- Resolution: Fixed Fix Version/s: 1.8 Fixed in r1663138. > Use .gz as the default extension

[jira] [Commented] (TIKA-289) Add magic byte patterns from file(1)

2015-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342393#comment-14342393 ] Nick Burch commented on TIKA-289: - Yup, from there, or from a source tarball, or from

[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2015-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342738#comment-14342738 ] Nick Burch commented on TIKA-879: - It might be good to try the widened versions with

[jira] [Updated] (TIKA-634) Command Line Parser for Metadata Extraction

2015-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch updated TIKA-634: Labels: new-parser (was: ) > Command Line Parser for Metadata Extract

[jira] [Commented] (TIKA-634) Command Line Parser for Metadata Extraction

2015-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342743#comment-14342743 ] Nick Burch commented on TIKA-634: - We still seem to lack proper unit tests

[jira] [Commented] (TIKA-675) PackageExtractor should track names of recursively nested resources

2015-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342749#comment-14342749 ] Nick Burch commented on TIKA-675: - I think this is already handled by

[jira] [Commented] (TIKA-712) Master slide text isn't extracted

2015-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342750#comment-14342750 ] Nick Burch commented on TIKA-712: - I think it might already be as fixed as it can be

[jira] [Resolved] (TIKA-727) Improve the outputed XHTML by HSLFExtractor

2015-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-727. - Resolution: Fixed I believe this has been fixed for some time, so I'm closing it. If you still have

[jira] [Commented] (TIKA-770) New ODF metadata keys

2015-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342764#comment-14342764 ] Nick Burch commented on TIKA-770: - I think this probably wants to be a Tika 2.0 fix

[jira] [Commented] (TIKA-1531) Upgrade to POI 3.12-beta1 when available

2015-03-01 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342768#comment-14342768 ] Nick Burch commented on TIKA-1531: -- Apache POI 3.12 beta 1 was released over the wee

[jira] [Commented] (TIKA-928) Separation of Tika Core Properties From Metadata Processing

2015-03-02 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343331#comment-14343331 ] Nick Burch commented on TIKA-928: - Yup, come Tika 2.0 we can remove the backw

[jira] [Commented] (TIKA-944) Extend tika-server API to be consistent with tika-app CLI

2015-03-02 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343566#comment-14343566 ] Nick Burch commented on TIKA-944: - With the server as-is, you can get the metadata as

[jira] [Commented] (TIKA-1039) Raw image file detected as audio/mpeg

2015-03-03 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346415#comment-14346415 ] Nick Burch commented on TIKA-1039: -- Without writing a dedicated detector, I'm

[jira] [Commented] (TIKA-1568) AutoDetectReader performance problem

2015-03-09 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353224#comment-14353224 ] Nick Burch commented on TIKA-1568: -- Maybe we could look at putting the EncodingDete

[jira] [Resolved] (TIKA-1569) Doc typo Mime Magic Detction

2015-03-09 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1569. -- Resolution: Fixed Fix Version/s: 1.8 Good spot! I've fixed it across all affected versions

[jira] [Commented] (TIKA-1573) Not possible to restrict default mime types

2015-03-11 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356856#comment-14356856 ] Nick Burch commented on TIKA-1573: -- If you only want a handful of types, why not

[jira] [Commented] (TIKA-1286) Adding MS Visio VSDX to mime-types detection

2015-03-11 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356859#comment-14356859 ] Nick Burch commented on TIKA-1286: -- Any chance you could create very small sample f

[jira] [Commented] (TIKA-1573) Not possible to restrict default mime types

2015-03-11 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356924#comment-14356924 ] Nick Burch commented on TIKA-1573: -- Detecting text types should be quick, I'

[jira] [Commented] (TIKA-1286) Adding MS Visio VSDX to mime-types detection

2015-03-11 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357169#comment-14357169 ] Nick Burch commented on TIKA-1286: -- Thanks for all this! Note that the types given

[jira] [Commented] (TIKA-1573) Not possible to restrict default mime types

2015-03-11 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357408#comment-14357408 ] Nick Burch commented on TIKA-1573: -- My hunch is that your profiling will show almos

[jira] [Resolved] (TIKA-1286) Adding MS Visio VSDX to mime-types detection

2015-03-12 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1286. -- Resolution: Fixed Fix Version/s: 1.8 > Adding MS Visio VSDX to mime-types detect

[jira] [Commented] (TIKA-682) Creative Suite formats support

2015-03-13 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14360369#comment-14360369 ] Nick Burch commented on TIKA-682: - [~cdedels] We already have a more specific InDe

[jira] [Commented] (TIKA-1082) Incorrect date in Doc metadata

2015-03-16 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14363973#comment-14363973 ] Nick Burch commented on TIKA-1082: -- IIRC, 1601-01-01 is a date value of 0. Tika

[jira] [Commented] (TIKA-1245) Incorrect MIME type detection

2015-03-16 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14363994#comment-14363994 ] Nick Burch commented on TIKA-1245: -- Maybe we should keep the current magic as-is,

[jira] [Commented] (TIKA-1195) XLSB support

2015-03-16 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364023#comment-14364023 ] Nick Burch commented on TIKA-1195: -- No POI support as yet - will take a non-tri

<    1   2   3   4   5   6   7   8   9   10   >