[jira] [Commented] (TIKA-3992) Add common missing mimes based on Common Crawl data

2023-03-29 Thread Andrew Jackson (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706400#comment-17706400 ] Andrew Jackson commented on TIKA-3992: -- Sounds interesting! Just wanted to note that

[jira] [Commented] (TIKA-2632) Analyze unknown govdocs files

2018-04-17 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441077#comment-16441077 ] Andrew Jackson commented on TIKA-2632: -- It would be great to see the old PowerPoint si

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-21 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635032#comment-14635032 ] Andrew Jackson commented on TIKA-1678: -- Sorry for the delay. Here are the results: *

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-15 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627960#comment-14627960 ] Andrew Jackson commented on TIKA-1678: -- As far as I can tell, the PDF spec seems to im

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-15 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627913#comment-14627913 ] Andrew Jackson commented on TIKA-1678: -- I'm seeing this in about 220,000 out of 21,204

[jira] [Updated] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-14 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-1678: - Summary: PDF metadata extraction fails to spot UTF-16 encoded title (was: PDF metadata extraction

[jira] [Created] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded data

2015-07-14 Thread Andrew Jackson (JIRA)
Andrew Jackson created TIKA-1678: Summary: PDF metadata extraction fails to spot UTF-16 encoded data Key: TIKA-1678 URL: https://issues.apache.org/jira/browse/TIKA-1678 Project: Tika Issue Ty

[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2015-03-19 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368858#comment-14368858 ] Andrew Jackson commented on TIKA-1154: -- Yes, thanks - that's the behaviour I'd hoped f

[jira] [Updated] (TIKA-1486) Minor issues with the Tika MIME type magic file

2014-11-27 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-1486: - Attachment: tika-mime-info-extensions-namespace.patch The attached patch adds a namespace declarati

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-26 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226415#comment-14226415 ] Andrew Jackson commented on TIKA-1302: -- We have two more sets of data. One is the same

[jira] [Commented] (TIKA-1486) Minor issues with the Tika MIME type magic file

2014-11-25 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224745#comment-14224745 ] Andrew Jackson commented on TIKA-1486: -- A-ha! I didn't notice the {{isregex="true"}} a

[jira] [Commented] (TIKA-1486) Minor issues with the Tika MIME type magic file

2014-11-25 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224714#comment-14224714 ] Andrew Jackson commented on TIKA-1486: -- There's no problem with adding an XML namespac

[jira] [Created] (TIKA-1486) Minor issues with the Tika MIME type magic file

2014-11-25 Thread Andrew Jackson (JIRA)
Andrew Jackson created TIKA-1486: Summary: Minor issues with the Tika MIME type magic file Key: TIKA-1486 URL: https://issues.apache.org/jira/browse/TIKA-1486 Project: Tika Issue Type: Improv

[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-13 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209757#comment-14209757 ] Andrew Jackson edited comment on TIKA-1302 at 11/13/14 1:42 PM: -

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-13 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209757#comment-14209757 ] Andrew Jackson commented on TIKA-1302: -- [~talli...@apache.org] I've created a download

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-28 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186718#comment-14186718 ] Andrew Jackson commented on TIKA-1302: -- Shall I go ahead and extract the XML errors? O

[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-21 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178361#comment-14178361 ] Andrew Jackson edited comment on TIKA-1302 at 10/21/14 12:59 PM:

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-21 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178361#comment-14178361 ] Andrew Jackson commented on TIKA-1302: -- Okay, so the c.300,000 exceptions are here: h

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14176934#comment-14176934 ] Andrew Jackson commented on TIKA-1302: -- I have 2,358,167 errors from one collection (2

[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-20 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14176892#comment-14176892 ] Andrew Jackson commented on TIKA-1302: -- At the UK Web Archive we run Apache Tika over

[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-09-08 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125384#comment-14125384 ] Andrew Jackson commented on TIKA-1232: -- Looks like this is fixed and in the 1.6 releas

[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-03-05 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920698#comment-13920698 ] Andrew Jackson commented on TIKA-1232: -- Does anyone have a copy of Acrobat 9.1? That v

[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-21 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908402#comment-13908402 ] Andrew Jackson commented on TIKA-1232: -- Going by my original intention, then I would p

[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2014-02-13 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900156#comment-13900156 ] Andrew Jackson commented on TIKA-1154: -- I've had no response on the metadata-extractor

[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-10 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13896697#comment-13896697 ] Andrew Jackson commented on TIKA-1232: -- Multiple dc:formats appears to be a reasonable

[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-07 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13894376#comment-13894376 ] Andrew Jackson commented on TIKA-1232: -- Great! For (1), very happy for that code to g

[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-05 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892210#comment-13892210 ] Andrew Jackson commented on TIKA-1232: -- Yes, you can't identify > 1.7 PDF or the PDF/A

[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757042#comment-13757042 ] Andrew Jackson commented on TIKA-1170: -- Fair point! Thanks for accepting the changes.

[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756981#comment-13756981 ] Andrew Jackson commented on TIKA-1170: -- Thanks, that's great. If you prefer, you shoul

[jira] [Updated] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-1170: - Attachment: 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch This additional patch

[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756577#comment-13756577 ] Andrew Jackson commented on TIKA-1170: -- I'm not sure that commit is right. I see this

[jira] [Updated] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-02 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-1170: - Attachment: 0001-Added-CGM-test-file-test-and-improved-magic.patch Patch containing test file, tes

[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-02 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756064#comment-13756064 ] Andrew Jackson commented on TIKA-1170: -- I was able to create an example file, using [G

[jira] [Updated] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-02 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-1170: - Attachment: plotutils-example.cgm This is an example version 3 binary CGM file, generated using GN

[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-02 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756051#comment-13756051 ] Andrew Jackson commented on TIKA-1170: -- My corpus is a chunk of the Internet Archive,

[jira] [Updated] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-02 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-1170: - Summary: Insufficiently specific magic for binary image/cgm files (was: Possibly erroneous magic

[jira] [Created] (TIKA-1170) Possibly erroneous magic for image/cgm files

2013-09-02 Thread Andrew Jackson (JIRA)
Andrew Jackson created TIKA-1170: Summary: Possibly erroneous magic for image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug

[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2013-07-25 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13719631#comment-13719631 ] Andrew Jackson commented on TIKA-1154: -- Okay, I submitted an issue here: https://code

[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2013-07-25 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13719594#comment-13719594 ] Andrew Jackson commented on TIKA-1154: -- We could exclude the package from coming in vi

[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2013-07-25 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13719513#comment-13719513 ] Andrew Jackson commented on TIKA-1154: -- Thanks for the stacktrace, which lead me to th

[jira] [Updated] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2013-07-25 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-1154: - Attachment: tika-breaker.html This file makes tika hang. If you remove both of the binary characte

[jira] [Created] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2013-07-25 Thread Andrew Jackson (JIRA)
Andrew Jackson created TIKA-1154: Summary: Tika hangs on format detection of malformed HTML file. Key: TIKA-1154 URL: https://issues.apache.org/jira/browse/TIKA-1154 Project: Tika Issue Type:

[jira] [Created] (TIKA-1117) IWorkPackageParser should not close the InputStream

2013-05-01 Thread Andrew Jackson (JIRA)
Andrew Jackson created TIKA-1117: Summary: IWorkPackageParser should not close the InputStream Key: TIKA-1117 URL: https://issues.apache.org/jira/browse/TIKA-1117 Project: Tika Issue Type: Bu

[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-06 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429426#comment-13429426 ] Andrew Jackson commented on TIKA-970: - Hi, I noticed the updated version includes a bit

[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-03 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428116#comment-13428116 ] Andrew Jackson commented on TIKA-970: - He's added the Apache licence here: https://gith

[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-03 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428108#comment-13428108 ] Andrew Jackson commented on TIKA-970: - I assume I'll need him to confirm an Apache 2 lic

[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-03 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428096#comment-13428096 ] Andrew Jackson commented on TIKA-970: - I should be able to sort that out. I know the aut

[jira] [Commented] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-03 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428085#comment-13428085 ] Andrew Jackson commented on TIKA-970: - BTW, this set of signatures rather clumsily repea

[jira] [Created] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-03 Thread Andrew Jackson (JIRA)
Andrew Jackson created TIKA-970: --- Summary: Full identification of the JPEG 2000 family of formats Key: TIKA-970 URL: https://issues.apache.org/jira/browse/TIKA-970 Project: Tika Issue Type: New

[jira] [Updated] (TIKA-970) Full identification of the JPEG 2000 family of formats

2012-08-03 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-970: Attachment: custom-mimetype.xml > Full identification of the JPEG 2000 family of formats > --

[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images

2012-04-23 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-900: Description: I have been testing Tika's ability to identify ISO9660 disk image file systems, and disc

[jira] [Commented] (TIKA-900) Tika fails to detect ISO9660 disk images

2012-04-23 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259615#comment-13259615 ] Andrew Jackson commented on TIKA-900: - I re-uploaded the patch as it had an extra format

[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images

2012-04-23 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-900: Description: I have been testing Tika's ability to identify ISO9660 disk image file systems, and disc

[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images

2012-04-23 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-900: Attachment: iso-image-detection.patch Patch to fix ISO image magic, and extended the buffer size so t

[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images

2012-04-23 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-900: Attachment: (was: iso-image-detection.patch) > Tika fails to detect ISO9660 disk images > ---

[jira] [Updated] (TIKA-900) Tika fails to detect ISO9660 disk images

2012-04-23 Thread Andrew Jackson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Jackson updated TIKA-900: Attachment: iso-image-detection.patch Patch to increase buffer size and fix ISO image detection.

[jira] [Created] (TIKA-900) Tika fails to detect ISO9660 disk images

2012-04-23 Thread Andrew Jackson (JIRA)
Andrew Jackson created TIKA-900: --- Summary: Tika fails to detect ISO9660 disk images Key: TIKA-900 URL: https://issues.apache.org/jira/browse/TIKA-900 Project: Tika Issue Type: Bug Com