Re: [metadata] roadmap proposal available on the wiki

2012-04-26 Thread Antoni Mylka
2012/04/26 Mattmann, Chris A (388J) napisał/wrote: Hi Guys, One comment RE: the below too -- this is precisely where I see Any23 coming into play and why there is a strong relationship between it and Tika: http://incubator.apache.org/any23/ I'm the current Champion for the project and the Tika

Re: [metadata] roadmap proposal available on the wiki

2012-04-26 Thread Antoni Mylka
2012/04/25 Joerg Ehrlich napisał/wrote: Hi, I have put a proposal of a roadmap for the metadata features in Tika on the wiki: http://wiki.apache.org/tika/MetadataRoadmap The proposal is based on a discussion around this topic I have had with Jukka. Please review and feel free to edit the wiki

[jira] [Commented] (TIKA-854) No text extraction for Word macroenabled template

2012-01-31 Thread Antoni Mylka (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196952#comment-13196952 ] Antoni Mylka commented on TIKA-854: --- Remember TIKA-560. It's best if media type

[jira] [Closed] (TIKA-823) Detect StarOffice files

2011-12-21 Thread Antoni Mylka (Closed) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka closed TIKA-823. - Resolution: Fixed Fix Version/s: 1.1 Committed in r1221686. Thanks for the tip about

[jira] [Updated] (TIKA-823) Detect StarOffice files

2011-12-20 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-823: -- Attachment: testStarOffice-5.2-write.sdw testStarOffice-5.2-impress.sdd

[jira] [Created] (TIKA-823) Detect StarOffice files

2011-12-20 Thread Antoni Mylka (Created) (JIRA)
Detect StarOffice files --- Key: TIKA-823 URL: https://issues.apache.org/jira/browse/TIKA-823 Project: Tika Issue Type: Improvement Affects Versions: 1.1 Reporter: Antoni Mylka I would like both

[jira] [Commented] (TIKA-821) Support detecting old MIcrosoft Works Word Processor formats

2011-12-20 Thread Antoni Mylka (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173267#comment-13173267 ] Antoni Mylka commented on TIKA-821: --- Committed in r1221323 >

[jira] [Created] (TIKA-821) Support detecting old MIcrosoft Works Word Processor formats

2011-12-20 Thread Antoni Mylka (Created) (JIRA)
Components: mime Affects Versions: 1.1 Reporter: Antoni Mylka Assignee: Antoni Mylka An issue similar to TIKA-812. This time it's about old Works Word Processor formats. They use an OLE2 structure, but the top-level entry is called "MatOST", they are not s

[jira] [Commented] (TIKA-686) Split tika-parsers into separate components

2011-12-20 Thread Antoni Mylka (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173147#comment-13173147 ] Antoni Mylka commented on TIKA-686: --- Why keep this issue open? PdfParser appeare

[jira] [Closed] (TIKA-814) Increase the amount of bytes read by TextDetector

2011-12-19 Thread Antoni Mylka (Closed) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka closed TIKA-814. - Resolution: Fixed Fix Version/s: 1.1 Committed in r1220698. This is a change, which theoretically

[jira] [Closed] (TIKA-813) Webarchive detection.

2011-12-19 Thread Antoni Mylka (Closed) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka closed TIKA-813. - Resolution: Fixed Fix Version/s: 1.1 Committed the magics and the unit tests in t1220696. Thanks

[jira] [Closed] (TIKA-812) Improve the detection of Works Spreadsheet 7.0 files

2011-12-19 Thread Antoni Mylka (Closed) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka closed TIKA-812. - Resolution: Fixed Fix Version/s: 1.1 Committed tika-812-ver2.patch in r1220687

Re: Pushing parsers upstream

2011-12-16 Thread Antoni Mylka
W dniu 2011-12-16 20:32, Jukka Zitting pisze: Hi, On Fri, Dec 16, 2011 at 7:45 PM, Antoni Mylka wrote: The moment upstream libraries start depending in tika-core, they stop being upstream libraries and become "side-stream" libraries. Putting POI between core and parsers in the

Re: Pushing parsers upstream

2011-12-16 Thread Antoni Mylka
;s not get carried away about creating yet another ultimate solution. Antoni Mylka antoni.my...@gmail.com

Re: Pushing parsers upstream

2011-12-16 Thread Antoni Mylka
ing a bit more complexity to the module setup I still feel it's worth it though. WDYT? Antoni Mylka antoni.my...@gmail.com

[jira] [Commented] (TIKA-810) Upgrade to PDFbox 1.7.0 as available

2011-12-16 Thread Antoni Mylka (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171098#comment-13171098 ] Antoni Mylka commented on TIKA-810: --- That's a very important question IMHO, c

[jira] [Closed] (TIKA-791) Fix the detection of protected OOXML files

2011-12-14 Thread Antoni Mylka (Closed) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka closed TIKA-791. - Resolution: Fixed Fix Version/s: 1.1 This seems fixed. > Fix the detection

[jira] [Updated] (TIKA-812) Improve the detection of Works Spreadsheet 7.0 files

2011-12-14 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-812: -- Attachment: tika-812-ver2.patch A second version of the patch. Contains a magic pattern for

[jira] [Updated] (TIKA-813) Webarchive detection.

2011-12-14 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-813: -- Attachment: testWEBARCHIVE.webarchive tika-813.patch A second version of the patch which

[jira] [Updated] (TIKA-813) Webarchive detection.

2011-12-14 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-813: -- Attachment: (was: tika-webarchive-detection.patch) > Webarchive detect

[jira] [Updated] (TIKA-814) Increase the amount of bytes read by TextDetector

2011-12-13 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-814: -- Attachment: tika-textdetector.patch A patch, which makes the text detector work on the entire array

[jira] [Created] (TIKA-814) Increase the amount of bytes read by TextDetector

2011-12-13 Thread Antoni Mylka (Created) (JIRA)
Reporter: Antoni Mylka Attachments: tika-textdetector.patch In TIKA-688 Jukka implemented a plain text detector. It is fired automatically inside MimeTypes. I find a number of files in my collections, which are binary but are still detected as plain text. They wouldn't be if the

[jira] [Created] (TIKA-813) Webarchive detection.

2011-12-13 Thread Antoni Mylka (Created) (JIRA)
Webarchive detection. - Key: TIKA-813 URL: https://issues.apache.org/jira/browse/TIKA-813 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.1 Reporter: Antoni Mylka

[jira] [Updated] (TIKA-813) Webarchive detection.

2011-12-13 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-813: -- Attachment: tika-webarchive-detection.patch A patch which adds the appropriate rules to tika

Re: Pushing parsers upstream

2011-12-13 Thread Antoni Mylka
apshot jar to maven central (and label it with a version number which includes the date or something). There are such jars, but how does it look in practice? Who decides if a jar can or cannot be uploaded? Antoni Mylka antoni.my...@gmail.com

[jira] [Updated] (TIKA-812) Improve the detection of Works Spreadsheet 7.0 files

2011-12-13 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-812: -- Attachment: tika-812.patch testWORKSSpreadsheet7.0.xlr Attached a test file and a patch

[jira] [Closed] (TIKA-798) Distinguish between EMF and WMF

2011-12-13 Thread Antoni Mylka (Closed) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka closed TIKA-798. - Resolution: Fixed Fix Version/s: 1.1 Thanks. > Distinguish between EMF and

[jira] [Created] (TIKA-812) Improve the detection of Works Spreadsheet 7.0 files

2011-12-13 Thread Antoni Mylka (Created) (JIRA)
Affects Versions: 1.1 Reporter: Antoni Mylka This was originally part of ver3 of my patch submitted to TIKA-806. Works Spreadsheet files are weird. Versions up to 3.0 used a Quattro Pro magic, version 4.0 used its own magic, while version 7.0 (probably later ones as well) use an

Re: Pushing parsers upstream

2011-12-13 Thread Antoni Mylka
t; of Tika functionality when the libraries are missing, without ugly ClassNotFoundErrors. (probably the only reliable way). I'm all for. Antoni Mylka antoni.my...@gmail.com

[jira] [Resolved] (TIKA-806) MS Word Detection magics are a bit overzealous

2011-12-13 Thread Antoni Mylka (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka resolved TIKA-806. --- Resolution: Not A Problem Fix Version/s: 1.1 Assignee: Antoni Mylka You're righ

JIRA rights.

2011-12-13 Thread Antoni Mylka
Hi, What are the rules wrt. JIRA rights? I can't close issues. It's not much of a problem, just thought I'd ask. My JIRA id is "antheque", coupled with the gmail address. Antoni Mylka antoni.my...@gmail.com

[jira] [Commented] (TIKA-806) MS Word Detection magics are a bit overzealous

2011-12-13 Thread Antoni Mylka (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168334#comment-13168334 ] Antoni Mylka commented on TIKA-806: --- If you put it like this, then it becomes a matte

Re: [ANNOUNCE] Welcome Antoni Mylka as Tika committer + PMC member

2011-12-12 Thread Antoni Mylka
W dniu 2011-12-12 17:58, Mattmann, Chris A (388J) pisze: Hi Folks, Please welcome Antoni Mylka to the ranks of the Tika PMC and as a Tika committer. He's just been VOTEd in and we're really happy to have him around. Antoni, please feel free to say a bit about yourself. Thanks a

[jira] [Updated] (TIKA-806) MS Word Detection magics are a bit overzealous

2011-12-12 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-806: -- Attachment: tika-806-ver3.zip It turns out that the XLR files are not detected by POIFSContainerDetector

[jira] [Commented] (TIKA-806) MS Word Detection magics are a bit overzealous

2011-12-12 Thread Antoni Mylka (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167509#comment-13167509 ] Antoni Mylka commented on TIKA-806: --- Probably not. Just that I don'

[jira] [Updated] (TIKA-806) MS Word Detection magics are a bit overzealous

2011-12-09 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-806: -- Attachment: tika-806-ver2.patch A second version of the patch which doesn't break the build. The

[jira] [Updated] (TIKA-806) MS Word Detection magics are a bit overzealous

2011-12-09 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-806: -- Attachment: (was: tika-806.patch) > MS Word Detection magics are a bit overzeal

[jira] [Updated] (TIKA-806) MS Word Detection magics are a bit overzealous

2011-12-09 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-806: -- Attachment: tika-806.patch A patch which removes those magics from tika-mimetypes.xml

[jira] [Created] (TIKA-806) MS Word Detection magics are a bit overzealous

2011-12-09 Thread Antoni Mylka (Created) (JIRA)
: 1.1 Reporter: Antoni Mylka tika-mimetypes.xml contains a following magic for MS Word: {noformat} {noformat} So if a file is an MS Office document (parent Office magic) and has the WordDocument string within the given offsets, then it's Word. I have a few (regrettably confide

[jira] [Updated] (TIKA-798) Distinguish between EMF and WMF

2011-12-02 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-798: -- Attachment: tika-emfwmf.zip A patch with two example files. From now WMF stays at applicaton/x

[jira] [Created] (TIKA-798) Distinguish between EMF and WMF

2011-12-02 Thread Antoni Mylka (Created) (JIRA)
Reporter: Antoni Mylka I'd like MimeTypes to distinguish between EMF and WMF. These are different formats with different magics. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/s

[jira] [Updated] (TIKA-797) MimeType.getExtension for application/vnd.ms-powerpoint returns ppz. I'd expect ppt.

2011-12-02 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-797: -- Attachment: tika-powerpointextension.patch A patch which reversed the order of globs for vnd.ms

[jira] [Created] (TIKA-797) MimeType.getExtension for application/vnd.ms-powerpoint returns ppz. I'd expect ppt.

2011-12-02 Thread Antoni Mylka (Created) (JIRA)
Tika Issue Type: Wish Components: mime Affects Versions: 1.0 Reporter: Antoni Mylka -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/s

[jira] [Updated] (TIKA-791) Fix the detection of protected OOXML files

2011-11-28 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-791: -- Attachment: tika-791-ver2.zip Attached an updated patch which uses a new media type "application/x

[jira] [Commented] (TIKA-791) Fix the detection of protected OOXML files

2011-11-25 Thread Antoni Mylka (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157216#comment-13157216 ] Antoni Mylka commented on TIKA-791: --- A specific mime type seems like a better idea in

[jira] [Updated] (TIKA-791) Fix the detection of protected OOXML files

2011-11-25 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-791: -- Attachment: tika-791.zip A ZIP file with the patch and some test documents. They differ from the ones in

[jira] [Created] (TIKA-791) Fix the detection of protected OOXML files

2011-11-25 Thread Antoni Mylka (Created) (JIRA)
: 1.1 Environment: Windows 7 64 bit Reporter: Antoni Mylka TIKA-437 patch allowed Tika to work with OOXML files protected with the default VelvetSweatshop password. I feel there is room for improvement. # The POIFSContainerDetector lies when it sees such a file. It should be

[jira] [Updated] (TIKA-779) Detection of Microsoft Works 2000 Word Processor files

2011-11-10 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-779: -- Attachment: tika-779.patch My workaround + test. > Detection of Microsoft Works 2

[jira] [Updated] (TIKA-779) Detection of Microsoft Works 2000 Word Processor files

2011-11-10 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-779: -- Attachment: microsoft-works-word-processor-2000.wps a test WPS files with no SPELLING top level name

[jira] [Updated] (TIKA-779) Detection of Microsoft Works 2000 Word Processor files

2011-11-10 Thread Antoni Mylka (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-779: -- Description: In older versions of Tika, our Microsoft Works 2000 Word Processor example file would get

[jira] [Created] (TIKA-779) Detection of Microsoft Works 2000 Word Processor files

2011-11-10 Thread Antoni Mylka (Created) (JIRA)
Environment: Windows 7, 64 bit Reporter: Antoni Mylka In older versions of Tika, our Microsoft Works 2000 Word Processor example file would get recognized properly by the POIFSContainerDetector. Now it isn't. Some debugging revealed that the improvements from TIKA-704 brok

Re: Support for Open Graph meta tags

2011-09-23 Thread Antoni Mylka
W dniu 2011-09-23 15:12, Jukka Zitting pisze: So I think I'll just patch my local copy to do the Q&D thing, and wait for someone with more XML/RDF-fu to deal with it properly. Until Someone (TM, :-) does that, I'd be very happy to see the simple property=xxx mapping you described added to HtmlP

Re: Appending Mime Types

2011-08-23 Thread Antoni Mylka
W dniu 2011-08-22 20:37, Tom Grant pisze: Here's the use case that I'm attempting to solve. I have a customer with many legacy systems, some of which are completely custom. These systems have data files that will never be seen outside of their environment. For example, some are XML files with

[jira] [Commented] (TIKA-686) Split tika-parsers into separate components

2011-07-28 Thread Antoni Mylka (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072328#comment-13072328 ] Antoni Mylka commented on TIKA-686: --- FWIW I would say that fewer is better.

Re: [Aperture-devel] TikaMimeTypeIdentifier in Aperture

2011-06-14 Thread Antoni Mylka
W dniu 2011-06-14 16:13, Maxim Valyanskiy pisze: Tika detects datatype and extracts text in one pass through supplied input stream. OOXML parser requires random access to ZIP archive files, so there is only two alternativies - to buffer data in memory or store it on disk. Overhead appears only wh

Re: [Aperture-devel] TikaMimeTypeIdentifier in Aperture

2011-06-14 Thread Antoni Mylka
W dniu 2011-06-14 16:11, Nick Burch pisze: On Tue, 14 Jun 2011, Antoni Mylka wrote: We'll need to buffer the whole file for zip either way. The current way will create a temp file if you start with an input stream (not if you have a file already), will scan through the file looking for en

Re: [Aperture-devel] TikaMimeTypeIdentifier in Aperture

2011-06-14 Thread Antoni Mylka
W dniu 2011-06-14 15:02, Nick Burch pisze: On Tue, 14 Jun 2011, Antoni Mylka wrote: You are right. There is still room for improvement. ZipContainerDetector creates a temp file, which I'd rather avoid We'll need to buffer the whole file for zip either way. The current way will cre

Re: [Aperture-devel] TikaMimeTypeIdentifier in Aperture

2011-06-14 Thread Antoni Mylka
W dniu 2011-06-14 11:50, Arjohn Kampman pisze: On 11/06/2011 02:22, Antoni Mylka wrote: Brought our TikaMimeTypeIdentifier up to date with the latest Tika trunk. I increased the number of bytes passed to the identifier to 512KB. It's a lot, but these days CPU is cheap. This large buffer

Re: TikaMimeTypeIdentifier in Aperture

2011-06-10 Thread Antoni Mylka
W dniu 2010-12-03 01:07, Antoni Mylka pisze: Hello Aperture (cc tika-dev, may be interesting for you too) Brought our TikaMimeTypeIdentifier up to date with the latest Tika trunk. I increased the number of bytes passed to the identifier to 512KB. It's a lot, but these days CPU is

TikaMimeTypeIdentifier in Aperture

2010-12-02 Thread Antoni Mylka
Hello Aperture (cc tika-dev, may be interesting for you too) As you know Tika has made certain advances in the field of mime type identification, which we (Aperture) wanted to implement for a long time. This is the feature request 3043080 but it applies to a bug 3025427 and feature requests: 2210

[jira] Resolved: (TIKA-560) Improve detection of .mht, Foxmail, and OOXML files

2010-11-30 Thread Antoni Mylka (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka resolved TIKA-560. --- Resolution: Fixed This is fixed from my POV, if you don't want to accept null stre

[jira] Commented: (TIKA-560) Improve detection of .mht, Foxmail, and OOXML files

2010-11-30 Thread Antoni Mylka (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965592#action_12965592 ] Antoni Mylka commented on TIKA-560: --- It's about detecting the testEXCEL.xlsb

[jira] Commented: (TIKA-562) In tika-mimetypes.xml OpenXML types should have x-tika-ooxml as their parent

2010-11-30 Thread Antoni Mylka (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965590#action_12965590 ] Antoni Mylka commented on TIKA-562: --- Your unit tests test identification by name an

[jira] Created: (TIKA-563) .vor files are Staroffice Templates, not Staroffice Writer documents

2010-11-30 Thread Antoni Mylka (JIRA)
Reporter: Antoni Mylka The current tika-mimetypes.xml states that *.vor files are from vnd.stardivision.writer. This is not true. The vor extension is used by templates from all staroffice applications. Moreover all of them have the msoffice magic number. -- This message is

[jira] Updated: (TIKA-563) .vor files are Staroffice Templates, not Staroffice Writer documents

2010-11-30 Thread Antoni Mylka (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-563: -- Attachment: staroffice.5.2.templates.zip staroffice-templates.patch A patch and some

[jira] Updated: (TIKA-562) In tika-mimetypes.xml OpenXML types should have x-tika-ooxml as their parent

2010-11-30 Thread Antoni Mylka (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-562: -- Attachment: ooxml-children.patch > In tika-mimetypes.xml OpenXML types should have x-tika-ooxml as th

[jira] Created: (TIKA-562) In tika-mimetypes.xml OpenXML types should have x-tika-ooxml as their parent

2010-11-30 Thread Antoni Mylka (JIRA)
Type: Bug Reporter: Antoni Mylka A couple of file types have application/x-tika-msoffice as their parent, when they should have application/x-tika-ooxml. This error is exhibited when you try to identify those files with both name and data. The data is found to be x-tika-ooxml, while

[jira] Issue Comment Edited: (TIKA-560) Improve detection of .mht, Foxmail, and OOXML files

2010-11-30 Thread Antoni Mylka (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965428#action_12965428 ] Antoni Mylka edited comment on TIKA-560 at 11/30/10 3:49 PM: -

[jira] Commented: (TIKA-560) Improve detection of .mht, Foxmail, and OOXML files

2010-11-30 Thread Antoni Mylka (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965428#action_12965428 ] Antoni Mylka commented on TIKA-560: --- It seems that when applying changes to

[jira] Commented: (TIKA-560) Improve detection of .mht, Foxmail, and OOXML files

2010-11-26 Thread Antoni Mylka (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12936090#action_12936090 ] Antoni Mylka commented on TIKA-560: --- MimeTypes, when you pass a null stream - uses

[jira] Updated: (TIKA-561) Support EMLX file detection

2010-11-25 Thread Antoni Mylka (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-561: -- Attachment: tika-561.patch a patch which contains the modifications and the test file, It overlaps with

[jira] Created: (TIKA-561) Support EMLX file detection

2010-11-25 Thread Antoni Mylka (JIRA)
Support EMLX file detection --- Key: TIKA-561 URL: https://issues.apache.org/jira/browse/TIKA-561 Project: Tika Issue Type: Improvement Reporter: Antoni Mylka Apple Mail generates email files in .emlx

[jira] Updated: (TIKA-560) Improve detection of .mht, Foxmail, and OOXML files

2010-11-25 Thread Antoni Mylka (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-560: -- Attachment: test-documents.zip tika-560.patch A patch with my solution proposal, and the

[jira] Created: (TIKA-560) Improve detection of .mht, Foxmail, and OOXML files

2010-11-25 Thread Antoni Mylka (JIRA)
Improve detection of .mht, Foxmail, and OOXML files --- Key: TIKA-560 URL: https://issues.apache.org/jira/browse/TIKA-560 Project: Tika Issue Type: Improvement Reporter: Antoni

[jira] Updated: (TIKA-487) ContainerAwareDetector doesn't support truncated Open XML files

2010-08-19 Thread Antoni Mylka (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-487: -- Attachment: tika-truncated-ooxml-file.patch A patch with a test that exposes the problem

[jira] Created: (TIKA-487) ContainerAwareDetector doesn't support truncated Open XML files

2010-08-19 Thread Antoni Mylka (JIRA)
ement Reporter: Antoni Mylka Attachments: tika-truncated-ooxml-file.patch When I try to run the detector on a truncated Open XML file I get an exception java.util.zip.ZipException: error in opening zip file at java.util.zip.ZipFile.open(Native Method)

[jira] Updated: (TIKA-486) ContainerAwareDetector doesn't support non-MSOffice files which use the same magic

2010-08-19 Thread Antoni Mylka (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-486: -- Attachment: tika-non-office-files-with-office-magic.patch test-documents.zip Four test

[jira] Created: (TIKA-486) ContainerAwareDetector doesn't support non-MSOffice files which use the same magic

2010-08-19 Thread Antoni Mylka (JIRA)
Tika Issue Type: Improvement Reporter: Antoni Mylka There are many applications which use the MSOffice magic number. I know of Corel Presentations X3, Corel Quattro Pro 7 and X3 and Microsoft Works Word Processor. They have their own mime types. They aren't properly su

[jira] Updated: (TIKA-485) ContainerAwareDetector doesn't support truncated POI files

2010-08-19 Thread Antoni Mylka (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoni Mylka updated TIKA-485: -- Attachment: tika-truncated-excel-file.patch a patch with a test that exposes the issue

[jira] Created: (TIKA-485) ContainerAwareDetector doesn't support truncated POI files

2010-08-19 Thread Antoni Mylka (JIRA)
ement Reporter: Antoni Mylka Attachments: tika-truncated-excel-file.patch If a file has a POI magic number but the call to new POIFSFileSystem(new FileInputStream(stream.getFile())); throws an exception because the file is broken - the entire process will fail. A simple try-catch around the ca

Working with multiple mime type definition files

2010-08-17 Thread Antoni Mylka
Hi, The tika mime type detection code has improved greatly since I last looked it a while ago. The root-XML-based detection and ContainerAwareDetector are things we (Aperture) have wanted to do ourselves since at least 2007 but never got round to it :) Unfortunately there are many subtle dif