2012/04/26 Mattmann, Chris A (388J) napisał/wrote:
Hi Guys,
One comment RE: the below too -- this is precisely where I see
Any23 coming into play and why there is a strong relationship
between it and Tika:
http://incubator.apache.org/any23/
I'm the current Champion for the project and the Tika
2012/04/25 Joerg Ehrlich napisał/wrote:
Hi,
I have put a proposal of a roadmap for the metadata features in Tika on the
wiki:
http://wiki.apache.org/tika/MetadataRoadmap
The proposal is based on a discussion around this topic I have had with Jukka.
Please review and feel free to edit the wiki
[
https://issues.apache.org/jira/browse/TIKA-854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196952#comment-13196952
]
Antoni Mylka commented on TIKA-854:
---
Remember TIKA-560. It's best if media type
[
https://issues.apache.org/jira/browse/TIKA-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka closed TIKA-823.
-
Resolution: Fixed
Fix Version/s: 1.1
Committed in r1221686. Thanks for the tip about
[
https://issues.apache.org/jira/browse/TIKA-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-823:
--
Attachment: testStarOffice-5.2-write.sdw
testStarOffice-5.2-impress.sdd
Detect StarOffice files
---
Key: TIKA-823
URL: https://issues.apache.org/jira/browse/TIKA-823
Project: Tika
Issue Type: Improvement
Affects Versions: 1.1
Reporter: Antoni Mylka
I would like both
[
https://issues.apache.org/jira/browse/TIKA-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173267#comment-13173267
]
Antoni Mylka commented on TIKA-821:
---
Committed in r1221323
>
Components: mime
Affects Versions: 1.1
Reporter: Antoni Mylka
Assignee: Antoni Mylka
An issue similar to TIKA-812. This time it's about old Works Word Processor
formats. They use an OLE2 structure, but the top-level entry is called
"MatOST", they are not s
[
https://issues.apache.org/jira/browse/TIKA-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173147#comment-13173147
]
Antoni Mylka commented on TIKA-686:
---
Why keep this issue open?
PdfParser appeare
[
https://issues.apache.org/jira/browse/TIKA-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka closed TIKA-814.
-
Resolution: Fixed
Fix Version/s: 1.1
Committed in r1220698.
This is a change, which theoretically
[
https://issues.apache.org/jira/browse/TIKA-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka closed TIKA-813.
-
Resolution: Fixed
Fix Version/s: 1.1
Committed the magics and the unit tests in t1220696. Thanks
[
https://issues.apache.org/jira/browse/TIKA-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka closed TIKA-812.
-
Resolution: Fixed
Fix Version/s: 1.1
Committed tika-812-ver2.patch in r1220687
W dniu 2011-12-16 20:32, Jukka Zitting pisze:
Hi,
On Fri, Dec 16, 2011 at 7:45 PM, Antoni Mylka wrote:
The moment upstream libraries start depending in tika-core, they stop being
upstream libraries and become "side-stream" libraries. Putting POI between
core and parsers in the
;s not
get carried away about creating yet another ultimate solution.
Antoni Mylka
antoni.my...@gmail.com
ing a bit more complexity to the module setup
I still feel it's worth it though.
WDYT?
Antoni Mylka
antoni.my...@gmail.com
[
https://issues.apache.org/jira/browse/TIKA-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171098#comment-13171098
]
Antoni Mylka commented on TIKA-810:
---
That's a very important question IMHO, c
[
https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka closed TIKA-791.
-
Resolution: Fixed
Fix Version/s: 1.1
This seems fixed.
> Fix the detection
[
https://issues.apache.org/jira/browse/TIKA-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-812:
--
Attachment: tika-812-ver2.patch
A second version of the patch. Contains a magic pattern for
[
https://issues.apache.org/jira/browse/TIKA-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-813:
--
Attachment: testWEBARCHIVE.webarchive
tika-813.patch
A second version of the patch which
[
https://issues.apache.org/jira/browse/TIKA-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-813:
--
Attachment: (was: tika-webarchive-detection.patch)
> Webarchive detect
[
https://issues.apache.org/jira/browse/TIKA-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-814:
--
Attachment: tika-textdetector.patch
A patch, which makes the text detector work on the entire array
Reporter: Antoni Mylka
Attachments: tika-textdetector.patch
In TIKA-688 Jukka implemented a plain text detector. It is fired automatically
inside MimeTypes. I find a number of files in my collections, which are binary
but are still detected as plain text. They wouldn't be if the
Webarchive detection.
-
Key: TIKA-813
URL: https://issues.apache.org/jira/browse/TIKA-813
Project: Tika
Issue Type: Improvement
Components: mime
Affects Versions: 1.1
Reporter: Antoni Mylka
[
https://issues.apache.org/jira/browse/TIKA-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-813:
--
Attachment: tika-webarchive-detection.patch
A patch which adds the appropriate rules to tika
apshot jar to maven
central (and label it with a version number which includes the date or
something). There are such jars, but how does it look in practice? Who
decides if a jar can or cannot be uploaded?
Antoni Mylka
antoni.my...@gmail.com
[
https://issues.apache.org/jira/browse/TIKA-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-812:
--
Attachment: tika-812.patch
testWORKSSpreadsheet7.0.xlr
Attached a test file and a patch
[
https://issues.apache.org/jira/browse/TIKA-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka closed TIKA-798.
-
Resolution: Fixed
Fix Version/s: 1.1
Thanks.
> Distinguish between EMF and
Affects Versions: 1.1
Reporter: Antoni Mylka
This was originally part of ver3 of my patch submitted to TIKA-806.
Works Spreadsheet files are weird. Versions up to 3.0 used a Quattro Pro magic,
version 4.0 used its own magic, while version 7.0 (probably later ones as well)
use an
t; of Tika functionality when
the libraries are missing, without ugly ClassNotFoundErrors. (probably
the only reliable way).
I'm all for.
Antoni Mylka
antoni.my...@gmail.com
[
https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka resolved TIKA-806.
---
Resolution: Not A Problem
Fix Version/s: 1.1
Assignee: Antoni Mylka
You're righ
Hi,
What are the rules wrt. JIRA rights? I can't close issues. It's not much
of a problem, just thought I'd ask. My JIRA id is "antheque", coupled
with the gmail address.
Antoni Mylka
antoni.my...@gmail.com
[
https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168334#comment-13168334
]
Antoni Mylka commented on TIKA-806:
---
If you put it like this, then it becomes a matte
W dniu 2011-12-12 17:58, Mattmann, Chris A (388J) pisze:
Hi Folks,
Please welcome Antoni Mylka to the ranks of the Tika PMC and as a Tika
committer.
He's just been VOTEd in and we're really happy to have him around.
Antoni, please feel free to say a bit about yourself. Thanks a
[
https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-806:
--
Attachment: tika-806-ver3.zip
It turns out that the XLR files are not detected by POIFSContainerDetector
[
https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167509#comment-13167509
]
Antoni Mylka commented on TIKA-806:
---
Probably not. Just that I don'
[
https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-806:
--
Attachment: tika-806-ver2.patch
A second version of the patch which doesn't break the build. The
[
https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-806:
--
Attachment: (was: tika-806.patch)
> MS Word Detection magics are a bit overzeal
[
https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-806:
--
Attachment: tika-806.patch
A patch which removes those magics from tika-mimetypes.xml
: 1.1
Reporter: Antoni Mylka
tika-mimetypes.xml contains a following magic for MS Word:
{noformat}
{noformat}
So if a file is an MS Office document (parent Office magic) and has the
WordDocument string within the given offsets, then it's Word. I have a few
(regrettably confide
[
https://issues.apache.org/jira/browse/TIKA-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-798:
--
Attachment: tika-emfwmf.zip
A patch with two example files. From now WMF stays at applicaton/x
Reporter: Antoni Mylka
I'd like MimeTypes to distinguish between EMF and WMF. These are different
formats with different magics.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/s
[
https://issues.apache.org/jira/browse/TIKA-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-797:
--
Attachment: tika-powerpointextension.patch
A patch which reversed the order of globs for vnd.ms
Tika
Issue Type: Wish
Components: mime
Affects Versions: 1.0
Reporter: Antoni Mylka
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/s
[
https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-791:
--
Attachment: tika-791-ver2.zip
Attached an updated patch which uses a new media type
"application/x
[
https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157216#comment-13157216
]
Antoni Mylka commented on TIKA-791:
---
A specific mime type seems like a better idea in
[
https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-791:
--
Attachment: tika-791.zip
A ZIP file with the patch and some test documents. They differ from the ones in
: 1.1
Environment: Windows 7 64 bit
Reporter: Antoni Mylka
TIKA-437 patch allowed Tika to work with OOXML files protected with the default
VelvetSweatshop password. I feel there is room for improvement.
# The POIFSContainerDetector lies when it sees such a file. It should be
[
https://issues.apache.org/jira/browse/TIKA-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-779:
--
Attachment: tika-779.patch
My workaround + test.
> Detection of Microsoft Works 2
[
https://issues.apache.org/jira/browse/TIKA-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-779:
--
Attachment: microsoft-works-word-processor-2000.wps
a test WPS files with no SPELLING top level name
[
https://issues.apache.org/jira/browse/TIKA-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-779:
--
Description:
In older versions of Tika, our Microsoft Works 2000 Word Processor example file
would get
Environment: Windows 7, 64 bit
Reporter: Antoni Mylka
In older versions of Tika, our Microsoft Works 2000 Word Processor example file
would get recognized properly by the POIFSContainerDetector. Now it isn't. Some
debugging revealed that the improvements from TIKA-704 brok
W dniu 2011-09-23 15:12, Jukka Zitting pisze:
So I think I'll just patch my local copy to do the Q&D thing, and wait for
someone with more XML/RDF-fu to deal with it properly.
Until Someone (TM, :-) does that, I'd be very happy to see the simple
property=xxx mapping you described added to HtmlP
W dniu 2011-08-22 20:37, Tom Grant pisze:
Here's the use case that I'm attempting to solve. I have a customer with
many legacy systems, some of which are completely custom. These systems
have data files that will never be seen outside of their environment. For
example, some are XML files with
[
https://issues.apache.org/jira/browse/TIKA-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072328#comment-13072328
]
Antoni Mylka commented on TIKA-686:
---
FWIW I would say that fewer is better.
W dniu 2011-06-14 16:13, Maxim Valyanskiy pisze:
Tika detects datatype and extracts text in one pass through supplied
input stream. OOXML parser requires random access to ZIP archive files,
so there is only two alternativies - to buffer data in memory or store
it on disk. Overhead appears only wh
W dniu 2011-06-14 16:11, Nick Burch pisze:
On Tue, 14 Jun 2011, Antoni Mylka wrote:
We'll need to buffer the whole file for zip either way. The current way
will create a temp file if you start with an input stream (not if you
have
a file already), will scan through the file looking for en
W dniu 2011-06-14 15:02, Nick Burch pisze:
On Tue, 14 Jun 2011, Antoni Mylka wrote:
You are right. There is still room for improvement. ZipContainerDetector
creates a temp file, which I'd rather avoid
We'll need to buffer the whole file for zip either way. The current way
will cre
W dniu 2011-06-14 11:50, Arjohn Kampman pisze:
On 11/06/2011 02:22, Antoni Mylka wrote:
Brought our TikaMimeTypeIdentifier up to date with the latest Tika
trunk. I increased the number of bytes passed to the identifier to
512KB. It's a lot, but these days CPU is cheap. This large buffer
W dniu 2010-12-03 01:07, Antoni Mylka pisze:
Hello Aperture
(cc tika-dev, may be interesting for you too)
Brought our TikaMimeTypeIdentifier up to date with the latest Tika
trunk. I increased the number of bytes passed to the identifier to
512KB. It's a lot, but these days CPU is
Hello Aperture
(cc tika-dev, may be interesting for you too)
As you know Tika has made certain advances in the field of mime type
identification, which we (Aperture) wanted to implement for a long
time. This is the feature request 3043080 but it applies to a bug
3025427 and feature requests: 2210
[
https://issues.apache.org/jira/browse/TIKA-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka resolved TIKA-560.
---
Resolution: Fixed
This is fixed from my POV, if you don't want to accept null stre
[
https://issues.apache.org/jira/browse/TIKA-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965592#action_12965592
]
Antoni Mylka commented on TIKA-560:
---
It's about detecting the testEXCEL.xlsb
[
https://issues.apache.org/jira/browse/TIKA-562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965590#action_12965590
]
Antoni Mylka commented on TIKA-562:
---
Your unit tests test identification by name an
Reporter: Antoni Mylka
The current tika-mimetypes.xml states that *.vor files are from
vnd.stardivision.writer. This is not true. The vor extension is used by
templates from all staroffice applications. Moreover all of them have the
msoffice magic number.
--
This message is
[
https://issues.apache.org/jira/browse/TIKA-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-563:
--
Attachment: staroffice.5.2.templates.zip
staroffice-templates.patch
A patch and some
[
https://issues.apache.org/jira/browse/TIKA-562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-562:
--
Attachment: ooxml-children.patch
> In tika-mimetypes.xml OpenXML types should have x-tika-ooxml as th
Type: Bug
Reporter: Antoni Mylka
A couple of file types have application/x-tika-msoffice as their parent, when
they should have application/x-tika-ooxml. This error is exhibited when you try
to identify those files with both name and data. The data is found to be
x-tika-ooxml, while
[
https://issues.apache.org/jira/browse/TIKA-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965428#action_12965428
]
Antoni Mylka edited comment on TIKA-560 at 11/30/10 3:49 PM:
-
[
https://issues.apache.org/jira/browse/TIKA-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965428#action_12965428
]
Antoni Mylka commented on TIKA-560:
---
It seems that when applying changes to
[
https://issues.apache.org/jira/browse/TIKA-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12936090#action_12936090
]
Antoni Mylka commented on TIKA-560:
---
MimeTypes, when you pass a null stream - uses
[
https://issues.apache.org/jira/browse/TIKA-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-561:
--
Attachment: tika-561.patch
a patch which contains the modifications and the test file, It overlaps with
Support EMLX file detection
---
Key: TIKA-561
URL: https://issues.apache.org/jira/browse/TIKA-561
Project: Tika
Issue Type: Improvement
Reporter: Antoni Mylka
Apple Mail generates email files in .emlx
[
https://issues.apache.org/jira/browse/TIKA-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-560:
--
Attachment: test-documents.zip
tika-560.patch
A patch with my solution proposal, and the
Improve detection of .mht, Foxmail, and OOXML files
---
Key: TIKA-560
URL: https://issues.apache.org/jira/browse/TIKA-560
Project: Tika
Issue Type: Improvement
Reporter: Antoni
[
https://issues.apache.org/jira/browse/TIKA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-487:
--
Attachment: tika-truncated-ooxml-file.patch
A patch with a test that exposes the problem
ement
Reporter: Antoni Mylka
Attachments: tika-truncated-ooxml-file.patch
When I try to run the detector on a truncated Open XML file I get an exception
java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
[
https://issues.apache.org/jira/browse/TIKA-486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-486:
--
Attachment: tika-non-office-files-with-office-magic.patch
test-documents.zip
Four test
Tika
Issue Type: Improvement
Reporter: Antoni Mylka
There are many applications which use the MSOffice magic number. I know of
Corel Presentations X3, Corel Quattro Pro 7 and X3 and Microsoft Works Word
Processor. They have their own mime types.
They aren't properly su
[
https://issues.apache.org/jira/browse/TIKA-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoni Mylka updated TIKA-485:
--
Attachment: tika-truncated-excel-file.patch
a patch with a test that exposes the issue
ement
Reporter: Antoni Mylka
Attachments: tika-truncated-excel-file.patch
If a file has a POI magic number but the call to new POIFSFileSystem(new
FileInputStream(stream.getFile())); throws an exception because the file is
broken - the entire process will fail. A simple try-catch around the ca
Hi,
The tika mime type detection code has improved greatly since I last
looked it a while ago. The root-XML-based detection and
ContainerAwareDetector are things we (Aperture) have wanted to do
ourselves since at least 2007 but never got round to it :)
Unfortunately there are many subtle dif
81 matches
Mail list logo