[
https://issues.apache.org/jira/browse/TIKA-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Margolis updated TIKA-3150:
---------------------------------
Summary: MimeType Regex End of Binary File Fails (was: MimeType Regex End
of Binary Fails)
> MimeType Regex End of Binary File Fails
> ---------------------------------------
>
> Key: TIKA-3150
> URL: https://issues.apache.org/jira/browse/TIKA-3150
> Project: Tika
> Issue Type: Improvement
> Components: config, detector
> Affects Versions: 1.24, 1.24.1
> Reporter: David Margolis
> Priority: Major
> Labels: config, detection, mime, regex
> Fix For: 1.24.1
>
>
> h1. Summary
> Regular expressions for matching mime types in custom Tika config files fail
> when trying to match exactly up to end of file with regex $ operator or {}
> operators.
> h1. Steps to reproduce
> Let's say, for example, we have a binary file that begins with 3 bytes,
> followed by 4 0x00 bytes, and this whole pattern repeats 5 times. The
> following should work for that situation
> {code:java}
> <mime-type type='application/MY_CUSTOM_FORMAT'>
> <acronym>custom</acronym>
> <magic priority='50'>
> <match value="^([\\S\\s]{3}(\\x00){4}){5}$" type="regex" offset="24"/>
> </match>
> </mime-type>
> {code}
> The $ operator causes this regex to fail. Additionally, changing the regex to
> match exactly 5 times to 6 times, does not cause the regex to fail, even
> though this would cause the regex to match past the end of the file. Is this
> because the regex is wrapping around the whole file back to the beginning?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)