[ 
https://issues.apache.org/jira/browse/TIKA-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Margolis updated TIKA-3150:
---------------------------------
    Summary: MimeType Regex End of Binary File Fails  (was: MimeType Regex End 
of Binary Fails)

> MimeType Regex End of Binary File Fails
> ---------------------------------------
>
>                 Key: TIKA-3150
>                 URL: https://issues.apache.org/jira/browse/TIKA-3150
>             Project: Tika
>          Issue Type: Improvement
>          Components: config, detector
>    Affects Versions: 1.24, 1.24.1
>            Reporter: David Margolis
>            Priority: Major
>              Labels: config, detection, mime, regex
>             Fix For: 1.24.1
>
>
> h1. Summary
> Regular expressions for matching mime types in custom Tika config files fail 
> when trying to match exactly up to end of file with regex $ operator or {} 
> operators.
> h1. Steps to reproduce
> Let's say, for example, we have a binary file that begins with 3 bytes, 
> followed by 4 0x00 bytes, and this whole pattern repeats 5 times. The 
> following should work for that situation
> {code:java}
> <mime-type type='application/MY_CUSTOM_FORMAT'>
>     <acronym>custom</acronym>
>     <magic priority='50'>
>         <match value="^([\\S\\s]{3}(\\x00){4}){5}$" type="regex" offset="24"/>
>     </match>
> </mime-type>
> {code}
> The $ operator causes this regex to fail. Additionally, changing the regex to 
> match exactly 5 times to 6 times, does not cause the regex to fail, even 
> though this would cause the regex to match past the end of the file. Is this 
> because the regex is wrapping around the whole file back to the beginning?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to