[ 
https://issues.apache.org/jira/browse/TIKA-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Margolis updated TIKA-3150:
---------------------------------
    Description: 
h1. Summary
Regular expressions for matching mime types in custom Tika config files fail 
when trying to match exactly up to end of a binary file with regex $ operator 
or {} operators.

h1. Steps to reproduce
Let's say, for example, we have a binary file that begins with 3 bytes, 
followed by 4 0x00 bytes, and this whole pattern repeats 5 times. The following 
should work for that situation

{code:java}
<mime-type type='application/MY_CUSTOM_FORMAT'>
    <acronym>custom</acronym>
    <magic priority='50'>
        <match value="^([\\S\\s]{3}(\\x00){4}){5}$" type="regex" offset="24"/>
    </match>
</mime-type>
{code}

The $ operator causes this regex to fail. Additionally, changing the regex to 
match exactly 5 times to 6 times, does not cause the regex to fail, even though 
this would cause the regex to match past the end of the file. Is this because 
the regex is wrapping around the whole file back to the beginning?

  was:
h1. Summary
Regular expressions for matching mime types in custom Tika config files fail 
when trying to match exactly up to end of file with regex $ operator or {} 
operators.

h1. Steps to reproduce
Let's say, for example, we have a binary file that begins with 3 bytes, 
followed by 4 0x00 bytes, and this whole pattern repeats 5 times. The following 
should work for that situation

{code:java}
<mime-type type='application/MY_CUSTOM_FORMAT'>
    <acronym>custom</acronym>
    <magic priority='50'>
        <match value="^([\\S\\s]{3}(\\x00){4}){5}$" type="regex" offset="24"/>
    </match>
</mime-type>
{code}

The $ operator causes this regex to fail. Additionally, changing the regex to 
match exactly 5 times to 6 times, does not cause the regex to fail, even though 
this would cause the regex to match past the end of the file. Is this because 
the regex is wrapping around the whole file back to the beginning?


> MimeType Regex End of Binary File Fails
> ---------------------------------------
>
>                 Key: TIKA-3150
>                 URL: https://issues.apache.org/jira/browse/TIKA-3150
>             Project: Tika
>          Issue Type: Improvement
>          Components: config, detector
>    Affects Versions: 1.24, 1.24.1
>            Reporter: David Margolis
>            Priority: Major
>              Labels: config, detection, mime, regex
>             Fix For: 1.24.1
>
>
> h1. Summary
> Regular expressions for matching mime types in custom Tika config files fail 
> when trying to match exactly up to end of a binary file with regex $ operator 
> or {} operators.
> h1. Steps to reproduce
> Let's say, for example, we have a binary file that begins with 3 bytes, 
> followed by 4 0x00 bytes, and this whole pattern repeats 5 times. The 
> following should work for that situation
> {code:java}
> <mime-type type='application/MY_CUSTOM_FORMAT'>
>     <acronym>custom</acronym>
>     <magic priority='50'>
>         <match value="^([\\S\\s]{3}(\\x00){4}){5}$" type="regex" offset="24"/>
>     </match>
> </mime-type>
> {code}
> The $ operator causes this regex to fail. Additionally, changing the regex to 
> match exactly 5 times to 6 times, does not cause the regex to fail, even 
> though this would cause the regex to match past the end of the file. Is this 
> because the regex is wrapping around the whole file back to the beginning?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to