[ 
https://issues.apache.org/jira/browse/JCR-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298779#comment-16298779
 ] 

Tim Allison edited comment on JCR-4215 at 12/20/17 5:31 PM:
------------------------------------------------------------

Tika's behavior, even in 1.16, was to sniff the bytes and trust those over what 
comes in via the metadata's {{Content-Type}}.  Before, you weren't sending any 
bytes, so it relied on what you told it.  Now, you're sending bytes to avoid 
the {{ZeroByteException}}, and it is sniffing those bytes, detecting text and 
ignoring the mime you are sending in.

To trigger the BlockingParser:
 1. Change the line you mentioned above to:
{noformat}
            resource.setProperty("jcr:data", "<?xml version=\"1.0\" 
encoding=\"UTF-8\" ?>
                                                                     
<blocked>FOOBAR</blocked>", PropertyType.BINARY);
{noformat}
 2. Add a file called {{custom-mimetypes.xml}} in 
{{test/resources/org/apache/tika/mime}} that looks like this:
{noformat}
<?xml version="1.0" encoding="UTF-8"?>
<!-- ASL 2.0 -->
<mime-info>
    <!-- add this for detection to trigger the BlockingParser -->
    <mime-type type="application/x-blocked">
        <root-XML localName="blocked"/>
        <sub-class-of type="application/xml"/>
    </mime-type>
</mime-info>
{noformat}


As a side note: if you want to override the detector and have it believe 
whatever you tell it the file is, you can do this with 
{{TikaCoreProperties.CONTENT_TYPE_OVERRIDE}} as of 1.17.


was (Author: talli...@mitre.org):
Tika's behavior, even in 1.16, was to sniff the bytes and trust those over what 
comes in via the metadata's Content-Type.  Before, you weren't sending any 
bytes, so it relied on what you told it.  Now, you're sending bytes to avoid 
the ZeroByteException, and it is sniffing those bytes, detecting text and 
ignoring the mime you are sending in.

To trigger the BlockingParser:
 1. Change the line you mentioned above to:
{noformat}
            resource.setProperty("jcr:data", "<?xml version=\"1.0\" 
encoding=\"UTF-8\" ?>
                                                                     
<blocked>FOOBAR</blocked>", PropertyType.BINARY);
{noformat}
 2. Add a file called {{custom-mimetypes.xml}} in 
{{test/resources/org/apache/tika/mime}} that looks like this:
{noformat}
<?xml version="1.0" encoding="UTF-8"?>
<!-- ASL 2.0 -->
<mime-info>
    <!-- add this to send files to the BlockingParser -->
    <mime-type type="application/x-blocked">
        <root-XML localName="blocked"/>
        <sub-class-of type="application/xml"/>
    </mime-type>
</mime-info>
{noformat}


As a side note: if you want to override the detector and have it believe 
whatever you tell it the file is, you can do this with 
{{TikaCoreProperties.CONTENT_TYPE_OVERRIDE}} as of 1.17.

> Use Tika version 1.17
> ---------------------
>
>                 Key: JCR-4215
>                 URL: https://issues.apache.org/jira/browse/JCR-4215
>             Project: Jackrabbit Content Repository
>          Issue Type: Task
>          Components: parent
>            Reporter: Julian Reschke
>            Assignee: Julian Reschke
>             Fix For: 2.18
>
>         Attachments: 
> TEST-org.apache.jackrabbit.core.query.lucene.IndexingQueueTest.xml, 
> org.apache.jackrabbit.core.query.lucene.IndexingQueueTest.log, 
> org.apache.jackrabbit.core.query.lucene.IndexingQueueTest.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to