[jira] [Commented] (TIKA-4459) protected ODF encryption detection fail

Manish S N (Jira) Tue, 29 Jul 2025 22:47:37 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18010803#comment-18010803
 ]


Manish S N commented on TIKA-4459:
----------------------------------

i collected some more Open Document  files and msOffice files from the internet 
and ran the test again

results are:

run 1:
{code:java}
#$# spooling: true, fileCount: 151, meanTime: 35.35, stdDeviation: 58.17, 
minTime: 1.0, maxTime: 364.0, medianTime: 11.0
#$# spooling: false, fileCount: 151, meanTime: 38.40, stdDeviation: 68.07, 
minTime: 1.0, maxTime: 422.0, medianTime: 10.0 {code}
run 2:
{code:java}
#$# spooling: true, fileCount: 151, meanTime: 35.65, stdDeviation: 60.67, 
minTime: 1.0, maxTime: 411.0, medianTime: 11.0
#$# spooling: false, fileCount: 151, meanTime: 39.63, stdDeviation: 70.73, 
minTime: 0.0, maxTime: 416.0, medianTime: 11.0 {code}
run 3:
{code:java}
#$# spooling: true, fileCount: 151, meanTime: 35.87, stdDeviation: 58.69, 
minTime: 1.0, maxTime: 378.0, medianTime: 10.0
#$# spooling: false, fileCount: 151, meanTime: 41.48, stdDeviation: 74.02, 
minTime: 0.0, maxTime: 440.0, medianTime: 10.0 {code}
 

You can see the spooling variant has better mean and maxTime in all runs

*_Hence it is inferred that the parser is more efficient with ZipFile than 
ZipStream_*

(Also it is the one that handles errors properly)

So can we change the OpenDocumentParser to spool files by default?

 

P.S:

as for the SSD write limit concern, there is 
[this|https://linustechtips.com/topic/811454-should-i-be-worried-of-ssd-write-limit/]
 Linus tech tips discussion and 
[this|https://superuser.com/questions/345997/what-happens-when-an-ssd-wears-out]
 super user forum

both agree that it is a myth

 

> protected ODF encryption detection fail
> ---------------------------------------
>
>                 Key: TIKA-4459
>                 URL: https://issues.apache.org/jira/browse/TIKA-4459
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 3.2.1
>         Environment: Ubuntu 24.04.2 LTS x86_64 
>            Reporter: Manish S N
>            Priority: Minor
>              Labels: encryption, odf, open-document-format, protected, 
> regression, zip
>             Fix For: 4.0.0, 3.2.2
>
>         Attachments: protected.odt, testProtected.odp
>
>
> When passing inputstream of protected odf format file to tika we get a 
> ZipException instead of a EncryptedDocumentException.
> This works well and correctly throws EncryptedDocumentException if you create 
> TikaInputStream with Path or call TikaInputStream.getPath() as it will write 
> to a temporary file in memory.
> But when working with InputStreams we get the following zip exception:
>  
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.odf.OpenDocumentParser@bae47a0
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
> at org.apache.tika.Tika.parseToString(Tika.java:525)
> at org.apache.tika.Tika.parseToString(Tika.java:495)
> at org.manish.AttachmentParser.parse(AttachmentParser.java:21)
> at org.manish.AttachmentParser.lambda$testParse$1(AttachmentParser.java:72)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
> at 
> java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
> at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
> at org.manish.AttachmentParser.testParse(AttachmentParser.java:64)
> at org.manish.AttachmentParser.main(AttachmentParser.java:57)
> Caused by: java.util.zip.ZipException: only DEFLATED entries can have EXT 
> descriptor
> at java.base/java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:313)
> at 
> java.base/java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:125)
> at 
> org.apache.tika.parser.odf.OpenDocumentParser.handleZipStream(OpenDocumentParser.java:218)
> at 
> org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:169)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
> ... 19 more
>  
> (We use tika to detect encrypted docs)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4459) protected ODF encryption detection fail

Reply via email to