[jira] [Commented] (TIKA-1489) PDF Text extraction without permission

Tilman Hausherr (JIRA) Wed, 26 Nov 2014 09:30:43 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226500#comment-14226500
 ]


Tilman Hausherr commented on TIKA-1489:
---------------------------------------

No, permissions are connected to encryption. Encrypted files have two 
passwords: the user and the owner password. The user password, if correct (it 
is often empty), allows to view the file but restricts certain permissions, and 
very often to extract the text. The owner password allows to "do everything".

Tika PDF2XHTML.java doesn't have any check for permissions, and neither does 
the parent class PDFTextStripper. Oh, oh.
{quote}Again, if I understand correctly, Tilman Hausherr's point is that 
applications have a responsibility to respect the document's desired access 
irrespective of encryption.{quote}
That is correct.


> PDF Text extraction without permission
> --------------------------------------
>
>                 Key: TIKA-1489
>                 URL: https://issues.apache.org/jira/browse/TIKA-1489
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.7
>            Reporter: Tilman Hausherr
>
> In TIKA-1442 text extraction from files like 717226.pdf that don't have text 
> extraction permission works. The permissions in PDF files are only enforced 
> by the application (i.e. PDFBox), i.e. the text information isn't stored 
> separately in encrypted form. 
> PDFBox ExtractText command line does throw an exception.
> So I wonder why TIKA is able to extract text. Either TIKA or the PDFBox call 
> used bypasses the permission checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1489) PDF Text extraction without permission

Reply via email to