[jira] [Commented] (TIKA-1489) PDF Text extraction without permission

Nick Burch (JIRA) Mon, 01 Dec 2014 05:17:47 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229768#comment-14229768
 ]


Nick Burch commented on TIKA-1489:
----------------------------------

If we make the change, then all sorts of things will silently stop working. 
People indexing PDFs via SOLR will find some of them stop showing up in their 
indexes. People calling out to Tika from ElasticSearch will stop finding 
documents. People searching PDFs on their hadoop clusters, or using the Tika 
Server, or calling tika-app will miss content they get now. Many of those users 
have no easy way to interact with a PDFParserConfig object.

For most of those use cases, Tika is not the end user application, so not the 
right place to be making decisions on what should and shouldn't be included / 
excluded. 

I'd probably rather we fetched those permissions fields, and passed those 
downstream in the metadata object, for the final end-user application to use. 
We could then look at a way of people setting something on the PDFParserConfig, 
or probably a more general thing on ParserContext, to say "please do 
permissions enforcement for my application". The default behaviour would remain 
as now, so we don't break anything for all our current users, with the 
difference that they can opt in or check the metadata as appropriate for their 
exact use case.

> PDF Text extraction without permission
> --------------------------------------
>
>                 Key: TIKA-1489
>                 URL: https://issues.apache.org/jira/browse/TIKA-1489
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.7
>            Reporter: Tilman Hausherr
>
> In TIKA-1442 text extraction from files like 717226.pdf that don't have text 
> extraction permission works. The permissions in PDF files are only enforced 
> by the application (i.e. PDFBox), i.e. the text information isn't stored 
> separately in encrypted form. 
> PDFBox ExtractText command line does throw an exception.
> So I wonder why TIKA is able to extract text. Either TIKA or the PDFBox call 
> used bypasses the permission checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1489) PDF Text extraction without permission

Reply via email to