[jira] [Commented] (TIKA-100) Structured PDF parsing

David vandendriessche (JIRA) Fri, 01 Mar 2013 03:05:18 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13590425#comment-13590425
 ]


David vandendriessche commented on TIKA-100:
--------------------------------------------

At the moment I'm using pdfbox  to upload my data to solr(seachengine). Since 
it doesn't support page extraction.

I'm pretty sure if tika(Solr uses tika if you use the extracthandler) gets 
this. They might change solr so it can return page hits for pdf's.


                
> Structured PDF parsing
> ----------------------
>
>                 Key: TIKA-100
>                 URL: https://issues.apache.org/jira/browse/TIKA-100
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> The PDF parser currently extracts and outputs document content as a single 
> string. PDFBox could be used to support structuring at least down to page and 
> paragraph (not sure how accurate) level.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-100) Structured PDF parsing

Reply via email to