[ 
https://issues.apache.org/jira/browse/TIKA-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356792#comment-15356792
 ] 

Joeran commented on TIKA-2018:
------------------------------

hey there,

i am one of the creators of "Docear's PDF Inspector". You are very welcome to 
use "our part" of the code under the Apache (2) licence. I am saying "our part" 
because Docear's PDF Inspector uses the external JPOD PDF library, and of 
course we cannot make any changes on JPOD's licence ;-). However, maybe JPOD's 
licence is compatible with Apache anyway (I haven't checked). 

If you are interested I could also send you the source code of the predecessor 
of Docear's PDF Inspector, named "SciPlore Xtract". SciPlore Xtract uses PDFBox 
but it's source code and process is much less nice than the one of Docear's PDF 
Inspector.



> Attempt to get Title from Full text if not present in MetaData ( 
> Application/Pdf )
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-2018
>                 URL: https://issues.apache.org/jira/browse/TIKA-2018
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Florent Valdelievre
>            Priority: Minor
>
> A vast majority of pdf documents don't fill meta information. 
> As a matter of fact, Tika won't be able to get information like the title.
> There is a [nice 
> scientific|http://docear.org/papers/SciPlore%20Xtract%20--%20Extracting%20Titles%20from%20Scientific%20PDF%20Documents%20by%20Analyzing%20Style%20Information%20%28Font%20Size%29-preprint.pdf]
>  document explaining how to get the title from styles present in the document 
> with simple rules based heuristic. We can probably ask the source code on 
> request if necessary.
> Also, I have tested another lib https://github.com/Docear/PDF-Inspector which 
> does a great job. However, it seems to work exclusively using File object 
> which is not relevant with Hadoop and Nutch context, It would have been nice 
> if it would have worked with stream.
> What do you think ? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to