Hello,
I've begun experimenting with the PdfTextExtractor in iText as a replacement for PDFBox. So far I'm very pleased with the results in many cases, however I've noticed several examples where all the words in the extracted text run together without spaces, so perhaps some tweaking is needed in the character distance calculations. See this PDF as an example where most spaces are missing:

http://nepomuk.semanticdesktop.org/xwiki/bin/download/Main1/Publications/Minack%202008.pdf

Sample output:

Abstract. WiththegrowthoftheSemanticWeb,therequirements
onstoringandqueryingRDFhasbecomemoresophisticated.When
alargeramountofdatahastobemanaged,queriesinstructuredquery
languages,suchasSPARQL,arenotalwayspowerfulenough.Useofad-
ditionalkeywordsforqueryingcanfurtherreducetheresultsettowards
theactualrelevantanswers,however,SPARQLonlyprovidescomplete
stringmatchingorlteringbasedonregularexpressions,whichisavery
slowoperation.Incontrast,stateoftheartInformationRetrieval(IR)
techniquesprovidesophisticatedfeaturessuchaskeywordsearch,lemma-
tisation,stemmingandranking.Inthispaperwepresentacombination
ofstructuredRDFqueriesandfull-textsearch.Itisimplementedasan
extensionofanestablishedRDFstore(Sesame)withIRcapabilitiesus-
ingthetextsearchlibraryLucene ,withoutrequiringmodicationsto
existingRDFquerylanguages.


Thanks for the great work, and hope this isn't too complicated to solve!

Cheers,
Alex

------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Reply via email to