[iText-questions] Missing spaces in extracted text

Alex Vigdor Tue, 21 Jul 2009 15:00:26 -0700

Hello,

I've begun experimenting with the PdfTextExtractor in iText as areplacement for PDFBox. So far I'm very pleased with the results inmany cases, however I've noticed several examples where all the wordsin the extracted text run together without spaces, so perhaps sometweaking is needed in the character distance calculations. See thisPDF as an example where most spaces are missing:


http://nepomuk.semanticdesktop.org/xwiki/bin/download/Main1/Publications/Minack%202008.pdf

Sample output:

Abstract. WiththegrowthoftheSemanticWeb,therequirements
onstoringandqueryingRDFhasbecomemoresophisticated.When
alargeramountofdatahastobemanaged,queriesinstructuredquery
languages,suchasSPARQL,arenotalwayspowerfulenough.Useofad-
ditionalkeywordsforqueryingcanfurtherreducetheresultsettowards
theactualrelevantanswers,however,SPARQLonlyprovidescomplete
stringmatchingorlteringbasedonregularexpressions,whichisavery
slowoperation.Incontrast,stateoftheartInformationRetrieval(IR)
techniquesprovidesophisticatedfeaturessuchaskeywordsearch,lemma-
tisation,stemmingandranking.Inthispaperwepresentacombination
ofstructuredRDFqueriesandfull-textsearch.Itisimplementedasan
extensionofanestablishedRDFstore(Sesame)withIRcapabilitiesus-
ingthetextsearchlibraryLucene ,withoutrequiringmodicationsto
existingRDFquerylanguages.


Thanks for the great work, and hope this isn't too complicated to solve!

Cheers,
Alex

------------------------------------------------------------------------------

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

[iText-questions] Missing spaces in extracted text

Reply via email to