Sorry to respond so quickly to my own message, but I thought I would at
least demonstrate a naive patch - obviously this would need to be validated
against many other sources, but at least it solves this particular case.
Interestingly, I observed that in some instances words that should be
separated had a kerning value of exactly "0" . ..

public void displayPdfString(PdfString string, float tj){
        String unicode = decode(string);


        float width = getStringWidth(unicode, tj); // this is width in
unscaled units - we have to normalize by the Tm scaling
+        if(tj < -200 || tj==0){
+           unicode = " ".concat(unicode);
+      }
        Matrix nextTextMatrix = new Matrix(width, 0).multiply(textMatrix);

        displayText(unicode, nextTextMatrix);

        textMatrix = nextTextMatrix;
    }

-Alex

On Tue, Jul 21, 2009 at 4:23 PM, Alex Vigdor <[email protected]> wrote:

> Hello,
> I've begun experimenting with the PdfTextExtractor in iText as a
> replacement for PDFBox.  So far I'm very pleased with the results in many
> cases, however I've noticed several examples where all the words in the
> extracted text run together without spaces, so perhaps some tweaking is
> needed in the character distance calculations.  See this PDF as an example
> where most spaces are missing:
>
>
> http://nepomuk.semanticdesktop.org/xwiki/bin/download/Main1/Publications/Minack%202008.pdf
> Sample output:
>
> Abstract. WiththegrowthoftheSemanticWeb,therequirements
> onstoringandqueryingRDFhasbecomemoresophisticated.When
> alargeramountofdatahastobemanaged,queriesinstructuredquery
> languages,suchasSPARQL,arenotalwayspowerfulenough.Useofad-
> ditionalkeywordsforqueryingcanfurtherreducetheresultsettowards
> theactualrelevantanswers,however,SPARQLonlyprovidescomplete
> stringmatchingorlteringbasedonregularexpressions,whichisavery
> slowoperation.Incontrast,stateoftheartInformationRetrieval(IR)
> techniquesprovidesophisticatedfeaturessuchaskeywordsearch,lemma-
> tisation,stemmingandranking.Inthispaperwepresentacombination
> ofstructuredRDFqueriesandfull-textsearch.Itisimplementedasan
> extensionofanestablishedRDFstore(Sesame)withIRcapabilitiesus-
> ingthetextsearchlibraryLucene ,withoutrequiringmodicationsto
> existingRDFquerylanguages.
>
>
> Thanks for the great work, and hope this isn't too complicated to solve!
>
> Cheers,
> Alex
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> iText-questions mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> Buy the iText book: http://www.1t3xt.com/docs/book.php
> Check the site with examples before you ask questions:
> http://www.1t3xt.info/examples/
> You can also search the keywords list:
> http://1t3xt.info/tutorials/keywords/
>
------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Reply via email to