[
https://issues.apache.org/jira/browse/TIKA-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782097#action_12782097
]
MRIT64 commented on TIKA-331:
-----------------------------
Spacing issue
--------------------
Look at lines 10 and 11 in test2.pdf.
Look at lines 11 and 12 in Tika parsing result (Parsing_result2.txt) :
ðLocalisation des zones de livraison et de stockage
ðLocalisation des zones dangereuses
There is no space between ð and Localisation (ð is the translation of Winding's
"Rightwards white arrow" by Tika).
If you copy and paste lines 10 and 11 in test2.pdf into a Notepad Window, you
get :
ð Localisation des zones de livraison et de stockage
ð Localisation des zones dangereuses
...with a space between ð and Localisation.
In my case, the missing space after Tika parsing result in considering
"ðLocalisation" as a word in following processes.
Regards
> Windings font recognition in Tika parsing + spacing issue
> ---------------------------------------------------------
>
> Key: TIKA-331
> URL: https://issues.apache.org/jira/browse/TIKA-331
> Project: Tika
> Issue Type: Wish
> Components: parser
> Affects Versions: 0.4
> Environment: Windows XP / Java JDK 1.6.0_15
> Reporter: MRIT64
> Attachments: Parsing_Result1.txt, Parsing_Result2.txt, test1.pdf,
> test2.pdf
>
>
> I have PDF files that include some characters in Windings font.
> Tika parser replaces them with some Unicode characters that have nothing to
> do with the original, and, in some cases, replaces them with alphabetic
> characters (that is normal regarding these characters codes).
> Would it be possible to improve the parsing and remplace these characters
> with more accurate Unicode characters ?
> (see http://www.alanwood.net/demos/wingdings.html for possible
> correspondences).
> I will attach examples files when this issue will be created (would it be
> possible to attach files directly when creating issues ?)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.