Hello mpsuzuki, Am 08.05.2018 um 18:33 schrieb suzuki toshiya: > Dear Adam, > > Now I don't have sufficient time to draft something, but > >> Generally speaking, I would say that some well-defined format like JSON >> or YAML would be preferable to the ad-hoc encoding? > > I agree. I had no special reason to use ad-hoc format. > > I prefer JSON if I choose JSON and YAML, but, considering > pdftotext has XML output, XML output might be expected too?
Yes, JSON would be my personal preference as well, but consistency trumps preferences IMHO, so XML seems more sensible. > Regards, > mpsuzuki Best regards, Adam > Adam Reichold wrote: >> Hello mpsuzuki, >> >> attached is a version of your patch with some inline comments. >> >> Generally speaking, I would say that some well-defined format like JSON >> or YAML would be preferable to the ad-hoc encoding? >> >> Best regards, >> Adam >> >> Am 03.05.2018 um 13:50 schrieb suzuki toshiya: >>> Current poppler-dump (a testing tool of cpp-frontend) has no feature to >>> demonstrate per-character bbox feature. >>> Attached patch adds the option to demonstrate it (I'm not saying "this is >>> ready >>> to use, please use", I want to understand your request and whether existing >>> features could cover some part of your requests). >>> >>> The patched poppler-dump can work like this: >>> >>> $ cpp/tests/poppler-dump --show-glyph-list test.pdf >>> Page 1/1: >>> --- >>> [Please] @ ( x=72 y=72.624 w=61.32 h=21.6 ) >>> [0] @ ( x=72 y=72.624 w=13.344 h=21.6 ) >>> [1] @ ( x=85.344 y=72.624 w=6.672 h=21.6 ) >>> [2] @ ( x=92.016 y=72.624 w=10.656 h=21.6 ) >>> [3] @ ( x=102.672 y=72.624 w=10.656 h=21.6 ) >>> [4] @ ( x=113.328 y=72.624 w=9.336 h=21.6 ) >>> [5] @ ( x=122.664 y=72.624 w=10.656 h=21.6 ) >>> [wait...] @ ( x=139.32 y=72.624 w=59.328 h=21.6 ) >>> [0] @ ( x=139.32 y=72.624 w=17.328 h=21.6 ) >>> [1] @ ( x=156.648 y=72.624 w=10.656 h=21.6 ) >>> [2] @ ( x=167.304 y=72.624 w=6.672 h=21.6 ) >>> [3] @ ( x=173.976 y=72.624 w=6.672 h=21.6 ) >>> [4] @ ( x=180.648 y=72.624 w=6 h=21.6 ) >>> [5] @ ( x=186.648 y=72.624 w=6 h=21.6 ) >>> [6] @ ( x=192.648 y=72.624 w=6 h=21.6 ) >>> [If] @ ( x=72 y=112.428 w=7.992 h=10.8 ) >>> [0] @ ( x=72 y=112.428 w=3.996 h=10.8 ) >>> [1] @ ( x=75.996 y=112.428 w=3.996 h=10.8 ) >>> [this] @ ( x=82.992 y=112.428 w=17.34 h=10.8 ) >>> [0] @ ( x=82.992 y=112.428 w=3.336 h=10.8 ) >>> [1] @ ( x=86.328 y=112.428 w=6 h=10.8 ) >>> [2] @ ( x=92.328 y=112.428 w=3.336 h=10.8 ) >>> [3] @ ( x=95.664 y=112.428 w=4.668 h=10.8 ) >>> ... >>> >>> Regards, >>> mpsuzuki >>> >>> suzuki toshiya wrote: >>>> Dear obsidian, >>>> >>>> Too many posts about similar issues :-) >>>> I'm not sure whether poppler maintainers are interested in the enhancement >>>> of >>>> pdftotext, >>>> but recently Jeroen and I were working with cpp-frontend to have similar >>>> features. >>>> >>>> in the latest version of poppler, >>>> cpp-frontend has a feature to retrieve the list of words with bounding box, >>>> and it can retrieve the bounding box for each glyph in the word. >>>> >>>> -- >>>> >>>> also I proposed a patch to retrieve the font family and point size: >>>> https://lists.freedesktop.org/archives/poppler/2018-April/013035.html >>>> >>>> it might be waiting the maintainers review. the discussion and result >>>> would be >>>> found at here: >>>> https://github.com/ropensci/pdftools/issues/29 >>>> >>>> -- >>>> >>>>> - style, i.e. none, bold, italic >>>> if the document producer has a bold font and used in the document, aslike >>>> Helvetica-Bold, >>>> it would be found by the family name. >>>> but if the document producer has no bold font and let the word processor >>>> software synthesize the embolden fonts, >>>> it would be difficult for the PDF renderer to recognize it as embolden >>>> font, >>>> because the embolding is done by showing same glyph with subtle shifting. >>>> Simple PDF renderers would be unable to distinguish "normal font but >>>> layered" >>>> and "embolden font". >>>> >>>> Regards, >>>> mpsuzuki >>>> >>>> obsidian . wrote: >>>>> I'm using "pdftotext -bbox file.pdf" to convert a pdf file into html. >>>>> >>>>> Here's a sample line from the output: >>>>> <word xMin="359.852025" yMin="462.548936" xMax="365.689478" >>>>> yMax="467.681498">foo</word> >>>>> >>>>> Is there a way to get font information for every word like: >>>>> - font family, e.g. Verdana >>>>> - style, i.e. none, bold, italic >>>>> - size, e.g. font size 9 >>>>> >>>>> I'm using pdftotext version 0.55.0 on Windows. >>>>> >>>>> >>>> _______________________________________________ >>>> poppler mailing list >>>> poppler@lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/poppler >>>> >>>> >>>> _______________________________________________ >>>> poppler mailing list >>>> poppler@lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/poppler >> > >
signature.asc
Description: OpenPGP digital signature
_______________________________________________ poppler mailing list poppler@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/poppler