A Dissabte, 31 de juliol de 2010, mpsuz...@hiroshima-u.ac.jp va escriure: > Hi, > > Sorry for a silence in a while. Checking the source, > I found following points. > > 1) poppler-qt4 page object issue > > In Page::getText() method, poppler's TextOutputDev > object is created, and its getText() method is invoked. > In the creation of TextOutputDev, we can tune its > configuration to enable/disable physical layout, > enable/disable raw order mode, etc. I think, when > the vertical text is re-layouted for horizontal text > renderer, the result is logically broken ordered > when MS Office's tricky vertical text. > > If I test TextOutputDev::displayPageSlice() method, > especially with rawOrder option, the text is not > re-layouted. For MS Office's tricky vertical text, > this is slightly better. However, displayPageSlice() > method is designed for FILE stream. If we can pass > the memory buffer to be filled by displayPageSlice(), > it is useful, but such change requires many modifications, > because displayPageSlice() is pan-device method. > > # changing TextOutputDev.cc is insufficient, I > # have to change SplashOutputDev.cc, PSOutputDev.cc, > # CairoOutputDev.cc, ArthurOutputDev.cc, ABWOutputDev.cc... > # I cannot test all of them. > > On the other hand, getText() is device specific method, > only in TextOutputDev.cc, so changing getText() is > easier. > > 2) TextOutputDev::getText() issue > > Because most PDF generator does not draw spaces by font > but moves the current point simply, the tack of TextOutputDev > is not only the objects drawn by fonts. It cares about > the moving of current point to insert space character > (U+0020) at appropriate position. Thus, TextOutputDev is > also layout-aware device as other output devices. > > TextOutputDev has optional switches for "force physical > layout" and "force raw order" of the internal text processing. > The results of "pdftotext -layout msword2007-vert.pdf -" > and "pdftotext -raw msword2007-vert.pdf -" shows the exist > of layout-aware routines in TextOutputDev very clearly. > > I think, raw-ordered text from MS Office's tricky vertical > text can be applicable for text search, but physically- > layouted text cannot be applicable for text search. > > 2-a) re-layout in vertical writing mode is required? > > We can find several interesting "TODO" comments in > TextOutputDev.cc: > > 2342 void TextPage::coalesce(GBool physLayout, GBool doHTML) { > ... > 2535 //----- assemble the blocks > 2536 > 2537 //~ add an outer loop for writing mode (vertical text) > 2538 > 2539 // build blocks for each rotation value > 2540 for (rot = 0; rot < 4; ++rot) { > ... > 2830 //~ need to compute the primary writing mode (horiz/vert) in > 2831 //~ addition to primary rotation > ... > 3316 // build the flows > 3317 //~ this needs to be adjusted for writing mode (vertical text) > 3318 //~ this also needs to account for right-to-left column ordering > 3319 flow = NULL; > 3320 while (flows) { > 3321 flow = flows; > 3322 flows = flows->next; > 3323 delete flow; > 3324 } > 3325 flows = lastFlow = NULL; > 3326 // assume blocks are already in reading order, > 3327 // and construct flows accordingly. > > ... > > 3589 GooString *TextPage::getText(double xMin, double yMin, > 3590 double xMax, double yMax) { > ... > 3632 //~ writing mode (horiz/vert) > 3633 > 3634 // collect the line fragments that are in the rectangle > > ... > > 4651 void TextPage::dump(void *outputStream, TextOutputFunc outputFunc, > 4652 GBool physLayout) { > > ... > > 4689 //~ writing mode (horiz/vert) > 4690 > 4691 // output the page in raw (content stream) order > 4692 if (rawOrder) { > ... > > From the comments, the authors of TextOutputDev.cc seem to > be aware that the current layout analysis is specific to > horizontal text. I think it's a homework for CJK people, > but now I don't have sufficient time to work this issue fully. > > # also we can find a few comments for right-to-left script. > > But, if we restrict our scope to the text search on PDF, > I think raw-ordered extraction can work for most cases. > > 2-b) getText() for rawOrder TextOutputDev? > > As I've written in above, the default, or, rawOrder mode > of pdftotext is useful for MS Office's tricky vertical text. > The rawOrder mode can be specified when TextOutputDev object > is created. But... When I create TextOutputDev object in > poppler-qt4 to extract raw-ordered text, TextOutputDev::getText() > returns NULL text. Oops. It is designed behaviour of > TextOutputDev::getText(). You can find following line in > TextOutputDev.cc. > > 3589 GooString *TextPage::getText(double xMin, double yMin, > 3590 double xMax, double yMax) { > > ... > > 3605 > 3606 s = new GooString(); > 3607 > 3608 if (rawOrder) { > 3609 return s; > 3610 } > > Yet I'm not sure why rawOrder case is discarded. As an > experiment, I wrote a rawOrder text extraction code aslike: > > diff --git a/poppler/TextOutputDev.cc b/poppler/TextOutputDev.cc > index f244639..1803629 100644 > --- a/poppler/TextOutputDev.cc > +++ b/poppler/TextOutputDev.cc > @@ -3702,10 +3702,6 @@ GooString *TextPage::getText(double xMin, double > yMin, > > s = new GooString(); > > - if (rawOrder) { > - return s; > - } > - > // get the output encoding > if (!(uMap = globalParams->getTextEncoding())) { > return s; > @@ -3726,6 +3722,23 @@ GooString *TextPage::getText(double xMin, double > yMin, break; > } > > + if (rawOrder) { > + TextWord* word; > + for (word = rawWords; word && word <= rawLastWord; word = word->next) > { + for (j = 0; j < word->getLength(); ++j) { > + double gXMin, gXMax, gYMin, gYMax; > + word->getCharBBox(j, &gXMin, &gYMin, &gXMax, &gYMax); > + if (xMin <= gXMin && gXMax <= xMax && yMin <= gYMin && gYMax <= > yMax) + { > + char mbc[16]; /* XXX: uMap should know the limit !*/ > + int mbc_len = uMap->mapUnicode( *(word->getChar(j)), mbc, > sizeof(mbc) ); + s->append(mbc, mbc_len); > + } > + } > + } > + return s; > + } > + > //~ writing mode (horiz/vert) > > // collect the line fragments that are in the rectangle > > Now TextOutputDev::getText() can extract the text from > TextOutputDev object in rawOrdered mode. > > 2-c) Line-joining issue in TextOutputDev::getText() > > The raw text in rawOrdered TextOutputDev object has no spaces > between words. Here, "word" means a group of glyphs drawn by > fonts without external current point shifting. My experimental > patch in above inserts the spaces between words. The insertion > of spaces between words makes English text better, but causes > bad effects in MS Office's tricky vertical text. In MS Office's > tricky vertical text, each glyph is drawn after vertical shift > of current point, so all words consist from 1 glyph. > > At present, I have 2 ideas to prevent such bad insertion of > spaces between tricky vertical text. > > idea i: > Tracking the current point and the distance between glyphs, > and determine 2 glyphs are belonging 1 vertical or horizontal > line. > > idea ii: > Referring line breaking algorithm in Unicode and determine > whether the space should be inserted between the glyphs. > - If the codepoints are Latin, the space is inserted. > - If the codepoints are CJK Ideographs, the space is NOT inserted. > - ... > > I think idea ii is so simple and good to start an experiment, > although it can be acceptable for poppler.
WoW, that's a huge mail :D So my understanding is that "proper" CJK searching is a lot of work and you advocate for just exposing the raw text to the upper layers (users of poppler- qt4) so they can do the work if they need it? Albert > > Regards, > mpsuzuki > > P.S. > I've attached a patch "20100801a.diff" to extend > 1) TextOutputDev::getText() to support rawOrder mode. > 2) Qt4 Page::text() to take extra flag for rawOrder boolean. > 3) a test program for poppler-qt's text extraction. > > On Wed, 28 Jul 2010 16:32:20 +0900 > > mpsuz...@hiroshima-u.ac.jp wrote: > >Hi, > > > >On Wed, 28 Jul 2010 15:04:53 +0800 (CST) > > > >"cobra.yu" <cobra...@hyweb.com.tw> wrote: > >> Of course, such fake vertical writing mode is unacceptable. > > > >Thanks. > > > >>So, it shows that we can't only count on the wMode of the font > >>information, but also take the real arrangent of text words on > >>pages into consideration? > > > >Yes, WMode is insufficient. As Deri analyzed, MS Office addin > >draws vertical text by repeating "draw a glyph, move current > >point vertically, draw a glyph...". So, it might be possible > >to detect the text flow direction by tracking the moving of > >current point. But, if our interest is only text search, the > >tracking of current point won't be essential, I think. Maybe > >collecting all glyphs in drawing order is sufficient for text > >search. I will check more detail in poppler-qt4 binding. > > > >Regards, > >mpsuzuki > > > >>-----Original message----- > >>From:suzuki toshiya <mpsuz...@hiroshima-u.ac.jp> > >>To:cobra...@hyweb.com.tw > >>Cc:poppler <poppler@lists.freedesktop.org> > >>Date:Wed, 28 Jul 2010 15:18:58 +0900 > >>Subject:Re: [poppler] Vertical or horizontal writing? > >> > >> > >>Hi, > >> > >>Please find attached fake vertical text produced by MS Excel > >>2007. Is it acceptable for you to exclude such fake vertical > >>text from your target? > >> > >>If you try to select the text on Adobe Reader, you can find > >>that the order of glyph drawing is horizontal, it is stupid > >>fake from the viewpoint of page rendering language. > >> > >>Regards, > >>mpsuzuki > >> > >>cobra.yu wrote: > >>> Hi, > >>> > >>> The original requirement to detect the direction of text flow is > >>> for "searching". The present "search" function of Poppler::Page > >>> is searching horizontally only. So, for CJK users, I must add one > >>> vertical search function for the vertical writing mode. I could > >>> sort out all the textboxes in every page by (x,y) of the bounding > >>> box to make a vertical-like textbox list, but I encountered a > >>> fundamental problem: If I can't know the exact direction of text > >>> flow first, how could I know when to use vertical or horizontal > >>> search? BTW, I've accomplished the vertical text selection by the > >>> same way as my vertical search right now, but it's rather simpler > >>> than searching indeed. > >>> > >>> Cobra > >>> > >>> -----Original message----- > >>> From:mpsuz...@hiroshima-u.ac.jp > >>> To:Deri James <d...@chuzzlewit.demon.co.uk> > >>> Cc:poppler@lists.freedesktop.org,cobra...@hyweb.com.tw > >>> Date:Wed, 28 Jul 2010 01:59:40 +0900 > >>> Subject:Re: [poppler] Vertical or horizontal writing? > >>> > >>> Dear Deri, > >>> > >>> On Tue, 27 Jul 2010 17:22:14 +0100 > >>> > >>> Deri James <d...@chuzzlewit.demon.co.uk> wrote: > >>>> When looking at the two PDFs you are using with acroread using the > >>>> text selection tool:- > >>>> > >>>> P1 of 'vert-horiz-ipa-std.pdf' selection caret is drawn horizontally. > >>>> 'msword2010-vert2.pdf' selection caret is drawn vertically. > >>>> > >>>> So, it seems acroread can't detect the vertical text in this file, > >>>> i.e. it is actually horizontal text placed one glyph at a time (apart > >>>> from 'MS Word 2010' which is horizontal text rotated 90 degrees). > >>>> > >>>> The contents of the stream confirms this:- > >>>> > >>>> stream > >>>> /P <</MCID 0/Lang (en-US)>> BDC BT > >>>> /F1 10.56 Tf > >>>> 0.000000001 -1 1 0.000000001 496.54 756.84 Tm > >>>> 0 g > >>>> 0 G > >>>> [(MS)6( )5(W)61(ord)-4( )5(20)10(10)] TJ > >>>> ET > >>>> EMC /P <</MCID 1>> BDC BT > >>>> /F2 10.56 Tf > >>>> 1 0.000000017 -0.000000017 1 495.29 673.7 Tm > >>>> <085B>Tj > >>>> ET > >>>> EMC /P <</MCID 2>> BDC BT > >>>> 1 0.000000017 -0.000000017 1 495.29 663.14 Tm > >>>> <29AA>Tj > >>>> > >>>> > >>>> > >>>> ... > >>>> > >>>> So this PDF does not have any true vertical text. > >>> > >>> Yes, yes, just I've reached exactly same conclusion. > >>> Thank you for checking the content of PDF. > >>> > >>> The PDF generated by MS Office addin uses the font object > >>> for horizontal writing mode, in PDF design, at least. So > >>> the text flow detection in PDF font level does not work > >>> with such PDF. Higher level recognization is needed. > >>> > >>> It brings a philosophical question: what is vertical text? > >>> Some people makes vertical serie of CJK glyphs by using > >>> very very narrow text box, is this wrong vertical text? > >>> If they are not vertical text, why we should distinguish? > >>> The invalid shape of the punctuations & arrows? Or... > >>> > >>> I have to ask Cobra about what is the original requirement > >>> why the text direction should be detected. Cobra, could > >>> you describe why you needed to detect the direction of > >>> text flow? > >>> > >>> Regards, > >>> mpsuzuki > > > >_______________________________________________ > >poppler mailing list > >poppler@lists.freedesktop.org > >http://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list poppler@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/poppler