[tesseract-ocr] Tesseract hOCR to produce xml, not (x)html, or a tool to simulate this, or a tool to collect the div (etc) attributes

Kim Rönnberg Sat, 26 Mar 2016 03:56:38 -0700

Is there a way to make Tesseract produce "real" xml instead of the (x)html 
hOCR produces, ie. to create xml tags like <ocr_page id='page_1' 
title='...'> instead of "<div class='ocr_page' id='page_1'...", <ocr_area 
id='...' title='...'> instead of "<div class='ocr_carea' id='block_1_1'..." 
etc.?


Or is there somewhere a "ready" something with which the (x)html hOCR 
produces can be converted to a more "easily" xml parseable format, or, even 
better, a something that would give me the div's, span's and p's gouped per 
word, line, area and page readily insertable to a (php) array for inserting 
into a database, of the data format the hOCR produces now?

Like "file_name", "page_nr", "area_id", "line_nr", "word_nr", "word bbox x1 
y1 x2 y2", "the word value", for each word? I realise this means a lot of 
rows (one per word in a document), but this is something I need.

I have spent some days on this, trying to find something that works on php, 
but have not managed to find anything.

Regards

Kim Rönnberg

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4e0ae287-64f8-49f4-a81c-bf4fbcb62178%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Tesseract hOCR to produce xml, not (x)html, or a tool to simulate this, or a tool to collect the div (etc) attributes

Reply via email to