Re: Is there a way to extract text on a page basis from odt ?

Ram Kane Wed, 28 Sep 2011 09:22:56 -0700

I'm using Symphony 3 and LibreOffice 3.3.2. They both display the
document with the same overall structure. That is, page X has the same
footer, header, footnotes, comments and main text in both
applications.


As you mention, i think my only chance for now is to try to understand
the underlying logic these applications use to render the document as
a series of pages.



On Tue, Sep 27, 2011 at 4:03 PM, Dennis E. Hamilton
<[email protected]> wrote:
>
> I think the answer is you can't get there from here today, and it will be an 
> unpredictable time before the answer would change.
>
>  - Dennis
>
> JUST FOR FUN, More questions:
>
> Where are you seeing what the pages are?
>
> That is, what are you looking at where you see what is page X, what is on 
> page X, and what are those things that apply to it (headers, footers, notes, 
> frames, tables, etc.).  What do you have to say to go to page X directly and 
> have it in view?
>
> It is important that the OpenDocument Format is not page oriented (in 
> contrast with final forms like PDFs that are).  I think you understand that 
> from the APIs.
>
> It is some ODF Consumer that puts together the presentation you are looking 
> at.  There is no normative answer to those questions looking at the ODF 
> format alone.  It is pretty much all determined by an ODF Consumer.  What 
> Consumer are you using that you see the pages that you are interested in?
>
> For the time being, it appears that you need to rely on the programmability 
> of that consumer, if any, to be able to derive page-relative actions, because 
> you are interested in features of the rendered document, not the recorded 
> format.
>
> Unless there is a simpler way of addressing a concrete case that could work 
> well enough in the short term.  (Mining PDFs might be better, but there might 
> not be enough structure left.  There are doubtless tools for working on PDFs 
> that might address your problem.)
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Ram 
> Kane
> Sent: Monday, September 26, 2011 06:56
> To: [email protected]
> Subject: Re: Is there a way to extract text on a page basis from odt ?
>
> Thanks all for the replies.
>
>
> > It seems best to revisit the problem statement and extract a
> > grounded case: What is the problem that needs to be solved;
> > what are the constraints on an acceptable solutions.
> >
> > Ram, can you please say more about the problem you want to solve?
> > What would be the simplest-acceptable result?
>
>
> I need to extract content for a given page inside a doc. By content i
> mean header, footer, footnotes, comments, main text from body.
> I need to have the option of extracting each of these elements of the
> page separately (extracting header for page X, footer for page X, body
> text for page X) and not just getting all the content as a single
> string.
>
> I've uploaded a doc that i found on your svn to use as an example here
> -> http://goo.gl/OMIEw
>
> Using the example doc and assuming that i need to extract content for
> page 1, i'd need to extract:
>
>    _ header ("ODFDOM in a header")
>    _ footer ("ODFDOM in a footer")
>    _ footnotes for page ("ODFDOM in a footnote")
>    _ main text and all additional content in the page body (" ODFDOM
> in a title ODFDOM in a section header ODFDOM in paragraph1 ..."
>

Re: Is there a way to extract text on a page basis from odt ?

Reply via email to