hslf ppt-file-format.xml book.xml quick-guide.xml

nick Thu, 09 Jun 2005 11:19:48 -0700

nick        2005/06/09 06:12:59

  Modified:    src/documentation/content/xdocs/hslf book.xml
                        quick-guide.xml
  Added:       src/documentation/content/xdocs/hslf ppt-file-format.xml
  Log:
  A few small updates to the HSLF useage docs, and adding some initial 
documentation on the PowerPoint file format
  
  Revision  Changes    Path
  1.2       +1 -0      jakarta-poi/src/documentation/content/xdocs/hslf/book.xml
  
  Index: book.xml
  ===================================================================
  RCS file: 
/home/cvs/jakarta-poi/src/documentation/content/xdocs/hslf/book.xml,v
  retrieving revision 1.1
  retrieving revision 1.2
  diff -u -r1.1 -r1.2
  --- book.xml  28 May 2005 19:28:22 -0000      1.1
  +++ book.xml  9 Jun 2005 13:12:59 -0000       1.2
  @@ -13,6 +13,7 @@
       <menu label="HSLF">
           <menu-item label="Overview" href="index.html"/>
           <menu-item label="Quick Guide" href="quick-guide.html"/>
  +        <menu-item label="PPT File Format" href="ppt-file-format.html"/>
        </menu>
        
   </book>
  
  
  
  1.2       +36 -9     
jakarta-poi/src/documentation/content/xdocs/hslf/quick-guide.xml
  
  Index: quick-guide.xml
  ===================================================================
  RCS file: 
/home/cvs/jakarta-poi/src/documentation/content/xdocs/hslf/quick-guide.xml,v
  retrieving revision 1.1
  retrieving revision 1.2
  diff -u -r1.1 -r1.2
  --- quick-guide.xml   28 May 2005 19:28:22 -0000      1.1
  +++ quick-guide.xml   9 Jun 2005 13:12:59 -0000       1.2
  @@ -15,8 +15,9 @@
           <section><title>Basic Text Extraction</title>
           <p>For basic text extraction, make use of 
   <code>org.apache.poi.extractor.PowerPointExtractor</code>. It accepts a file 
or an input
  -stream. The <code>getText()</code> method can be used to get the text from 
the slides,
  -from the notes, or from both.
  +stream. The <code>getText()</code> method can be used to get the text from 
the slides, and the <code>getNotes()</code> method can be used to get the text
  +from the notes. Finally, <code>getText(true,true)</code> will get the text
  +from both.
                </p>
                </section>
                
  @@ -31,19 +32,45 @@
                </p>
                </section>
                
  +        <section><title>Poor Quality Text Extraction</title>
  +        <p>If speed is the most important thing for you, you don't care
  +             about getting duplicate blocks of text, you don't care about 
  +             getting text from master sheets, and you don't care about 
getting
  +             old text, then 
  +             
<code>org.apache.poi.extractor.QuickButCruddyTextExtractor</code>
  +             might be of use.</p>
  +             <p>QuickButCruddyTextExtractor doesn't use the normal record 
  +             parsing code, instead it uses a tree structure blind search 
  +             method to get all text holding records. You will get all the 
text,
  +             including lots of text you normally wouldn't ever want. However,
  +             you will get it back very very fast!</p>
  +             <p>There are two ways of getting the text back. 
  +             <code>getTextAsString()</code> will return a single string with 
all
  +             the text in it. <code>getTextAsVector()</code> will return a 
  +             vector of strings, one for each text record found in the file.
  +             </p>
  +             </section>
  +
                <section><title>Changing Text</title>
  -             <p>It is possible to change the text via 
<code>TextRun.setText(String)</code>. However, if
  -the length of the text is changed, things will break because PowerPoint has
  -internal file references in byte offsets, which are not yet all updated when
  -the size changes.
  +             <p>It is possible to change the text via 
  +             <code>TextRun.setText(String)</code>. However, if the length of 
  +             the text is changed, things will break because PowerPoint has
  +             internal file references in byte offsets. We currently update 
all
  +             of these byte references that we know about when writing out, 
but
  +             there are a few more still to be found.
                </p>
                </section>
                
                <section><title>Guide to key classes</title>
                <ul>
                <li><code>org.apache.poi.hslf.HSLFSlideShow</code>
  -  Handles reading in and writing out files. Generates a tree of the records
  -  in the file
  +             Handles reading in and writing out files. Calls 
  +             <code>org.apache.poi.hslf.record.record</code> to build a tree
  +             of all the records in the file, which it allows access to.
  +             </li>
  +             <li><code>org.apache.poi.hslf.record.record</code>
  +             Base class of all records. Also provides the main record 
generation
  +             code, which will build up a tree of records for a file.
                </li>
                <li><code>org.apache.poi.hslf.usermode.SlideShow</code>
     Builds up model entries from the records, and presents a user facing
  @@ -55,4 +82,4 @@
                </ul>
                </section>
        </body>
  -</document>
  \ No newline at end of file
  +</document>
  
  
  
  1.1                  
jakarta-poi/src/documentation/content/xdocs/hslf/ppt-file-format.xml
  
  Index: ppt-file-format.xml
  ===================================================================
  <?xml version="1.0" encoding="UTF-8"?>
  <!-- Copyright (C) 2004 The Apache Software Foundation. All rights reserved. 
-->
  <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" 
"../dtd/document-v11.dtd">
  
  <document>
      <header>
          <title>POI-HSLF - A Guide to the PowerPoint File Format</title>
          <subtitle>Overview</subtitle>
          <authors>
              <person name="Nick Burch" email="nick at torchbox dot com"/>
          </authors>
      </header>
  
      <body>
          <section><title>Records, Containers and Atoms</title>
                <p>
                PowerPoint documents are made up of a tree of records. A record 
may
                contain either other records (in which case it is a Container),
                or data (in which case it's an Atom). A record can't hold both.
                </p>
                <p>
                PowerPoint documents don't have one overall container record. 
Instead,
                there are a number of different container records to be found at
                the top level.
                </p>
                <p>
                Any numbers or strings stored in the records are always stored 
in
                Little Endian format (least important bytes first). This is the 
case
                no matter what platform the file was written on - be that a 
                Little Endian or a Big Endian system.
                </p>
                <p>
                PowerPoint may have Escher (DDF) records embeded in it. These
                are always held as the children of a PPDrawing record (record
                type 1036). Escher records have the same format as PowerPoint
                records.
                </p>
                </section>
                
                <section><title>Record Headers</title>
                <p>
                All records, be they containers or atoms, have the same standard
                8 byte header. It is:
                </p>
                <ul><li>1/2 byte container flag</li>
                <li>1.5 byte option field</li>
                <li>2 byte record type</li>
                <li>4 byte record length</li></ul>
                <p>
                If the first byte of the header, BINARY_AND with 0x0f, is 0x0f,
                then the record is a container. Otherwise, it's an atom. The 
rest
                of the first two bytes are used to store the "options" for the
                record. Most commonly, this is used to indicate the version of
                the record, but the exact useage is record specific.
                </p>
                <p>
                The record type is a little endian number, which tells you what
                kind of record you're dealing with. Each different kind of 
record
                has it's own value that gets stored here. PowerPoint records 
have
                a type that's normally less than 6000 (decimal). Escher records
                normally have a type between 0xF000 and 0xF1FF.
                </p>
                <p>
                The record length is another little endian number. For an atom,
                it's the size of the data part of the record, i.e. the length
                of the record <em>less</em> its 8 byte record header. For a
                container, it's the size of all the records that are children of
                this record. That means that the size of a container record is 
the
                length, plus 8 bytes for its record header.
                </p>
                </section>
  
                <section><title>CurrentUserAtom, UserEditAtom and 
PersistPtrIncrementalBlock</title>
                <p><strong>aka Records that care about the byte level position 
of other records</strong></p>
                <p>
                A small number of records contain byte level position offsets 
to other
                records. If you change the position of any records in the file, 
then
                there's a good chance that you will need to update some of these
                special records.
                </p>
                <p>
                First up, CurrentUserAtom. This is actually stored in a 
different
                OLE2 (POIFS) stream to the main PowerPoint document. It contains
                a few bits of information on who lasted edited the file. Most
                importantly, at byte 8 of its contents, it stores (as a 32 bit
                little endian number) the offset in the main stream to the most
                recent UserEditAtom.
                </p>
                <p>
                The UserEditAtom contains two byte level offsets (again as 32 
bit
                little endian numbers). At byte 12 is the offset to the 
                PersistPtrIncrementalBlock associated with this UserEditAtom
                (each UserEditAtom has one and only one 
PersistPtrIncrementalBlock).
                At byte 8, there's the offset to the previous UserEditAtom. If 
this
                is 0, then you're at the first one.
                </p>
                <p>
                Every time you do a non full save in PowerPoint, it tacks on 
another
                UserEditAtom and another PersistPtrIncrementalBlock. The 
                CurrentUserAtom is updated to point to this new UserEditAtom, 
and the
                new UserEditAtom points back to the previous UserEditAtom. You 
then
                end up with a chain, starting from the CurrentUserAtom, linking
                back through all the UserEditAtoms, until you reach the first 
one
                from a full save.
                </p>
  <source>
  /-------------------------------\
  | CurrentUserAtom (own stream)  |
  |   OffsetToCurrentEdit = 10562 |==\
  \-------------------------------/  |
                                     |
  /==================================/
  |                                         
/-----------------------------------\
  |                                         | PersistPtrIncrementalBlock @ 6144 
|
  |                                         
\-----------------------------------/
  |  /---------------------------------\                  |
  |  | UserEditAtom @ 6176             |                  |
  |  |   LastUserEditAtomOffset = 0    |                  |
  |  |   PersistPointersOffset =  6144 |==================/
  |  \---------------------------------/
  |                 |                       
/-----------------------------------\
  |                 \====================\  | PersistPtrIncrementalBlock @ 8646 
|
  |                                      |  
\-----------------------------------/
  |  /---------------------------------\ |                |
  |  | UserEditAtom @ 8674             | |                |
  |  |   LastUserEditAtomOffset = 6176 |=/                |
  |  |   PersistPointersOffset =  8646 |==================/
  |  \---------------------------------/
  |                 |                       
/------------------------------------\
  |                 \====================\  | PersistPtrIncrementalBlock @ 
10538 |
  |                                      |  
\------------------------------------/
  |  /---------------------------------\ |                |
  \==| UserEditAtom @ 10562            | |                |
     |   LastUserEditAtomOffset = 8674 |=/                |
     |   PersistPointersOffset = 10538 |==================/
     \---------------------------------/
  </source>
                <p>
                The PersistPtrIncrementalBlock contains byte offsets to all the
                Slides, Notes, Documents and MasterSlides in the file. The first
                PersistPtrIncrementalBlock will point to all the ones that
                were present the first time the file was saved. Subsequent 
                PersistPtrIncrementalBlocks will contain pointers to all the 
ones
                that were changed in that edit. To find the offset to a given
                sheet in the latest version, then start with the most recent
                PersistPtrIncrementalBlock. If this knows about the sheet, use 
the
                offset it has. If it doesn't, then work back through older
                PersistPtrIncrementalBlocks until you find one which does, and
                use that.
                </p>
                <p>
                Each PersistPtrIncrementalBlock can contain a number of entries
                blocks. Each block holds information on a sequence of sheets.
                Each block starts with a 32 bit little endian integer. Once read
                into memory, the lower 20 bits contain the starting number for 
the
                sequence of sheets to be described. The higher 12 bits contain
                the count of the number of sheets described. Following that is
                one 32 bit little endian integer for each sheet in the 
sequence, 
                the value being the offset to that sheet. If there is any data
                left after parsing a block, then it corresponds to the next 
block.
                </p>
  <source>
  hex on disk      decimal        description
  -----------      -------        -----------
  0000             0              No options
  7217             6002           Record type is 6002
  2000 0000        32             Length of data is 32 bytes
  0100 5000        5242881        Count is 5 (12 highest bits)
                                  Starting number is 1 (20 lowest bits)
  0000 0000        0              Sheet (1+0)=1 starts at offset 0
  900D 0000        3472           Sheet (1+1)=2 starts at offset 3472
  E403 0000        996            Sheet (1+2)=3 starts at offset 996
  9213 0000        5010           Sheet (1+3)=4 starts at offset 5010
  BE15 0000        5566           Sheet (1+4)=5 starts at offset 5566
  0900 1000        1048585        Count is 1 (12 highest bits)
                                  Starting number is 9 (20 lowest bits)
  4418 0000        6212           Sheet (9+0)=9 starts at offset 9212
  </source>
                </section>
        </body>
  </document>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/

cvs commit: jakarta-poi/src/documentation/content/xdocs/hslf ppt-file-format.xml book.xml quick-guide.xml

Reply via email to