https://issues.apache.org/bugzilla/show_bug.cgi?id=51317

             Bug #: 51317
           Summary: Need ability to stream and chunk data out of MS
                    Publisher documents
           Product: POI
           Version: 3.2-FINAL
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: critical
          Priority: P2
         Component: HPBF
        AssignedTo: [email protected]
        ReportedBy: [email protected]
    Classification: Unclassified


This is a follow-up to 45602 (Add Java API for MS Publisher .pub files).

Basically, we need to be able to stream text data out of pub files and have
enough API hooks to control its chunking.

Right now, HPBFDocument doesn't support the NIO version of the POI file system
which makes it load the whole document into memory.

Text extraction is done from the QuillContents object (probably needs to
examine the other parts like Main, Escher etc - subject of another ticket).
QuillContents currently reads the whole document input stream into a single
byte buffer, then makes sense of it and splits it into bits, then picks out the
text and hyperlink bits.

For streaming, we'd want a way to not load everything at once but:
a. emit bits as they're encountered
b. make their contents streamable/chunkable, since a single bit may contain a
lot of text data

I've attempted to implement this but came across exceptions in
NDocumentInputStream - subject of another ticket.

Additionally, this functionality would ideally cover Publisher 2010 files which
I don't believe it does - subject of another ticket.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to