Re: Musings on POI Architecture

Javen O'Neal Wed, 01 Jun 2016 07:40:24 -0700

> Unfortunately, XML is a highly inefficient format, and maybe it would be
better, with respect to memory use, to model documents internally in a more
efficient format, and at save time convert the document to its binary or
XML format as necessary.


This is done in the H??F classes, where each field is read from the binary
stream, doing the Little Endian conversion for multi-byte values. This
means that each class instance uses roughly the same memory as the number
of bytes corresponding to that element in the binary stream if the class
does not include additional data structures to improve performance.
Meanwhile, most X??F classes store these fields into frequently larger data
types (short->int) and unpacking multiple codes that were encoded in one
short into multiple 1-byte bool fields. This is usually done while keeping
the XML nodes in memory and writing changes to the nodes. Full
deserialization and reserialization would be more memory efficient, but
requires us to implement every feature that could exist on that element
(otherwise updating a document could result in loss of data or corruption).

I think reading the XML into regular Java data structures and discarding
the XML nodes from memory at read, then recreating the XML would be a good
direction to aim for, but it's such a large task that no one has done it.
As difficult as it is for me to ask IT at my day job to provide 16GB of RAM
to engineers who use internal POI-powered applications, it's less work than
memory-optimizing just the subset of XSSF classes that we use.

Don't let the magnitude of this task turn you off. Chisel away at bloaty
classes as your are able/interested.
On Jun 1, 2016 07:13, "Javen O'Neal" <[email protected]> wrote:

> > create a branch and start experimenting! :)
> Forking the Git mirror might be the easiest way to manage these
> contributions.
> On Jun 1, 2016 06:35, "Nick Burch" <[email protected]> wrote:
>
>> On Wed, 1 Jun 2016, Murphy, Mark wrote:
>>
>>> At work I have been using the SS side of POI, and have become fairly
>>> comfortable with it. I realize that there are some things still that need
>>> to be done, and some issues with XML Beans that have been discussed, but it
>>> seems fairly well organized. Recently I have also been working with the WP
>>> side as well, and it is obviously still a work in progress.
>>>
>>
>> There's not a lot of link between HWPF and XWPF. I tried to put one in,
>> but the formats have a surprising number of differences in concepts and
>> approaches, more-so than HSSF/XSSF. Coupled with less XWPF contributions,
>> and HWPF needing lots of love after the loss of the main developer, and
>> that's how we end up in the situation today...
>>
>> I have found that XWPF does not yet have a clear separation between the
>>> model and the usermodel.
>>>
>>
>> For anything done by POI committers, it should do. However, we've taken a
>> lot of community contributions, and many of those steer more towards "get
>> it done" than "build a full solution perfectly". That's why you see a lot
>> of "leakages" of the low-level XML stuff. It'd be great to wrap all of that
>> stuff up! And required for dropping xmlbeans - we need to get everyone off
>> the CT classes if we want to be able to replace them
>>
>> I would like to propose a change to the POI architecture with respect to
>>> SS, as it already has a well-defined architecture. This change would allow
>>> us to more easily move away from XML Beans, and potentially reduce memory
>>> consumption in the XML format space. It seems to me that one of the reasons
>>> we use XML Beans is that it allows us to update XML documents in place.
>>>
>>
>> On the whole, you can buy/beg/rent more memory, or faster machines. The
>> resource we really lack in POI is contributors writing code or
>> documentation or tests. xmlbeans makes development of the X??F stuff
>> quicker, and that's what we tend to optimise for!
>>
>> Unfortunately, XML is a highly inefficient format, and maybe it would be
>>> better, with respect to memory use, to model documents internally in a more
>>> efficient format, and at save time convert the document to its binary or
>>> XML format as necessary.
>>>
>>
>> The binary and XML formats have more differences than you'd ideally
>> expect or like, which in part is why we don't have more shared stuff
>> between them. Not saying that this plan wouldn't work, just that it might
>> not be as clean as you'd like especially for more fiddly stuff like
>> formatting, colours or the like
>>
>> The WP side is a perfect place to try this out since it does not really
>>> have a well-defined separation between model and usermodel. If I go on any
>>> more, this thought will totally fall apart, so I will leave this open for
>>> discussion, and I hope that no one feels that I am stepping on toes. That
>>> is not my intention.
>>>
>>
>> As long as it doesn't make new contributions to POI harder or slower (we
>> need more contributions!), and as long as you want to do the work, create a
>> branch and start experimenting! :)
>>
>> Nick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Re: Musings on POI Architecture

Reply via email to