Musings on POI Architecture

Murphy, Mark Wed, 01 Jun 2016 05:50:48 -0700

I want to apologize in advance on this Stream of Consciousness post. I hope it 
makes sense to someone.


At work I have been using the SS side of POI, and have become fairly 
comfortable with it. I realize that there are some things still that need to be 
done, and some issues with XML Beans that have been discussed, but it seems 
fairly well organized. Recently I have also been working with the WP side as 
well, and it is obviously still a work in progress. Likely there are fewer 
developers contributing there. But as I sat here considering the best way to 
get the things done that I need, I thought about the need to have a common POI 
architecture between the pieces of the project. This may exist, I just haven't 
found it yet. I have found that XWPF does not yet have a clear separation 
between the model and the usermodel. For example, to build headers and footers, 
the user must drip into the model to get a key object that has not yet been 
exposed in the usermodel. And, significant parts still require use of CT and ST 
classes. This is likely due to the early level of development of the WP portion 
of POI, but I feel that this is a great place to start if we intend to replace 
XML Beans.

I would like to propose a change to the POI architecture with respect to SS, as 
it already has a well-defined architecture. This change would allow us to more 
easily move away from XML Beans, and potentially reduce memory consumption in 
the XML format space. It seems to me that one of the reasons we use XML Beans 
is that it allows us to update XML documents in place. Unfortunately, XML is a 
highly inefficient format, and maybe it would be better, with respect to memory 
use, to model documents internally in a more efficient format, and at save time 
convert the document to its binary or XML format as necessary. In this case, 
the model would be the internal representation of the document, and the 
usermodel would be the API we expose to users of the library. In this manner we 
could have a single model and user model for each document type: spreadsheet, 
word processor, diagram, etc. Then on write we would convert to the binary or 
XML format as requested. In addition to the potential memory savings, this 
would enable a few things: We could more easily support additional formats 
(such as .ods and .csv) because we would not have to manipulate those formats 
internally. We could move XML Beans or its replacement to the periphery making 
it easier to swap out that piece. We would not run into issues such as the one 
we currently have with the swapRows() method in XSSF where the file data is 
hard to sort because of the tight coupling with XML Beans.

The WP side is a perfect place to try this out since it does not really have a 
well-defined separation between model and usermodel. If I go on any more, this 
thought will totally fall apart, so I will leave this open for discussion, and 
I hope that no one feels that I am stepping on toes. That is not my intention.

Mark Murphy

Musings on POI Architecture

Reply via email to