+1
And thank you Damjan for your continued effort!
 
Best,
Pedro

> On 10/15/2022 5:21 PM WEST Matthias Seidel <matthias.sei...@hamburg.de> wrote:
>  
>  
> 
> Hi Damjan,
> 
> I am not a developer, so I can't comment on the technical issues.
> 
> But generally I like the idea of reducing dependencies and simplifying the 
> code.
> 
> Maybe we can create a separate git branch for this task?
> 
> Regards,
> 
>    Matthias
> 
> Am 15.10.22 um 10:32 schrieb Damjan Jovanovic:
> 
> > 
> > Hi
> > 
> > We currently use 2 XML libraries in our C/C++ code, expat and libxml2.
> > This is unnecessary, one of them can be removed, and I propose we
> > remove expat.
> > 
> > expat is a small library that only does SAX parsing, and lacks DOM,
> > lacks XSLT, lacks schema validation or any more advanced features we
> > use. By comparison, libxml2 is probably the most complete open-source
> > XML library in the world for C, with endless features, many users,
> > stable API, no mandatory dependencies, portable, highly standards
> > compliant and compatible with other XML libraries, performs well
> > (https://xmlbench.sourceforge.net/results/benchmark200910/index.html),
> > is a requirement for libxslt, is permissively licensed, takes security
> > seriously (the maintainer made money by fixing its security bugs in
> > Google's bug bounties), actively developed, documented, nice readable
> > source code, and it's generally of very high quality.
> > 
> > We generally don't like unnecessary dependencies, and expat has been
> > difficult to upgrade to newer versions due to compiler issues. Since
> > expat only implements a small subset of libxml2's features, could we
> > replace it with libxml2?
> > 
> > Where is expat used? The output of "grep expat */prj/build.lst" gives
> > us these modules:
> > lucene
> > openssl
> > sax
> > shell
> > tools
> > 
> > lucene and openssl don't really require expat. Removing expat from
> > their build.lst and stopping the expat/ module from building still
> > gets lucene and openssl to build successfully, so clearly their
> > dependency on expat was always unnecessary. As of commit
> > 461facd3d94599cc708db50510b7e42483407720 they no longer depend on
> > expat.
> > 
> > tools probably also has an unnecessary dependency on expat, but this
> > is harder to prove as tools -> i18pool -> sax -> expat, so let's
> > revisit this later.
> > 
> > Only the sax and shell modules really use expat.
> > 
> > In sax, it is used to implement the com.sun.star.xml.sax.Parser UNO
> > service (com.sun.star.comp.extensions.xml.sax.ParserExpat
> > implementation name), which is used fairly widely throughout AOO,
> > including for loading OpenDocument files (sax -> svx -> xmloff), but
> > all the expat usage is contained in a single file:
> > sax/source/expatwrap/sax_expat.cxx.
> > 
> > In shell, it is used to implement a SAX parser internal to that
> > module, which is then consumed by several libraries, including
> > source/unix/sysshell which deals with the recently used file list in
> > $HOME/.recently-used.
> > 
> > --------------------------------
> > Porting sax to libxml2
> > --------------------------------
> > 
> > Patching the sax module to use libxml2 instead of expat was successful
> > enough to get AOO to build, run and load documents, even though it's
> > still incomplete.
> > 
> > Most of the SAX callbacks are compatible between expat and libxml2,
> > and converting those was easy. String types had to be changed, and
> > sometimes the parameter list has to be reordered, or expat-only
> > parameters removed. CDATA is reported with a single event in libxml2,
> > which needed conversion to startCDATA + characters + endCDATA events
> > on our UNO callbacks.
> > 
> > Unfortunately expat's XML_SetExternalEntityRefHandler() works quite
> > differently on libxml2, and has been commented out for now while I try
> > to understand what external entities are and how you generally load
> > them. Also expat's XML_SetDefaultHandlerExpand() is not yet
> > implemented and I am not sure what to do there.
> > 
> > One gotcha was that when there are no attributes on an element,
> > libxml2 calls the startElement() callback with NULL attributes, while
> > expat called it with attributes to a 1 element char** array containing
> > NULL. This was crashing i18npool's saxparser during the build, in
> > SaxExpatParser_Impl::callbackStartElement() during the conversion of
> > attributes to its attribute list. Luckily it was easy to fix.
> > 
> > The code for sax_expat.cxx has become quite a bit shorter, because
> > libxml2 always uses UTF-8 unlike expat which can be built with UTF-16
> > or UTF-8, so the XmlNChar2OUString() and XmlChar2OUString() functions
> > have been removed. Also the getErrorMessage() function no longer has
> > to have a list of hardcoded error codes with our error message for
> > each, as libxml2's xmlCtxtGetLastError() returns the official error
> > message.
> > 
> > Also the sax module uses the SAX1 API. libxml2 supports both SAX1 and
> > SAX2, but SAX1 is optional, and will be disabled if
> > LIBXML_SAX1_ENABLED is undefined in libxml2's
> > include/libxml/xmlversion.h. Currently it's hardcoded to 1, so it
> > should always be defined, but when we use the system libxml2, we don't
> > know that for sure. For now, I've added a check for
> > LIBXML_SAX1_ENABLED and preprocessor #error if missing. It's also
> > possible for the sax module to convert SAX2 callbacks to SAX1 on the
> > UNO side, but that's inefficient (eg. we'd have to concatenate
> > namespace + ":" + localname, which the consuming code will later just
> > split up again). Ideally we would make a XDocumentHandler2 UNO
> > interface based on SAX2, and upstream UNO consumers would use that, so
> > we could use SAX2 throughout AOO. This is more of a long-term plan
> > though, for now SAX1 should be fine.
> > 
> > Even with these issues, it works perfectly, building and running and
> > loading many documents. I am sure libxml2 is being used to load
> > documents instead of expat, because a debugger breakpoint on expat's
> > XML_Parse() doesn't get triggered.
> > 
> > --------------------------------
> > Porting shell to libxml2
> > --------------------------------
> > 
> > I haven't started this yet (but you can ;). It seems easier, fewer
> > expat APIs are used, and since the module is internal and we control
> > both the library and its consumers, it might even be possible to use
> > SAX2 throughout, instead of the custom parsing of tag names into
> > namespaces that it currently does.
> > 
> > --------------
> > In closing
> > --------------
> > 
> > They say "release early, release often" is what made open-source
> > software good, so I am attaching a patch with what I've done so far.
> > Please review, test, and if you can, help implement
> > XML_SetExternalEntityRefHandler() and XML_SetDefaultHandlerExpand(),
> > and port the shell module.
> > 
> > It would also be good to have benchmarks of how long it takes to load
> > a large document with expat vs libxml2.
> > 
> > Thank you
> > Damjan
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org 
> > mailto:dev-unsubscr...@openoffice.apache.org
> > For additional commands, e-mail: dev-h...@openoffice.apache.org 
> > mailto:dev-h...@openoffice.apache.org
> > 
> 

Reply via email to