+1 And thank you Damjan for your continued effort! Best, Pedro
> On 10/15/2022 5:21 PM WEST Matthias Seidel <matthias.sei...@hamburg.de> wrote: > > > > Hi Damjan, > > I am not a developer, so I can't comment on the technical issues. > > But generally I like the idea of reducing dependencies and simplifying the > code. > > Maybe we can create a separate git branch for this task? > > Regards, > > Matthias > > Am 15.10.22 um 10:32 schrieb Damjan Jovanovic: > > > > > Hi > > > > We currently use 2 XML libraries in our C/C++ code, expat and libxml2. > > This is unnecessary, one of them can be removed, and I propose we > > remove expat. > > > > expat is a small library that only does SAX parsing, and lacks DOM, > > lacks XSLT, lacks schema validation or any more advanced features we > > use. By comparison, libxml2 is probably the most complete open-source > > XML library in the world for C, with endless features, many users, > > stable API, no mandatory dependencies, portable, highly standards > > compliant and compatible with other XML libraries, performs well > > (https://xmlbench.sourceforge.net/results/benchmark200910/index.html), > > is a requirement for libxslt, is permissively licensed, takes security > > seriously (the maintainer made money by fixing its security bugs in > > Google's bug bounties), actively developed, documented, nice readable > > source code, and it's generally of very high quality. > > > > We generally don't like unnecessary dependencies, and expat has been > > difficult to upgrade to newer versions due to compiler issues. Since > > expat only implements a small subset of libxml2's features, could we > > replace it with libxml2? > > > > Where is expat used? The output of "grep expat */prj/build.lst" gives > > us these modules: > > lucene > > openssl > > sax > > shell > > tools > > > > lucene and openssl don't really require expat. Removing expat from > > their build.lst and stopping the expat/ module from building still > > gets lucene and openssl to build successfully, so clearly their > > dependency on expat was always unnecessary. As of commit > > 461facd3d94599cc708db50510b7e42483407720 they no longer depend on > > expat. > > > > tools probably also has an unnecessary dependency on expat, but this > > is harder to prove as tools -> i18pool -> sax -> expat, so let's > > revisit this later. > > > > Only the sax and shell modules really use expat. > > > > In sax, it is used to implement the com.sun.star.xml.sax.Parser UNO > > service (com.sun.star.comp.extensions.xml.sax.ParserExpat > > implementation name), which is used fairly widely throughout AOO, > > including for loading OpenDocument files (sax -> svx -> xmloff), but > > all the expat usage is contained in a single file: > > sax/source/expatwrap/sax_expat.cxx. > > > > In shell, it is used to implement a SAX parser internal to that > > module, which is then consumed by several libraries, including > > source/unix/sysshell which deals with the recently used file list in > > $HOME/.recently-used. > > > > -------------------------------- > > Porting sax to libxml2 > > -------------------------------- > > > > Patching the sax module to use libxml2 instead of expat was successful > > enough to get AOO to build, run and load documents, even though it's > > still incomplete. > > > > Most of the SAX callbacks are compatible between expat and libxml2, > > and converting those was easy. String types had to be changed, and > > sometimes the parameter list has to be reordered, or expat-only > > parameters removed. CDATA is reported with a single event in libxml2, > > which needed conversion to startCDATA + characters + endCDATA events > > on our UNO callbacks. > > > > Unfortunately expat's XML_SetExternalEntityRefHandler() works quite > > differently on libxml2, and has been commented out for now while I try > > to understand what external entities are and how you generally load > > them. Also expat's XML_SetDefaultHandlerExpand() is not yet > > implemented and I am not sure what to do there. > > > > One gotcha was that when there are no attributes on an element, > > libxml2 calls the startElement() callback with NULL attributes, while > > expat called it with attributes to a 1 element char** array containing > > NULL. This was crashing i18npool's saxparser during the build, in > > SaxExpatParser_Impl::callbackStartElement() during the conversion of > > attributes to its attribute list. Luckily it was easy to fix. > > > > The code for sax_expat.cxx has become quite a bit shorter, because > > libxml2 always uses UTF-8 unlike expat which can be built with UTF-16 > > or UTF-8, so the XmlNChar2OUString() and XmlChar2OUString() functions > > have been removed. Also the getErrorMessage() function no longer has > > to have a list of hardcoded error codes with our error message for > > each, as libxml2's xmlCtxtGetLastError() returns the official error > > message. > > > > Also the sax module uses the SAX1 API. libxml2 supports both SAX1 and > > SAX2, but SAX1 is optional, and will be disabled if > > LIBXML_SAX1_ENABLED is undefined in libxml2's > > include/libxml/xmlversion.h. Currently it's hardcoded to 1, so it > > should always be defined, but when we use the system libxml2, we don't > > know that for sure. For now, I've added a check for > > LIBXML_SAX1_ENABLED and preprocessor #error if missing. It's also > > possible for the sax module to convert SAX2 callbacks to SAX1 on the > > UNO side, but that's inefficient (eg. we'd have to concatenate > > namespace + ":" + localname, which the consuming code will later just > > split up again). Ideally we would make a XDocumentHandler2 UNO > > interface based on SAX2, and upstream UNO consumers would use that, so > > we could use SAX2 throughout AOO. This is more of a long-term plan > > though, for now SAX1 should be fine. > > > > Even with these issues, it works perfectly, building and running and > > loading many documents. I am sure libxml2 is being used to load > > documents instead of expat, because a debugger breakpoint on expat's > > XML_Parse() doesn't get triggered. > > > > -------------------------------- > > Porting shell to libxml2 > > -------------------------------- > > > > I haven't started this yet (but you can ;). It seems easier, fewer > > expat APIs are used, and since the module is internal and we control > > both the library and its consumers, it might even be possible to use > > SAX2 throughout, instead of the custom parsing of tag names into > > namespaces that it currently does. > > > > -------------- > > In closing > > -------------- > > > > They say "release early, release often" is what made open-source > > software good, so I am attaching a patch with what I've done so far. > > Please review, test, and if you can, help implement > > XML_SetExternalEntityRefHandler() and XML_SetDefaultHandlerExpand(), > > and port the shell module. > > > > It would also be good to have benchmarks of how long it takes to load > > a large document with expat vs libxml2. > > > > Thank you > > Damjan > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org > > mailto:dev-unsubscr...@openoffice.apache.org > > For additional commands, e-mail: dev-h...@openoffice.apache.org > > mailto:dev-h...@openoffice.apache.org > > >