On 19/06/2020 13:44, Boris Kolpackov wrote:
Hi Roger,
Thanks for getting the ball rolling. See my comments below.
Roger Leigh <rle...@codelibre.net> writes:
One of the issues I encountered was difficulty in building on modern
platforms, Windows in particular, which was the impetus for developing
the CMake build now incorporated officially in the Xerces-C 3.2.x
releases.
I would suggest we drop support for autotools in 4.0.0. I personally
view both (autotools and CMake) as pretty bad and I don't see a reason
to maintain two bad options.
I wouldn't object to this. It halves the testing required on Unix
platforms. I can't say I adore CMake; it's a big improvement over the
autotools but still not as nice as I would like, but it serves its
purpose, and my use of it is primarily pragmatic (I have submitted a
fair few upstream contributions though, including FindXercesC and
FindXalanC).
These are too deeply entrenched to consider removing at this point, but
C++11 brings native char16_t and char32_t character types which could
alleviate some of the problems.
Agree. Do you know what's the story with char16_t vs wchar_t on Windows?
Specifically, will (Windows-only) codebases that pass L""-strings to
Xerces-C++ API need to be changed?
They are not directly compatible, but a simple typecast is sufficient to
convert them. We already do this in the Win32 transcoder if I recall
correctly.
I'm not entirely sure how to class code written using L"". It's not
really portable, being Windows-only as you say (Windows being the only
platform where wchar_t is 16-bit and usable as XMLCh). And it's not
strictly portable even to different builds of Xerces-C, given that XMLCh
is configurable and that wchar_t isn't even the default (char16_t is the
default for C++11 and above).
As a result, I think that for any users who are using L"", they would
have three options:
1. Replace L"" with u"". This would work with Xerces-C 4.0 and C++11,
but would not be backward compatible with older Xerces-C versions.
2. Add static_cast<const XMLCh *>() around L"" strings when passing them
to Xerces-C. This is portable and works even with older Xerces
versions, so would be useful as a transitional step for codebases which
want to support 4.0 and 3.2 and earlier.
3. Add static_cast<const XMLCh *>() around u"" strings when passing them
to Xerces-C. As for (2) this provides the same portability to old and
new versions (it's a no-op when XMLCh is char16_t). This might be a
better option when you want to use C++11 but you also need to support
3.2 and earlier.
So long as we had these clearly documented, I think that would provide
reasonable guidance and an effective means to transition while retaining
compatibility with older versions.
Personally, I'd go with (3) for my own projects which are already using
C++11 and UTF-8-encoded source files and then switch to (1) once all the
target platforms are using Xerces-C 4.x.
* having exceptions derive from std::runtime_error; existing types can
remain compatible with wide strings
* support use of streams, including stringstreams, directly with Xerces e.g.
InputSource, perhaps as a set of adaptors
* where the C++ language and standard replace functionality in Xerces, it
would be worth considering replacement where there is a benefit; language
thread support might come under this category
Sounds great! If we require C++11 (or later), I see no reason not to
switch to std::thread & friends.
I think here we could also look to Xalan-C for examples. It has
historically made use of C++98 features including some support for
strings, streams etc., and we might be able to directly copy some of its
implementation choices (providing they make sense).
One of the nice aspects of Xerces-C is that it compiles like greased
lightning due to not making heavy use of standard library headers. It
would be nice to retain that if possible. We could possibly constrain
use of streams to specific modules, for example.
On the maintainability side, I'd like to reduce the number of configuration
options to keep testing and support within reason.
Agree. One thing that I would like to keep is the ability to build
Xerces-C++ without any third-party library dependencies.
Yes, I think that point is well made, particularly when it comes to the
network accessors, transcoders and the like. There should be a
self-contained option or none at all as possibilities.
* We have three message loaders and three sets of translations for en_US,
but no other translations. Some or all of them might be worth thinking
about dropping given the complete lack of utility these provide.
To me keeping only ICU and inmemory sounds like the way to go.
Agreed.
* We have several network accessors. But with the modern push for using
HTTPS everywhere, should Xerces be providing its own or should we simply
require CURL or platform-specific functionality?
Yes, I believe there should be only two options: no network support and
CURL.
There are also several transcoder options. Again, I think we should
only keep the built-in stuff and ICU.
These both make sense. For the networking side, I think we could also
argue the case for the MacOS accessor as well (cfurl), so long as it
supports SSL. The ones I think we should drop are the two direct socket
implementations (unix socket and win32 socket).
Finally, I should note that while the above might look quite disruptive, I'm
not suggesting any sort of API breakage at this point.
I agree. I think we should be mindful of migration efforts that will be
required on the user's side.
Absolutely. Many users of Xerces-C have well established codebases, and
we should take care not to break them. I think every change suggested
here should be possible without making any API break.
The only possible source of breakage is the XMLCh switch, and that's not
really a break at all given how flexible that type is today: portable
code already handles it being of varying type.
Another thing worth discussing is which C++ standard we should target. I
think at a minimum C++11 but perhaps we should be bold and aim a bit
higher?
In fact, IMO, talking about targeting a C++ standard like C++11 or C++14
is not very useful since every major C++ compiler (GCC, Clang, MSVC)
completes support for the next standard over multiple releases. As
a result, while compilers may not have complete support, they often
include a perfectly usable subset of the features.
In this light, what we found more useful is to specify the minimum
versions of the three major compilers that we are willing to support
and any features that are available in all three are fair game.
I would consider C++11 a minimum, but I would prefer C++14 as the
baseline if possible. In practice, most compilers supporting C++11 also
support a useful subset of C++14.
I agree that picking the minimum version of MSVC, GCC and LLVM is a
useful means of determining the subset of features which are permitted.
I should note that when Xerces-C 3.2 was released, and included in the
FreeBSD ports, I built it with C++11 and XMLCh=char16_t and ensured that
every (packaged) open source user of Xerces-C++ was capable of being
built both with C++11 and with char16_t [amberfish,
apache-xml-security-c, cegui, enigma, freecad, gdal, glest, kdepim,
libepp-nicbr, libkolabxml, opensaml, passwordsafe, pktanon, qbox, qgis,
qgis-ltr, shibboleth-sp, sumo, traingame, xalan-c, xmlcopyeditor,
xmltooling, xsd, zorba]. Every patch has been submitted and
incorporated upstream as far as I'm aware. This means that making the
switch to C++11 or C++14 should be completely transparent for most
Xerces-C users.
In another project I am working on (build2) we had good results with
picking GCC 4.9, Clang 3.7, and MSVC 14u3 and getting a very usable
subset of C++14 (including move capture and generic lambdas).
I'm on slightly more recent versions, as is the AppVeyor and Travis CI,
but so long as we can agree on the required C++ subset we can identify
the minimum version requirement.
Another issue that we will need to decide on is which standard we
are going to build for (at least by default) and whether we will
be making it configurable. The problem here is that there is no
guarantee that code built for different standards is ABI-compatible
(and there are cases where the C++ standard itself broke this
compatibility). As a result, the only sure way to avoid surprises
is to build everything (Xerces-C++ and the application that uses
it) for the same standard.
For example, in build2 by default we use the latest available
standard for any given compiler/version but there is also a way to
override it for the entire build configuration. I am not sure if
CMake has anything like this.
It does, see https://cmake.org/cmake/help/latest/prop_tgt/CXX_STANDARD.html
We currently have it set like this:
https://github.com/apache/xerces-c/blob/master/CMakeLists.txt#L37 :
# Try C++17, then fall back to C++14, C++11 then C++98. Used for
feature tests
# for optional features.
set(CMAKE_CXX_STANDARD 17)
CMAKE_CXX_STANDARD sets the requested standard version. If unavailable,
it will fall back to earlier standards. Since Xerces-C++ doesn't have a
minimum standard at present, we allow it to fall back without
restriction and then do specific feature tests to see what's available.
If we require C++14, we will need to add
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
in order to mandate C++14. It will then fail if it is not possible to
achieve this.
What I do in my projects is something like this:
# Prefer C++17 support
if(NOT CMAKE_CXX_STANDARD)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED FALSE)
endif()
This is to permit the user to override it, and only default if unset.
Kind regards,
Roger