Re: XML stream writer library

2021-01-12 Thread Richard Kimberly Heck
On 1/12/21 6:19 PM, Thibaut Cuvelier wrote:
> On Tue, 12 Jan 2021 at 16:33, Lorenzo Bertini
> mailto:lorenzobertin...@gmail.com>> wrote:
>
> Il 08/01/21 03:00, Thibaut Cuvelier ha scritto:
> > A tour of some C++ libraries for XML:
> > - RapidXML: mostly unmaintained since 2013, no support for
> namespaces
> > (except in forks: https://github.com/dwd/rapidxml
> 
> > >)
> > - Boost Property Tree: no XML parser, which limits further use
> (it can
> > use RapidXML though, see above)
> > - libstudxml: C++ library, designed for speed, no DOM
> > - libxml2: C library, designed for features and not speed (also
> includes
> > XPath and XSLT, DTD and XML Schema, namespaces), "mature" and
> barely not
> > evolving anymore
> > - libxml++: depends on glibmm2
> > - Xerces-C++: C++ library, designed for features and not speed
> (also
> > includes XPath, DTD and XML Schema, namespaces), "mature" and
> barely not
> > evolving anymore; no XSLT (Xalan could be used, but it only
> works with a
> > ancient version of Xerces; XQuilla implemented XPath 2, but is
> no more
> > developed since 2016)
> > - Expat: C library, designed for speed, no DOM by default
> (provided by
> > https://github.com/kolotsey/expat-dom
> 
> >  >), with namespaces
> > - tinyxml2: C++ library, designed for speed only (also includes
> XPath
> > through the unmaintained
> https://github.com/stanthomas/tinyxml2-ex
> 
> >  >, no validation, no
> > namespaces), mature and slowly evolving
> > - pugixml: C++ library, designed for speed with a few features
> (like
> > XPath, no validation, no namespaces), mature and evolving
> > - libroxml: C library, no clear design goal (includes XPath,
> namespaces,
> > no validation), evolving
> > - Saxon-C: C/C++ wrapper of the state-of-the-art Java library,
> largest
> > amount of features (XPath and XSLT 3, DTD and XML Schema
> validation --
> > extension for RelaxNG: http://www.cfoster.net/saxon-jing/
> 
> >  > --, namespaces), very mature,
> > really evolving (both performance and features), but it requires
> a JVM
> > (Excelsior is built-in, even though it's not been maintained for
> quite a
> > long time)
> > - Qt: no, I was joking :). Qt XML is not supported anymore, it's
> > recommended to switch to QXmlStreamReader and QXmlStreamWriter
> (which
> > are only SAX-like). Qt XML Patterns used to have XPath, XSLT,
> and XML
> > Schema, but it's been deprecated a while ago (Qt 5.13 for the last
> > wake-up call, but it hasn't been touched since Qt 4, basically)
> >
> > If LyX is being really serious about XML (i.e. moving as many
> things as
> > possible to XML technologies), Saxon is probably the way to go.
> > Otherwise, it's going to be too heavy to ship Saxon and a JVM
> along with
> > LyX. Instead, pugixml seems to me like a good choice: a few
> features
> > (XPath is the most relevant for LyX, and included in the base
> library,
> > no need for addons), good performance, still maintained (there is a
> > chance to have bugs fixed in a newer version, plus security
> > vulnerabilities taken care of).
> Was this addressed in the virtual meeting? 
>
>
> As far as I know, it wasn't discussed.

We were pretty focused on planning for 2.4.0.

 

> Anyhow, I think that for a start we'd need only the most basic
> features
> (tag insertion, indent), as was the purpose of #12055 in the first
> place
> (I'm sorry to have opened this pandora's box), so maybe no harm will
> come if we start wrapping pugi.
>
> Let me know what you think, and if this is not the time for this, as
> with LyX 2.4 coming out there might be other things that need focus.
>
>
> It looks like the patches cannot get integrated into the master
> development branch before 2.4 is out (or at least branched). However,
> in the meantime, I think I can create a feature branch and push your
> patches there (https://www.lyx.org/trac/browser/features
> ).

Yes, that would be the way to go.

Riki



-- 
lyx-devel mailing list
lyx-devel@lists.lyx.org
http://lists.lyx.org/mailman/listinfo/lyx-devel


Re: XML stream writer library

2021-01-12 Thread Thibaut Cuvelier
On Tue, 12 Jan 2021 at 16:33, Lorenzo Bertini 
wrote:

> Il 08/01/21 03:00, Thibaut Cuvelier ha scritto:
> > A tour of some C++ libraries for XML:
> > - RapidXML: mostly unmaintained since 2013, no support for namespaces
> > (except in forks: https://github.com/dwd/rapidxml
> > )
> > - Boost Property Tree: no XML parser, which limits further use (it can
> > use RapidXML though, see above)
> > - libstudxml: C++ library, designed for speed, no DOM
> > - libxml2: C library, designed for features and not speed (also includes
> > XPath and XSLT, DTD and XML Schema, namespaces), "mature" and barely not
> > evolving anymore
> > - libxml++: depends on glibmm2
> > - Xerces-C++: C++ library, designed for features and not speed (also
> > includes XPath, DTD and XML Schema, namespaces), "mature" and barely not
> > evolving anymore; no XSLT (Xalan could be used, but it only works with a
> > ancient version of Xerces; XQuilla implemented XPath 2, but is no more
> > developed since 2016)
> > - Expat: C library, designed for speed, no DOM by default (provided by
> > https://github.com/kolotsey/expat-dom
> > ), with namespaces
> > - tinyxml2: C++ library, designed for speed only (also includes XPath
> > through the unmaintained https://github.com/stanthomas/tinyxml2-ex
> > , no validation, no
> > namespaces), mature and slowly evolving
> > - pugixml: C++ library, designed for speed with a few features (like
> > XPath, no validation, no namespaces), mature and evolving
> > - libroxml: C library, no clear design goal (includes XPath, namespaces,
> > no validation), evolving
> > - Saxon-C: C/C++ wrapper of the state-of-the-art Java library, largest
> > amount of features (XPath and XSLT 3, DTD and XML Schema validation --
> > extension for RelaxNG: http://www.cfoster.net/saxon-jing/
> >  --, namespaces), very mature,
> > really evolving (both performance and features), but it requires a JVM
> > (Excelsior is built-in, even though it's not been maintained for quite a
> > long time)
> > - Qt: no, I was joking :). Qt XML is not supported anymore, it's
> > recommended to switch to QXmlStreamReader and QXmlStreamWriter (which
> > are only SAX-like). Qt XML Patterns used to have XPath, XSLT, and XML
> > Schema, but it's been deprecated a while ago (Qt 5.13 for the last
> > wake-up call, but it hasn't been touched since Qt 4, basically)
> >
> > If LyX is being really serious about XML (i.e. moving as many things as
> > possible to XML technologies), Saxon is probably the way to go.
> > Otherwise, it's going to be too heavy to ship Saxon and a JVM along with
> > LyX. Instead, pugixml seems to me like a good choice: a few features
> > (XPath is the most relevant for LyX, and included in the base library,
> > no need for addons), good performance, still maintained (there is a
> > chance to have bugs fixed in a newer version, plus security
> > vulnerabilities taken care of).
> Was this addressed in the virtual meeting?


As far as I know, it wasn't discussed.


> Also, since Xerces-C was the
> most feature full and mature after Saxon-C, I was curious as to why you
> didn't mention it.
>

Actually, Xerces-C and Xerces-C++ are just the same thing (the official
name being Xerces-C++ and the name of the packages Xerces-C, if I got it
correctly).


> Anyhow, I think that for a start we'd need only the most basic features
> (tag insertion, indent), as was the purpose of #12055 in the first place
> (I'm sorry to have opened this pandora's box), so maybe no harm will
> come if we start wrapping pugi.
>
> Let me know what you think, and if this is not the time for this, as
> with LyX 2.4 coming out there might be other things that need focus.
>

It looks like the patches cannot get integrated into the master development
branch before 2.4 is out (or at least branched). However, in the meantime,
I think I can create a feature branch and push your patches there (
https://www.lyx.org/trac/browser/features).
-- 
lyx-devel mailing list
lyx-devel@lists.lyx.org
http://lists.lyx.org/mailman/listinfo/lyx-devel


Re: XML stream writer library

2021-01-12 Thread Lorenzo Bertini

Il 08/01/21 03:00, Thibaut Cuvelier ha scritto:

A tour of some C++ libraries for XML:
- RapidXML: mostly unmaintained since 2013, no support for namespaces 
(except in forks: https://github.com/dwd/rapidxml 
)
- Boost Property Tree: no XML parser, which limits further use (it can 
use RapidXML though, see above)

- libstudxml: C++ library, designed for speed, no DOM
- libxml2: C library, designed for features and not speed (also includes 
XPath and XSLT, DTD and XML Schema, namespaces), "mature" and barely not 
evolving anymore

- libxml++: depends on glibmm2
- Xerces-C++: C++ library, designed for features and not speed (also 
includes XPath, DTD and XML Schema, namespaces), "mature" and barely not 
evolving anymore; no XSLT (Xalan could be used, but it only works with a 
ancient version of Xerces; XQuilla implemented XPath 2, but is no more 
developed since 2016)
- Expat: C library, designed for speed, no DOM by default (provided by 
https://github.com/kolotsey/expat-dom 
), with namespaces
- tinyxml2: C++ library, designed for speed only (also includes XPath 
through the unmaintained https://github.com/stanthomas/tinyxml2-ex 
, no validation, no 
namespaces), mature and slowly evolving
- pugixml: C++ library, designed for speed with a few features (like 
XPath, no validation, no namespaces), mature and evolving
- libroxml: C library, no clear design goal (includes XPath, namespaces, 
no validation), evolving
- Saxon-C: C/C++ wrapper of the state-of-the-art Java library, largest 
amount of features (XPath and XSLT 3, DTD and XML Schema validation -- 
extension for RelaxNG: http://www.cfoster.net/saxon-jing/ 
 --, namespaces), very mature, 
really evolving (both performance and features), but it requires a JVM 
(Excelsior is built-in, even though it's not been maintained for quite a 
long time)
- Qt: no, I was joking :). Qt XML is not supported anymore, it's 
recommended to switch to QXmlStreamReader and QXmlStreamWriter (which 
are only SAX-like). Qt XML Patterns used to have XPath, XSLT, and XML 
Schema, but it's been deprecated a while ago (Qt 5.13 for the last 
wake-up call, but it hasn't been touched since Qt 4, basically)


If LyX is being really serious about XML (i.e. moving as many things as 
possible to XML technologies), Saxon is probably the way to go. 
Otherwise, it's going to be too heavy to ship Saxon and a JVM along with 
LyX. Instead, pugixml seems to me like a good choice: a few features 
(XPath is the most relevant for LyX, and included in the base library, 
no need for addons), good performance, still maintained (there is a 
chance to have bugs fixed in a newer version, plus security 
vulnerabilities taken care of).
Was this addressed in the virtual meeting? Also, since Xerces-C was the 
most feature full and mature after Saxon-C, I was curious as to why you 
didn't mention it.


Anyhow, I think that for a start we'd need only the most basic features 
(tag insertion, indent), as was the purpose of #12055 in the first place 
(I'm sorry to have opened this pandora's box), so maybe no harm will 
come if we start wrapping pugi.


Let me know what you think, and if this is not the time for this, as 
with LyX 2.4 coming out there might be other things that need focus.

--
lyx-devel mailing list
lyx-devel@lists.lyx.org
http://lists.lyx.org/mailman/listinfo/lyx-devel


Re: XML stream writer library

2021-01-07 Thread Thibaut Cuvelier
On Thu, 7 Jan 2021 at 18:23, Thibaut Cuvelier  wrote:

> On Thu, 7 Jan 2021, 12:52 Lorenzo Bertini, 
> wrote:
>
>> I think almost all the options are on the table at this point. For the
>> sake of completeness I think it's worth mentioning DOM library Boost
>> Property Tree, which popped up frequently while searching.
>>
>> I think Thibaut is right when saying that, for the way LyX is structured
>> now, a SAX writer would be more appropriate, because we won't work on
>> xml directly, but convert the LyX file. However most of the libraries
>> have a DOM approach, and also, if someday we'll convert LyX format to
>> something xml-like, we might have to start all of this again.
>>
>> I did a small benchmark with pugixml and to both read and write a xml
>> document of 2.2Mb of equivalent ~100/120 pages chock full of math: it
>> takes negligble time to both read and write on my really modest laptop
>> A10-9600). Peak memory consumption was 14Mb, but since some MathML was
>> corrupted (it has trouble with backslash \) it's possible it might be
>> way less once fixed: LyX consumption opening the corresponding LyX file
>> was ~120Mb. The benchmark table in
>> <
>> http://rapidxml.sourceforge.net/manual.html#namespacerapidxml_1performance>
>>
>> seems to indicate that pugixml and RapidXML have performance just one
>> order greater than strlen, so I don't think parse time will ever be a
>> problem.
>
>
> Thanks for your benchmark. For me, the major difference between the two
> libraries is that pugixml is still maintained, but not really RapidXML. And
> XML parsing is very often a source of security problems (not just XXE).
>
> I'm unfamiliar with the concept of "wrapping" libraries and "layers": is
>> it when you write your own classes and methods on top of some common
>> stuff those libraries do, so if for whatever reason you have to switch
>> you can "plug" another easily?
>>
>
> Yes, exactly.
>

Below is my take on
https://stackoverflow.com/questions/9387610/what-xml-parser-should-i-use-in-c
and https://github.com/fffaraz/awesome-cpp#xml

XPath would be very useful if LyX switches to an XML representation (easy
queries on an XML document, think of SQL for XML).
XSLT is a way to describe transformations from XML to anything. If LyX
switches to an XML representation, it might be used to replace C++
exporters (but formula conversion will be a pain!). It might lower the
entry bar for new contributors, even though XSLT is not an easy language.
XQuery is a script language for XML processes.
Apart from Java libraries, only versions 1.0 are implemented: apart from
XPath, it really limits their use… A state-of-the-art implementation of the
current norms is Saxon, which has a C binding.

To allow for validation of XML files (i.e. check they respect some
grammar), DTD is the oldest way (inherited from SGML), XML Schema adds many
features over DTD (like types). The best technology nowadays is RelaxNG
(it's not recent: 2005), which is much more powerful than XML Schema.

XInclude is the XML way of specifying includes of other files (not
necessarily XML). Think \input in LaTeX or LyX child documents with a few
more features.

Name spaces are similar to those of C++, and are especially useful when
mixing several standards (like MathML and DocBook).

A tour of some C++ libraries for XML:
- RapidXML: mostly unmaintained since 2013, no support for namespaces
(except in forks: https://github.com/dwd/rapidxml)
- Boost Property Tree: no XML parser, which limits further use (it can use
RapidXML though, see above)
- libstudxml: C++ library, designed for speed, no DOM
- libxml2: C library, designed for features and not speed (also includes
XPath and XSLT, DTD and XML Schema, namespaces), "mature" and barely not
evolving anymore
- libxml++: depends on glibmm2
- Xerces-C++: C++ library, designed for features and not speed (also
includes XPath, DTD and XML Schema, namespaces), "mature" and barely not
evolving anymore; no XSLT (Xalan could be used, but it only works with a
ancient version of Xerces; XQuilla implemented XPath 2, but is no more
developed since 2016)
- Expat: C library, designed for speed, no DOM by default (provided by
https://github.com/kolotsey/expat-dom), with namespaces
- tinyxml2: C++ library, designed for speed only (also includes XPath
through the unmaintained https://github.com/stanthomas/tinyxml2-ex, no
validation, no namespaces), mature and slowly evolving
- pugixml: C++ library, designed for speed with a few features (like XPath,
no validation, no namespaces), mature and evolving
- libroxml: C library, no clear design goal (includes XPath, namespaces, no
validation), evolving
- Saxon-C: C/C++ wrapper of the state-of-the-art Java library, largest
amount of features (XPath and XSLT 3, DTD and XML Schema validation --
extension for RelaxNG: http://www.cfoster.net/saxon-jing/ --, namespaces),
very mature, really evolving (both performance and features), but it
requires a JVM (Excelsior is built-in, even though it's 

Re: XML stream writer library

2021-01-07 Thread Thibaut Cuvelier
On Thu, 7 Jan 2021, 12:52 Lorenzo Bertini, 
wrote:

> I think almost all the options are on the table at this point. For the
> sake of completeness I think it's worth mentioning DOM library Boost
> Property Tree, which popped up frequently while searching.
>
> I think Thibaut is right when saying that, for the way LyX is structured
> now, a SAX writer would be more appropriate, because we won't work on
> xml directly, but convert the LyX file. However most of the libraries
> have a DOM approach, and also, if someday we'll convert LyX format to
> something xml-like, we might have to start all of this again.
>
> I did a small benchmark with pugixml and to both read and write a xml
> document of 2.2Mb of equivalent ~100/120 pages chock full of math: it
> takes negligble time to both read and write on my really modest laptop
> A10-9600). Peak memory consumption was 14Mb, but since some MathML was
> corrupted (it has trouble with backslash \) it's possible it might be
> way less once fixed: LyX consumption opening the corresponding LyX file
> was ~120Mb. The benchmark table in
> <
> http://rapidxml.sourceforge.net/manual.html#namespacerapidxml_1performance>
>
> seems to indicate that pugixml and RapidXML have performance just one
> order greater than strlen, so I don't think parse time will ever be a
> problem.


Thanks for your benchmark. For me, the major difference between the two
libraries is that pugixml is still maintained, but not really RapidXML. And
XML parsing is very often a source of security problems (not just XXE).

I'm unfamiliar with the concept of "wrapping" libraries and "layers": is
> it when you write your own classes and methods on top of some common
> stuff those libraries do, so if for whatever reason you have to switch
> you can "plug" another easily?
>

Yes, exactly.

>
-- 
lyx-devel mailing list
lyx-devel@lists.lyx.org
http://lists.lyx.org/mailman/listinfo/lyx-devel


Re: XML stream writer library

2021-01-07 Thread Lorenzo Bertini
I think almost all the options are on the table at this point. For the 
sake of completeness I think it's worth mentioning DOM library Boost 
Property Tree, which popped up frequently while searching.


I think Thibaut is right when saying that, for the way LyX is structured 
now, a SAX writer would be more appropriate, because we won't work on 
xml directly, but convert the LyX file. However most of the libraries 
have a DOM approach, and also, if someday we'll convert LyX format to 
something xml-like, we might have to start all of this again.


I did a small benchmark with pugixml and to both read and write a xml 
document of 2.2Mb of equivalent ~100/120 pages chock full of math: it 
takes negligble time to both read and write on my really modest laptop 
A10-9600). Peak memory consumption was 14Mb, but since some MathML was 
corrupted (it has trouble with backslash \) it's possible it might be 
way less once fixed: LyX consumption opening the corresponding LyX file 
was ~120Mb. The benchmark table in 
 
seems to indicate that pugixml and RapidXML have performance just one 
order greater than strlen, so I don't think parse time will ever be a 
problem.


I'm unfamiliar with the concept of "wrapping" libraries and "layers": is 
it when you write your own classes and methods on top of some common 
stuff those libraries do, so if for whatever reason you have to switch 
you can "plug" another easily?


Thanks, Lo.
--
lyx-devel mailing list
lyx-devel@lists.lyx.org
http://lists.lyx.org/mailman/listinfo/lyx-devel


Re: XML stream writer library

2021-01-06 Thread Thibaut Cuvelier
On Tue, 5 Jan 2021 at 10:37, Joel Kulesza  wrote:

> On Tue, Jan 5, 2021 at 1:19 AM Pavel Sanda  wrote:
>
>> On Mon, Jan 04, 2021 at 09:48:42PM +0100, Thibaut Cuvelier wrote:
>> > There are multiple issues here. What is needed to generate HTML and
>> DocBook
>> > is a simple SAX writer, not a parser. I've done plenty of research about
>> > it, there's no XML library that does that. Most of them are using a DOM,
>> > which is a total waste of memory for such an application: it stores a
>> > complete XML tree in memory before serialising it. With SAX, you just
>> need
>> > a string backend, which is much more lightweight (by several factors).
>>
>> After little bit more thinking, is using DOM actually that big issue?
>> I mean how much it takes - for document of length n its O(n) in space?
>>
>> Sure, it might be cut to constant, but practically speaking when you have
>> 100 pages document what is the real time/memory consumption. Timewise
>> you spent 1s in XML compared to next 30s in conversion figures to pdf or
>> whatever format? Spacewise probably one more time than what we
>> already allocated for document itself.
>>
>> If using more heavy-weight caliber xml lib is not pain from API point
>> of view (and I do not know, you are the expert here) then we might
>> actually consider it, given the difficulties in SAX space?
>>
>
> I had a similar thought and will note that I've had good success on other
> projects with pugixml.
>

It's typical to have a DOM tree that is two to five times larger than the
raw text, that's not always negligible (Xerces is close to 2, Java
implementations anywhere between 2 and 5, I haven't checked pugixml or
TinyXML2 for this specific criterion). But that's not the real issue: for
generating HTML and DocBook, for now, DOM is not so useful from a developer
point of view, DOM is more suitable to handle an existing document or to
modify it, not really to generate one from scratch. A SAX writer is really
what's the most appropriate, given the way LyX is internally structured:
there is very little need to go backward when generating the file (e.g.,
add something to the header when encountering some LyX inset).

Using DOM will not really simplify the code (I'm speaking for the DocBook
export, which is highly similar to HTML). However, it might make its logic
easier to understand for a newcomer. Nevertheless, DOM comes with more
complex syntax: with SAX, you are only appending content to the file, with
only strings; with DOM, you have to indicate where you want to write
something (with methods like InsetEndChild), and you pass around complete
XML nodes (built from the same strings).

More specifically, in SAX (where stream is mostly a large string object
with helper methods):

stream.writeStartTag("tag");

With DOM, taking the example of TinyXML2 (where document is the root of the
DOM tree and node the node in the tree that is being filled):

node->InsertEndChild( document->NewElement("tag") );

Both are perfectly good choices, though. If we write a thin layer on top of
a DOM writer (as Riki suggested, this would allow decoupling with the
actual XML library), we might be able to have a syntax close to that of SAX
while having the extra flexibility of DOM. This way, the LyX code would be
clean, and avoid current intricacies to output things at the right place
(in DocBook, especially the  tag).

More specifically, @Pavel: for DocBook, you spend 0% of your time dealing
with images, as it's supposed to be done by the DocBook processor
afterwards. Any gain in the XML part of LyX will be noticeable by the user
for large documents (book-sized).
(And I won't say that something being O(n) is negligible in this case: I'm
using daily exponential-time algorithms that work so much faster than
polynomial-time ones…)
-- 
lyx-devel mailing list
lyx-devel@lists.lyx.org
http://lists.lyx.org/mailman/listinfo/lyx-devel


Re: XML stream writer library

2021-01-05 Thread Joel Kulesza
On Tue, Jan 5, 2021 at 1:19 AM Pavel Sanda  wrote:

> On Mon, Jan 04, 2021 at 09:48:42PM +0100, Thibaut Cuvelier wrote:
> > There are multiple issues here. What is needed to generate HTML and
> DocBook
> > is a simple SAX writer, not a parser. I've done plenty of research about
> > it, there's no XML library that does that. Most of them are using a DOM,
> > which is a total waste of memory for such an application: it stores a
> > complete XML tree in memory before serialising it. With SAX, you just
> need
> > a string backend, which is much more lightweight (by several factors).
>
> After little bit more thinking, is using DOM actually that big issue?
> I mean how much it takes - for document of length n its O(n) in space?
>
> Sure, it might be cut to constant, but practically speaking when you have
> 100 pages document what is the real time/memory consumption. Timewise
> you spent 1s in XML compared to next 30s in conversion figures to pdf or
> whatever format? Spacewise probably one more time than what we
> already allocated for document itself.
>
> If using more heavy-weight caliber xml lib is not pain from API point
> of view (and I do not know, you are the expert here) then we might
> actually consider it, given the difficulties in SAX space?
>

I had a similar thought and will note that I've had good success on other
projects with pugixml.

Regards,
Joel
-- 
lyx-devel mailing list
lyx-devel@lists.lyx.org
http://lists.lyx.org/mailman/listinfo/lyx-devel


Re: XML stream writer library

2021-01-05 Thread Pavel Sanda
On Mon, Jan 04, 2021 at 09:48:42PM +0100, Thibaut Cuvelier wrote:
> There are multiple issues here. What is needed to generate HTML and DocBook
> is a simple SAX writer, not a parser. I've done plenty of research about
> it, there's no XML library that does that. Most of them are using a DOM,
> which is a total waste of memory for such an application: it stores a
> complete XML tree in memory before serialising it. With SAX, you just need
> a string backend, which is much more lightweight (by several factors). 

After little bit more thinking, is using DOM actually that big issue?
I mean how much it takes - for document of length n its O(n) in space? 

Sure, it might be cut to constant, but practically speaking when you have 
100 pages document what is the real time/memory consumption. Timewise
you spent 1s in XML compared to next 30s in conversion figures to pdf or
whatever format? Spacewise probably one more time than what we
already allocated for document itself.

If using more heavy-weight caliber xml lib is not pain from API point
of view (and I do not know, you are the expert here) then we might
actually consider it, given the difficulties in SAX space?

Pavel
-- 
lyx-devel mailing list
lyx-devel@lists.lyx.org
http://lists.lyx.org/mailman/listinfo/lyx-devel


Re: XML stream writer library

2021-01-04 Thread Richard Kimberly Heck
On 1/4/21 5:10 PM, Pavel Sanda wrote:
> On Mon, Jan 04, 2021 at 09:48:42PM +0100, Thibaut Cuvelier wrote:
>> My recommendation, based on a quite long study of XML libraries (i.e.
>> several years, but quite far from full-time): either use QXmlStreamWriter
>> (which is mostly a SAX implementation in C++) or write our own.
>> QXmlStreamWriter is almost 4k-line long, but it can substantially be
>> simplified in our case (
>> https://github.com/qt/qtbase/blob/54875be84de059374920e4c0deacd13a41caaa13/src/corelib/serialization/qxmlstream.cpp).
>>
>>
>> TinyXML2 (https://github.com/leethomason/tinyxml2), pugixml (
>> https://github.com/zeux/pugixml), and Xerces-C++ (
>> https://xerces.apache.org/xerces-c/) are only DOM-based. There are quite a
>> few C libraries, like libxml2, that can be SAX-like, but C libraries are
>> horrible to use (http://www.xmlsoft.org/examples/testWriter.c).

I did some searching and, yes, I see the problem. Word is that recent
versions of libxml and libxml2 have dependencies on Gnome libraries that
we don't want.

I'll let you know if I get any answers to my question on the Fedora list.


> I do not dare to make any qualified recommendation between the choices
> above. But thinking aloud -- if there de facto isn't an alternative
> to QXmlStreamWriter, would it be hard to separate that class from
> the rest of Qt, fork and include it as an internal lyx routine?
> We would have full control over that code without unnecessary surprises
> of Qt's development.

I was going to suggest something in this spirit.

If, as our usual policy has been, we confine QXmlStreamWrapper to
support/, then what that basically means is writing our own LyX API as a
kind of wrapper around the Qt stuff. (Thibaut, if you haven't already,
you might look at how the FileName class. Much of it is a wrapper around
QFile.) Some, even many, of the routines might just directly call the Qt
equivalent (probably after a call to toqstr, from qstring_helpers). This
would be a relatively quick way to get something that worked and was
easy to use, and work on adapting DocBook and HTML export to this code
could proceed.

At that point, we could then write our own XML backend, possibly
adapting it from the Qt code. There are quite a few dependencies there,
but I'll guess some of them we do not need (e.g., the QApplication and
QFile dependencies). Our we build a lightweight library from scratch.
(It does seem like maybe there's a general need for that.) With the
already functioning backend from QXmlStreamWrapper, it would be easy to
test our own code and make sure it was producing the same output.

Riki


-- 
lyx-devel mailing list
lyx-devel@lists.lyx.org
http://lists.lyx.org/mailman/listinfo/lyx-devel


Re: XML stream writer library

2021-01-04 Thread Pavel Sanda
On Mon, Jan 04, 2021 at 09:48:42PM +0100, Thibaut Cuvelier wrote:
> My recommendation, based on a quite long study of XML libraries (i.e.
> several years, but quite far from full-time): either use QXmlStreamWriter
> (which is mostly a SAX implementation in C++) or write our own.
> QXmlStreamWriter is almost 4k-line long, but it can substantially be
> simplified in our case (
> https://github.com/qt/qtbase/blob/54875be84de059374920e4c0deacd13a41caaa13/src/corelib/serialization/qxmlstream.cpp).
> 
> 
> TinyXML2 (https://github.com/leethomason/tinyxml2), pugixml (
> https://github.com/zeux/pugixml), and Xerces-C++ (
> https://xerces.apache.org/xerces-c/) are only DOM-based. There are quite a
> few C libraries, like libxml2, that can be SAX-like, but C libraries are
> horrible to use (http://www.xmlsoft.org/examples/testWriter.c).

I do not dare to make any qualified recommendation between the choices
above. But thinking aloud -- if there de facto isn't an alternative
to QXmlStreamWriter, would it be hard to separate that class from
the rest of Qt, fork and include it as an internal lyx routine?
We would have full control over that code without unnecessary surprises
of Qt's development.

Pavel
-- 
lyx-devel mailing list
lyx-devel@lists.lyx.org
http://lists.lyx.org/mailman/listinfo/lyx-devel


Re: XML stream writer library

2021-01-04 Thread Yuriy Skalko

TinyXML2 (https://github.com/leethomason/tinyxml2), pugixml (
https://github.com/zeux/pugixml), and Xerces-C++ (
https://xerces.apache.org/xerces-c/) are only DOM-based. There are quite a
few C libraries, like libxml2, that can be SAX-like, but C libraries are
horrible to use (http://www.xmlsoft.org/examples/testWriter.c).


There are several C++ wrappers for libxml2 on GitHub. Maybe they can be 
useful:


https://github.com/libxmlplusplus/libxmlplusplus
https://github.com/rioki/libxmlmm


Yuriy
--
lyx-devel mailing list
lyx-devel@lists.lyx.org
http://lists.lyx.org/mailman/listinfo/lyx-devel


Re: XML stream writer library

2021-01-04 Thread Thibaut Cuvelier
On Mon, 4 Jan 2021 at 20:30, Richard Kimberly Heck  wrote:

> On 1/3/21 3:37 PM, Lorenzo Bertini wrote:
>
> Hello list,
>
> In 12055 , discussing the merge of
> some MathMLStream and XmlStream components, we were contemplating the
> possibility of using an external library to handle XML streams, for example
> with indentation and tag insertion. One of the candidates was
> QXmlStreamWriter  class,
> but with the talk about removing unnecessary Qt components we thought to
> ask the list.
>
> Lest us know what do you think it's the best course, and if you know of
> other libraries we should look.
>
> As I mention in the bug, I looked over various XML libraries a while ago,
> when I was thinking about the long-standing idea of converting LyX's own
> format to XML. There seemed to be a myriad of options, and I never settled
> upon one. But it looks like there's a general feeling that we don't want to
> get too married to Qt---any more than we already are. That is in part
> because Qt seems to break itself fairly frequently (especially on OSX) and
> partly because they keep changing their attitude towards open source. There
> was some thing not long ago about how recent updates would only be
> available to paid subscribers right away, or something like that.
>
> So I'd generally suggest searching around for good, well-maintained XML
> libraries, maybe asking on Stack Exchange what people like. I'll send an
> email to the Fedora list and see what suggestions pop up.
>
There are multiple issues here. What is needed to generate HTML and DocBook
is a simple SAX writer, not a parser. I've done plenty of research about
it, there's no XML library that does that. Most of them are using a DOM,
which is a total waste of memory for such an application: it stores a
complete XML tree in memory before serialising it. With SAX, you just need
a string backend, which is much more lightweight (by several factors). In
this case, as the content is generated without ever looking back, SAX is
the best choice.

You have more choices in the Java world, and the standard library is often
enough (well, the standard extensions javax and JAXP). If you need a good
XML tool, chances are it will be written in Java, especially if it's open
source (Saxon for XSLT or XQuery, eXist or MarkLogic for XML database).

On the other hand, if you want to represent a complete LyX document and
work on it, you'd rather go for DOM, as you will always have the whole
structure in memory: you may want to edit things at any point in the
document. (Unless there is never an operation on the file structures, and
only on the set of insets of the document)

My recommendation, based on a quite long study of XML libraries (i.e.
several years, but quite far from full-time): either use QXmlStreamWriter
(which is mostly a SAX implementation in C++) or write our own.
QXmlStreamWriter is almost 4k-line long, but it can substantially be
simplified in our case (
https://github.com/qt/qtbase/blob/54875be84de059374920e4c0deacd13a41caaa13/src/corelib/serialization/qxmlstream.cpp).


TinyXML2 (https://github.com/leethomason/tinyxml2), pugixml (
https://github.com/zeux/pugixml), and Xerces-C++ (
https://xerces.apache.org/xerces-c/) are only DOM-based. There are quite a
few C libraries, like libxml2, that can be SAX-like, but C libraries are
horrible to use (http://www.xmlsoft.org/examples/testWriter.c).
-- 
lyx-devel mailing list
lyx-devel@lists.lyx.org
http://lists.lyx.org/mailman/listinfo/lyx-devel


Re: XML stream writer library

2021-01-04 Thread Richard Kimberly Heck
On 1/3/21 3:37 PM, Lorenzo Bertini wrote:
>
> Hello list,
>
> In 12055 , discussing the merge
> of some MathMLStream and XmlStream components, we were contemplating
> the possibility of using an external library to handle XML streams,
> for example with indentation and tag insertion. One of the candidates
> was QXmlStreamWriter 
> class, but with the talk about removing unnecessary Qt components we
> thought to ask the list.
>
> Lest us know what do you think it's the best course, and if you know
> of other libraries we should look.
>
As I mention in the bug, I looked over various XML libraries a while
ago, when I was thinking about the long-standing idea of converting
LyX's own format to XML. There seemed to be a myriad of options, and I
never settled upon one. But it looks like there's a general feeling that
we don't want to get too married to Qt---any more than we already are.
That is in part because Qt seems to break itself fairly frequently
(especially on OSX) and partly because they keep changing their attitude
towards open source. There was some thing not long ago about how recent
updates would only be available to paid subscribers right away, or
something like that.

So I'd generally suggest searching around for good, well-maintained XML
libraries, maybe asking on Stack Exchange what people like. I'll send an
email to the Fedora list and see what suggestions pop up.

Riki


-- 
lyx-devel mailing list
lyx-devel@lists.lyx.org
http://lists.lyx.org/mailman/listinfo/lyx-devel