Re: [xml] Constraint validation for huge documents

2021-01-05 Thread Liam R E Quin
On Tue, 2021-01-05 at 19:12 +0100, Stefan de Konink wrote:
> 
> 
> Yesterday I wrote a custom validator in lxml for key/keyref and
> unique 
> constraints.

Could you do this instead using schematron?

It  may be somewhat slower but easier to maintain.

Liam


-- 
Liam Quin - web slave for https://www.fromoldbooks.org/
with fabulous vintage art and fascinating texts to read.
Click here to have the slave beaten.

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Constraint validation for huge documents

2021-01-05 Thread Stefan de Konink

Hi Nick,


Thanks for your reply. It does have a noticeable impact, while having 
compiled libxml2-git yesterday, I oversaw it.



With the single constraint file;
libxml2-2.9.10
User time (seconds): 90.81
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:31.60

libxml2-git
User time (seconds): 49.57
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:50.57

With the full constraint file;
libxml2-2.9.10
Not completed after 1 hour 30 min

libxml2-git
User time (seconds): 900.60
Elapsed (wall clock) time (h:mm:ss or m:ss): 15:02.87


Yesterday I wrote a custom validator in lxml for key/keyref and unique 
constraints. It basically validates syntactically using the normal libxml2 
code, and then fetches all constraints (this might be a shortcut), creates 
a hashset per constraint. This process can be executed in parallel per 
constraint. If taking into account the number of elements (by heuristics, 
if the same xsd is used over time) parallelism can be ensured over a longer 
period.


With multithreading (8):
User time (seconds): 1136.37
Percent of CPU this job got: 388%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:57.09

Without multithreading:

User time (seconds): 709.82
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 11:52.15



I assume that the optimisation currently present in git is a serious 
improvement. Sure, it is still not 'perfect' but I think that doing the 
validation in parallel might be something worthwhile to explore.


--
Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Constraint validation for huge documents

2021-01-05 Thread Nick Wellnhofer via xml
The XML Schemas code hasn't been actively maintained for more than 10 years, 
so it's unlikely to receive a helpful answer regarding the code.


There was a recent patch which might help:


https://gitlab.gnome.org/GNOME/libxml2/-/commit/faea2fa9b890cc329f33ce518dfa1648e64e14d6

Other than that, you'll have to dig through the sources yourself.

Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml