[lxml] Re: Cannot from a network location

Paul Higgs Fri, 25 Jun 2021 22:21:37 -0700

Thanks for the hint regarding parsers.

After spending a few hours trying to understand what special tricks I needed to 
put in a resolver, I realized that there were none. The resolver juts needs to 
fetch the data (I would have expected the Parser to do this itself)


      schema4=io.StringIO('''<schema xmlns="http://www.w3.org/2001/XMLSchema"; 
xmlns:patch="urn:paulhiggs:my-patch" targetNamespace="urn:paulhiggs:my-patch" 
elementFormDefault="qualified" attributeFormDefault="unqualified">
          <include 
schemaLocation="https://www.iana.org/assignments/xml-registry/schema/patch-ops.xsd"/>
          <element name="Patch" type="patch:PatchType"/>
          <complexType name="PatchType">
              <choice minOccurs="1" maxOccurs="unbounded">
                  <element name="add" type="patch:add"/>
                  <element name="remove" type="patch:remove"/>
                  <element name="replace" type="patch:replace"/>
              </choice>
              <attribute name="paulsAttrib" type="string" use="required"/>
          </complexType>
      </schema>''')

      import requests
      class PrefixResolver(etree.Resolver):
        # https://lxml.de/resolvers.html
        def __init__(self, prefix):
          self.prefix = prefix.lower()

        def resolve(self, url, pubid, context):
          if url.lower().startswith(self.prefix):
            res=requests.get(url, allow_redirects=True)
            return self.resolve_string(res.text, context)


      parser=etree.XMLParser(load_dtd=True, no_network=False, huge_tree=True, 
resolve_entities=True)
      parser.resolvers.add( PrefixResolver("https") )
      parser.resolvers.add( PrefixResolver("http") )

      my_schema=etree.XMLSchema(etree.parse(schema4, parser))

This now works OK for an include of HTTP and HTTPS!
I need to look into the workings of libxml2 to see if loading for INCLUDE and 
IMPORT are somehow handled differently – I have never had a problem with an 
HTTP or HTTPS IMPORT

Paul



-----Original Message-----
From: Paul Higgs <[email protected]>
Sent: 25 June 2021 14:51
To: [email protected]; [email protected]
Subject: [lxml] Re: Cannot <include> from a network location

Thanks Holger

First, I am not behind any proxy - CURL to http: and https: both give the schema

Second, Your test_schema.py using http: also works for me, however is 
https:/www.loc.gov... is used then I get an error
>python test_schema.py
Traceback (most recent call last):
  File "G:\lxml-test\test_schema.py", line 14, in <module>
    schema = etree.XMLSchema(tree)
  File "src\lxml\xmlschema.pxi", line 88, in lxml.etree.XMLSchema.__init__
lxml.etree.XMLSchemaParseError: Element 
'{http://www.w3.org/2001/XMLSchema}include': Failed to load the document 
'https://www.loc.gov/standards/xlink/xlink.xsd' for inclusion., line 8

Third, 'https://www.loc.gov/standards/xlink/xlink.xsd' does exist - CURL 
retrieves it OK
>curl --head https://www.loc.gov/standards/xlink/xlink.xsd
HTTP/2 200
date: Fri, 25 Jun 2021 13:44:27 GMT
content-type: text/xml
content-length: 3180
last-modified: Thu, 23 Aug 2007 19:02:01 GMT
etag: "119c982-c6c-867afc40"
accept-ranges: bytes
cf-cache-status: DYNAMIC
cf-request-id: 0ae5033c37000054ab1ab16000000001
expect-ct: max-age=604800, 
report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct";
server: cloudflare
cf-ray: 664ea1738c7e54ab-MAN

>curl --head http://www.loc.gov/standards/xlink/xlink.xsd
HTTP/1.1 200 OK
Date: Fri, 25 Jun 2021 13:45:16 GMT
Content-Type: text/xml
Content-Length: 3180
Connection: keep-alive
Last-Modified: Thu, 23 Aug 2007 19:02:01 GMT
ETag: "119c982-c6c-43862867afc40"
Accept-Ranges: bytes
X-Frame-Options: deny
Set-Cookie: HttpOnly
CF-Cache-Status: DYNAMIC
cf-request-id: 0ae503f9cd0000000a56ab5000000001
Server: cloudflare
CF-RAY: 664ea2a2ed37000a-MAN

Paul

-----Original Message-----
From: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Sent: 25 June 2021 13:31
To: [email protected]<mailto:[email protected]>
Subject: [lxml] Re: Cannot <include> from a network location

> Just a thought: Might this be proxy- or https-related? Does it work if you 
> locally serve the xs:included schema with http?
>
> I *think* libxml2 respects http_proxy but I don’t know anything about https 
> support.

Just found that for an example adapted from 
https://bugs.launchpad.net/lxml/+bug/1234114/comments/3
(xs:include instead of xs:import):

##############

# test_schema.py
XSD = b"""<?xml version="1.0" encoding="UTF-8"?> <xs:schema
    xmlns:xs="http://www.w3.org/2001/XMLSchema";
    xmlns:xlink="http://www.w3.org/1999/xlink";
    xmlns="lxmltest"
    targetNamespace="http://www.w3.org/1999/xlink";
    elementFormDefault="qualified"
    attributeFormDefault="unqualified">
    <xs:include schemaLocation="http://www.loc.gov/standards/xlink/xlink.xsd"; 
/> </xs:schema> """
from lxml import etree
parser = etree.XMLParser(
    load_dtd=True, no_network=True, huge_tree=True, resolve_entities=True) tree 
= etree.fromstring(XSD, parser=parser) schema = etree.XMLSchema(tree)
print(schema)

##############

I can successfully run this if the xs:include location is http but not https 
(the xlink.xsd is available both with https and http URLs).

If I change it to https I get

Traceback (most recent call last):
  File "test_schema.py", line 16, in <module>
    schema = etree.XMLSchema(tree)
  File "src/lxml/xmlschema.pxi", line 88, in lxml.etree.XMLSchema.__init__
lxml.etree.XMLSchemaParseError: Element 
'{http://www.w3.org/2001/XMLSchema}include': Failed to load the document 
'https://www.loc.gov/standards/xlink/xlink.xsd' for inclusion., line 9

As I'm behind a proxy:
I also just found out that while curl happily accepts 
http_proxy=my.proxy.address.net:8080 lxml (libxml2) only works if this is set 
to http_proxy=http://my.proxy.address.net:8080 i.e. with an explicit <scheme>://

Cheers, H.








Landesbank Baden-Wuerttemberg
Anstalt des oeffentlichen Rechts
Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht 
Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz

Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen 
Daten.
Informationen finden Sie unter https://www.lbbw.de/datenschutz.
_______________________________________________
lxml - The Python XML Toolkit mailing list -- 
[email protected]<mailto:[email protected]> To unsubscribe send an email to 
[email protected]<mailto:[email protected]> 
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]<mailto:[email protected]>
_______________________________________________
lxml - The Python XML Toolkit mailing list -- 
[email protected]<mailto:[email protected]> To unsubscribe send an email to 
[email protected]<mailto:[email protected]> 
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]<mailto:[email protected]>

_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]

[lxml] Re: Cannot from a network location

Reply via email to