Hi Simon,
On 14/11/2019 06:52, Simon Opper wrote:
Hi folks
I'm having an issue importing/converting a local html file using
sparql motion scripts as opposed to a web file at a URL. A web html
file works fine in my tests.
I can use the importXHTML module on a web URL fine e.g.
wwww.examplesite.com/htmlfile
But if I try point it at a local file it fails. I've tried the
following http protocol without success. e.g.
file:///fileLocation/htmlfile (specifiying .html file type makes no
difference and also with or without .html type added on disc). I also
tried file://localhost/fileLocation/htmlfile with no success.
I also tried converting the html file to xhtml using oxgenXML but this
made no change.
Are you referring to this error?
Caused by: java.net.MalformedURLException: Only http & https protocols
supported
at
org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:636)
at
org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:629)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:261)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:250)
at org.topbraid.html2xml.HTML2XML.parseFromURL(HTML2XML.java:28)
at
org.topbraid.sparqlmotion.lib.internal.ImportXHTMLModule.execute(ImportXHTMLModule.java:37)
... 7 more
is there some aspect of tidy function or something else at play that
either I'm missing or can't control for a local file ?
I guess we could switch to this if the URL is a local file:
https://jsoup.org/cookbook/input/load-document-from-file
Would this solve your use case? (There still is time for the 6.3 final
release).
Given that the current version only support HTTP, could you use the
EDG/TBL server to access the files? For example, with TBC-ME:
1) Create a folder in the workspace such as myfiles.www
2) Copy your .html file(s) into that folder, e.g. hk.html
3) Use sml:url http://localhost:8083/tbl/lib/myfiles/hk.html
In my quick test that worked fine.
As a related question on debugging this. Is it possible to see more
info anywhere about these modules other than the basic info in TBCME
help and at the SPIN vocab files which are only of limited help ? e.g.
more details on the underlying classes and signatures ?
Not that I could think of. The stack traces should you some of the
internals, but only if something goes wrong.
Maybe the rest of the email can be ignored if the solution above works
for you?
Holger
I then tried using another route such as the convertXMLtoRDF module.
The usage note for the module says that the smlxmlType can be set to
XHTML so that it "treats input as html source". see ref below.
" sml:xml: The XML document that shall be converted to RDF. To avoid
character encoding issues, we strongly recommend this value to be a
reference to an already parsed XML document, and not a literal. In
other words, use "Add SPARQL expression" from the drop down menu and
enter ?varName and do not use a string value such as {?varName}. The
actual document parsing should be handled by predecessing modules such
as sml:ImportXMLFromURL.
sml:xmlType (xsd:string): [Optional] An (optional) type indicator for
the Semantic XML conversion. Current supported values are "XHTML"
(treats the input as HTML source, and may run a tidy algorithm in case
the HTML is not well-formed XHTML). "
I experimented with a few ways of processing the html to xml such as
importTextFile and importXMLfile but I asssume because the html is not
valid xml this doesn't work.
e.g.
warnings:ImportTextFile_2
a sml:ImportTextFile ;
sm:next warnings:Convert_html_XMLToRDF_2 ;
sm:nodeX 617 ;
sm:nodeY 39 ;
sm:outputVariable "textOut" ;
sml:sourceFilePath "mfu@id=4851.txt" ;
rdfs:label "Import text file xml test" ;
.
# the xmlToRDF below fails. A character encoding issue by the looks.
exception message Caused by:
org.topbraid.spin.sparqlmotion.modules.SMException:
org.xml.sax.SAXParseException; lineNumber: 9; columnNumber: 43; The
reference to entity "l" must end with the ';' delimiter.
warnings:Convert_html_XMLToRDF_2
a sml:ConvertXMLToRDF ;
sm:nodeX 601 ;
sm:nodeY 272 ;
sml:baseURI "www.example2.com" ;
sml:xml [
sp:varName "textOut" ;
] ;
sml:xmlType "xhtml" ;
rdfs:label "Convert html XMLTo RDF 2" ;
.
--
You received this message because you are subscribed to the Google
Groups "TopBraid Suite Users" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to topbraid-users+unsubscr...@googlegroups.com
<mailto:topbraid-users+unsubscr...@googlegroups.com>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/topbraid-users/17f9a123-98b7-49e2-bc67-11524e0e1911%40googlegroups.com
<https://groups.google.com/d/msgid/topbraid-users/17f9a123-98b7-49e2-bc67-11524e0e1911%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups "TopBraid
Suite Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to topbraid-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/topbraid-users/25552c97-8cec-e814-3130-f775ee6e9f7f%40topquadrant.com.