[ https://issues.apache.org/jira/browse/ANY23-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526916#comment-16526916 ]
ASF GitHub Bot commented on ANY23-347: -------------------------------------- GitHub user HansBrende reopened a pull request: https://github.com/apache/any23/pull/88 ANY23-347 fixed RDFParseExceptions caused by unbound xml namespaces My solution to this problem was to strip all xml namespaces from attributes and tag names except the "xml:" and "xmlns:" namespaces. I also added a test case. mvn clean test -> all tests pass @lewismc any comments? You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-347 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/88.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #88 ---- commit f82f34b22fed2368f26205612eed64c5357fe74a Author: Hans <firedrake93@...> Date: 2018-06-28T16:57:24Z ANY23-347 fixed RDFParseExceptions caused by unbound xml prefixes ---- > RDFParseException: the prefix "pw" is not bound > ----------------------------------------------- > > Key: ANY23-347 > URL: https://issues.apache.org/jira/browse/ANY23-347 > Project: Apache Any23 > Issue Type: Bug > Components: extractors > Affects Versions: 2.3 > Reporter: Hans Brende > Assignee: Hans Brende > Priority: Minor > Fix For: 2.3 > > > I get the following error log for the site: > https://69.agendaculturel.fr/concert/ > Haven't had time to debug this. > {code} > ERROR org.apache.any23.extractor.rdf.BaseRDFExtractor - Error while parsing > RDF document. > org.eclipse.rdf4j.rio.RDFParseException: org.xml.sax.SAXParseException; > lineNumber: 165; columnNumber: 101; The prefix "pw" for attribute > "pw:twitter-via" associated with an element type "div" is not bound. > at > org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:111) > at > org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:95) > at > org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:158) > at > org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:57) > at > org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:471) > at > org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:259) > at org.apache.any23.Any23.extract(Any23.java:302) > at org.apache.any23.Any23.extract(Any23.java:437) > at > com.utownapp.crawl.tripledb.Triples.lambda$extractTriples$0(Triples.java:146) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.semarglproject.rdf.ParseException: > org.xml.sax.SAXParseException; lineNumber: 165; columnNumber: 101; The prefix > "pw" for attribute "pw:twitter-via" associated with an element type "div" is > not bound. > at > org.semarglproject.rdf.rdfa.RdfaParser.processException(RdfaParser.java:1141) > at org.semarglproject.source.XmlSource.process(XmlSource.java:50) > at > org.semarglproject.source.StreamProcessor.processInternal(StreamProcessor.java:87) > at > org.semarglproject.source.BaseStreamProcessor.process(BaseStreamProcessor.java:167) > at > org.semarglproject.source.BaseStreamProcessor.process(BaseStreamProcessor.java:154) > at > org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:109) > ... 12 more > Caused by: org.xml.sax.SAXParseException; lineNumber: 165; columnNumber: 101; > The prefix "pw" for attribute "pw:twitter-via" associated with an element > type "div" is not bound. > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at org.semarglproject.source.XmlSource.process(XmlSource.java:48) > ... 16 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)