[jira] [Commented] (ANY23-271) Address "...The entity "raquo" was referenced, but not declared" SAXParseException

2018-01-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338870#comment-16338870
 ] 

Hudson commented on ANY23-271:
--

SUCCESS: Integrated in Jenkins build Any23-trunk #1530 (See 
[https://builds.apache.org/job/Any23-trunk/1530/])
ANY23-227, ANY23-268, ANY23-317, ANY23-271, ANY23-273, ANY23-326, (Hans: rev 
0dd9837798a53b5f5a84c2b84891eaf9e8a99494)
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue227.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue273-and-317.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue326-and-267.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue268-and-317.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue271-and-317.html
* (edit) 
core/src/test/java/org/apache/any23/extractor/rdfa/RDFa11ExtractorTest.java


> Address "...The entity "raquo" was referenced, but not declared" 
> SAXParseException
> --
>
> Key: ANY23-271
> URL: https://issues.apache.org/jira/browse/ANY23-271
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.2
>
>
> When attempting extractions on the following URL
> http://data.brandweeraa.nl/data/incident/2016/32601/deployment/201601272048400
> I get the following Exception with the Webservice at any23.org
> {code}
> 
> 
> Could not parse input.
> 
> 
> 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-227) not extracting opengraph rdfa

2018-01-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338867#comment-16338867
 ] 

Hudson commented on ANY23-227:
--

SUCCESS: Integrated in Jenkins build Any23-trunk #1530 (See 
[https://builds.apache.org/job/Any23-trunk/1530/])
ANY23-227, ANY23-268, ANY23-317, ANY23-271, ANY23-273, ANY23-326, (Hans: rev 
0dd9837798a53b5f5a84c2b84891eaf9e8a99494)
* (edit) 
core/src/test/java/org/apache/any23/extractor/rdfa/RDFa11ExtractorTest.java
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue271-and-317.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue326-and-267.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue273-and-317.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue227.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue268-and-317.html


> not extracting opengraph rdfa
> -
>
> Key: ANY23-227
> URL: https://issues.apache.org/jira/browse/ANY23-227
> Project: Apache Any23
>  Issue Type: Bug
>Affects Versions: 1.0
>Reporter: hadar
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.2
>
>
> unable to extract opengraph data using any23 default settings.
> example page.
> http://www.last.fm/music/Bread



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-268) Entire extraction task fails due to "Element type "t.length" must be followed by either attribute specifications, ">" or "/>"

2018-01-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338868#comment-16338868
 ] 

Hudson commented on ANY23-268:
--

SUCCESS: Integrated in Jenkins build Any23-trunk #1530 (See 
[https://builds.apache.org/job/Any23-trunk/1530/])
ANY23-227, ANY23-268, ANY23-317, ANY23-271, ANY23-273, ANY23-326, (Hans: rev 
0dd9837798a53b5f5a84c2b84891eaf9e8a99494)
* (edit) 
core/src/test/java/org/apache/any23/extractor/rdfa/RDFa11ExtractorTest.java
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue271-and-317.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue326-and-267.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue273-and-317.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue227.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue268-and-317.html


> Entire extraction task fails due to "Element type "t.length" must be followed 
> by either attribute specifications, ">" or "/>"
> -
>
> Key: ANY23-268
> URL: https://issues.apache.org/jira/browse/ANY23-268
> Project: Apache Any23
>  Issue Type: Sub-task
>  Components: core
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.2
>
>
> WebService API call
> http://any23.org/rdfxml/http://data.gov
> {code}
> Could not parse input.
> 
>  BEGIN Exception context 
> ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:http://www.data.gov/)
> Errors {
> }
>  END   Exception context 
> org.apache.any23.extractor.ExtractionException: Error while parsing RDF 
> document.
>   at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:109)
>   at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:41)
>   at 
> org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:463)
>   at 
> org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:255)
>   at org.apache.any23.Any23.extract(Any23.java:298)
>   at org.apache.any23.Any23.extract(Any23.java:450)
>   at 
> org.apache.any23.servlet.WebResponder.runExtraction(WebResponder.java:114)
>   at org.apache.any23.servlet.Servlet.doGet(Servlet.java:79)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:618)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:725)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:301)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>   at 
> org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:239)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>   at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
>   at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:106)
>   at 
> org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:503)
>   at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:136)
>   at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:74)
>   at 
> org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:610)
>   at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88)
>   at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:526)
>   at 
> org.apache.coyote.ajp.AbstractAjpProcessor.process(AbstractAjpProcessor.java:794)
>   at 
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:652)
>   at 
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1575)
>   at 
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1533)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.openrdf.rio.RDFParseException: org.xml.sax.SAXParseException; 
> lineNumber: 5; columnNumber: 367; Element type "t.length" must be followed by 
> either attribute specifications, ">" or "/>".
>   at 
> org.semarglproject.sesame.rdf.rdfa

[jira] [Commented] (ANY23-273) The content of elements must consist of well-formed character data or markup - no bogus comments

2018-01-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338871#comment-16338871
 ] 

Hudson commented on ANY23-273:
--

SUCCESS: Integrated in Jenkins build Any23-trunk #1530 (See 
[https://builds.apache.org/job/Any23-trunk/1530/])
ANY23-227, ANY23-268, ANY23-317, ANY23-271, ANY23-273, ANY23-326, (Hans: rev 
0dd9837798a53b5f5a84c2b84891eaf9e8a99494)
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue227.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue273-and-317.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue326-and-267.html
* (edit) 
core/src/test/java/org/apache/any23/extractor/rdfa/RDFa11ExtractorTest.java
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue268-and-317.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue271-and-317.html


> The content of elements must consist of well-formed character data or markup 
> - no bogus comments
> 
>
> Key: ANY23-273
> URL: https://issues.apache.org/jira/browse/ANY23-273
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 1.2
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.2
>
>
> Whilst attempting to address ANY23-131, I tried running the any23.org service 
> over the [example 
> URL|https://www.otto.de/p/aeg-waschmaschine-lavamat-l14as7-aplusplusplus-7-kg-1400-u-min-508571361/#variationId=504747671-M48]
>  with the following failure
> {code}
> 
> 
> Could not parse input.
> 
> 
> 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-317) Any23 fails when dealing with JavaScript

2018-01-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338869#comment-16338869
 ] 

Hudson commented on ANY23-317:
--

SUCCESS: Integrated in Jenkins build Any23-trunk #1530 (See 
[https://builds.apache.org/job/Any23-trunk/1530/])
ANY23-227, ANY23-268, ANY23-317, ANY23-271, ANY23-273, ANY23-326, (Hans: rev 
0dd9837798a53b5f5a84c2b84891eaf9e8a99494)
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue227.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue273-and-317.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue326-and-267.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue268-and-317.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue271-and-317.html
* (edit) 
core/src/test/java/org/apache/any23/extractor/rdfa/RDFa11ExtractorTest.java


> Any23 fails when dealing with JavaScript
> 
>
> Key: ANY23-317
> URL: https://issues.apache.org/jira/browse/ANY23-317
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core, extractors
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Priority: Critical
> Fix For: 2.2
>
>
> Any23 always crashes when attempting to parse 

[jira] [Commented] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338872#comment-16338872
 ] 

Hudson commented on ANY23-326:
--

SUCCESS: Integrated in Jenkins build Any23-trunk #1530 (See 
[https://builds.apache.org/job/Any23-trunk/1530/])
ANY23-227, ANY23-268, ANY23-317, ANY23-271, ANY23-273, ANY23-326, (Hans: rev 
0dd9837798a53b5f5a84c2b84891eaf9e8a99494)
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue227.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue273-and-317.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue326-and-267.html
* (edit) 
core/src/test/java/org/apache/any23/extractor/rdfa/RDFa11ExtractorTest.java
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue268-and-317.html
* (add) test-resources/src/test/resources/html/rdfa/rdfa-issue271-and-317.html


> parsing unclosed meta and input tags fails
> --
>
> Key: ANY23-326
> URL: https://issues.apache.org/jira/browse/ANY23-326
> Project: Apache Any23
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 2.1
> Environment: ubuntu 17.04
>Reporter: Ben Roberts
>Priority: Major
> Fix For: 2.2
>
>
> parsing fails as soon as it hits an unclosed input or meta tag, as an example 
> try
>  ./bin/any23 rover https://ben.thatmustbe.me/note/2017/12/28/1
> [Fatal Error] :170:3: The element type "input" must be terminated by the 
> matching end-tag "".
>  
> It seems like the issue might be that this is using a very old version of 
> jsoup.  at least as best I could tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Jenkins build is back to normal : Any23-trunk #1530

2018-01-24 Thread Apache Jenkins Server
See 




Build failed in Jenkins: Any23-trunk #1529

2018-01-24 Thread Apache Jenkins Server
See 

--
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on ubuntu-eu2 (ubuntu trusty) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url 
 > https://git-wip-us.apache.org/repos/asf/any23.git # timeout=10
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://git-wip-us.apache.org/repos/asf/any23.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:825)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1092)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1123)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1202)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
at hudson.model.Run.execute(Run.java:1724)
at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:421)
Caused by: hudson.plugins.git.GitException: Command "git config 
remote.origin.url https://git-wip-us.apache.org/repos/asf/any23.git"; returned 
status code 4:
stdout: 
stderr: error: failed to write new configuration file .git/config.lock

at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1970)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1938)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1934)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1572)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1584)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.setRemoteUrl(CliGitAPIImpl.java:1218)
at hudson.plugins.git.GitAPI.setRemoteUrl(GitAPI.java:160)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:922)
at 
hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:896)
at 
hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853)
at hudson.remoting.UserRequest.perform(UserRequest.java:207)
at hudson.remoting.UserRequest.perform(UserRequest.java:53)
at hudson.remoting.Request$2.run(Request.java:358)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to 
ubuntu-eu2
at 
hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1693)
at hudson.remoting.UserResponse.retrieve(UserRequest.java:310)
at hudson.remoting.Channel.call(Channel.java:908)
at 
hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:281)
at com.sun.proxy.$Proxy110.setRemoteUrl(Unknown Source)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl.setRemoteUrl(RemoteGitImpl.java:295)
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:813)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1092)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1123)
at hudson.scm.SCM.checkout(SCM.java:495)
at 
hudson.model.AbstractProject.checkout(AbstractProject.java:1202)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
at 
jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
at hudson.model.Run.execute(Run.java:1724)
at 
hudson.maven.MavenModuleSetBui

[jira] [Commented] (ANY23-266) Fix Issues with Failing WebService Examples

2018-01-24 Thread Hans Brende (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338787#comment-16338787
 ] 

Hans Brende commented on ANY23-266:
---

[~lewismc] this issue is probably fixed now FYI.

> Fix Issues with Failing WebService Examples
> ---
>
> Key: ANY23-266
> URL: https://issues.apache.org/jira/browse/ANY23-266
> Project: Apache Any23
>  Issue Type: Bug
>  Components: service
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.2
>
>
> Right now we have a number of examples on the public Any23 service.
> It is pretty important that these examples always work e.g.
> {code}
> http://any23.org/best/twitter.com/cygri
> http://any23.org/rdfxml/http://data.gov
> http://any23.org/ttl/http://www.w3.org/People/Berners-Lee/card
> http://any23.org/?uri=http://dbpedia.org/resource/Berlin
> http://any23.org/?format=nt&uri=http://dbpedia.org/resource/Berlin
> {code}
> This is a parent issue for addressing the various issues which have arisen 
> recently when I updated the Service and rebooted the VM at any23-vm.apache.org



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ANY23-227) not extracting opengraph rdfa

2018-01-24 Thread Hans Brende (JIRA)

 [ 
https://issues.apache.org/jira/browse/ANY23-227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende reassigned ANY23-227:
-

Assignee: Hans Brende

> not extracting opengraph rdfa
> -
>
> Key: ANY23-227
> URL: https://issues.apache.org/jira/browse/ANY23-227
> Project: Apache Any23
>  Issue Type: Bug
>Affects Versions: 1.0
>Reporter: hadar
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.2
>
>
> unable to extract opengraph data using any23 default settings.
> example page.
> http://www.last.fm/music/Bread



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-227) not extracting opengraph rdfa

2018-01-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ANY23-227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved ANY23-227.

Resolution: Fixed

> not extracting opengraph rdfa
> -
>
> Key: ANY23-227
> URL: https://issues.apache.org/jira/browse/ANY23-227
> Project: Apache Any23
>  Issue Type: Bug
>Affects Versions: 1.0
>Reporter: hadar
>Priority: Major
> Fix For: 2.2
>
>
> unable to extract opengraph data using any23 default settings.
> example page.
> http://www.last.fm/music/Bread



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Build failed in Jenkins: Any23-trunk #1528

2018-01-24 Thread Apache Jenkins Server
See 

--
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on ubuntu-eu2 (ubuntu trusty) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url 
 > https://git-wip-us.apache.org/repos/asf/any23.git # timeout=10
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://git-wip-us.apache.org/repos/asf/any23.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:825)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1092)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1123)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1202)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
at hudson.model.Run.execute(Run.java:1724)
at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:421)
Caused by: hudson.plugins.git.GitException: Command "git config 
remote.origin.url https://git-wip-us.apache.org/repos/asf/any23.git"; returned 
status code 4:
stdout: 
stderr: error: failed to write new configuration file .git/config.lock

at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1970)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1938)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1934)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1572)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1584)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.setRemoteUrl(CliGitAPIImpl.java:1218)
at hudson.plugins.git.GitAPI.setRemoteUrl(GitAPI.java:160)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:922)
at 
hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:896)
at 
hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853)
at hudson.remoting.UserRequest.perform(UserRequest.java:207)
at hudson.remoting.UserRequest.perform(UserRequest.java:53)
at hudson.remoting.Request$2.run(Request.java:358)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to 
ubuntu-eu2
at 
hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1693)
at hudson.remoting.UserResponse.retrieve(UserRequest.java:310)
at hudson.remoting.Channel.call(Channel.java:908)
at 
hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:281)
at com.sun.proxy.$Proxy110.setRemoteUrl(Unknown Source)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl.setRemoteUrl(RemoteGitImpl.java:295)
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:813)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1092)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1123)
at hudson.scm.SCM.checkout(SCM.java:495)
at 
hudson.model.AbstractProject.checkout(AbstractProject.java:1202)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
at 
jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
at hudson.model.Run.execute(Run.java:1724)
at 
hudson.maven.MavenModuleSetBui

[GitHub] any23 pull request #61: ANY23-227,ANY23-268,ANY23-317,ANY23-271,ANY23-273,AN...

2018-01-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/any23/pull/61


---


[jira] [Commented] (ANY23-227) not extracting opengraph rdfa

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338760#comment-16338760
 ] 

ASF GitHub Bot commented on ANY23-227:
--

Github user asfgit closed the pull request at:

https://github.com/apache/any23/pull/61


> not extracting opengraph rdfa
> -
>
> Key: ANY23-227
> URL: https://issues.apache.org/jira/browse/ANY23-227
> Project: Apache Any23
>  Issue Type: Bug
>Affects Versions: 1.0
>Reporter: hadar
>Priority: Major
> Fix For: 2.2
>
>
> unable to extract opengraph data using any23 default settings.
> example page.
> http://www.last.fm/music/Bread



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-227) not extracting opengraph rdfa

2018-01-24 Thread Hans Brende (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338758#comment-16338758
 ] 

Hans Brende commented on ANY23-227:
---

[~lewismc] are you sure? When I tested, I got the correct triples out. I've 
submitted a new PR with the additional tests. Let me know if I should add more. 

https://github.com/apache/any23/pull/61

> not extracting opengraph rdfa
> -
>
> Key: ANY23-227
> URL: https://issues.apache.org/jira/browse/ANY23-227
> Project: Apache Any23
>  Issue Type: Bug
>Affects Versions: 1.0
>Reporter: hadar
>Priority: Major
> Fix For: 2.2
>
>
> unable to extract opengraph data using any23 default settings.
> example page.
> http://www.last.fm/music/Bread



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-227) not extracting opengraph rdfa

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338757#comment-16338757
 ] 

ASF GitHub Bot commented on ANY23-227:
--

GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/61

ANY23-227,ANY23-268,ANY23-317,ANY23-271,ANY23-273,ANY23-326,ANY23-267

Added tests to ensure that all of these issues were fixed by PR #59, and so 
that we don't regress on a subsequent PR.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-227

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/61.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #61


commit 0dd9837798a53b5f5a84c2b84891eaf9e8a99494
Author: Hans 
Date:   2018-01-25T05:15:41Z

ANY23-227, ANY23-268, ANY23-317, ANY23-271, ANY23-273, ANY23-326, ANY23-267 
Wrote tests to ensure that all of these issues were fixed by PR #59.




> not extracting opengraph rdfa
> -
>
> Key: ANY23-227
> URL: https://issues.apache.org/jira/browse/ANY23-227
> Project: Apache Any23
>  Issue Type: Bug
>Affects Versions: 1.0
>Reporter: hadar
>Priority: Major
> Fix For: 2.2
>
>
> unable to extract opengraph data using any23 default settings.
> example page.
> http://www.last.fm/music/Bread



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] any23 pull request #61: ANY23-227,ANY23-268,ANY23-317,ANY23-271,ANY23-273,AN...

2018-01-24 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/61

ANY23-227,ANY23-268,ANY23-317,ANY23-271,ANY23-273,ANY23-326,ANY23-267

Added tests to ensure that all of these issues were fixed by PR #59, and so 
that we don't regress on a subsequent PR.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-227

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/61.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #61


commit 0dd9837798a53b5f5a84c2b84891eaf9e8a99494
Author: Hans 
Date:   2018-01-25T05:15:41Z

ANY23-227, ANY23-268, ANY23-317, ANY23-271, ANY23-273, ANY23-326, ANY23-267 
Wrote tests to ensure that all of these issues were fixed by PR #59.




---


[jira] [Commented] (ANY23-291) JSON-LD should be looked up in entire HTML document, not just in

2018-01-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338750#comment-16338750
 ] 

Hudson commented on ANY23-291:
--

SUCCESS: Integrated in Jenkins build Any23-trunk #1527 (See 
[https://builds.apache.org/job/Any23-trunk/1527/])
ANY23-291 Allow JSONLD scripts to be located anywhere in document (Hans: rev 
d69558268b5d8e8d57f00d94b864c54ec2eaf75f)
* (edit) 
core/src/main/java/org/apache/any23/extractor/html/EmbeddedJSONLDExtractor.java
* (add) 
test-resources/src/test/resources/html/html-body-embedded-jsonld-extractor.html
* (add) 
test-resources/src/test/resources/html/html-head-and-body-embedded-jsonld-extractor.html
* (edit) 
core/src/test/java/org/apache/any23/extractor/html/EmbeddedJSONLDExtractorTest.java


> JSON-LD should be looked up in entire HTML document, not just in 
> ---
>
> Key: ANY23-291
> URL: https://issues.apache.org/jira/browse/ANY23-291
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: extractors
>Affects Versions: 1.2
>Reporter: Thomas Francart
>Assignee: Hans Brende
>Priority: Minor
> Fix For: 2.2
>
> Attachments: example-embedded-jsonld.html
>
>
> In 
> org.apache.any23.extractor.html.EmbeddedJSONLDExtractor.extractJSONLDScript(),
>  I think this line :
> List scriptNodes = DomUtils.findAll(in, "/HTML/HEAD/SCRIPT");
> is too restrictive. scripts containing json-ld can be placed anywhere in the 
> page, and actually some CMS/Wordpress plugin inserting JSON-LD are generating 
> their output in the body, not in the head.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338740#comment-16338740
 ] 

Hudson commented on ANY23-326:
--

SUCCESS: Integrated in Jenkins build Any23-trunk #1526 (See 
[https://builds.apache.org/job/Any23-trunk/1526/])
ANY23-326 fixed rdfa issue with unclosed input & meta tags (Hans: rev 
eefa208db3b4ad176ab3636fb3cc539bc00ea100)
* (edit) api/src/main/resources/default-configuration.properties
* (edit) core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java
* (edit) core/src/main/java/org/apache/any23/extractor/html/TagSoupParser.java
* (edit) 
core/src/main/java/org/apache/any23/extractor/html/TagSoupParsingConfiguration.java
* (add) core/src/main/java/org/apache/any23/extractor/html/JsoupUtils.java


> parsing unclosed meta and input tags fails
> --
>
> Key: ANY23-326
> URL: https://issues.apache.org/jira/browse/ANY23-326
> Project: Apache Any23
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 2.1
> Environment: ubuntu 17.04
>Reporter: Ben Roberts
>Priority: Major
> Fix For: 2.2
>
>
> parsing fails as soon as it hits an unclosed input or meta tag, as an example 
> try
>  ./bin/any23 rover https://ben.thatmustbe.me/note/2017/12/28/1
> [Fatal Error] :170:3: The element type "input" must be terminated by the 
> matching end-tag "".
>  
> It seems like the issue might be that this is using a very old version of 
> jsoup.  at least as best I could tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[DISCUSS] Release Any23 2.2?

2018-01-24 Thread lewis john mcgibbney
Hi Folks,
Any objections to pushing a release candidate for Any23 2.2?
Our 2.2 development progress can be seen at
https://issues.apache.org/jira/projects/ANY23/versions/12341626
Lewis

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[jira] [Resolved] (ANY23-291) JSON-LD should be looked up in entire HTML document, not just in

2018-01-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ANY23-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved ANY23-291.

Resolution: Fixed

Thank you [~HansBrende]

> JSON-LD should be looked up in entire HTML document, not just in 
> ---
>
> Key: ANY23-291
> URL: https://issues.apache.org/jira/browse/ANY23-291
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: extractors
>Affects Versions: 1.2
>Reporter: Thomas Francart
>Assignee: Hans Brende
>Priority: Minor
> Fix For: 2.2
>
> Attachments: example-embedded-jsonld.html
>
>
> In 
> org.apache.any23.extractor.html.EmbeddedJSONLDExtractor.extractJSONLDScript(),
>  I think this line :
> List scriptNodes = DomUtils.findAll(in, "/HTML/HEAD/SCRIPT");
> is too restrictive. scripts containing json-ld can be placed anywhere in the 
> page, and actually some CMS/Wordpress plugin inserting JSON-LD are generating 
> their output in the body, not in the head.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-291) JSON-LD should be looked up in entire HTML document, not just in

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338733#comment-16338733
 ] 

ASF GitHub Bot commented on ANY23-291:
--

Github user asfgit closed the pull request at:

https://github.com/apache/any23/pull/60


> JSON-LD should be looked up in entire HTML document, not just in 
> ---
>
> Key: ANY23-291
> URL: https://issues.apache.org/jira/browse/ANY23-291
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: extractors
>Affects Versions: 1.2
>Reporter: Thomas Francart
>Assignee: Hans Brende
>Priority: Minor
> Fix For: 2.2
>
> Attachments: example-embedded-jsonld.html
>
>
> In 
> org.apache.any23.extractor.html.EmbeddedJSONLDExtractor.extractJSONLDScript(),
>  I think this line :
> List scriptNodes = DomUtils.findAll(in, "/HTML/HEAD/SCRIPT");
> is too restrictive. scripts containing json-ld can be placed anywhere in the 
> page, and actually some CMS/Wordpress plugin inserting JSON-LD are generating 
> their output in the body, not in the head.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] any23 pull request #60: ANY23-291 Allow JSONLD scripts to be located anywher...

2018-01-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/any23/pull/60


---


[jira] [Assigned] (ANY23-291) JSON-LD should be looked up in entire HTML document, not just in

2018-01-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ANY23-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned ANY23-291:
--

Assignee: Hans Brende

> JSON-LD should be looked up in entire HTML document, not just in 
> ---
>
> Key: ANY23-291
> URL: https://issues.apache.org/jira/browse/ANY23-291
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: extractors
>Affects Versions: 1.2
>Reporter: Thomas Francart
>Assignee: Hans Brende
>Priority: Minor
> Fix For: 2.2
>
> Attachments: example-embedded-jsonld.html
>
>
> In 
> org.apache.any23.extractor.html.EmbeddedJSONLDExtractor.extractJSONLDScript(),
>  I think this line :
> List scriptNodes = DomUtils.findAll(in, "/HTML/HEAD/SCRIPT");
> is too restrictive. scripts containing json-ld can be placed anywhere in the 
> page, and actually some CMS/Wordpress plugin inserting JSON-LD are generating 
> their output in the body, not in the head.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved ANY23-326.

Resolution: Fixed

Fixed via https://github.com/apache/any23/pull/59

> parsing unclosed meta and input tags fails
> --
>
> Key: ANY23-326
> URL: https://issues.apache.org/jira/browse/ANY23-326
> Project: Apache Any23
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 2.1
> Environment: ubuntu 17.04
>Reporter: Ben Roberts
>Priority: Major
> Fix For: 2.2
>
>
> parsing fails as soon as it hits an unclosed input or meta tag, as an example 
> try
>  ./bin/any23 rover https://ben.thatmustbe.me/note/2017/12/28/1
> [Fatal Error] :170:3: The element type "input" must be terminated by the 
> matching end-tag "".
>  
> It seems like the issue might be that this is using a very old version of 
> jsoup.  at least as best I could tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-271) Address "...The entity "raquo" was referenced, but not declared" SAXParseException

2018-01-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ANY23-271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved ANY23-271.

Resolution: Fixed

Fixed via https://github.com/apache/any23/pull/59

> Address "...The entity "raquo" was referenced, but not declared" 
> SAXParseException
> --
>
> Key: ANY23-271
> URL: https://issues.apache.org/jira/browse/ANY23-271
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.2
>
>
> When attempting extractions on the following URL
> http://data.brandweeraa.nl/data/incident/2016/32601/deployment/201601272048400
> I get the following Exception with the Webservice at any23.org
> {code}
> 
> 
> Could not parse input.
> 
> 
> 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-268) Entire extraction task fails due to "Element type "t.length" must be followed by either attribute specifications, ">" or "/>"

2018-01-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ANY23-268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved ANY23-268.

Resolution: Fixed

Fixed via https://github.com/apache/any23/pull/59

> Entire extraction task fails due to "Element type "t.length" must be followed 
> by either attribute specifications, ">" or "/>"
> -
>
> Key: ANY23-268
> URL: https://issues.apache.org/jira/browse/ANY23-268
> Project: Apache Any23
>  Issue Type: Sub-task
>  Components: core
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.2
>
>
> WebService API call
> http://any23.org/rdfxml/http://data.gov
> {code}
> Could not parse input.
> 
>  BEGIN Exception context 
> ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:http://www.data.gov/)
> Errors {
> }
>  END   Exception context 
> org.apache.any23.extractor.ExtractionException: Error while parsing RDF 
> document.
>   at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:109)
>   at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:41)
>   at 
> org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:463)
>   at 
> org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:255)
>   at org.apache.any23.Any23.extract(Any23.java:298)
>   at org.apache.any23.Any23.extract(Any23.java:450)
>   at 
> org.apache.any23.servlet.WebResponder.runExtraction(WebResponder.java:114)
>   at org.apache.any23.servlet.Servlet.doGet(Servlet.java:79)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:618)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:725)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:301)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>   at 
> org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:239)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>   at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
>   at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:106)
>   at 
> org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:503)
>   at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:136)
>   at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:74)
>   at 
> org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:610)
>   at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88)
>   at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:526)
>   at 
> org.apache.coyote.ajp.AbstractAjpProcessor.process(AbstractAjpProcessor.java:794)
>   at 
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:652)
>   at 
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1575)
>   at 
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1533)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.openrdf.rio.RDFParseException: org.xml.sax.SAXParseException; 
> lineNumber: 5; columnNumber: 367; Element type "t.length" must be followed by 
> either attribute specifications, ">" or "/>".
>   at 
> org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:111)
>   at 
> org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:95)
>   at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:105)
>   ... 29 more
> Caused by: org.semarglproject.rdf.ParseException: 
> org.xml.sax.SAXParseException; lineNumber: 5; columnNumber: 367; Element type 
> "t.length" must be followed by either attribute specifications, ">" or "/>".
>   at 
> org.semarglproject.rdf.rdfa.RdfaParser.processException(RdfaParser.java:1130)
>   at org.semarglproject.source.XmlSource.process(XmlSource.java:50)
>   at 
> org.se

[jira] [Commented] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338726#comment-16338726
 ] 

ASF GitHub Bot commented on ANY23-326:
--

Github user asfgit closed the pull request at:

https://github.com/apache/any23/pull/59


> parsing unclosed meta and input tags fails
> --
>
> Key: ANY23-326
> URL: https://issues.apache.org/jira/browse/ANY23-326
> Project: Apache Any23
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 2.1
> Environment: ubuntu 17.04
>Reporter: Ben Roberts
>Priority: Major
> Fix For: 2.2
>
>
> parsing fails as soon as it hits an unclosed input or meta tag, as an example 
> try
>  ./bin/any23 rover https://ben.thatmustbe.me/note/2017/12/28/1
> [Fatal Error] :170:3: The element type "input" must be terminated by the 
> matching end-tag "".
>  
> It seems like the issue might be that this is using a very old version of 
> jsoup.  at least as best I could tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] any23 pull request #59: ANY23-326 fixed rdfa issue with unclosed input & met...

2018-01-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/any23/pull/59


---


[jira] [Commented] (ANY23-227) not extracting opengraph rdfa

2018-01-24 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338720#comment-16338720
 ] 

Lewis John McGibbney commented on ANY23-227:


Using PR 59 I am not able to extract the following
{code}




https://www.last.fm/music/Bread"; data-replaceable-head-tag />
https://lastfm-img2.akamaized.net/i/u/ar0/c41e3b80d0044973b56bd3c36df99aa2.jpg";
 data-replaceable-head-tag>






https://lastfm-img2.akamaized.net/i/u/ar0/c41e3b80d0044973b56bd3c36df99aa2.jpg";
 data-replaceable-head-tag>


{code}

> not extracting opengraph rdfa
> -
>
> Key: ANY23-227
> URL: https://issues.apache.org/jira/browse/ANY23-227
> Project: Apache Any23
>  Issue Type: Bug
>Affects Versions: 1.0
>Reporter: hadar
>Priority: Major
> Fix For: 2.2
>
>
> unable to extract opengraph data using any23 default settings.
> example page.
> http://www.last.fm/music/Bread



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread Hans Brende (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338630#comment-16338630
 ] 

Hans Brende commented on ANY23-326:
---

[~ben.thatmustbe.me] I fixed the wrong html parser in the last PR haha, sorry. 
But the issue should be fixed now in PR #59. Can you confirm?

https://github.com/apache/any23/pull/59

> parsing unclosed meta and input tags fails
> --
>
> Key: ANY23-326
> URL: https://issues.apache.org/jira/browse/ANY23-326
> Project: Apache Any23
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 2.1
> Environment: ubuntu 17.04
>Reporter: Ben Roberts
>Priority: Major
> Fix For: 2.2
>
>
> parsing fails as soon as it hits an unclosed input or meta tag, as an example 
> try
>  ./bin/any23 rover https://ben.thatmustbe.me/note/2017/12/28/1
> [Fatal Error] :170:3: The element type "input" must be terminated by the 
> matching end-tag "".
>  
> It seems like the issue might be that this is using a very old version of 
> jsoup.  at least as best I could tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-291) JSON-LD should be looked up in entire HTML document, not just in

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338600#comment-16338600
 ] 

ASF GitHub Bot commented on ANY23-291:
--

GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/60

ANY23-291 Allow JSONLD scripts to be located anywhere in document

Pretty self explanatory. Simply changed one xpath expression.

mvn clean install -> all tests pass.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-291

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/60.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #60


commit ce4ff34ad31c12dae6d3b0a3df8fcb5b1d6932e9
Author: Hans 
Date:   2018-01-25T01:58:25Z

ANY23-291 Allow JSONLD scripts to be located anywhere in document




> JSON-LD should be looked up in entire HTML document, not just in 
> ---
>
> Key: ANY23-291
> URL: https://issues.apache.org/jira/browse/ANY23-291
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: extractors
>Affects Versions: 1.2
>Reporter: Thomas Francart
>Priority: Minor
> Fix For: 2.2
>
> Attachments: example-embedded-jsonld.html
>
>
> In 
> org.apache.any23.extractor.html.EmbeddedJSONLDExtractor.extractJSONLDScript(),
>  I think this line :
> List scriptNodes = DomUtils.findAll(in, "/HTML/HEAD/SCRIPT");
> is too restrictive. scripts containing json-ld can be placed anywhere in the 
> page, and actually some CMS/Wordpress plugin inserting JSON-LD are generating 
> their output in the body, not in the head.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] any23 pull request #60: ANY23-291 Allow JSONLD scripts to be located anywher...

2018-01-24 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/60

ANY23-291 Allow JSONLD scripts to be located anywhere in document

Pretty self explanatory. Simply changed one xpath expression.

mvn clean install -> all tests pass.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-291

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/60.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #60


commit ce4ff34ad31c12dae6d3b0a3df8fcb5b1d6932e9
Author: Hans 
Date:   2018-01-25T01:58:25Z

ANY23-291 Allow JSONLD scripts to be located anywhere in document




---


[jira] [Commented] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338526#comment-16338526
 ] 

ASF GitHub Bot commented on ANY23-326:
--

Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/59
  
@lewismc I rebased all the commits into a single commit to make it easier 
to see what has changed. Everything should be fully functional now.


> parsing unclosed meta and input tags fails
> --
>
> Key: ANY23-326
> URL: https://issues.apache.org/jira/browse/ANY23-326
> Project: Apache Any23
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 2.1
> Environment: ubuntu 17.04
>Reporter: Ben Roberts
>Priority: Major
> Fix For: 2.2
>
>
> parsing fails as soon as it hits an unclosed input or meta tag, as an example 
> try
>  ./bin/any23 rover https://ben.thatmustbe.me/note/2017/12/28/1
> [Fatal Error] :170:3: The element type "input" must be terminated by the 
> matching end-tag "".
>  
> It seems like the issue might be that this is using a very old version of 
> jsoup.  at least as best I could tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] any23 issue #59: ANY23-326 fixed rdfa issue with unclosed input & meta tags

2018-01-24 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/59
  
@lewismc I rebased all the commits into a single commit to make it easier 
to see what has changed. Everything should be fully functional now.


---


[jira] [Commented] (ANY23-267) Entire extractions fail due to "The element type 'meta' must be terminated by the matching end-tag "

2018-01-24 Thread Hans Brende (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338416#comment-16338416
 ] 

Hans Brende commented on ANY23-267:
---

[~lewismc] fixed in the latest commit.

> Entire extractions fail due to "The element type 'meta' must be terminated by 
> the matching end-tag "
> ---
>
> Key: ANY23-267
> URL: https://issues.apache.org/jira/browse/ANY23-267
> Project: Apache Any23
>  Issue Type: Sub-task
>  Components: core
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.2
>
>
> WebService API call
> http://any23.org/best/twitter.com/cygri
> {code}
> Could not parse input.
> 
>  BEGIN Exception context 
> ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:https://twitter.com/cygri)
> Errors {
> }
>  END   Exception context 
> org.apache.any23.extractor.ExtractionException: Error while parsing RDF 
> document.
>   at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:109)
>   at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:41)
>   at 
> org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:463)
>   at 
> org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:255)
>   at org.apache.any23.Any23.extract(Any23.java:298)
>   at org.apache.any23.Any23.extract(Any23.java:450)
>   at 
> org.apache.any23.servlet.WebResponder.runExtraction(WebResponder.java:114)
>   at org.apache.any23.servlet.Servlet.doGet(Servlet.java:79)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:618)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:725)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:301)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>   at 
> org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:239)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>   at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
>   at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:106)
>   at 
> org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:503)
>   at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:136)
>   at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:74)
>   at 
> org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:610)
>   at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88)
>   at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:526)
>   at 
> org.apache.coyote.ajp.AbstractAjpProcessor.process(AbstractAjpProcessor.java:794)
>   at 
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:652)
>   at 
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1575)
>   at 
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1533)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.openrdf.rio.RDFParseException: org.xml.sax.SAXParseException; 
> lineNumber: 15; columnNumber: 116; The element type "meta" must be terminated 
> by the matching end-tag "".
>   at 
> org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:111)
>   at 
> org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:95)
>   at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:105)
>   ... 29 more
> Caused by: org.semarglproject.rdf.ParseException: 
> org.xml.sax.SAXParseException; lineNumber: 15; columnNumber: 116; The element 
> type "meta" must be terminated by the matching end-tag "".
>   at 
> org.semarglproject.rdf.rdfa.RdfaParser.processException(RdfaParser.java:1130)
>   at org.semarglproject.source.XmlSource.process(XmlSource.java:50)
>   at 
> org.semarglproject.source.Stream

[jira] [Commented] (ANY23-227) not extracting opengraph rdfa

2018-01-24 Thread Hans Brende (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338407#comment-16338407
 ] 

Hans Brende commented on ANY23-227:
---

[~lewismc] Now it is!

> not extracting opengraph rdfa
> -
>
> Key: ANY23-227
> URL: https://issues.apache.org/jira/browse/ANY23-227
> Project: Apache Any23
>  Issue Type: Bug
>Affects Versions: 1.0
>Reporter: hadar
>Priority: Major
> Fix For: 2.2
>
>
> unable to extract opengraph data using any23 default settings.
> example page.
> http://www.last.fm/music/Bread



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338388#comment-16338388
 ] 

ASF GitHub Bot commented on ANY23-326:
--

Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/59
  
@lewismc my latest commit should fix ANY23-227 and ANY23-268.


> parsing unclosed meta and input tags fails
> --
>
> Key: ANY23-326
> URL: https://issues.apache.org/jira/browse/ANY23-326
> Project: Apache Any23
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 2.1
> Environment: ubuntu 17.04
>Reporter: Ben Roberts
>Priority: Major
> Fix For: 2.2
>
>
> parsing fails as soon as it hits an unclosed input or meta tag, as an example 
> try
>  ./bin/any23 rover https://ben.thatmustbe.me/note/2017/12/28/1
> [Fatal Error] :170:3: The element type "input" must be terminated by the 
> matching end-tag "".
>  
> It seems like the issue might be that this is using a very old version of 
> jsoup.  at least as best I could tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] any23 issue #59: ANY23-326 fixed rdfa issue with unclosed input & meta tags

2018-01-24 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/59
  
@lewismc my latest commit should fix ANY23-227 and ANY23-268.


---


[jira] [Commented] (ANY23-267) Entire extractions fail due to "The element type 'meta' must be terminated by the matching end-tag "

2018-01-24 Thread Hans Brende (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338349#comment-16338349
 ] 

Hans Brende commented on ANY23-267:
---

[~lewismc] hmmm, it appears that the jsoup Parser.xmlParser() does not parse 
javascript correctly. I'll switch back to the Parser.htmlParser() and update my 
PR soon.

> Entire extractions fail due to "The element type 'meta' must be terminated by 
> the matching end-tag "
> ---
>
> Key: ANY23-267
> URL: https://issues.apache.org/jira/browse/ANY23-267
> Project: Apache Any23
>  Issue Type: Sub-task
>  Components: core
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.2
>
>
> WebService API call
> http://any23.org/best/twitter.com/cygri
> {code}
> Could not parse input.
> 
>  BEGIN Exception context 
> ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:https://twitter.com/cygri)
> Errors {
> }
>  END   Exception context 
> org.apache.any23.extractor.ExtractionException: Error while parsing RDF 
> document.
>   at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:109)
>   at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:41)
>   at 
> org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:463)
>   at 
> org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:255)
>   at org.apache.any23.Any23.extract(Any23.java:298)
>   at org.apache.any23.Any23.extract(Any23.java:450)
>   at 
> org.apache.any23.servlet.WebResponder.runExtraction(WebResponder.java:114)
>   at org.apache.any23.servlet.Servlet.doGet(Servlet.java:79)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:618)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:725)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:301)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>   at 
> org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:239)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>   at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
>   at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:106)
>   at 
> org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:503)
>   at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:136)
>   at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:74)
>   at 
> org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:610)
>   at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88)
>   at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:526)
>   at 
> org.apache.coyote.ajp.AbstractAjpProcessor.process(AbstractAjpProcessor.java:794)
>   at 
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:652)
>   at 
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1575)
>   at 
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1533)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.openrdf.rio.RDFParseException: org.xml.sax.SAXParseException; 
> lineNumber: 15; columnNumber: 116; The element type "meta" must be terminated 
> by the matching end-tag "".
>   at 
> org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:111)
>   at 
> org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:95)
>   at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:105)
>   ... 29 more
> Caused by: org.semarglproject.rdf.ParseException: 
> org.xml.sax.SAXParseException; lineNumber: 15; columnNumber: 116; The element 
> type "meta" must be terminated by the matching end-tag "".
>   at 
> org.semarglproject.rdf.rdfa.RdfaParser.processException(RdfaParser.ja

[jira] [Commented] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338251#comment-16338251
 ] 

ASF GitHub Bot commented on ANY23-326:
--

Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/59#discussion_r163683520
  
--- Diff: 
core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java ---
@@ -105,7 +109,24 @@ public void run(
 
parser.getParserConfig().addNonFatalError(BasicParserSettings.NORMALIZE_DATATYPE_VALUES);
 //ByteBuffer seems to represent incorrect content. Need to 
make sure it is the content
 //of the 

[GitHub] any23 pull request #59: ANY23-326 fixed rdfa issue with unclosed input & met...

2018-01-24 Thread HansBrende
Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/59#discussion_r163683520
  
--- Diff: 
core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java ---
@@ -105,7 +109,24 @@ public void run(
 
parser.getParserConfig().addNonFatalError(BasicParserSettings.NORMALIZE_DATATYPE_VALUES);
 //ByteBuffer seems to represent incorrect content. Need to 
make sure it is the content
 //of the 

[jira] [Commented] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338239#comment-16338239
 ] 

ASF GitHub Bot commented on ANY23-326:
--

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/any23/pull/59#discussion_r163680939
  
--- Diff: 
core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java ---
@@ -105,7 +109,24 @@ public void run(
 
parser.getParserConfig().addNonFatalError(BasicParserSettings.NORMALIZE_DATATYPE_VALUES);
 //ByteBuffer seems to represent incorrect content. Need to 
make sure it is the content
 //of the 

[GitHub] any23 pull request #59: ANY23-326 fixed rdfa issue with unclosed input & met...

2018-01-24 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/any23/pull/59#discussion_r163680939
  
--- Diff: 
core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java ---
@@ -105,7 +109,24 @@ public void run(
 
parser.getParserConfig().addNonFatalError(BasicParserSettings.NORMALIZE_DATATYPE_VALUES);
 //ByteBuffer seems to represent incorrect content. Need to 
make sure it is the content
 //of the 

[jira] [Commented] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338238#comment-16338238
 ] 

ASF GitHub Bot commented on ANY23-326:
--

Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/59#discussion_r163680710
  
--- Diff: 
core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java ---
@@ -105,7 +109,24 @@ public void run(
 
parser.getParserConfig().addNonFatalError(BasicParserSettings.NORMALIZE_DATATYPE_VALUES);
 //ByteBuffer seems to represent incorrect content. Need to 
make sure it is the content
 //of the 

[GitHub] any23 pull request #59: ANY23-326 fixed rdfa issue with unclosed input & met...

2018-01-24 Thread HansBrende
Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/59#discussion_r163680710
  
--- Diff: 
core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java ---
@@ -105,7 +109,24 @@ public void run(
 
parser.getParserConfig().addNonFatalError(BasicParserSettings.NORMALIZE_DATATYPE_VALUES);
 //ByteBuffer seems to represent incorrect content. Need to 
make sure it is the content
 //of the 

[GitHub] any23 pull request #59: ANY23-326 fixed rdfa issue with unclosed input & met...

2018-01-24 Thread HansBrende
Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/59#discussion_r163679410
  
--- Diff: 
core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java ---
@@ -22,16 +22,20 @@
 import org.apache.any23.extractor.ExtractionParameters;
 import org.apache.any23.extractor.ExtractionResult;
 import org.apache.any23.extractor.Extractor;
-import org.eclipse.rdf4j.rio.RDFHandlerException;
-import org.eclipse.rdf4j.rio.RDFParseException;
-import org.eclipse.rdf4j.rio.RDFParser;
-import org.eclipse.rdf4j.rio.RioSetting;
+import org.eclipse.rdf4j.rio.*;
--- End diff --

Sure, sorry, that was something that IntelliJ did. I have no idea why. Will 
fix.


---


[jira] [Commented] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338230#comment-16338230
 ] 

ASF GitHub Bot commented on ANY23-326:
--

Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/59#discussion_r163679410
  
--- Diff: 
core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java ---
@@ -22,16 +22,20 @@
 import org.apache.any23.extractor.ExtractionParameters;
 import org.apache.any23.extractor.ExtractionResult;
 import org.apache.any23.extractor.Extractor;
-import org.eclipse.rdf4j.rio.RDFHandlerException;
-import org.eclipse.rdf4j.rio.RDFParseException;
-import org.eclipse.rdf4j.rio.RDFParser;
-import org.eclipse.rdf4j.rio.RioSetting;
+import org.eclipse.rdf4j.rio.*;
--- End diff --

Sure, sorry, that was something that IntelliJ did. I have no idea why. Will 
fix.


> parsing unclosed meta and input tags fails
> --
>
> Key: ANY23-326
> URL: https://issues.apache.org/jira/browse/ANY23-326
> Project: Apache Any23
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 2.1
> Environment: ubuntu 17.04
>Reporter: Ben Roberts
>Priority: Major
> Fix For: 2.2
>
>
> parsing fails as soon as it hits an unclosed input or meta tag, as an example 
> try
>  ./bin/any23 rover https://ben.thatmustbe.me/note/2017/12/28/1
> [Fatal Error] :170:3: The element type "input" must be terminated by the 
> matching end-tag "".
>  
> It seems like the issue might be that this is using a very old version of 
> jsoup.  at least as best I could tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] any23 pull request #59: ANY23-326 fixed rdfa issue with unclosed input & met...

2018-01-24 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/any23/pull/59#discussion_r163679147
  
--- Diff: 
core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java ---
@@ -105,7 +109,24 @@ public void run(
 
parser.getParserConfig().addNonFatalError(BasicParserSettings.NORMALIZE_DATATYPE_VALUES);
 //ByteBuffer seems to represent incorrect content. Need to 
make sure it is the content
 //of the 

[jira] [Commented] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338229#comment-16338229
 ] 

ASF GitHub Bot commented on ANY23-326:
--

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/any23/pull/59#discussion_r163679147
  
--- Diff: 
core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java ---
@@ -105,7 +109,24 @@ public void run(
 
parser.getParserConfig().addNonFatalError(BasicParserSettings.NORMALIZE_DATATYPE_VALUES);
 //ByteBuffer seems to represent incorrect content. Need to 
make sure it is the content
 //of the 

[jira] [Commented] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338228#comment-16338228
 ] 

ASF GitHub Bot commented on ANY23-326:
--

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/any23/pull/59#discussion_r163678949
  
--- Diff: 
core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java ---
@@ -22,16 +22,20 @@
 import org.apache.any23.extractor.ExtractionParameters;
 import org.apache.any23.extractor.ExtractionResult;
 import org.apache.any23.extractor.Extractor;
-import org.eclipse.rdf4j.rio.RDFHandlerException;
-import org.eclipse.rdf4j.rio.RDFParseException;
-import org.eclipse.rdf4j.rio.RDFParser;
-import org.eclipse.rdf4j.rio.RioSetting;
+import org.eclipse.rdf4j.rio.*;
--- End diff --

Can you please make explicit imports rather than wildcard.


> parsing unclosed meta and input tags fails
> --
>
> Key: ANY23-326
> URL: https://issues.apache.org/jira/browse/ANY23-326
> Project: Apache Any23
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 2.1
> Environment: ubuntu 17.04
>Reporter: Ben Roberts
>Priority: Major
> Fix For: 2.2
>
>
> parsing fails as soon as it hits an unclosed input or meta tag, as an example 
> try
>  ./bin/any23 rover https://ben.thatmustbe.me/note/2017/12/28/1
> [Fatal Error] :170:3: The element type "input" must be terminated by the 
> matching end-tag "".
>  
> It seems like the issue might be that this is using a very old version of 
> jsoup.  at least as best I could tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-227) not extracting opengraph rdfa

2018-01-24 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338227#comment-16338227
 ] 

Lewis John McGibbney commented on ANY23-227:


This issue is not fixed by https://github.com/apache/any23/pull/59 [~HansBrende]

> not extracting opengraph rdfa
> -
>
> Key: ANY23-227
> URL: https://issues.apache.org/jira/browse/ANY23-227
> Project: Apache Any23
>  Issue Type: Bug
>Affects Versions: 1.0
>Reporter: hadar
>Priority: Major
> Fix For: 2.2
>
>
> unable to extract opengraph data using any23 default settings.
> example page.
> http://www.last.fm/music/Bread



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] any23 pull request #59: ANY23-326 fixed rdfa issue with unclosed input & met...

2018-01-24 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/any23/pull/59#discussion_r163678949
  
--- Diff: 
core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java ---
@@ -22,16 +22,20 @@
 import org.apache.any23.extractor.ExtractionParameters;
 import org.apache.any23.extractor.ExtractionResult;
 import org.apache.any23.extractor.Extractor;
-import org.eclipse.rdf4j.rio.RDFHandlerException;
-import org.eclipse.rdf4j.rio.RDFParseException;
-import org.eclipse.rdf4j.rio.RDFParser;
-import org.eclipse.rdf4j.rio.RioSetting;
+import org.eclipse.rdf4j.rio.*;
--- End diff --

Can you please make explicit imports rather than wildcard.


---


[jira] [Commented] (ANY23-271) Address "...The entity "raquo" was referenced, but not declared" SAXParseException

2018-01-24 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338218#comment-16338218
 ] 

Lewis John McGibbney commented on ANY23-271:


When I run the above extraction with the patch provided at I get the following 
issues... note they are still related to the RDFa1.1 Extractor. Also note 
however that the entity "raquo" issue is now resolved so this issue is fixed.

{code}



html-head-meta
html-embedded-jsonld
html-head-title
html-rdfa11






Can't resolve term profile
Can't resolve term pingback
Can't resolve term dns-prefetch
Can't resolve term dns-prefetch
Can't resolve term dns-prefetch
Element type "i.length" must be followed 
by either attribute specifications, ">" or "/>".
















{code}

> Address "...The entity "raquo" was referenced, but not declared" 
> SAXParseException
> --
>
> Key: ANY23-271
> URL: https://issues.apache.org/jira/browse/ANY23-271
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.2
>
>
> When attempting extractions on the following URL
> http://data.brandweeraa.nl/data/incident/2016/32601/deployment/201601272048400
> I get the following Exception with the Webservice at any23.org
> {code}
> 
> 
> Could not parse input.
> 
> 
> 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ANY23-271) Address "...The entity "raquo" was referenced, but not declared" SAXParseException

2018-01-24 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338218#comment-16338218
 ] 

Lewis John McGibbney edited comment on ANY23-271 at 1/24/18 9:05 PM:
-

When I run the above extraction with the patch provided at 
https://github.com/apache/any23/pull/59 I get the following issues... note they 
are still related to the RDFa1.1 Extractor. Also note however that the entity 
"raquo" issue is now resolved so this issue is fixed.

{code}



html-head-meta
html-embedded-jsonld
html-head-title
html-rdfa11






Can't resolve term profile
Can't resolve term pingback
Can't resolve term dns-prefetch
Can't resolve term dns-prefetch
Can't resolve term dns-prefetch
Element type "i.length" must be followed 
by either attribute specifications, ">" or "/>".
















{code}


was (Author: lewismc):
When I run the above extraction with the patch provided at I get the following 
issues... note they are still related to the RDFa1.1 Extractor. Also note 
however that the entity "raquo" issue is now resolved so this issue is fixed.

{code}



html-head-meta
html-embedded-jsonld
html-head-title
html-rdfa11






Can't resolve term profile
Can't resolve term pingback
Can't resolve term dns-prefetch
Can't resolve term dns-prefetch
Can't resolve term dns-prefetch
Element type "i.length" must be followed 
by either attribute specifications, ">" or "/>".
















{code}

> Address "...The entity "raquo" was referenced, but not declared" 
> SAXParseException
> --
>
> Key: ANY23-271
> URL: https://issues.apache.org/jira/browse/ANY23-271
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.2
>
>
> When attempting extractions on the following URL
> http://data.brandweeraa.nl/data/incident/2016/32601/deployment/201601272048400
> I get the following Exception with the Webservice at any23.org
> {code}
> 
> 
> Could not parse input.
> 
> 
> 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-267) Entire extractions fail due to "The element type 'meta' must be terminated by the matching end-tag "

2018-01-24 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338211#comment-16338211
 ] 

Lewis John McGibbney commented on ANY23-267:


When I execute an extraction, with https://github.com/apache/any23/pull/59, I 
get the following [~HansBrende]. There is sitll an issue as outlined below.
{code}



html-mf-xfn
html-head-meta
html-head-title






Element type "t.length" must be followed 
by either attribute specifications, ">" or "/>".
















{code}

> Entire extractions fail due to "The element type 'meta' must be terminated by 
> the matching end-tag "
> ---
>
> Key: ANY23-267
> URL: https://issues.apache.org/jira/browse/ANY23-267
> Project: Apache Any23
>  Issue Type: Sub-task
>  Components: core
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.2
>
>
> WebService API call
> http://any23.org/best/twitter.com/cygri
> {code}
> Could not parse input.
> 
>  BEGIN Exception context 
> ExtractionContext(urn:x-any23:html-rdfa11:root-extraction-result-id:https://twitter.com/cygri)
> Errors {
> }
>  END   Exception context 
> org.apache.any23.extractor.ExtractionException: Error while parsing RDF 
> document.
>   at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:109)
>   at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:41)
>   at 
> org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:463)
>   at 
> org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:255)
>   at org.apache.any23.Any23.extract(Any23.java:298)
>   at org.apache.any23.Any23.extract(Any23.java:450)
>   at 
> org.apache.any23.servlet.WebResponder.runExtraction(WebResponder.java:114)
>   at org.apache.any23.servlet.Servlet.doGet(Servlet.java:79)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:618)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:725)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:301)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>   at 
> org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:239)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>   at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
>   at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:106)
>   at 
> org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:503)
>   at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:136)
>   at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:74)
>   at 
> org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:610)
>   at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88)
>   at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:526)
>   at 
> org.apache.coyote.ajp.AbstractAjpProcessor.process(AbstractAjpProcessor.java:794)
>   at 
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:652)
>   at 
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1575)
>   at 
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1533)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.openrdf.rio.RDFParseException: org.xml.sax.SAXParseException; 
> lineNumber: 15; columnNumber: 116; The element type "meta" must be terminated 
> by the matching end-tag "".
>   at 
> org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:111)
>   at 
> org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:95)
>   at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:105)
>   ... 29 more
> Caused by: org.semarglproject.rdf.ParseException: 
> org.xml.sax.SAXParseException; lineNumber: 15

[jira] [Resolved] (ANY23-324) Replace net.sourceforge.nekohtml with jsoup

2018-01-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ANY23-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved ANY23-324.

Resolution: Fixed

> Replace net.sourceforge.nekohtml with jsoup 
> 
>
> Key: ANY23-324
> URL: https://issues.apache.org/jira/browse/ANY23-324
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: core
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.2
>
>
> A long standing issue relates to the performance of the existing default 
> [TagSoupParser.java|https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/html/TagSoupParser.java].
>  There are a number of issues which now relate to limitations in the way 
> nekohtml parses HTML5 for example 
> [ANY23-317|https://issues.apache.org/jira/browse/ANY23-317], 
> [ANY23-273|https://issues.apache.org/jira/browse/ANY23-273], 
> [ANY23-267|https://issues.apache.org/jira/browse/ANY23-267]... there are 
> several others.
> I propose to @Deprecate the TagSoupParser.java implementation for the next 
> release (possibly making it configurable via 
> default-configuration.properties). I also propose to replace it with 
> https://jsoup.org/. AFAIK, Apache Tika also did this several years ago.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-273) The content of elements must consist of well-formed character data or markup - no bogus comments

2018-01-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ANY23-273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved ANY23-273.

Resolution: Fixed

> The content of elements must consist of well-formed character data or markup 
> - no bogus comments
> 
>
> Key: ANY23-273
> URL: https://issues.apache.org/jira/browse/ANY23-273
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 1.2
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.2
>
>
> Whilst attempting to address ANY23-131, I tried running the any23.org service 
> over the [example 
> URL|https://www.otto.de/p/aeg-waschmaschine-lavamat-l14as7-aplusplusplus-7-kg-1400-u-min-508571361/#variationId=504747671-M48]
>  with the following failure
> {code}
> 
> 
> Could not parse input.
> 
> 
> 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-317) Any23 fails when dealing with JavaScript

2018-01-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ANY23-317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved ANY23-317.

Resolution: Fixed

> Any23 fails when dealing with JavaScript
> 
>
> Key: ANY23-317
> URL: https://issues.apache.org/jira/browse/ANY23-317
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core, extractors
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Priority: Critical
> Fix For: 2.2
>
>
> Any23 always crashes when attempting to parse 

[jira] [Updated] (ANY23-295) Implement ability to use librdfa

2018-01-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ANY23-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated ANY23-295:
---
Labels: gsoc2017 gsoc2018  (was: gsoc2017)

> Implement ability to use librdfa
> 
>
> Key: ANY23-295
> URL: https://issues.apache.org/jira/browse/ANY23-295
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: core, extractors
>Reporter: Lewis John McGibbney
>Priority: Major
>  Labels: gsoc2017, gsoc2018
> Fix For: 2.2
>
>
> It would be cool for us to see what kind of speed up we can get by 
> implementing RDFa parsing via librdfa C implementation at 
> https://github.com/rdfa/librdfa
> The C implementation also states "...It currently supports XML+RDFa, 
> XHTML+RDFa, SVG+RDFa, HTML4+RDFa and HTML5+RDFa for both RDFa 1.0 and RDFa 
> 1.1."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337540#comment-16337540
 ] 

ASF GitHub Bot commented on ANY23-326:
--

GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/59

ANY23-326 fixed rdfa issue with unclosed input & meta tags

This PR should also fix ANY23-317, ANY23-273, ANY23-267, ANY23-271, and 
ANY23-227 (this time, for realz).

These all have to do with the RDFa implementation failing to parse HTML.

My previous commit attempted to fix these issues by changing the default 
parser from NekoHTML to Jsoup. But alas, it turns out the RDFa implementation 
is using a completely different html parser under the hood, and it's the RDFa 
parser that's too strict, not ours, so changing ours from NekoHTML to Jsoup had 
no effect (although it did come with a nice 20% speed increase, so there's 
that). It seems that, for rio parsers, the document is parsed with Jsoup *only 
to get the document language* and then parsed **again** under the hood with who 
knows what.

Now, I simply check the RDF format to see if we're putting out XHTML. If we 
are, I first XHTML-ify the stream with Jsoup before sending it on to the rio 
RDF parser.

mvn clean install -> all tests passed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-326

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/59.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #59


commit 74b2909b6d91cc4989093d90a38baef1c34c603f
Author: Hans 
Date:   2018-01-24T12:26:40Z

ANY23-326 fixed rdfa issue with unclosed input & meta tags




> parsing unclosed meta and input tags fails
> --
>
> Key: ANY23-326
> URL: https://issues.apache.org/jira/browse/ANY23-326
> Project: Apache Any23
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 2.1
> Environment: ubuntu 17.04
>Reporter: Ben Roberts
>Priority: Major
> Fix For: 2.2
>
>
> parsing fails as soon as it hits an unclosed input or meta tag, as an example 
> try
>  ./bin/any23 rover https://ben.thatmustbe.me/note/2017/12/28/1
> [Fatal Error] :170:3: The element type "input" must be terminated by the 
> matching end-tag "".
>  
> It seems like the issue might be that this is using a very old version of 
> jsoup.  at least as best I could tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] any23 pull request #59: ANY23-326 fixed rdfa issue with unclosed input & met...

2018-01-24 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/59

ANY23-326 fixed rdfa issue with unclosed input & meta tags

This PR should also fix ANY23-317, ANY23-273, ANY23-267, ANY23-271, and 
ANY23-227 (this time, for realz).

These all have to do with the RDFa implementation failing to parse HTML.

My previous commit attempted to fix these issues by changing the default 
parser from NekoHTML to Jsoup. But alas, it turns out the RDFa implementation 
is using a completely different html parser under the hood, and it's the RDFa 
parser that's too strict, not ours, so changing ours from NekoHTML to Jsoup had 
no effect (although it did come with a nice 20% speed increase, so there's 
that). It seems that, for rio parsers, the document is parsed with Jsoup *only 
to get the document language* and then parsed **again** under the hood with who 
knows what.

Now, I simply check the RDF format to see if we're putting out XHTML. If we 
are, I first XHTML-ify the stream with Jsoup before sending it on to the rio 
RDF parser.

mvn clean install -> all tests passed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-326

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/59.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #59


commit 74b2909b6d91cc4989093d90a38baef1c34c603f
Author: Hans 
Date:   2018-01-24T12:26:40Z

ANY23-326 fixed rdfa issue with unclosed input & meta tags




---


[jira] [Commented] (ANY23-324) Replace net.sourceforge.nekohtml with jsoup

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337404#comment-16337404
 ] 

ASF GitHub Bot commented on ANY23-324:
--

Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/58
  
@lewismc Yeah, I just realized that this PR fixes none of the issues we 
thought it would... because the TagSoupParser is not what was causing the 
problem... the semargl parser is causing the problem. Don't worry, I've got 
another PR coming shortly!


> Replace net.sourceforge.nekohtml with jsoup 
> 
>
> Key: ANY23-324
> URL: https://issues.apache.org/jira/browse/ANY23-324
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: core
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.2
>
>
> A long standing issue relates to the performance of the existing default 
> [TagSoupParser.java|https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/html/TagSoupParser.java].
>  There are a number of issues which now relate to limitations in the way 
> nekohtml parses HTML5 for example 
> [ANY23-317|https://issues.apache.org/jira/browse/ANY23-317], 
> [ANY23-273|https://issues.apache.org/jira/browse/ANY23-273], 
> [ANY23-267|https://issues.apache.org/jira/browse/ANY23-267]... there are 
> several others.
> I propose to @Deprecate the TagSoupParser.java implementation for the next 
> release (possibly making it configurable via 
> default-configuration.properties). I also propose to replace it with 
> https://jsoup.org/. AFAIK, Apache Tika also did this several years ago.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] any23 issue #58: ANY23-324 Changed default html parser from NekoHTML to Jsou...

2018-01-24 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/58
  
@lewismc Yeah, I just realized that this PR fixes none of the issues we 
thought it would... because the TagSoupParser is not what was causing the 
problem... the semargl parser is causing the problem. Don't worry, I've got 
another PR coming shortly!


---


[jira] [Commented] (ANY23-326) parsing unclosed meta and input tags fails

2018-01-24 Thread Hans Brende (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337280#comment-16337280
 ] 

Hans Brende commented on ANY23-326:
---

Just for reference, here is the actual stack trace:

org.eclipse.rdf4j.rio.RDFParseException: org.xml.sax.SAXParseException; 
lineNumber: 170; columnNumber: 3; The element type "input" must be terminated 
by the matching end-tag "".
 at 
org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:111)
 at 
org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:95)
 at 
org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:117)
 at 
org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:47)
 at 
org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:473)
 at 
org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:261)
 at org.apache.any23.Any23.extract(Any23.java:300)
 at org.apache.any23.Any23.extract(Any23.java:452)
 at org.apache.any23.cli.Rover.performExtraction(Rover.java:182)
...
Caused by: org.semarglproject.rdf.ParseException: 
org.xml.sax.SAXParseException; lineNumber: 170; columnNumber: 3; The element 
type "input" must be terminated by the matching end-tag "".
 at 
org.semarglproject.rdf.rdfa.RdfaParser.processException(RdfaParser.java:1141)
 at org.semarglproject.source.XmlSource.process(XmlSource.java:50)
 at 
org.semarglproject.source.StreamProcessor.processInternal(StreamProcessor.java:87)
 at 
org.semarglproject.source.BaseStreamProcessor.process(BaseStreamProcessor.java:167)
 at 
org.semarglproject.source.BaseStreamProcessor.process(BaseStreamProcessor.java:154)
 at 
org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:109)
 ... 10 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 170; columnNumber: 3; The 
element type "input" must be terminated by the matching end-tag "".
 at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
 at org.semarglproject.source.XmlSource.process(XmlSource.java:48)
 ... 14 more

> parsing unclosed meta and input tags fails
> --
>
> Key: ANY23-326
> URL: https://issues.apache.org/jira/browse/ANY23-326
> Project: Apache Any23
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 2.1
> Environment: ubuntu 17.04
>Reporter: Ben Roberts
>Priority: Major
> Fix For: 2.2
>
>
> parsing fails as soon as it hits an unclosed input or meta tag, as an example 
> try
>  ./bin/any23 rover https://ben.thatmustbe.me/note/2017/12/28/1
> [Fatal Error] :170:3: The element type "input" must be terminated by the 
> matching end-tag "".
>  
> It seems like the issue might be that this is using a very old version of 
> jsoup.  at least as best I could tell.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)