[GitHub] any23 issue #132: ANY23-419 Add J2EE dependencies such that service runs und...
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/132 No objections from me! +1. ---
[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/132 See also: https://issues.apache.org/jira/browse/CXF-7899 ---
[GitHub] any23 issue #131: ANY23-418 improve TikaEncodingDetector
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/131 @lewismc I've simplified the code a lot so it should be a whole lot easier to see what's going on now. Also, I improved the UTF-8 detector by reverse engineering jchardet's methodology for UTF-8 detection, and created a UTF-8 state machine which does the same thing as jchardet (in a much more human-readable manner), plus fixed two bugs in jchardet's UTF-8 detector along the way (possibly due to the lack of human-readability in the original source code). I started looking into jchardet because, according to [TIKA-2038](https://issues.apache.org/jira/browse/TIKA-2038), using it to detect UTF-8 before anything else increased the accuracy of charset detection from ~72% to ~96%. Our encoding detector should now be at least as accurate. Any thoughts on the methodology, as compared to what we had before? ---
[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/132 Created an issue about this in Tika: https://issues.apache.org/jira/projects/TIKA/issues/TIKA-2778 ---
[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/132 ALSO: it appears that `javax.activation:activation:1.1.1` has been replaced by `com.sun.activation:javax.activation:1.2.0` and `javax.activation:javax.activation-api:1.2.0`. However, I'm a bit fuzzy on how this works because it appears that the `javax.activation-api` sources are a subset of the `javax.activation` sources (i.e., `javax.activation:1.2.0` does not *depend* on `javax.activation-api:1.2.0`, but rather simply copies the source files... I think.) In any case, `org.glassfish.jaxb:jaxb-runtime:2.3.1` *depends on* `javax.activation:javax.activation-api:1.2.0`, but does *not* depend on `com.sun.activation:javax.activation:1.2.0`. Leading me to believe that if we include both `jaxb-runtime:2.3.1` and `javax.activation:1.2.0`, we might have to exclude `javax.activation-api` from `jaxb-runtime` to avoid duplicate classes? Cf. https://stackoverflow.com/questions/46493613/what-is-the-replacement-for-javax-activation-package-in-java-9 Cf. https://stackoverflow.com/questions/52921879/migration-to-jdk-11-has-error-occure-java-lang-noclassdeffounderror-javax-acti Cf. https://stackoverflow.com/questions/48204141/replacements-for-deprecated-jpms-modules-with-java-ee-apis Cf. https://mvnrepository.com/artifact/org.glassfish.jaxb/jaxb-runtime/2.3.1 ---
[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/132 ALSO: it appears that, as of `org.glassfish.jaxb:jaxb-runtime` version 2.3.1, `jaxb-core` is no longer required (as it has been merged into `jaxb-runtime`). ---
[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/132 Indeed, when I navigate to https://javalibs.com/bom/com.sun.xml.bind/jaxb-impl I see: > This artifact has been retired! New location is: org.glassfish.jaxb:jaxb-runtime ---
[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/132 @lewismc @ansell of interest may be [TIKA-2743](https://issues.apache.org/jira/projects/TIKA/issues/TIKA-2743), entitled, "Replace com.sun.xml.bind:jaxb-impl and jaxb-core by org.glassfish.jaxb:jaxb-runtime and jaxb-core", which states: > com.sun.xml.bind:* is actually the old name and is currently a repackaging of org.glassfish.jaxb:*. probably kept as a retro compatibility ---
[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/132 @ansell Well, this is what I see when I run `mvn dependency:tree` on the current state of the Any23 trunk: ``` [INFO] | +- org.apache.tika:tika-core:jar:1.19.1:compile [INFO] | +- org.apache.tika:tika-parsers:jar:1.19.1:compile [INFO] | | +- org.glassfish.jaxb:jaxb-core:jar:2.3.0.1:compile [INFO] | | | +- javax.xml.bind:jaxb-api:jar:2.3.0:compile [INFO] | | | +- org.glassfish.jaxb:txw2:jar:2.3.0.1:compile [INFO] | | | \- com.sun.istack:istack-commons-runtime:jar:3.0.5:compile [INFO] | | +- org.glassfish.jaxb:jaxb-runtime:jar:2.3.0.1:compile [INFO] | | | +- org.jvnet.staxex:stax-ex:jar:1.7.8:compile [INFO] | | | \- com.sun.xml.fastinfoset:FastInfoset:jar:1.2.13:compile [INFO] | | +- javax.activation:activation:jar:1.1.1:compile ``` ---
[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/132 @ansell Right. And I know that Tika (as of 1.19.0) already pulls in all but one of these dependencies (I think). The only difference being, Tika uses the jaxb-core module from glassfish rather than `com.sun.xml.bind`. So, should we exclude the Tika dependencies and use these? Or just use Tika's, with the addition of `jaxws-api`? (Basically, I'm just trying to avoid overlapping class names). ---
[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/132 @lewismc when I try to use `sudo` to, for example, read a text file using ``` sudo cat /opt/tomcat9/BUILDING.txt ``` It gives me a ``` sudo: PAM authentication error: User not known to the underlying authentication module ``` Any ideas? ---
[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/132 @lewismc Also, now that they've granted me access, how do I actually access the VM? https://issues.apache.org/jira/browse/INFRA-17224 ---
[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/132 @lewismc I have zero experience with Tomcat, so I'm just as in-the-dark as you in this regard. Did you get an error message or stacktrace you could post here? I'd recommend removing all the newly added dependencies except for `javax.xml.ws:jaxws-api`, since that's the only one (I think) that Tika doesn't already pull in, and see how far that gets us. (Because otherwise we might need to add additional exclusions to our tika-parsers dependency to avoid conflicting class names if anyone's using the `maven-shade-plugin` or similar). ---
[GitHub] any23 issue #131: ANY23-418 improve TikaEncodingDetector
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/131 @lewismc I've added some additional unit tests which test against the main issues we've been having with encoding detection. Unfortunately, the only real way to comprehensively test this is to compare against millions of webpages "in the wild", but I am confident that it represents a huge improvement over what we have *now*, based on our past problems with encoding detection, plus discussions over in Tika regarding the various issues *they've* been having with encoding detection. Compare to the original version of this file [here](https://github.com/apache/any23/blob/bd607c1cc8c63225f9678ec967c73daa474b45aa/encoding/src/main/java/org/apache/any23/encoding/TikaEncodingDetector.java). Since that time, I've made a couple changes to the algorithm to fix up problems we've encountered along the way, but those tweaks weren't as comprehensive as this one is. Ideally, I'd like to compare this more comprehensive solution against our original solution across millions of webpages, but I'm not yet sure how to proceed in that regard. ---
[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/132 @lewismc FYI, I believe that some of these dependencies are already pulled in by Tika. Running `mvn dependency:tree` on the service module, I see: - `javax.xml.bind:jaxb-api:jar:2.3.0` is pulled in by tika-parsers - `org.glassfish.jaxb:jaxb-core:jar:2.3.0.1` is pulled in by tika-parsers - `javax.activation:activation:jar:1.1.1` is pulled in by tika-parsers leaving (I believe) the only library not already pulled in by Tika to be the `jaxws-api`. Out of curiosity though, where are these libraries actually used by the service? Are they required by jetty? ---
[GitHub] any23 issue #131: ANY23-418 improve TikaEncodingDetector
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/131 @lewismc any thoughts about this? ---
[GitHub] any23 pull request #131: ANY23-418 improve TikaEncodingDetector
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/131 ANY23-418 improve TikaEncodingDetector Improves TikaEncodingDetector by: 1. Not second-guessing UTF-8 if there is *any* indication that a stream is UTF-8-encoded. We can't afford false positives from obscure, obsolete charsets such as IBM500 (See [TIKA-2771](https://issues.apache.org/jira/browse/TIKA-2771)). 2. Taking entire stream into account rather than a prefix (this shouldn't be a huge memory issue, as we are already holding the entire stream in memory to pass to each extractor, and extractors such as RDFa already parse the entire content into a DOM before generating the triples. If we want to make Any23 "streaming"-capable in the future to reduce memory requirements, we can look into that, but for now, since we're not, we may as well use that to our advantage to be more accurate in charset detection.) 3. Taking [TIKA-2771](https://issues.apache.org/jira/browse/TIKA-2771), [TIKA-2038](https://issues.apache.org/jira/browse/TIKA-2038), and [TIKA-539](https://issues.apache.org/jira/browse/TIKA-539) into account. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-418 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/131.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #131 commit d64dac9dfe0752c45d3ff9fbca37bbe447e5c55b Author: Hans Date: 2018-11-06T21:27:00Z ANY23-418 improve TikaEncodingDetector ---
[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/124 I don't think we need to worry about these remaining 5 tests before the 2.3 release. So I'm going to add code to ignore them for now. ---
[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/124 ## Update Merging ANY23-410 and ANY23-411 into master resulted in a reduction of failed tests from 11 to 5. Failed tests are now as follows: - Test 0073: Vocabulary Expansion test with rdfs:subPropertyOf Test 0074: Vocabulary Expansion test with owl:equivalentProperty Test 0081: Simple `@itemprop-reverse` (experimental) Test 0082: `@itemprop-reverse` with `@itemscope` value (experimental) Test 0084: `@itemprop-reverse` with `@itemprop` (experimental) ---
[GitHub] any23 pull request #130: ANY23-410 fix microdata itemrefs
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/130 ANY23-410 fix microdata itemrefs This fixes the regression introduced in version 2.2 causing Any23 to ignore itemrefs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-410 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/130.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #130 commit 13d04c7426b10c7bf982dacfc5cfd2bee2385b0e Author: Hans Date: 2018-10-25T22:45:09Z ANY23-410 fix microdata itemrefs ---
[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/124 ## Update Merging ANY23-309 into master resulted in a reduction of failed tests from 12 to 11. Failed tests are now as follows: - Test 0062: `@itemref` to single id Test 0063: `@itemref` generates property values Test 0064: `@itemref` to single id with different types Test 0065: `@itemref` to multiple ids Test 0066: `@itemref` with chaining Test 0067: Shared `@itemref` Test 0073: Vocabulary Expansion test with rdfs:subPropertyOf Test 0074: Vocabulary Expansion test with owl:equivalentProperty Test 0081: Simple `@itemprop-reverse` (experimental) Test 0082: `@itemprop-reverse` with `@itemscope` value (experimental) Test 0084: `@itemprop-reverse` with `@itemprop` (experimental) ---
[GitHub] any23 pull request #129: ANY23-409 allow multiple microdata itemtype values
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/129 ANY23-409 allow multiple microdata itemtype values This PR addresses the failed `Test 0056: token property and multiple @itemtypes from different vocabularies` microdata test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-409 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/129.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #129 commit 8b951d8e06ed5ad941ec4ba452532bb93d04a057 Author: Hans Date: 2018-10-24T21:36:12Z ANY23-409 allow multiple microdata itemtype values ---
[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/124 ## Update Merging ANY23-408 into master and turning on the microdata "strict" flag resulted in a reduction of failed tests from 17 to 12. Failed tests are now as follows: - Test 0056: token property and multiple `@itemtype`s from different vocabularies Test 0062: `@itemref` to single id Test 0063: `@itemref` generates property values Test 0064: `@itemref` to single id with different types Test 0065: `@itemref` to multiple ids Test 0066: `@itemref` with chaining Test 0067: Shared `@itemref` Test 0073: Vocabulary Expansion test with rdfs:subPropertyOf Test 0074: Vocabulary Expansion test with owl:equivalentProperty Test 0081: Simple `@itemprop-reverse` (experimental) Test 0082: `@itemprop-reverse` with `@itemscope` value (experimental) Test 0084: `@itemprop-reverse` with `@itemprop` (experimental) ---
[GitHub] any23 pull request #128: ANY23-408 Use document IRI as default namespace in ...
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/128 ANY23-408 Use document IRI as default namespace in microdata strict mode Currently, we just drop predicates that don't have a namespace in strict mode. This commit will align strict mode with the actual spec. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-408 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/128.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #128 commit a58d59e35da537b69820baf0cb6423fb3facea02 Author: Hans Date: 2018-10-24T19:16:57Z ANY23-408 Use document IRI as default namespace in microdata strict mode ---
[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/124 ## Update Merging ANY23-407 into master resulted in a reduction of failed tests from 18 to 17. Failed tests are now as follows: - Test 0002: Item with no itemtype and 2 elements with equivalent itemprop Test 0003: Item with itemprop having two properties Test 0052: token property no `@itemtype` Test 0053: token property empty `@itemtype` Test 0054: token property and relative `@itemtype` Test 0056: token property and multiple `@itemtype`s from different vocabularies Test 0062: `@itemref` to single id Test 0063: `@itemref` generates property values Test 0064: `@itemref` to single id with different types Test 0065: `@itemref` to multiple ids Test 0066: `@itemref` with chaining Test 0067: Shared `@itemref` Test 0073: Vocabulary Expansion test with rdfs:subPropertyOf Test 0074: Vocabulary Expansion test with owl:equivalentProperty Test 0081: Simple `@itemprop-reverse` (experimental) Test 0082: `@itemprop-reverse` with `@itemscope` value (experimental) Test 0084: `@itemprop-reverse` with `@itemprop` (experimental) ---
[GitHub] any23 pull request #127: ANY23-407 allow microdata itemids from relative url...
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/127 ANY23-407 allow microdata itemids from relative urls You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-407 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/127.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #127 commit 9bab662c46107350417f61e1f7cbd3058809edf1 Author: Hans Date: 2018-10-24T17:00:51Z ANY23-407 allow microdata itemids from relative urls ---
[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/124 ## Update Merging ANY23-405 into master resulted in a reduction of failed tests from 28 to 18. Failed tests are now as follows: - Test 0002: Item with no itemtype and 2 elements with equivalent itemprop Test 0003: Item with itemprop having two properties Test 0051: relative URL as itemid Test 0052: token property no `@itemtype` Test 0053: token property empty `@itemtype` Test 0054: token property and relative `@itemtype` Test 0056: token property and multiple `@itemtype`s from different vocabularies Test 0062: `@itemref` to single id Test 0063: `@itemref` generates property values Test 0064: `@itemref` to single id with different types Test 0065: `@itemref` to multiple ids Test 0066: `@itemref` with chaining Test 0067: Shared `@itemref` Test 0073: Vocabulary Expansion test with rdfs:subPropertyOf Test 0074: Vocabulary Expansion test with owl:equivalentProperty Test 0081: Simple `@itemprop-reverse` (experimental) Test 0082: `@itemprop-reverse` with `@itemscope` value (experimental) Test 0084: `@itemprop-reverse` with `@itemprop` (experimental) ---
[GitHub] any23 pull request #126: ANY23-405 Parse microdata property values correctly
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/126 ANY23-405 Parse microdata property values correctly See http://w3c.github.io/microdata-rdf/#dfn-property-values You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-405 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/126.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #126 commit f23c25cc23938aa27551426d38dd0139fd30b9f4 Author: Hans Date: 2018-10-24T15:35:10Z ANY23-405 Parse microdata property values correctly ---
[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/124 ## Update Merging ANY23-404 into master resulted in a reduction of failed tests from 30 to 28. Failed tests are now as follows: Test 0002: Item with no itemtype and 2 elements with equivalent itemprop Test 0003: Item with itemprop having two properties Test 0046: Use of time with `@datetime` xsd:time Test 0047: Use of time with `@datetime` xsd:dateTime Test 0048: Use of time with `@datetime` xsd:duration Test 0049: Use of time with `@datetime` invalid Test 0051: relative URL as itemid Test 0052: token property no `@itemtype` Test 0053: token property empty `@itemtype` Test 0054: token property and relative `@itemtype` Test 0056: token property and multiple `@itemtype`s from different vocabularies Test 0062: `@itemref` to single id Test 0063: `@itemref` generates property values Test 0064: `@itemref` to single id with different types Test 0065: `@itemref` to multiple ids Test 0066: `@itemref` with chaining Test 0067: Shared `@itemref` Test 0073: Vocabulary Expansion test with rdfs:subPropertyOf Test 0074: Vocabulary Expansion test with owl:equivalentProperty Test 0075: Use of data and xsd:float Test 0076: Use of data and xsd:integer Test 0077: Use of data and string Test 0078: Use of meter and xsd:double Test 0079: Use of meter and xsd:integer Test 0080: Use of meter and xsd:string Test 0081: Simple @itemprop-reverse (experimental) Test 0082: `@itemprop-reverse` with `@itemscope` value (experimental) Test 0084: `@itemprop-reverse` with `@itemprop` (experimental) ---
[GitHub] any23 pull request #125: ANY23-404 hardcode default microdata registry
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/125 ANY23-404 hardcode default microdata registry This PR should ensure that our microdata extractor is compliant with the standard default microdata registry in terms of vocabulary expansion and property URI generation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-404 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/125.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #125 commit 6b1469152ccd30f93b0686a73bd1ba02955d6411 Author: Hans Date: 2018-10-24T00:37:37Z ANY23-404 hardcode default microdata registry ---
[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/124 @lewismc Yes and no. While we certainly don't need to address all of these test failures before the next release, I want to make sure that property URI generation works as expected for all namespaces in the default registry, at least. That should be a quick fix. ---
[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/121 Now that ANY23-396 has been implemented in #122 and merged into master, can we close this PR? @lewismc ? @jgrzebyta ? I don't have the required permissions to close issues myself. ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 Alright, as far as I'm concerned, all my cleanup is done here. @lewismc if you have no further comments, would you prefer I squash my commits before merging, or just merge everything in as-is? ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 On second thought, I'm thinking about making `TripleWriter` a class in `core` rather than an interface in `api`. It would have the same effect as before, except it would allow more freedom of implementation from the api perspective. ---
[GitHub] any23 pull request #124: ANY23-67 test against online microdata test-suite
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/124 ANY23-67 test against online microdata test-suite I created a microdata unit test which tests against the latest online [microdata test suite](http://w3c.github.io/microdata-rdf/tests/). Currently, 30 out of 84 total tests are failing! *Note: the tests are relaxed such that the expected model is only required to be a subset of the actual model, and not necessarily the other way around. Requiring strict isomorphism, on the other hand, causes 83 out of 84 tests to fail.* **The 30 failing tests are as follows:** Test 0002: Item with no itemtype and 2 elements with equivalent itemprop Test 0003: Item with itemprop having two properties Test 0046: Use of time with `@datetime` xsd:time Test 0047: Use of time with `@datetime` xsd:dateTime Test 0048: Use of time with `@datetime` xsd:duration Test 0049: Use of time with `@datetime` invalid Test 0051: relative URL as itemid Test 0052: token property no `@itemtype` Test 0053: token property empty `@itemtype` Test 0054: token property and relative `@itemtype` Test 0056: token property and multiple `@itemtypes` from different vocabularies Test 0062: `@itemref` to single id Test 0063: `@itemref` generates property values Test 0064: `@itemref` to single id with different types Test 0065: `@itemref` to multiple ids Test 0066: `@itemref` with chaining Test 0067: Shared `@itemref` Test 0070: Property URI generation (default) 3 Test 0071: Vocabulary Expansion test with schema:additionalType Test 0073: Vocabulary Expansion test with rdfs:subPropertyOf Test 0074: Vocabulary Expansion test with owl:equivalentProperty Test 0075: Use of data and xsd:float Test 0076: Use of data and xsd:integer Test 0077: Use of data and string Test 0078: Use of meter and xsd:double Test 0079: Use of meter and xsd:integer Test 0080: Use of meter and xsd:string Test 0081: Simple `@itemprop-reverse` (experimental) Test 0082: `@itemprop-reverse` with `@itemscope` value (experimental) Test 0084: `@itemprop-reverse` with `@itemprop` (experimental) For more details on expected vs. actual statements, run the `MicrodataExtractorTest.runOnlineTests()` test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-67 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/124.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #124 commit 2e3451ccaa36234a9dbd60ce783bf20501fc70c4 Author: Hans Date: 2018-10-23T02:49:46Z ANY23-67 test against online microdata test-suite ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 @lewismc If you don't have any more comments, I'm ready to merge this in. One question: would you prefer I squash the commits before merging, or not? ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 @lewismc the reason I ask is that my **cleanup item 4** allows me to specify a new method in `TripleWriter` which accepts a group as a `Resource` rather than as an `IRI`. That's what I've done in my latest cleanup commit. I'll leave this PR open for at least another day before merging to master just in case anyone comes up with any further comments or concerns. ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 @lewismc One quick question for you before I finish up here: according to RDF specifications, graph names are allowed to be blank nodes, but it appears that Any23 only supports graph names that are IRIs. (Whereas RDF4J supports graph names that are blank nodes *or* IRIs. It appears Any23 silently drops any parsed graph names that are blank nodes rather than IRIs.) Is there a reason for this? Any historical context you can give me on why Any23 opted to not support BNode graph names? Should we lean towards supporting this in the future? ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 @lewismc , great to hear! After I do a bit of last-minute cleanup, I will merge this PR. Cleanup item 1: I'm renaming `TripleFormat.ExtendedCapabilities` to `TripleFormat.FineCapabilities`, as I think the first name is a slightly misleading. Cleanup item 2: For now, I'm removing the newly added support for configuring a writer's charset via `Settings` (although we could take another look at doing this in a future issue), for 3 reasons: 1. Some XML-based writers hard-code a "encoding=utf8" declaration which might then conflict with user-supplied charset and produce an invalid document. 2. If the user sets a writer's charset to US-ASCII or similar, that could create a problem if the writer doesn't support escaping non-ascii characters. (To my knowledge, only the `NTriplesWriter` and `NQuadsWriter` support this.) 3. The default charset for every existing writer is already UTF-8, and I can't think of a good reason to support anything else. ---
[GitHub] any23 pull request #122: ANY23-396 allow mapping/filtering TripleHandlers in...
Github user HansBrende commented on a diff in the pull request: https://github.com/apache/any23/pull/122#discussion_r226702275 --- Diff: api/src/main/java/org/apache/any23/configuration/Setting.java --- @@ -118,7 +128,22 @@ private Type getValueType() { } } -protected abstract V checkedValue(Setting original, V newValue) throws Exception; +/** + * Subclasses may override this method to check that new settings for this key are valid, + * and/or to decorate new setting values, using, for example, {@link Collections#unmodifiableList(List)}. + * The default implementation of this method throws a {@link NullPointerException} if the new value is null and the initial value was non-null. + * + * @param initial the setting containing the initial value for this key, or null if the setting has not yet been initialized + * @param newValue the new value for this setting + * @return the new value for this setting + * @throws Exception if the new value for this setting was invalid + */ +protected V checkedValue(Setting initial, V newValue) throws Exception { +if (newValue == null && initial != null && initial.value != null) { +throw new NullPointerException(); +} +return newValue; +} --- End diff -- Actually, we should not allow keys to decorate values. Consider the following scenario: user copies the value from one setting into another setting. Now the key is decorating a value that has *already been decorated*. This could lead to an unfortunate chain of, e.g., ``` Collections.unmodifiableList(Collections.unmodifiableList(Collections.unmodifiableList(... ))) ``` Therefore, any decorating should happen *before* the setting is created, and if the value is not appropriately decorated, the key should throw an exception in the value check. Also we should change this method's signature to: ``` protected void checkValue(Setting initial, V newValue) throws Exception ``` ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 @lewismc I just added some unit tests & javadoc for the `Settings` API. Let me know your thoughts! ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 @lewismc do you have any concerns or questions regarding my latest commit? Would love to hear your thoughts. ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 **Implementation note:** I considered using the existing `Configuration` interface to construct `TripleWriter` instances, but it seemed rather limited, in that settings are only validated when they are first used, rather than *failing fast*, and they are all stored as strings rather than the actual parsed objects they represent. This is good for settings imported from a config file or loaded from the command line, but not very easy, type-safe, or performant for programmatic configuration. So instead, I created `Settings`, which could be considered a type-safe version of `Configuration`, or a *parsed* configuration. In the future, we could add the ability to create a `Settings` object *from* a `Configuration` object, given a set of supported settings and a configuration parser. In a future PR, I'm planning to implement a similar concept for `Rover`, so that a `Settings` object can be parsed from the command line for each writer. E.g., instead of having, simply: ``` --format mycustomdecorator,notrivial,turtle ``` we could do something like: ``` --format mycustomdecorator,notrivial;alwayssuppresscsstriples=true,turtle;prettyprint=true ``` ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 @lewismc I've implemented a few new things here for the new `TripleWriterFactory` (what I used to call `FormatWriterFactory`). Most important of these, in my opinion, being `Settings`, which allows you to configure writers. (Analogous to rdf4j's `RioSetting` api, but with several improvements.) The `Settings` capability will be able to replace the existing solution for ANY23-388 (PR #117 ). Also, we'll finally be able to allow users to turn off pretty printing if they so choose, or any other configuration option they desire. (E.g., when we upgrade to rdf4j 2.4.0, we can add a "hierarchical" settings option for the new hierarchical JSON-LD printing ability.) Then there's the new `TripleFormat` class, analogous to rdf4j's `RDFFormat` class with a few improvements (one being a "characteristics" flag which allows a much broader range of boolean characteristics to be specified than the 2 in `RDFFormat`.) I'm also deprecating the `FormatWriter` interface (which is nearly useless as it stands--and could be replaced in the future with a simple `AnnotatingDelegatingWriter`) in favor of the new `TripleWriter` interface (which extends `FormatWriter` for backwards compatibility, but introduces methods that are more useful). Let me know what you think! ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 @lewismc One possibility that I'm considering right now is using this opportunity to define our own `TripleFormat` class analogous to rdf4j's `RDFFormat` (such that any `TripleFormat` could be converted to a `RDFFormat` if desired), and then changing the method signature of `RDFFormat getFormat()` to `TripleFormat getFormat()`. The reason being: shouldn't all return types of methods (aside from the ubiquitous `IRI`, `BNode`, etc.) in new interfaces (e.g. `FormatWriterFactory`) be, preferably, part of our own API, rather than RDF4J's? Having our own `TripleFormat` class would give us more control over our own API. For example, suppose we were to add the following default method to the `TripleHandler` interface: ```java default void handleComment(String comment, ExtractionContext context) { //default implementation = do nothing } ``` And then we wanted to add a `supportsComments` flag to the format returned by `FormatWriterFactory.getFormat()` (which we could set to `true` for, e.g., the `TurtleWriter`). Well, if we're using RDF4J's `RDFFormat` class, we could log an issue in RDF4J asking them to add that additional parameter, but we're pretty much at their mercy. However, if we had our own `TripleFormat` class, we could add an additional `TripleFormat` constructor with a `boolean supportsComments` parameter (and a default value of `false`). What do you think about this? (The only reason I've hesitated so far about merging this PR is that once new interfaces are introduced as part of the core API, I'd prefer to never change them again--so I want to get it right the first time!) ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 I've taken another look at the `RDFFormat` class, and it turns out that we don't really need the new method: `FileFormat getFormat()` because any `RDFFormat` can be converted from a `FileFormat` by constructing it with a `null` standard URI, and setting both "supports namespaces" and "supports contexts" to `false`. This should be applicable to any writer, even those that don't print out a standardized RDF format. E.g., in the `URIListWriter` class, "supports namespaces" and "supports contexts" are clearly false since the class only writes out subjects; but does not write out predicates, objects, namespaces, or contexts. Therefore, I think I'm going to drop the new `FileFormat getFormat()` method and retain the `RDFFormat getRdfFormat()` method. ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 > @HansBrende. Sorry, I did not follow your latest commits. My proof of concept was done for different approach so I thought it would be useless in your one. I had thought I would need to write the new one. But if you reused so I am happy of that. You have still my +1. @jgrzebyta glad to hear I still have your +1! Yes, although I used your original unit tests, I did have to modify the way they were implemented. Here are the implementation changes I made: 1. `ExtractorsFlowTest` diff: https://www.diffchecker.com/pPGAQxE6 2. `PeopleExtractorFactory` diff: https://www.diffchecker.com/Mn7XTZOB 3. `PeopleExtractor` diff: https://www.diffchecker.com/x4du9RqE ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 @jgrzebyta thanks for your +1. Is the improved state of the javadoc to your liking? > I will add my unit in a separate ticket. I'm confused: are you referring to your existing PR #121 ? This PR is meant to be an alternative to that one. In my last commit, I have also added your [`ExtractorsFlowTest`](https://github.com/apache/any23/blob/f95f23865c0a7088e4ab1cbe507b8457fc90dda5/cli/src/test/java/org/apache/any23/cli/ExtractorsFlowTest.java) proof-of-concept to this PR to clarify that this PR provides at least as much functionality as #121 does. The only difference being: this PR uses the new `DelegatingWriterFactory` to accomplish the same behavior previously provided by `ModelExtractor` in #121. So if you approve this PR, my assumption would be that you prefer it over #121, and that #121 would be discarded. Any additional comments or concerns? Do I still have your +1? ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 I've added a proof-of-concept unit test, which deprecates the `--notrivial` flag and instead makes that the identifier for a `DelegatingWriterFactory`. Now you can simply specify: ```shell --format notrivial,nquads ``` @lewismc any additional comments or concerns? @jgrzebyta can you please verify whether or not this PR will satisfy your use-case for [ANY23-396](https://issues.apache.org/jira/browse/ANY23-396)? Any additional comments or concerns? ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 For the time being, I've opted for the third option for (4) (making the public methods I added to `WriterFactoryRegistry` become `private static` methods in `Rover`) so that I don't have to deal with that naming issue in this PR. If we want to add extra utility methods in `WriterFactoryRegistry`, that can be the subject of a different JIRA issue. However, I did have to fix a couple of synchronization issues in `WriterFactoryRegistry` to accomplish this: I noticed that iterating through the list of writer factories returned by `WriterFactoryRegistry.getWriters()` could potentially throw a `ConcurrentModificationException` even though that method was marked `synchronized` (because, unless I am mistaken, the underlying list implementation *can* be modified after access to the list is given to a caller and the method returns). To fix this problem, I changed the implementation of the backing list of writers from `ArrayList` to `CopyOnWriteArrayList`, which guarantees thread safety for iterators. Since writes to `CopyOnWriteArrayList` are relatively expensive, I also changed the logic around a bit to use *batch writing*, i.e., registering all `WriterFactory` instances at once in a `registerAll()` method, rather than through consecutive invocations of the `register()` method. Similar issues existed for the methods to retrieve id entifiers and mime types, which I fixed in the same manner. With this last commit, I am now satisfied, personally, with my implementation of ANY23-396. Anything else, @lewismc @jgrzebyta ? ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 A third option for (4) is to simply defer this decision to another day by removing the methods I added to `WriterFactoryRegistry` and adding them directly to `Rover` as private methods. This option is also tempting. ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 Another (more drastic) option for (4) would be to deprecate `getWriters()` and `getWriterByIdentifier(String id)`, and create the replacement methods `getWriterFactories()` and `getWriterFactoryByIdentifier(String id)` (or simply, `getWriterFactory(String id)`.) Then we would be free to call writer instances "writers", and could leave the method names how they currently stand, namely: `getWriter(id, output)` and `getDefaultWriter(output)` ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 Thank you, @lewismc . The only item of concern left from my perspective is *naming*. Should any of the new `public` interfaces/methods I have created be named differently, or are they adequately descriptive as they currently stand? This decision should be made now, as there is no going back. Here follows the names of all the new `public` methods/interfaces I have created in this PR: 1. `public interface `**`FormatWriterFactory`** > Is this descriptive enough? It does specify a `FileFormat getFormat()` method, returning the format which will be written to the output stream, so the name does still make sense even though we now return a `TripleHandler` rather than a `FormatWriter` from the `getTripleWriter(OutputStream)` method. On the other hand, we could also call it `ContentWriterFactory` in line with the existing `ContentExtractor` interface (although I'm not sure if that would make it any more descriptive). Another possibility would be `OutputStreamWriterFactory`. 2. `public interface`**`DelegatingWriterFactory`** > Alternatives include `CompositeWriterFactory` or `FilterWriterFactory` (similar to `java.io.FilterOutputStream`). 3. `TripleHandler`**`getTripleWriter(Output)`** (specified in the `BaseWriterFactory` interface) > Alternatives include `getTripleHandler` or simply `getWriter`. I chose `getTripleWriter` over `getWriter` because it seemed more descriptive, and to avoid confusion with the `java.io.Writer` class. 4. `TripleHandler`**`getWriter(id, output)`** and `TripleHandler`**`getDefaultWriter(OutputStream)`** (specified in `WriterFactoryRegistry`). > This one is confused by the fact that `WriterFactoryRegistry` already uses the term "writer" to refer to *`WriterFactory`* instances (e.g. `List getWriters()` and `WriterFactory getWriterByIdentifier(String id)`). An easy alternative would be to take a hint from the existing, now-deprecated method `FormatWriter getWriterInstanceByIdentifier(id, output)` and use "**writerInstance**" to refer to a triple handler, i.e., `TripleHandler getWriterInstance(id, output)` and `getDefaultWriterInstance(OutputStream)`. Alternatively, we could use `getTripleWriter(id, output)` and `getDefaultTripleWriter(OutputStream)`. Any suggestions, or better names that I haven't thought of, @lewismc ? @jgrzebyta ? ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 As per part 4 of my last comment, in my most recent commit I've allowed `FormatWriterFactory` and `DelegatingWriterFactory` to extend the same (package-private) base interface specifying a single generic method: ```java interface BaseWriterFactory extends WriterFactory { TripleHandler getTripleWriter(Output o); } ``` I could have added this method directly to `WriterFactory` (with a default implementation of throwing `UnsupportedOperationException`), but since all instances of this interface *must* be instances of either `FormatWriterFactory` or `DelegatingWriterFactory` (since the interface is package-private), and all interaction with this method will be done by casting to one of these two interfaces, adding generic arguments to `WriterFactory` itself would have only added unnecessary verbosity (e.g., always having to specify `WriterFactory` instead of `WriterFactory` to avoid rawtypes warnings). @lewismc any comments? ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 My last commit reflects the notes of interest I mentioned in my last comment. 1. Since `WriterFactory.getMimeType()` is redundant and I had to deprecate it anyway to make this PR work, I've simply opted to *not* un-deprecate it in the extending `FormatWriterFactory`. To retrieve the MIME type of a `FormatWriterFactory` instance, simply call `getFormat().getDefaultMIMEType()`. However, to keep new implementations of `FormatWriterFactory` backwards compatible with the older behavior, I've simply added the following default implementation of `getMimeType()` in the `FormatWriterFactory` interface: ```java @Override @Deprecated default String getMimeType() { return getFormat().getDefaultMIMEType(); } ``` 2. Since not all implementations of `FormatWriterFactory` print RDF triples (case in point: `URIListWriterFactory`), the deprecation of `WriterFactory.getRdfFormat()` presents us with the perfect opportunity to make the return type of `getRdfFormat()` more generic in `FormatWriterFactory` (namely, using `FileFormat`, the superclass of `RDFFormat`, instead of `RDFFormat`). To accomplish this, I've simply opted to *not* un-deprecate the `getRdfFormat()` method in the `FormatWriterFactory` interface, and instead, add the following method: ```java FileFormat getFormat(); ``` To keep everything backwards compatible with the previous behavior, I've added the following default implementation of `getRdfFormat()` to the `FormatWriterFactory` interface: ```java @Override @Deprecated default RDFFormat getRdfFormat() { FileFormat f = getFormat(); if (f instanceof RDFFormat) { return (RDFFormat)f; } else { throw new UnsupportedOperationException("This class does not print RDF triples."); } } ``` Now the `URIListWriterFactory` can utilize the method `getFormat()`, instead of its previous behavior of throwing a `RuntimeException`. To that effect, I've opted to return the following `FileFormat` from `URIListWriterFactory.getFormat()`: ```java private static final FileFormat FORMAT = new FileFormat("PLAINTEXT", "text/plain", StandardCharsets.UTF_8, "txt"); @Override public FileFormat getFormat() { return FORMAT; } ``` 3. Since the `FormatWriterFactory` interface is now not only tasked with `RDFFormat`s, but also arbitrary `FileFormat`s, deprecating the `WriterFactory.getRdfWriter(OutputStream)` method presents us with the perfect opportunity to choose a more appropriate name for this method in the subinterface `FormatWriterFactory`. To this effect, I've opted to simply *not* un-deprecate the `FormatWriterFactory.getRdfWriter(OutputStream)` method, and instead choose a more appropriate name. The name I've provisionally opted for is: ```java FormatWriter getFormatWriter(OutputStream); ``` To keep everything backwards compatible, I've added the following default implementation of `FormatWriterFactory.getRdfWriter(OutputStream)`: ```java @Override @Deprecated default FormatWriter getRdfWriter(OutputStream os) { return getFormatWriter(os); } ``` 4. Finally, I have one further question for discussion: We could use this deprecation opportunity to further genericize the `FormatWriterFactory.getFormatWriter(OutputStream)` method, replacing: ```java FormatWriter getFormatWriter(OutputStream); ``` with: ```java TripleHandler getWriter(OutputStream); ``` which would allow `FormatWriterFactory` implementations to return arbitrary `TripleHandler`s instead of forcing them to return the more specific (but arguably *not* more useful) `FormatWriter` implementations. Where behavior specific to `FormatWriter` is actually needed, e.g. `FormatWriter.isAnnotated()` (a method which is actually *never* used anywhere in Any23), a check could be added as follows: ```java boolean isAnnotated(TripleHandler writer) { return writer instanceof FormatWriter ? ((FormatWriter)writer).isAnnotated() : false; } ``` One additional benefit of doing this would be that `DelegatingWriterFactory` and `FormatWriterFactory` could both then extend some base interface as follows: ``` interface BaseWriterFactory { TripleHandler getWriter(Output); } interface FormatWriterFactory extends BaseWriterFactory { ... } interface DelegatingWriterFactory extends BaseWriterFactory { ... } ``` @lewismc Any comments? ---
[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/122 While we're deprecating things anyway, there are a few notes of interest which we should mull over before any merges into master happen here: 1. The method `WriterFactory.getMimeType()` appears to be redundant, as there also exists the `WriterFactory.getRdfFormat().getDefaultMimeType()`. 2. Also note the presence of `FileFormat`, the superclass of `RDFFormat`, which we could possibly use to make the `FormatWriterFactory` interface more generic (possibly helpful for the `URIListWriterFactory`, and other writer factories which similarly do not print RDF triples as output). 3. The method `WriterFactory.getRdfFormat()` is never actually used anywhere in the Any23 project. 4. `JSONWriterFactory` and `URIListWriterFactory` both throw `RuntimeException` in the `getRdfFormat()` method (the former questionably so, since there exists the `RDFFormat.RDFJSON` file format). ---
[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/121 @jgrzebyta please check out [this PR](https://github.com/apache/any23/pull/122), which is an implementation of what I've just described. It seems like a lot simpler and less error-prone way to produce a domain-specific rdf graph. Eager to know your thoughts! ---
[GitHub] any23 pull request #122: ANY23-396 allow mapping/filtering TripleHandlers in...
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/122 ANY23-396 allow mapping/filtering TripleHandlers in Rover Here is one possible alternative to the existing PR for ANY23-396. **Pros:** 1. Fully backwards compatible 2. Extends `WriterFactory` with new `DelegatingWriterFactory` interface, which, rather than writing a `TripleHandler` to an output stream, writes a `TripleHandler` to another `TripleHandler`. This will allow users to produce a final domain-specific RDF graph of their choosing in Rover by implementing mapping/filtering `DelegatingWriterFactory` implementations. 3. the `--format` flag in rover now represents a list of WriterFactory ids, rather than a single WriterFactory id. Each id in the list is composed with the one previous to it to construct the final `TripleHandler`. All writers in the list, except the last, are required to implement `DelegatingTripleHandler`. **Cons:** 1. this solution requires deprecating 3 methods in the `WriterFactory` interface (and then un-deprecating them in the extending `FormatWriterFactory` interface.) However, this drawback does not affect backwards compatibility. ## ALTERNATIVE In order to avoid the single "con" I have listed, the alternative to this solution would be, rather than extending the `WriterFactory` interface with `DelegatingWriterFactory`, to keep these two interfaces completely separate and define a new `DelegatingWriterFactoryRegistry` (analogous to the `WriterFactoryRegistry`) with a different `ServiceLoader` in order to load `DelegatingWriterFactory` implementations. @jgrzebyta @lewismc Thoughts? You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-396 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/122.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #122 commit cb293e22b9352652d91474cd3e35233e75dc9fb9 Author: Hans Date: 2018-09-14T15:29:33Z ANY23-396 allow mapping/filtering TripleHandlers in Rover ---
[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/121 @jgrzebyta One easy solution to my above comment that I can think of right off the bat is as follows: First, we could extend the WriterFactory interface as follows (or similar): ``` interface DelegatingWriterFactory extends WriterFactory { TripleHandler getWriter(TripleHandler delegate); } ``` Second, in the rover `--format` flag (which actually accepts a WriterFactory *id*, not necessarily a format name), we could simply allow a comma-separated *list* of WriterFactory ids rather than a single id. Then, to construct the final writer, we'd compose each writer in the list with the previous one, i.e.: ``` Collections.reverse(listOfIds); tripleHandler = getWriterFactoryForId(listOfIds.get(0)).getRdfWriter(outputStream); for (String id : listOfIds.subList(1, listOfIds.size())) { tripleHandler = ((DelegatingWriterFactory)getWriterFactoryForId(id)).getWriter(tripleHandler); } ``` This is just one initial idea, but food for thought. It also seems more in line with your concept of a "flow". What do you think? ---
[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/121 @jgrzebyta But as far as rover goes, you're right: we currently don't have support for using an arbitrary triple handler. Looks like it expects an RDFFormat and then finds a triple handler based on that. I wonder if it would be possible to allow a more flexible way to specify triple handlers as rover arguments to fix this problem? While I don't think that creating a `ModelExtractor` as currently defined in this PR is the way to go, I do think that rover needs to be improved in this respect. I will think on this. ---
[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow
Github user HansBrende commented on a diff in the pull request: https://github.com/apache/any23/pull/121#discussion_r217159830 --- Diff: core/src/main/java/org/apache/any23/writer/BufferedTripleHandler.java --- @@ -0,0 +1,161 @@ +package org.apache.any23.writer; + +import com.google.common.base.Throwables; +import org.apache.any23.extractor.ExtractionContext; +import org.eclipse.rdf4j.model.IRI; +import org.eclipse.rdf4j.model.Model; +import org.eclipse.rdf4j.model.Resource; +import org.eclipse.rdf4j.model.Value; +import org.eclipse.rdf4j.model.impl.LinkedHashModelFactory; +import org.eclipse.rdf4j.model.impl.TreeModelFactory; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Map; +import java.util.Stack; +import java.util.TreeMap; + +/** + * Collects all statements until end document. + * + * All statements are kept within {@link Model}. + * + * @author Jacek Grzebyta (jgrzeb...@apache.org) + */ +public class BufferedTripleHandler implements TripleHandler { + +private static final Logger log = LoggerFactory.getLogger(BufferedTripleHandler.class); +private TripleHandler underlying; +private static boolean isDocumentFinish = false; + +private static class ContextHandler { +ContextHandler(ExtractionContext ctx, Model m) { +extractionContext = ctx; +extractionModel = m; +} +ExtractionContext extractionContext; +Model extractionModel; +} + +private static class WorkflowContext { +WorkflowContext(TripleHandler underlying) { +this.rootHandler = underlying; +} + + +Stack extractors = new Stack<>(); +Map modelMap = new TreeMap<>(); +IRI documentIRI = null; +TripleHandler rootHandler ; +} + +public BufferedTripleHandler(TripleHandler underlying) { +this.underlying = underlying; + +// hide model in the thread +WorkflowContext wc = new WorkflowContext(underlying); +BufferedTripleHandler.workflowContext.set(wc); +} + +private static final ThreadLocal workflowContext = new ThreadLocal<>(); + +/** + * Returns model which contains all other models. + * @return + */ +public static Model getModel() { +return BufferedTripleHandler.workflowContext.get().modelMap.values().stream() +.map(ch -> ch.extractionModel) +.reduce(new LinkedHashModelFactory().createEmptyModel(), (mf, exm) -> { +mf.addAll(exm); +return mf; +}); +} + +@Override +public void startDocument(IRI documentIRI) throws TripleHandlerException { +BufferedTripleHandler.workflowContext.get().documentIRI = documentIRI; +} + +@Override +public void openContext(ExtractionContext context) throws TripleHandlerException { +// +} + +@Override +public void receiveTriple(Resource s, IRI p, Value o, IRI g, ExtractionContext context) throws TripleHandlerException { +getModelForContext(context).add(s,p,o,g); +} + +@Override +public void receiveNamespace(String prefix, String uri, ExtractionContext context) throws TripleHandlerException { +getModelForContext(context).setNamespace(prefix, uri); +} + +@Override +public void closeContext(ExtractionContext context) throws TripleHandlerException { +// +} + +@Override +public void endDocument(IRI documentIRI) throws TripleHandlerException { +BufferedTripleHandler.isDocumentFinish = true; +} + +@Override +public void setContentLength(long contentLength) { +underlying.setContentLength(contentLength); +} + +@Override +public void close() throws TripleHandlerException { +underlying.close(); +} + +/** + * Releases content of the model into underlying writer. + */ +public static void releaseModel() throws TripleHandlerException { +if(!BufferedTripleHandler.isDocumentFinish) { +throw new RuntimeException("Before releasing document should be finished."); +} + +WorkflowContext workflowContext = BufferedTripleHandler.workflowContext.get(); + +String lastExtractor = ((Stack) workflowContext.extractors).peek(); --- End diff -- @jgrzebyta IMHO, it would be vastly more straightforward to simply have the user extend the [`Composite
[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/121 Another thought: Using the TripleHandler interface (as intended) to transform triples, rather than a separate ModelExtractor, has the added advantage that the triples might not necessarily need to be stored in memory during the transformation process. The user could implement either a "collecting" triple handler which stores statements in memory prior to transforming them, or a "streaming" triple handler for transformation-on-the-fly (e.g., if mapping some predicate A to some other predicate B), or some combination of these two concepts. The "collecting" ability could be easily supplemented with a `ModelWriter` or equivalent, as in [ANY23-397](https://issues.apache.org/jira/browse/ANY23-397). But adding a separate "ModelExtractor" concept only muddles this already-existing ability to transform triples with TripleHandlers by introducing a redundant construct of more limited abstraction power than what already exists. So for me: -1 for ANY23-396 +1 for ANY23-397 @lewismc any thoughts? ---
[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/121 Aside from the comments I've made on this PR, I'm still not convinced that having a ModelExtractor is a good idea in the first place. Why not just create a ModelWriter (as in ANY23-397) or an equivalent "collecting" TripleHandler, and then allow the end user to transform the collected statements however they wish? Having a ModelExtractor creates additional questions & complexities: in what order are the extractors executed? (Certainly the ModelExtractors would have to be executed last in order to have access to all previously collected statements.) What if multiple ModelExtractors are declared? Which ones have higher precedence in the extraction order? I'm not sure that having a dedicated ModelExtractor is worth the trouble of dealing with these complexities, when a user could accomplish the same thing by simply transforming the statements collected by a ModelWriter or equivalent, or defining their own filtering and/or mapping TripleHandler. ---
[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow
Github user HansBrende commented on a diff in the pull request: https://github.com/apache/any23/pull/121#discussion_r217044600 --- Diff: api/src/main/resources/default-configuration.properties --- @@ -76,3 +76,6 @@ any23.extraction.csv.comment=# # A confidence threshold for the OpenIE extractions # Any extractions below this value will not be processed. any23.extraction.openie.confidence.threshold=0.5 + +# Allows to enable(on)/disable(off) the workflow feature +any23.extraction.workflows=off --- End diff -- No extra flag should be needed for this. ---
[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow
Github user HansBrende commented on a diff in the pull request: https://github.com/apache/any23/pull/121#discussion_r217044167 --- Diff: cli/src/main/java/org/apache/any23/cli/Rover.java --- @@ -172,6 +174,8 @@ protected void configure() { defaultns); } + extractionParameters.setFlag(ExtractionParameters.EXTRACTION_WORKFLOWS_FLAG, workflow); --- End diff -- We should not need a separate flag to enable certain extractors. If an extractor is contained within the extractor group we are using, then that should be, on its own, enough to enable itself. ---
[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow
Github user HansBrende commented on a diff in the pull request: https://github.com/apache/any23/pull/121#discussion_r217042421 --- Diff: core/src/main/java/org/apache/any23/writer/BufferedTripleHandler.java --- @@ -0,0 +1,161 @@ +package org.apache.any23.writer; + +import com.google.common.base.Throwables; +import org.apache.any23.extractor.ExtractionContext; +import org.eclipse.rdf4j.model.IRI; +import org.eclipse.rdf4j.model.Model; +import org.eclipse.rdf4j.model.Resource; +import org.eclipse.rdf4j.model.Value; +import org.eclipse.rdf4j.model.impl.LinkedHashModelFactory; +import org.eclipse.rdf4j.model.impl.TreeModelFactory; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Map; +import java.util.Stack; +import java.util.TreeMap; + +/** + * Collects all statements until end document. + * + * All statements are kept within {@link Model}. + * + * @author Jacek Grzebyta (jgrzeb...@apache.org) + */ +public class BufferedTripleHandler implements TripleHandler { + +private static final Logger log = LoggerFactory.getLogger(BufferedTripleHandler.class); +private TripleHandler underlying; +private static boolean isDocumentFinish = false; + +private static class ContextHandler { +ContextHandler(ExtractionContext ctx, Model m) { +extractionContext = ctx; +extractionModel = m; +} +ExtractionContext extractionContext; +Model extractionModel; +} + +private static class WorkflowContext { +WorkflowContext(TripleHandler underlying) { +this.rootHandler = underlying; +} + + +Stack extractors = new Stack<>(); +Map modelMap = new TreeMap<>(); +IRI documentIRI = null; +TripleHandler rootHandler ; +} + +public BufferedTripleHandler(TripleHandler underlying) { +this.underlying = underlying; + +// hide model in the thread +WorkflowContext wc = new WorkflowContext(underlying); +BufferedTripleHandler.workflowContext.set(wc); +} + +private static final ThreadLocal workflowContext = new ThreadLocal<>(); + +/** + * Returns model which contains all other models. + * @return + */ +public static Model getModel() { +return BufferedTripleHandler.workflowContext.get().modelMap.values().stream() +.map(ch -> ch.extractionModel) +.reduce(new LinkedHashModelFactory().createEmptyModel(), (mf, exm) -> { +mf.addAll(exm); +return mf; +}); +} + +@Override +public void startDocument(IRI documentIRI) throws TripleHandlerException { +BufferedTripleHandler.workflowContext.get().documentIRI = documentIRI; +} + +@Override +public void openContext(ExtractionContext context) throws TripleHandlerException { +// +} + +@Override +public void receiveTriple(Resource s, IRI p, Value o, IRI g, ExtractionContext context) throws TripleHandlerException { +getModelForContext(context).add(s,p,o,g); +} + +@Override +public void receiveNamespace(String prefix, String uri, ExtractionContext context) throws TripleHandlerException { +getModelForContext(context).setNamespace(prefix, uri); +} + +@Override +public void closeContext(ExtractionContext context) throws TripleHandlerException { +// +} + +@Override +public void endDocument(IRI documentIRI) throws TripleHandlerException { +BufferedTripleHandler.isDocumentFinish = true; +} + +@Override +public void setContentLength(long contentLength) { +underlying.setContentLength(contentLength); +} + +@Override +public void close() throws TripleHandlerException { +underlying.close(); +} + +/** + * Releases content of the model into underlying writer. + */ +public static void releaseModel() throws TripleHandlerException { +if(!BufferedTripleHandler.isDocumentFinish) { +throw new RuntimeException("Before releasing document should be finished."); +} + +WorkflowContext workflowContext = BufferedTripleHandler.workflowContext.get(); + +String lastExtractor = ((Stack) workflowContext.extractors).peek(); --- End diff -- Feels hacky... what if not all of the triples came from the same extractor? ---
[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow
Github user HansBrende commented on a diff in the pull request: https://github.com/apache/any23/pull/121#discussion_r217041561 --- Diff: core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java --- @@ -483,6 +488,14 @@ private SingleExtractionReport runExtractor( documentReport.getDocument(), extractionResult ); +} else if (extractor instanceof ModelExtractor) { +final ModelExtractor modelExtractor = (ModelExtractor) extractor; +final Model singleModel = BufferedTripleHandler.getModel(); --- End diff -- Should not be static. ---
[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow
Github user HansBrende commented on a diff in the pull request: https://github.com/apache/any23/pull/121#discussion_r217041300 --- Diff: core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java --- @@ -295,6 +294,12 @@ public SingleDocumentExtractionReport run(ExtractionParameters extractionParamet } finally { try { output.endDocument(documentIRI); + + // in case of workflow flag release data from model +if (extractionParameters.getFlag(ExtractionParameters.EXTRACTION_WORKFLOWS_FLAG)) { +BufferedTripleHandler.releaseModel(); --- End diff -- This should not be a static call. ---
[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow
Github user HansBrende commented on a diff in the pull request: https://github.com/apache/any23/pull/121#discussion_r217037875 --- Diff: core/src/main/java/org/apache/any23/writer/BufferedTripleHandler.java --- @@ -0,0 +1,161 @@ +package org.apache.any23.writer; + +import com.google.common.base.Throwables; +import org.apache.any23.extractor.ExtractionContext; +import org.eclipse.rdf4j.model.IRI; +import org.eclipse.rdf4j.model.Model; +import org.eclipse.rdf4j.model.Resource; +import org.eclipse.rdf4j.model.Value; +import org.eclipse.rdf4j.model.impl.LinkedHashModelFactory; +import org.eclipse.rdf4j.model.impl.TreeModelFactory; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Map; +import java.util.Stack; +import java.util.TreeMap; + +/** + * Collects all statements until end document. + * + * All statements are kept within {@link Model}. + * + * @author Jacek Grzebyta (jgrzeb...@apache.org) + */ +public class BufferedTripleHandler implements TripleHandler { + +private static final Logger log = LoggerFactory.getLogger(BufferedTripleHandler.class); +private TripleHandler underlying; +private static boolean isDocumentFinish = false; + +private static class ContextHandler { +ContextHandler(ExtractionContext ctx, Model m) { +extractionContext = ctx; +extractionModel = m; +} +ExtractionContext extractionContext; +Model extractionModel; +} + +private static class WorkflowContext { +WorkflowContext(TripleHandler underlying) { +this.rootHandler = underlying; +} + + +Stack extractors = new Stack<>(); +Map modelMap = new TreeMap<>(); +IRI documentIRI = null; +TripleHandler rootHandler ; +} + +public BufferedTripleHandler(TripleHandler underlying) { +this.underlying = underlying; + +// hide model in the thread +WorkflowContext wc = new WorkflowContext(underlying); +BufferedTripleHandler.workflowContext.set(wc); +} + +private static final ThreadLocal workflowContext = new ThreadLocal<>(); --- End diff -- Model should not be static, unless there is a very good reason for doing so? ---
[GitHub] any23 issue #120: ANY23-393 Any23 master to build under JDK 10.X
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/120 @lewismc +1. Just out of curiosity, what is the new `javax.xml.bind:jaxb-api` dependency for? ---
[GitHub] any23 issue #118: ANY23-390 implement ICal, JCal, and XCal extractors
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/118 @lewismc Glad you like it! Merged to master. Please let me know if you discover any issues. ---
[GitHub] any23 pull request #118: ANY23-390 implement ICal, JCal, and XCal extractors
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/118 ANY23-390 implement ICal, JCal, and XCal extractors This is my first stab at implementing the ical, jcal, and xcal extractors. @lewismc Any input? You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-390 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/118.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #118 commit 54a92960ac2fda9510041b6886eb7259a9b1220b Author: Hans Date: 2018-08-21T16:37:35Z ANY23-390 implement ICal, JCal, and XCal extractors ---
[GitHub] any23 issue #116: Any23 388
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/116 @larsgsvensson it looks like you may have rebased in the wrong direction? One problem is that your master branch is 3 commits ahead of the apache/master branch. So here's what I would do (before making any more commits): ``` bash git checkout master git reset --hard HEAD~3 git pull https://github.com/apache/any23.git master git push -f origin master git checkout ANY23-388 git reset --hard HEAD~5 git rebase master git push -f origin ANY23-388 ``` After you do that, then make your changes to the ANY23-388 branch. Then: ```bash git add . git commit -m "ANY23-388 [message]" git push origin ANY23-388 ``` Then we should be good to go. ---
[GitHub] any23 pull request #116: Any23 388
Github user HansBrende commented on a diff in the pull request: https://github.com/apache/any23/pull/116#discussion_r210745076 --- Diff: core/src/main/java/org/apache/any23/writer/RDFWriterTripleHandler.java --- @@ -35,7 +35,7 @@ */ public abstract class RDFWriterTripleHandler implements FormatWriter, TripleHandler { -private final RDFWriter writer; +protected final RDFWriter writer; --- End diff -- @jgrzebyta I don't see any reason to disallow protected access to the `writer` if we allow protected access to the constructor. Subclasses could bypass the `private` modifier anyway by simply saving a reference to the writer in the constructor. ---
[GitHub] any23 issue #116: Any23 388
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/116 @larsgsvensson Thanks for the pull request! It looks like some of your commits reverse recent changes made to the master branch. You might want to just start over with a clean pull from master. ---
[GitHub] any23 issue #104: Any23 295: Implement ability to use librdfa
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/104 Ok, I added some whitespace to a package-info.java file, should be showing up now. ---
[GitHub] any23 issue #104: Any23 295: Implement ability to use librdfa
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/104 Ok, apparently due to some quirkiness of the way mirroring works, new branches in git-wip will not show up on github until an actual change is made to the branch. ---
[GitHub] any23 issue #104: Any23 295: Implement ability to use librdfa
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/104 It is showing up here: https://git-wip-us.apache.org/repos/asf?p=any23.git But not here: https://github.com/apache/any23/branches ---
[GitHub] any23 issue #104: Any23 295: Implement ability to use librdfa
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/104 @lewismc I attempted to create a new branch using: ``` git checkout -b ANY23-295 git push canonical ANY23-295 ``` to which git responded with: ``` To https://git-wip-us.apache.org/repos/asf/any23.git * [new branch]ANY23-295 -> ANY23-295 ``` However, it isn't showing up under "Branches". Not sure why. It's showing up under my own "Branches" when I pushed to `origin`. ---
[GitHub] any23 issue #104: Any23 295: Implement ability to use librdfa
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/104 @JulioCCBUcuenca Alright, sounds great then! ---
[GitHub] any23 issue #104: Any23 295: Implement ability to use librdfa
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/104 @JulioCCBUcuenca Great to hear it passes the semargl test suites! However, `AbstractRDFaExtractorTestCase` contains only a fraction of the tests we run on RDFa. You should also test against the [RDFaExtractorTest](https://github.com/apache/any23/blob/master/core/src/test/java/org/apache/any23/extractor/rdfa/RDFaExtractorTest.java) and [RDFa11ExtractorTest](https://github.com/apache/any23/blob/master/core/src/test/java/org/apache/any23/extractor/rdfa/RDFa11ExtractorTest.java) classes. ---
[GitHub] any23 issue #104: Any23 295: Implement ability to use librdfa
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/104 `Current XML parsers can use both lang or xml:lang, but since librdfa uses an old library for parsing XML it generates an error since it cannot identify the language.` That seems worrisome to me... considering then, that most html pages will break the librdfa parser. @lewismc should we not test more thoroughly *before* merging to master? Maybe a separate branch instead? Also, I would think that all of our current rdfa parsing tests should be duplicated for the librdfa parser to ensure that it is *at least as stable as our current rdfa parser*. TBH, I'd be in favor of adding this into version 2.4 rather than 2.3 so we have more time to thoroughly test the module. ---
[GitHub] any23 pull request #115: ANY23-385 improve encoding detection
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/115 ANY23-385 improve encoding detection 1. Increase default sniff limit for text charset detection from 12000 bytes to 65536 bytes 2. Include results of xml declaration encoding detection 3. Include results of html meta charset encoding detection mvn clean test -> all tests passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-385 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/115.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #115 commit 22b3047d55f5e5b8fcba9c912424c9ed45313163 Author: Hans Date: 2018-08-05T23:39:01Z ANY23-385 improve encoding detection ---
[GitHub] any23 pull request #114: ANY23-383 allow all unicode space chars in JSON-LD
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/114 ANY23-383 allow all unicode space chars in JSON-LD mvn clean test -> all tests passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-383 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/114.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #114 commit 0df8cdba68fea0c6dcf819759627759c7597f0cb Author: Hans Date: 2018-08-04T05:47:16Z ANY23-383 allow all unicode space characters in JSON-LD ---
[GitHub] any23 pull request #113: ANY23-382 don't kill extraction on fatal json parsi...
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/113 ANY23-382 don't kill extraction on fatal json parsing errors mvn clean test -> all tests passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-382 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/113.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #113 commit 837f92b9167d7944dbc88a965d6e17cf22b375e0 Author: Hans Date: 2018-08-03T21:06:15Z ANY23-382 don't kill extraction on fatal json parsing errors ---
[GitHub] any23 pull request #112: ANY23-381 escape illegal characters in JSON-LD stri...
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/112 ANY23-381 escape illegal characters in JSON-LD strings mvn clean test -> all tests passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-381 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/112.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #112 commit 817e744af90d8f3c9bf419e5c395c421e0c3924a Author: Hans Date: 2018-08-02T21:33:36Z ANY23-381 fix illegal unescaped characters in JSON-LD ---
[GitHub] any23 pull request #110: ANY23-380 disallow duplicate attribute keys
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/110 ANY23-380 disallow duplicate attribute keys I disallowed duplicate attribute keys in html to avoid `org.xml.sax.SAXParseException`s. Along the way, I also cleaned up some annoying or unnecessary logging/console output produced by our massive suite of test cases. Also cleaned up some javadoc/miscellaneous items. mvn clean test -> all tests passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-380 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/110.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #110 commit 4e3011a4d80545f04563f427687f4fa74e17103f Author: Hans Date: 2018-08-01T21:06:55Z ANY23-380 disallow duplicate attribute keys commit 159aeb489473f600213142a746d39a49e3d3548b Author: Hans Date: 2018-08-02T17:46:44Z cleaned up annoying logging/console output commit 0291f588d04859053ef4eb8845686bad824b4461 Author: Hans Date: 2018-08-02T18:01:19Z added license and javadoc ---
[GitHub] any23 pull request #109: ANY23-379 remove invalid XML characters from docume...
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/109 ANY23-379 remove invalid XML characters from document mvn clean test -> all tests passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-379 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/109.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #109 commit 36fba681e4295b65faf61b251308ecd8d2aa6771 Author: Hans Date: 2018-08-01T18:46:14Z ANY23-379 remove invalid XML characters from document ---
[GitHub] any23 pull request #108: ANY23-378 clean commas in JSON-LD
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/108 ANY23-378 clean commas in JSON-LD Remove trailing commas from objects and arrays. Also replace semicolons with commas (compare to gson's `JsonReader.setLenient()`). mvn clean test -> all tests passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-378 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/108.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #108 commit aae21370e70715f82f7cc868b9a298f1178d0f80 Author: Hans Date: 2018-08-01T16:25:21Z ANY23-378 clean commas in JSON-LD ---
[GitHub] any23 pull request #107: ANY23-377 don't replace empty strings with 'Null'
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/107 ANY23-377 don't replace empty strings with 'Null' mvn clean test -> all tests passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-377 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/107.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #107 commit a07d1f058fcdc2d994dcd220759310737fe68965 Author: Hans Date: 2018-07-31T21:37:25Z ANY23-377 don't replace empty strings with 'Null' ---
[GitHub] any23 pull request #106: ANY23-376 fix IllegalArgumentException in microdata...
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/106 ANY23-376 fix IllegalArgumentException in microdata extractor mvn clean test -> all tests passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-376 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/106.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #106 commit 6173637bb801da62b07b69be64fa2c75f8d54904 Author: Hans Date: 2018-07-31T20:35:55Z ANY23-376 fix IllegalArgumentException in microdata extractor ---
[GitHub] any23 pull request #105: ANY23-374 fix schemeless microdata urls
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/105 ANY23-374 fix schemeless microdata urls Fixes microdata itemtype urls that are lacking a scheme by using a default scheme of "http". mvn clean test -> all tests passed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-374 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/105.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #105 commit d283d70ceb692cacb1f31659ee5d5c987822028f Author: Hans Date: 2018-07-31T17:21:26Z ANY23-374 fix schemeless microdata urls ---
[GitHub] any23 issue #102: ANY23-367 update 'latest.stable.released' property
Github user HansBrende commented on the issue: https://github.com/apache/any23/pull/102 @lewismc Any comments or am I good to merge this? ---
[GitHub] any23 pull request #103: ANY23-369 Resolved overlapping dependencies
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/103 ANY23-369 Resolved overlapping dependencies Taking a hint from the [`tika-parsers` pom](https://github.com/apache/tika/blob/master/tika-parsers/pom.xml) and the [`rdf4j-rio-jsonld` pom](https://github.com/eclipse/rdf4j/blob/master/rio/jsonld/pom.xml), I excluded the following libraries from the project: - `stax:stax-api` - `org.apache.httpcomponents:fluent-hc` - `org.apache.httpcomponents:httpcore-nio` - `org.apache.httpcomponents:httpcore-osgi` - `org.apache.httpcomponents:httpclient-osgi` mvn clean test -> all tests passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-369 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/103.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #103 commit 0259c695cc9e75fd0156018976391bab04d4d3c1 Author: Hans Date: 2018-07-18T19:11:32Z ANY23-369 Resolved overlapping dependencies ---
[GitHub] any23 pull request #102: ANY23-367 update 'latest.stable.released' property
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/102 ANY23-367 update 'latest.stable.released' property @lewismc anything else I need to do here to ensure this refactor works properly? You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-367 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/102.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #102 commit 950631873fec6e931859ea22b3beb91577164b25 Author: Hans Date: 2018-07-16T14:38:21Z ANY23-367 update 'latest.stable.released' property ---
[GitHub] any23 pull request #101: ANY23-366 resolved additional build warnings
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/101 ANY23-366 resolved additional build warnings 1. Excluded `commons-logging` from dependencies to ensure `jcl-over-slf4j` works as expected 2. Changed deprecated 'name' tag to 'id' in `appassembler-maven-plugin` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-366 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/101.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #101 commit 8cd464be02e701b8e5d05d6f12bc3e22a6f0b0b4 Author: Hans Date: 2018-07-12T19:38:09Z ANY23-366 excluded commons-logging from dependencies commit 8d7b4fd67e26bc9ab07af0e26bc002c35b0c6176 Author: Hans Date: 2018-07-12T19:41:46Z ANY23-366 changed 'name' to 'id' in appassembler-maven-plugin ---
[GitHub] any23 pull request #100: ANY23-365 resolved additional warnings
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/100 ANY23-365 resolved additional warnings Resolved the following warnings: - Annotation `Author.class` is not retained for reflective access (in `o.a.a.cli.PluginVerifier`) - `o.a.a.cli.PluginVerifier` uses unchecked or unsafe operations - `sun.security.validator.ValidatorException` is internal proprietary API and may be removed in a future release (in `o.a.a.servlet.WebResponder`) mvn clean test -> all tests passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-365 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/100.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #100 commit 3f87cf3a8ca51650376d7f111613fe0c1eda74d5 Author: Hans Date: 2018-07-11T20:53:05Z ANY23-365 resolved additional warnings ---
[GitHub] any23 pull request #99: ANY23-364 resolved POI deprecation warnings
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/99 ANY23-364 resolved POI deprecation warnings mvn clean test -> all tests passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-364 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/99.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #99 commit 5a2613b848b317c54381bcc8d7b23ca1e27e3725 Author: Hans Date: 2018-07-11T20:10:46Z ANY23-364 resolved POI deprecation warnings ---
[GitHub] any23 pull request #98: ANY23-363 updated httpclient/httpcore to 4.5.6/4.4.1...
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/98 ANY23-363 updated httpclient/httpcore to 4.5.6/4.4.10 mvn clean test -> all tests passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-363 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/98.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #98 commit 40619343dd8876dc447ea49ae952af33899b008f Author: Hans Date: 2018-07-11T18:28:11Z ANY23-363 updated httpclient/httpcore to 4.5.6/4.4.10 ---
[GitHub] any23 pull request #97: ANY23-362 resolved rdf4j deprecation warnings
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/97 ANY23-362 resolved rdf4j deprecation warnings 1. resolved rdf4j deprecation warnings 2. refactored for code style and improved singleton pattern in `RDFParserFactory` 3. updated no-arg constructors in `RDFXMLExtractor` and `TriXExtractor` to match their javadoc specs, along with the behavior all the other RDF extractor classes' no-arg constructors mvn clean test -> all tests passed @lewismc any comments before I merge this in? You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-362 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/97.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #97 commit 59091ef08aadc30e64abdff1a4b17cf81c2b6bbd Author: Hans Date: 2018-07-11T16:13:30Z ANY23-362 resolved rdf4j deprecation warnings ---