[GitHub] any23 issue #132: ANY23-419 Add J2EE dependencies such that service runs und...

2018-11-14 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/132
  
No objections from me! 
+1.


---


[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...

2018-11-13 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/132
  
See also: https://issues.apache.org/jira/browse/CXF-7899


---


[GitHub] any23 issue #131: ANY23-418 improve TikaEncodingDetector

2018-11-11 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/131
  
@lewismc I've simplified the code a lot so it should be a whole lot easier 
to see what's going on now.

Also, I improved the UTF-8 detector by reverse engineering jchardet's 
methodology for UTF-8 detection, and created a UTF-8 state machine which does 
the same thing as jchardet (in a much more human-readable manner), plus fixed 
two bugs in jchardet's UTF-8 detector along the way (possibly due to the lack 
of human-readability in the original source code). 

I started looking into jchardet because, according to 
[TIKA-2038](https://issues.apache.org/jira/browse/TIKA-2038), using it to 
detect UTF-8 before anything else increased the accuracy of charset detection 
from ~72% to ~96%. 

Our encoding detector should now be at least as accurate.

Any thoughts on the methodology, as compared to what we had before?


---


[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...

2018-11-09 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/132
  
Created an issue about this in Tika: 
https://issues.apache.org/jira/projects/TIKA/issues/TIKA-2778


---


[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...

2018-11-09 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/132
  
ALSO: it appears that `javax.activation:activation:1.1.1` has been replaced 
by `com.sun.activation:javax.activation:1.2.0` and 
`javax.activation:javax.activation-api:1.2.0`. However, I'm a bit fuzzy on how 
this works because it appears that the `javax.activation-api` sources are a 
subset of the `javax.activation` sources (i.e., `javax.activation:1.2.0` does 
not *depend* on `javax.activation-api:1.2.0`, but rather simply copies the 
source files... I think.)

In any case, `org.glassfish.jaxb:jaxb-runtime:2.3.1` *depends on* 
`javax.activation:javax.activation-api:1.2.0`, but does *not* depend on 
`com.sun.activation:javax.activation:1.2.0`. Leading me to believe that if we 
include both `jaxb-runtime:2.3.1` and `javax.activation:1.2.0`, we might have 
to exclude `javax.activation-api` from `jaxb-runtime` to avoid duplicate 
classes?

Cf. 
https://stackoverflow.com/questions/46493613/what-is-the-replacement-for-javax-activation-package-in-java-9

Cf. 
https://stackoverflow.com/questions/52921879/migration-to-jdk-11-has-error-occure-java-lang-noclassdeffounderror-javax-acti

Cf. 
https://stackoverflow.com/questions/48204141/replacements-for-deprecated-jpms-modules-with-java-ee-apis

Cf. https://mvnrepository.com/artifact/org.glassfish.jaxb/jaxb-runtime/2.3.1


---


[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...

2018-11-09 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/132
  
ALSO: it appears that, as of `org.glassfish.jaxb:jaxb-runtime` version 
2.3.1, `jaxb-core` is no longer required (as it has been merged into 
`jaxb-runtime`).


---


[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...

2018-11-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/132
  
Indeed, when I navigate to 
https://javalibs.com/bom/com.sun.xml.bind/jaxb-impl
I see:
> This artifact has been retired! New location is: 
org.glassfish.jaxb:jaxb-runtime


---


[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...

2018-11-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/132
  
@lewismc @ansell of interest may be 
[TIKA-2743](https://issues.apache.org/jira/projects/TIKA/issues/TIKA-2743), 
entitled, "Replace com.sun.xml.bind:jaxb-impl and jaxb-core by 
org.glassfish.jaxb:jaxb-runtime and jaxb-core", which states:

> com.sun.xml.bind:* is actually the old name and is currently a 
repackaging of org.glassfish.jaxb:*. probably kept as a retro compatibility


---


[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...

2018-11-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/132
  
@ansell Well, this is what I see when I run `mvn dependency:tree` on the 
current state of the Any23 trunk:

```
[INFO] |  +- org.apache.tika:tika-core:jar:1.19.1:compile
[INFO] |  +- org.apache.tika:tika-parsers:jar:1.19.1:compile
[INFO] |  |  +- org.glassfish.jaxb:jaxb-core:jar:2.3.0.1:compile
[INFO] |  |  |  +- javax.xml.bind:jaxb-api:jar:2.3.0:compile
[INFO] |  |  |  +- org.glassfish.jaxb:txw2:jar:2.3.0.1:compile
[INFO] |  |  |  \- com.sun.istack:istack-commons-runtime:jar:3.0.5:compile
[INFO] |  |  +- org.glassfish.jaxb:jaxb-runtime:jar:2.3.0.1:compile
[INFO] |  |  |  +- org.jvnet.staxex:stax-ex:jar:1.7.8:compile
[INFO] |  |  |  \- com.sun.xml.fastinfoset:FastInfoset:jar:1.2.13:compile
[INFO] |  |  +- javax.activation:activation:jar:1.1.1:compile
```




---


[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...

2018-11-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/132
  
@ansell Right. And I know that Tika (as of 1.19.0) already pulls in all but 
one of these dependencies (I think). The only difference being, Tika uses the 
jaxb-core module from glassfish rather than `com.sun.xml.bind`. So, should we 
exclude the Tika dependencies and use these? Or just use Tika's, with the 
addition of `jaxws-api`? (Basically, I'm just trying to avoid overlapping class 
names).


---


[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...

2018-11-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/132
  
@lewismc when I try to use `sudo` to, for example, read a text file using
```
sudo cat /opt/tomcat9/BUILDING.txt
```
It gives me a 
```
sudo: PAM authentication error: User not known to the underlying 
authentication module
```
Any ideas?


---


[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...

2018-11-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/132
  
@lewismc Also, now that they've granted me access, how do I actually access 
the VM? https://issues.apache.org/jira/browse/INFRA-17224


---


[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...

2018-11-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/132
  
@lewismc I have zero experience with Tomcat, so I'm just as in-the-dark as 
you in this regard. Did you get an error message or stacktrace you could post 
here?

I'd recommend removing all the newly added dependencies except for 
`javax.xml.ws:jaxws-api`, since that's the only one (I think) that Tika doesn't 
already pull in, and see how far that gets us. (Because otherwise we might need 
to add additional exclusions to our tika-parsers dependency to avoid 
conflicting class names if anyone's using the `maven-shade-plugin` or similar).


---


[GitHub] any23 issue #131: ANY23-418 improve TikaEncodingDetector

2018-11-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/131
  
@lewismc I've added some additional unit tests which test against the main 
issues we've been having with encoding detection.

Unfortunately, the only real way to comprehensively test this is to compare 
against millions of webpages "in the wild", but I am confident that it 
represents a huge improvement over what we have *now*, based on our past 
problems with encoding detection, plus discussions over in Tika regarding the 
various issues *they've* been having with encoding detection.

Compare to the original version of this file 
[here](https://github.com/apache/any23/blob/bd607c1cc8c63225f9678ec967c73daa474b45aa/encoding/src/main/java/org/apache/any23/encoding/TikaEncodingDetector.java).

Since that time, I've made a couple changes to the algorithm to fix up 
problems we've encountered along the way, but those tweaks weren't as 
comprehensive as this one is.

Ideally, I'd like to compare this more comprehensive solution against our 
original solution across millions of webpages, but I'm not yet sure how to 
proceed in that regard.


---


[GitHub] any23 issue #132: ANY23-419 Add J2EE depednencies such that service runs und...

2018-11-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/132
  
@lewismc FYI, I believe that some of these dependencies are already pulled 
in by Tika. 

Running `mvn dependency:tree` on the service module, I see: 

- `javax.xml.bind:jaxb-api:jar:2.3.0` is pulled in by tika-parsers
- `org.glassfish.jaxb:jaxb-core:jar:2.3.0.1` is pulled in by tika-parsers
- `javax.activation:activation:jar:1.1.1` is pulled in by tika-parsers

leaving (I believe) the only library not already pulled in by Tika to be 
the `jaxws-api`.

Out of curiosity though, where are these libraries actually used by the 
service? Are they required by jetty?


---


[GitHub] any23 issue #131: ANY23-418 improve TikaEncodingDetector

2018-11-06 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/131
  
@lewismc any thoughts about this?


---


[GitHub] any23 pull request #131: ANY23-418 improve TikaEncodingDetector

2018-11-06 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/131

ANY23-418 improve TikaEncodingDetector

Improves TikaEncodingDetector by:

1. Not second-guessing UTF-8 if there is *any* indication that a stream is 
UTF-8-encoded. We can't afford false positives from obscure, obsolete charsets 
such as IBM500 (See 
[TIKA-2771](https://issues.apache.org/jira/browse/TIKA-2771)).
2. Taking entire stream into account rather than a prefix (this shouldn't 
be a huge memory issue, as we are already holding the entire stream in memory 
to pass to each extractor, and extractors such as RDFa already parse the entire 
content into a DOM before generating the triples. If we want to make Any23 
"streaming"-capable in the future to reduce memory requirements, we can look 
into that, but for now, since we're not, we may as well use that to our 
advantage to be more accurate in charset detection.)
3. Taking [TIKA-2771](https://issues.apache.org/jira/browse/TIKA-2771), 
[TIKA-2038](https://issues.apache.org/jira/browse/TIKA-2038), and 
[TIKA-539](https://issues.apache.org/jira/browse/TIKA-539) into account.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-418

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/131.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #131


commit d64dac9dfe0752c45d3ff9fbca37bbe447e5c55b
Author: Hans 
Date:   2018-11-06T21:27:00Z

ANY23-418 improve TikaEncodingDetector




---


[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite

2018-10-25 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/124
  
I don't think we need to worry about these remaining 5 tests before the 2.3 
release. So I'm going to add code to ignore them for now.


---


[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite

2018-10-25 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/124
  
## Update
Merging ANY23-410 and ANY23-411 into master resulted in a reduction of 
failed tests from 11 to 5.

Failed tests are now as follows:
-
Test 0073: Vocabulary Expansion test with rdfs:subPropertyOf
Test 0074: Vocabulary Expansion test with owl:equivalentProperty
Test 0081: Simple `@itemprop-reverse` (experimental)
Test 0082: `@itemprop-reverse` with `@itemscope` value (experimental)
Test 0084: `@itemprop-reverse` with `@itemprop` (experimental)


---


[GitHub] any23 pull request #130: ANY23-410 fix microdata itemrefs

2018-10-25 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/130

ANY23-410 fix microdata itemrefs

This fixes the regression introduced in version 2.2 causing Any23 to ignore 
itemrefs.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-410

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/130.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #130


commit 13d04c7426b10c7bf982dacfc5cfd2bee2385b0e
Author: Hans 
Date:   2018-10-25T22:45:09Z

ANY23-410 fix microdata itemrefs




---


[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite

2018-10-24 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/124
  
## Update
Merging ANY23-309 into master resulted in a reduction of failed tests from 
12 to 11.

Failed tests are now as follows:
-
Test 0062: `@itemref` to single id
Test 0063: `@itemref` generates property values
Test 0064: `@itemref` to single id with different types
Test 0065: `@itemref` to multiple ids
Test 0066: `@itemref` with chaining
Test 0067: Shared `@itemref`
Test 0073: Vocabulary Expansion test with rdfs:subPropertyOf
Test 0074: Vocabulary Expansion test with owl:equivalentProperty
Test 0081: Simple `@itemprop-reverse` (experimental)
Test 0082: `@itemprop-reverse` with `@itemscope` value (experimental)
Test 0084: `@itemprop-reverse` with `@itemprop` (experimental)


---


[GitHub] any23 pull request #129: ANY23-409 allow multiple microdata itemtype values

2018-10-24 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/129

ANY23-409 allow multiple microdata itemtype values

This PR addresses the failed `Test 0056: token property and multiple 
@itemtypes from different vocabularies` microdata test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-409

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/129.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #129


commit 8b951d8e06ed5ad941ec4ba452532bb93d04a057
Author: Hans 
Date:   2018-10-24T21:36:12Z

ANY23-409 allow multiple microdata itemtype values




---


[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite

2018-10-24 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/124
  
## Update
Merging ANY23-408 into master and turning on the microdata "strict" flag 
resulted in a reduction of failed tests from 17 to 12.

Failed tests are now as follows:
-
Test 0056: token property and multiple `@itemtype`s from different 
vocabularies
Test 0062: `@itemref` to single id
Test 0063: `@itemref` generates property values
Test 0064: `@itemref` to single id with different types
Test 0065: `@itemref` to multiple ids
Test 0066: `@itemref` with chaining
Test 0067: Shared `@itemref`
Test 0073: Vocabulary Expansion test with rdfs:subPropertyOf
Test 0074: Vocabulary Expansion test with owl:equivalentProperty
Test 0081: Simple `@itemprop-reverse` (experimental)
Test 0082: `@itemprop-reverse` with `@itemscope` value (experimental)
Test 0084: `@itemprop-reverse` with `@itemprop` (experimental)


---


[GitHub] any23 pull request #128: ANY23-408 Use document IRI as default namespace in ...

2018-10-24 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/128

ANY23-408 Use document IRI as default namespace in microdata strict mode

Currently, we just drop predicates that don't have a namespace in strict 
mode. This commit will align strict mode with the actual spec.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-408

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/128.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #128


commit a58d59e35da537b69820baf0cb6423fb3facea02
Author: Hans 
Date:   2018-10-24T19:16:57Z

ANY23-408 Use document IRI as default namespace in microdata strict mode




---


[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite

2018-10-24 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/124
  
## Update
Merging ANY23-407 into master resulted in a reduction of failed tests from 
18 to 17.

Failed tests are now as follows:
-
Test 0002: Item with no itemtype and 2 elements with equivalent itemprop
Test 0003: Item with itemprop having two properties
Test 0052: token property no `@itemtype`
Test 0053: token property empty `@itemtype`
Test 0054: token property and relative `@itemtype`
Test 0056: token property and multiple `@itemtype`s from different 
vocabularies
Test 0062: `@itemref` to single id
Test 0063: `@itemref` generates property values
Test 0064: `@itemref` to single id with different types
Test 0065: `@itemref` to multiple ids
Test 0066: `@itemref` with chaining
Test 0067: Shared `@itemref`
Test 0073: Vocabulary Expansion test with rdfs:subPropertyOf
Test 0074: Vocabulary Expansion test with owl:equivalentProperty
Test 0081: Simple `@itemprop-reverse` (experimental)
Test 0082: `@itemprop-reverse` with `@itemscope` value (experimental)
Test 0084: `@itemprop-reverse` with `@itemprop` (experimental)


---


[GitHub] any23 pull request #127: ANY23-407 allow microdata itemids from relative url...

2018-10-24 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/127

ANY23-407 allow microdata itemids from relative urls



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-407

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/127.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #127


commit 9bab662c46107350417f61e1f7cbd3058809edf1
Author: Hans 
Date:   2018-10-24T17:00:51Z

ANY23-407 allow microdata itemids from relative urls




---


[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite

2018-10-24 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/124
  
## Update
Merging ANY23-405 into master resulted in a reduction of failed tests from 
28 to 18.

Failed tests are now as follows:
-
Test 0002: Item with no itemtype and 2 elements with equivalent itemprop
Test 0003: Item with itemprop having two properties
Test 0051: relative URL as itemid
Test 0052: token property no `@itemtype`
Test 0053: token property empty `@itemtype`
Test 0054: token property and relative `@itemtype`
Test 0056: token property and multiple `@itemtype`s from different 
vocabularies
Test 0062: `@itemref` to single id
Test 0063: `@itemref` generates property values
Test 0064: `@itemref` to single id with different types
Test 0065: `@itemref` to multiple ids
Test 0066: `@itemref` with chaining
Test 0067: Shared `@itemref`
Test 0073: Vocabulary Expansion test with rdfs:subPropertyOf
Test 0074: Vocabulary Expansion test with owl:equivalentProperty
Test 0081: Simple `@itemprop-reverse` (experimental)
Test 0082: `@itemprop-reverse` with `@itemscope` value (experimental)
Test 0084: `@itemprop-reverse` with `@itemprop` (experimental)


---


[GitHub] any23 pull request #126: ANY23-405 Parse microdata property values correctly

2018-10-24 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/126

ANY23-405 Parse microdata property values correctly

See http://w3c.github.io/microdata-rdf/#dfn-property-values

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-405

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/126.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #126


commit f23c25cc23938aa27551426d38dd0139fd30b9f4
Author: Hans 
Date:   2018-10-24T15:35:10Z

ANY23-405 Parse microdata property values correctly




---


[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite

2018-10-23 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/124
  
## Update 
Merging ANY23-404 into master resulted in a reduction of failed tests from 
30 to 28.

Failed tests are now as follows:
Test 0002: Item with no itemtype and 2 elements with equivalent itemprop
Test 0003: Item with itemprop having two properties
Test 0046: Use of time with `@datetime` xsd:time
Test 0047: Use of time with `@datetime` xsd:dateTime
Test 0048: Use of time with `@datetime` xsd:duration
Test 0049: Use of time with `@datetime` invalid
Test 0051: relative URL as itemid
Test 0052: token property no `@itemtype`
Test 0053: token property empty `@itemtype`
Test 0054: token property and relative `@itemtype`
Test 0056: token property and multiple `@itemtype`s from different 
vocabularies
Test 0062: `@itemref` to single id
Test 0063: `@itemref` generates property values
Test 0064: `@itemref` to single id with different types
Test 0065: `@itemref` to multiple ids
Test 0066: `@itemref` with chaining
Test 0067: Shared `@itemref`
Test 0073: Vocabulary Expansion test with rdfs:subPropertyOf
Test 0074: Vocabulary Expansion test with owl:equivalentProperty
Test 0075: Use of data and xsd:float
Test 0076: Use of data and xsd:integer
Test 0077: Use of data and string
Test 0078: Use of meter and xsd:double
Test 0079: Use of meter and xsd:integer
Test 0080: Use of meter and xsd:string
Test 0081: Simple @itemprop-reverse (experimental)
Test 0082: `@itemprop-reverse` with `@itemscope` value (experimental)
Test 0084: `@itemprop-reverse` with `@itemprop` (experimental)


---


[GitHub] any23 pull request #125: ANY23-404 hardcode default microdata registry

2018-10-23 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/125

ANY23-404 hardcode default microdata registry

This PR should ensure that our microdata extractor is compliant with the 
standard default microdata registry in terms of vocabulary expansion and 
property URI generation.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-404

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/125.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #125


commit 6b1469152ccd30f93b0686a73bd1ba02955d6411
Author: Hans 
Date:   2018-10-24T00:37:37Z

ANY23-404 hardcode default microdata registry




---


[GitHub] any23 issue #124: ANY23-67 test against online microdata test-suite

2018-10-23 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/124
  
@lewismc Yes and no. While we certainly don't need to address all of these 
test failures before the next release, I want to make sure that property URI 
generation works as expected for all namespaces in the default registry, at 
least. That should be a quick fix.


---


[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow

2018-10-23 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/121
  
Now that ANY23-396 has been implemented in #122 and merged into master, can 
we close this PR? @lewismc ? @jgrzebyta ? I don't have the required permissions 
to close issues myself.


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-10-23 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
Alright, as far as I'm concerned, all my cleanup is done here.

@lewismc if you have no further comments, would you prefer I squash my 
commits before merging, or just merge everything in as-is?


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-10-23 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
On second thought, I'm thinking about making `TripleWriter` a class in 
`core` rather than an interface in `api`. It would have the same effect as 
before, except it would allow more freedom of implementation from the api 
perspective.


---


[GitHub] any23 pull request #124: ANY23-67 test against online microdata test-suite

2018-10-22 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/124

ANY23-67 test against online microdata test-suite

I created a microdata unit test which tests against the latest online 
[microdata test suite](http://w3c.github.io/microdata-rdf/tests/).

Currently, 30 out of 84 total tests are failing!

*Note: the tests are relaxed such that the expected model is only required 
to be a subset of the actual model, and not necessarily the other way around. 
Requiring strict isomorphism, on the other hand, causes 83 out of 84 tests to 
fail.*

**The 30 failing tests are as follows:**
Test 0002: Item with no itemtype and 2 elements with equivalent itemprop
Test 0003: Item with itemprop having two properties
Test 0046: Use of time with `@datetime` xsd:time
Test 0047: Use of time with `@datetime` xsd:dateTime
Test 0048: Use of time with `@datetime` xsd:duration
Test 0049: Use of time with `@datetime` invalid
Test 0051: relative URL as itemid
Test 0052: token property no `@itemtype`
Test 0053: token property empty `@itemtype`
Test 0054: token property and relative `@itemtype`
Test 0056: token property and multiple `@itemtypes` from different 
vocabularies
Test 0062: `@itemref` to single id
Test 0063: `@itemref` generates property values
Test 0064: `@itemref` to single id with different types
Test 0065: `@itemref` to multiple ids
Test 0066: `@itemref` with chaining
Test 0067: Shared `@itemref`
Test 0070: Property URI generation (default) 3
Test 0071: Vocabulary Expansion test with schema:additionalType
Test 0073: Vocabulary Expansion test with rdfs:subPropertyOf
Test 0074: Vocabulary Expansion test with owl:equivalentProperty
Test 0075: Use of data and xsd:float
Test 0076: Use of data and xsd:integer
Test 0077: Use of data and string
Test 0078: Use of meter and xsd:double
Test 0079: Use of meter and xsd:integer
Test 0080: Use of meter and xsd:string
Test 0081: Simple `@itemprop-reverse` (experimental)
Test 0082: `@itemprop-reverse` with `@itemscope` value (experimental)
Test 0084: `@itemprop-reverse` with `@itemprop` (experimental)

For more details on expected vs. actual statements, run the 
`MicrodataExtractorTest.runOnlineTests()` test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-67

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/124.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #124


commit 2e3451ccaa36234a9dbd60ce783bf20501fc70c4
Author: Hans 
Date:   2018-10-23T02:49:46Z

ANY23-67 test against online microdata test-suite




---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-10-22 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
@lewismc If you don't have any more comments, I'm ready to merge this in.

One question: would you prefer I squash the commits before merging, or not?


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-10-21 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
@lewismc the reason I ask is that my **cleanup item 4** allows me to 
specify a new method in `TripleWriter` which accepts a group as a `Resource` 
rather than as an `IRI`. That's what I've done in my latest cleanup commit. 

I'll leave this PR open for at least another day before merging to master 
just in case anyone comes up with any further comments or concerns.


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-10-21 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
@lewismc One quick question for you before I finish up here: according to 
RDF specifications, graph names are allowed to be blank nodes, but it appears 
that Any23 only supports graph names that are IRIs. (Whereas RDF4J supports 
graph names that are blank nodes *or* IRIs. It appears Any23 silently drops any 
parsed graph names that are blank nodes rather than IRIs.)

Is there a reason for this? Any historical context you can give me on why 
Any23 opted to not support BNode graph names? Should we lean towards supporting 
this in the future?


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-10-19 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
@lewismc , great to hear!

After I do a bit of last-minute cleanup, I will merge this PR.

Cleanup item 1: I'm renaming `TripleFormat.ExtendedCapabilities` to 
`TripleFormat.FineCapabilities`, as I think the first name is a slightly 
misleading. 

Cleanup item 2: For now, I'm removing the newly added support for 
configuring a writer's charset via `Settings` (although we could take another 
look at doing this in a future issue), for 3 reasons:  

1. Some XML-based writers hard-code a "encoding=utf8" declaration which 
might then conflict with user-supplied charset and produce an invalid document.
2. If the user sets a writer's charset to US-ASCII or similar, that could 
create a problem if the writer doesn't support escaping non-ascii characters. 
(To my knowledge, only the `NTriplesWriter` and `NQuadsWriter` support this.)
3. The default charset for every existing writer is already UTF-8, and I 
can't think of a good reason to support anything else.


---


[GitHub] any23 pull request #122: ANY23-396 allow mapping/filtering TripleHandlers in...

2018-10-19 Thread HansBrende
Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/122#discussion_r226702275
  
--- Diff: api/src/main/java/org/apache/any23/configuration/Setting.java ---
@@ -118,7 +128,22 @@ private Type getValueType() {
 }
 }
 
-protected abstract V checkedValue(Setting original, V newValue) 
throws Exception;
+/**
+ * Subclasses may override this method to check that new settings 
for this key are valid,
+ * and/or to decorate new setting values, using, for example, 
{@link Collections#unmodifiableList(List)}.
+ * The default implementation of this method throws a {@link 
NullPointerException} if the new value is null and the initial value was 
non-null.
+ *
+ * @param initial the setting containing the initial value for 
this key, or null if the setting has not yet been initialized
+ * @param newValue the new value for this setting
+ * @return the new value for this setting
+ * @throws Exception if the new value for this setting was invalid
+ */
+protected V checkedValue(Setting initial, V newValue) throws 
Exception {
+if (newValue == null && initial != null && initial.value != 
null) {
+throw new NullPointerException();
+}
+return newValue;
+}
--- End diff --

Actually, we should not allow keys to decorate values. Consider the 
following scenario: user copies the value from one setting into another 
setting. Now the key is decorating a value that has *already been decorated*. 
This could lead to an unfortunate chain of, e.g., 
```

Collections.unmodifiableList(Collections.unmodifiableList(Collections.unmodifiableList(...
 )))
```

Therefore, any decorating should happen *before* the setting is created, 
and if the value is not appropriately decorated, the key should throw an 
exception in the value check. 

Also we should change this method's signature to:
```
protected void checkValue(Setting initial, V newValue) throws Exception
```


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-10-18 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
@lewismc I just added some unit tests & javadoc for the `Settings` API. Let 
me know your thoughts!


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-10-13 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
@lewismc do you have any concerns or questions regarding my latest commit? 
Would love to hear your thoughts.


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-10-11 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
**Implementation note:** I considered using the existing `Configuration` 
interface to construct `TripleWriter` instances, but it seemed rather limited, 
in that settings are only validated when they are first used, rather than 
*failing fast*, and they are all stored as strings rather than the actual 
parsed objects they represent. This is good for settings imported from a config 
file or loaded from the command line, but not very easy, type-safe, or 
performant for programmatic configuration.

So instead, I created `Settings`, which could be considered a type-safe 
version of `Configuration`, or a *parsed* configuration. In the future, we 
could add the ability to create a `Settings` object *from* a `Configuration` 
object, given a set of supported settings and a configuration parser. In a 
future PR, I'm planning to implement a similar concept for `Rover`, so that a 
`Settings` object can be parsed from the command line for each writer. E.g., 
instead of having, simply:

```
--format mycustomdecorator,notrivial,turtle
```
we could do something like:
```
--format 
mycustomdecorator,notrivial;alwayssuppresscsstriples=true,turtle;prettyprint=true
```


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-10-10 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
@lewismc I've implemented a few new things here for the new 
`TripleWriterFactory` (what I used to call `FormatWriterFactory`). Most 
important of these, in my opinion, being `Settings`, which allows you to 
configure writers. (Analogous to rdf4j's `RioSetting` api, but with several 
improvements.) The `Settings` capability will be able to replace the existing 
solution for ANY23-388 (PR #117 ). Also, we'll finally be able to allow users 
to turn off pretty printing if they so choose, or any other configuration 
option they desire. (E.g., when we upgrade to rdf4j 2.4.0, we can add a 
"hierarchical" settings option for the new hierarchical JSON-LD printing 
ability.)

Then there's the new `TripleFormat` class, analogous to rdf4j's `RDFFormat` 
class with a few improvements (one being a "characteristics" flag which allows 
a much broader range of boolean characteristics to be specified than the 2 in 
`RDFFormat`.)

I'm also deprecating the `FormatWriter` interface (which is nearly useless 
as it stands--and could be replaced in the future with a simple 
`AnnotatingDelegatingWriter`) in favor of the new `TripleWriter` interface 
(which extends `FormatWriter` for backwards compatibility, but introduces 
methods that are more useful).

Let me know what you think!


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-10-03 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
@lewismc One possibility that I'm considering right now is using this 
opportunity to define our own `TripleFormat` class analogous to rdf4j's 
`RDFFormat` (such that any `TripleFormat` could be converted to a `RDFFormat` 
if desired), and then changing the method signature of `RDFFormat getFormat()` 
to `TripleFormat getFormat()`.

The reason being: shouldn't all return types of methods (aside from the 
ubiquitous `IRI`, `BNode`, etc.) in new interfaces (e.g. `FormatWriterFactory`) 
be, preferably, part of our own API, rather than RDF4J's? Having our own 
`TripleFormat` class would give us more control over our own API. For example, 
suppose we were to add the following default method to the `TripleHandler` 
interface:

```java
default void handleComment(String comment, ExtractionContext context) {
//default implementation = do nothing
}
```
And then we wanted to add a `supportsComments` flag to the format returned 
by `FormatWriterFactory.getFormat()` (which we could set to `true` for, e.g., 
the `TurtleWriter`). Well, if we're using RDF4J's `RDFFormat` class, we could 
log an issue in RDF4J asking them to add that additional parameter, but we're 
pretty much at their mercy. However, if we had our own `TripleFormat` class, we 
could add an additional `TripleFormat` constructor with a `boolean 
supportsComments` parameter (and a default value of `false`).

What do you think about this? 

(The only reason I've hesitated so far about merging this PR is that once 
new interfaces are introduced as part of the core API, I'd prefer to never 
change them again--so I want to get it right the first time!)


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-10-02 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
I've taken another look at the `RDFFormat` class, and it turns out that we 
don't really need the new method: `FileFormat getFormat()` because any 
`RDFFormat` can be converted from a `FileFormat` by constructing it with a 
`null` standard URI, and setting both "supports namespaces" and "supports 
contexts" to `false`. This should be applicable to any writer, even those that 
don't print out a standardized RDF format. E.g., in the `URIListWriter` class, 
"supports namespaces" and "supports contexts" are clearly false since the class 
only writes out subjects; but does not write out predicates, objects, 
namespaces, or contexts.

Therefore, I think I'm going to drop the new `FileFormat getFormat()` 
method and retain the `RDFFormat getRdfFormat()` method.


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-09-26 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
> @HansBrende.  Sorry, I did not follow your latest commits. My proof of
concept was done for different approach so I thought it would be useless in
your one. I had thought I would need to write the new one. But if you
reused so I am happy of that. You have still my +1.

@jgrzebyta glad to hear I still have your +1! Yes, although I used your 
original unit tests, I did have to modify the way they were implemented. Here 
are the implementation changes I made:
1. `ExtractorsFlowTest` diff: https://www.diffchecker.com/pPGAQxE6
2. `PeopleExtractorFactory` diff: https://www.diffchecker.com/Mn7XTZOB
3. `PeopleExtractor` diff: https://www.diffchecker.com/x4du9RqE



---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-09-26 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
@jgrzebyta thanks for your +1. Is the improved state of the javadoc to your 
liking?

> I will add my unit in a separate ticket.

I'm confused: are you referring to your existing PR #121 ? This PR is meant 
to be an alternative to that one. In my last commit, I have also added your 
[`ExtractorsFlowTest`](https://github.com/apache/any23/blob/f95f23865c0a7088e4ab1cbe507b8457fc90dda5/cli/src/test/java/org/apache/any23/cli/ExtractorsFlowTest.java)
 proof-of-concept to this PR to clarify that this PR provides at least as much 
functionality as #121 does. The only difference being: this PR uses the new 
`DelegatingWriterFactory` to accomplish the same behavior previously provided 
by `ModelExtractor` in #121. So if you approve this PR, my assumption would be 
that you prefer it over #121, and that #121 would be discarded.

Any additional comments or concerns? Do I still have your +1?



---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-09-18 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
I've added a proof-of-concept unit test, which deprecates the `--notrivial` 
flag and instead makes that the identifier for a `DelegatingWriterFactory`. Now 
you can simply specify:
```shell
--format notrivial,nquads
```

@lewismc any additional comments or concerns?

@jgrzebyta can you please verify whether or not this PR will satisfy your 
use-case for [ANY23-396](https://issues.apache.org/jira/browse/ANY23-396)? Any 
additional comments or concerns?



---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-09-17 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
For the time being, I've opted for the third option for (4) (making the 
public methods I added to `WriterFactoryRegistry` become `private static` 
methods in `Rover`) so that I don't have to deal with that naming issue in this 
PR. If we want to add extra utility methods in `WriterFactoryRegistry`, that 
can be the subject of a different JIRA issue.

However, I did have to fix a couple of synchronization issues in 
`WriterFactoryRegistry` to accomplish this: I noticed that iterating through 
the list of writer factories returned by `WriterFactoryRegistry.getWriters()` 
could potentially throw a `ConcurrentModificationException` even though that 
method was marked `synchronized` (because, unless I am mistaken, the underlying 
list implementation *can* be modified after access to the list is given to a 
caller and the method returns). To fix this problem, I changed the 
implementation of the backing list of writers from `ArrayList` to 
`CopyOnWriteArrayList`, which guarantees thread safety for iterators. Since 
writes to `CopyOnWriteArrayList` are relatively expensive, I also changed the 
logic around a bit to use *batch writing*, i.e., registering all 
`WriterFactory` instances at once in a `registerAll()` method, rather than 
through consecutive invocations of the `register()` method. Similar issues 
existed for the methods to retrieve id
 entifiers and mime types, which I fixed in the same manner.

With this last commit, I am now satisfied, personally, with my 
implementation of ANY23-396.

Anything else, @lewismc @jgrzebyta ?


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-09-17 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
A third option for (4) is to simply defer this decision to another day by 
removing the methods I added to `WriterFactoryRegistry` and adding them 
directly to `Rover` as private methods. This option is also tempting.


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-09-17 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
Another (more drastic) option for (4) would be to deprecate `getWriters()` 
and `getWriterByIdentifier(String id)`, and create the replacement methods 
`getWriterFactories()` and `getWriterFactoryByIdentifier(String id)` (or 
simply, `getWriterFactory(String id)`.)

Then we would be free to call writer instances "writers", and could leave 
the method names how they currently stand, namely: `getWriter(id, output)` and 
`getDefaultWriter(output)`


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-09-16 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
Thank you, @lewismc .

The only item of concern left from my perspective is *naming*. Should any 
of the new `public` interfaces/methods I have created be named differently, or 
are they adequately descriptive as they currently stand? This decision should 
be made now, as there is no going back.

Here follows the names of all the new `public` methods/interfaces I have 
created in this PR:
1. `public interface `**`FormatWriterFactory`**
> Is this descriptive enough? It does specify a `FileFormat getFormat()` 
method, returning the format which will be written to the output stream, so the 
name does still make sense even though we now return a `TripleHandler` rather 
than a `FormatWriter` from the `getTripleWriter(OutputStream)` method. On the 
other hand, we could also call it `ContentWriterFactory` in line with the 
existing `ContentExtractor` interface (although I'm not sure if that would make 
it any more descriptive). Another possibility would be 
`OutputStreamWriterFactory`.
2. `public interface`**`DelegatingWriterFactory`**
> Alternatives include `CompositeWriterFactory` or `FilterWriterFactory` 
(similar to `java.io.FilterOutputStream`).
3. `TripleHandler`**`getTripleWriter(Output)`** (specified in the 
`BaseWriterFactory` interface)
> Alternatives include `getTripleHandler` or simply `getWriter`. I chose 
`getTripleWriter` over `getWriter` because it seemed more descriptive, and to 
avoid confusion with the `java.io.Writer` class.
4. `TripleHandler`**`getWriter(id, output)`** and 
`TripleHandler`**`getDefaultWriter(OutputStream)`** (specified in 
`WriterFactoryRegistry`).
> This one is confused by the fact that `WriterFactoryRegistry` already 
uses the term "writer" to refer to *`WriterFactory`* instances (e.g. 
`List getWriters()` and `WriterFactory 
getWriterByIdentifier(String id)`). An easy alternative would be to take a hint 
from the existing, now-deprecated method `FormatWriter 
getWriterInstanceByIdentifier(id, output)` and use "**writerInstance**" to 
refer to a triple handler, i.e., `TripleHandler getWriterInstance(id, output)` 
and `getDefaultWriterInstance(OutputStream)`. Alternatively, we could use 
`getTripleWriter(id, output)` and `getDefaultTripleWriter(OutputStream)`.

Any suggestions, or better names that I haven't thought of, @lewismc ? 
@jgrzebyta ?


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-09-15 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
As per part 4 of my last comment, in my most recent commit I've allowed 
`FormatWriterFactory` and `DelegatingWriterFactory` to extend the same 
(package-private) base interface specifying a single generic method:

```java
interface BaseWriterFactory extends WriterFactory {
TripleHandler getTripleWriter(Output o);
}
```

I could have added this method directly to `WriterFactory` (with a default 
implementation of throwing `UnsupportedOperationException`), but since all 
instances of this interface *must* be instances of either `FormatWriterFactory` 
or `DelegatingWriterFactory` (since the interface is package-private), and all 
interaction with this method will be done by casting to one of these two 
interfaces, adding generic arguments to `WriterFactory` itself would have only 
added unnecessary verbosity (e.g., always having to specify `WriterFactory` 
instead of `WriterFactory` to avoid rawtypes warnings).

@lewismc any comments?


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-09-14 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
My last commit reflects the notes of interest I mentioned in my last 
comment.
1. Since `WriterFactory.getMimeType()` is redundant and I had to deprecate 
it anyway to make this PR work, I've simply opted to *not* un-deprecate it in 
the extending `FormatWriterFactory`. To retrieve the MIME type of a 
`FormatWriterFactory` instance, simply call `getFormat().getDefaultMIMEType()`. 
However, to keep new implementations of `FormatWriterFactory` backwards 
compatible with the older behavior, I've simply added the following default 
implementation of `getMimeType()` in the `FormatWriterFactory` interface:

```java
@Override
@Deprecated
default String getMimeType() {
return getFormat().getDefaultMIMEType();
}
```

2. Since not all implementations of `FormatWriterFactory` print RDF triples 
(case in point: `URIListWriterFactory`), the deprecation of 
`WriterFactory.getRdfFormat()` presents us with the perfect opportunity to make 
the return type of `getRdfFormat()` more generic in `FormatWriterFactory` 
(namely, using `FileFormat`, the superclass of `RDFFormat`, instead of 
`RDFFormat`). To accomplish this, I've simply opted to *not* un-deprecate the 
`getRdfFormat()` method in the `FormatWriterFactory` interface, and instead, 
add the following method:

```java
FileFormat getFormat();
```

To keep everything backwards compatible with the previous behavior, I've 
added the following default implementation of `getRdfFormat()` to the 
`FormatWriterFactory` interface:

```java
@Override
@Deprecated
default RDFFormat getRdfFormat() {
FileFormat f = getFormat();
if (f instanceof RDFFormat) {
return (RDFFormat)f;
} else {
throw new UnsupportedOperationException("This class does not print 
RDF triples.");
}
}
```
Now the `URIListWriterFactory` can utilize the method `getFormat()`, 
instead of its previous behavior of throwing a `RuntimeException`. To that 
effect, I've opted to return the following `FileFormat` from 
`URIListWriterFactory.getFormat()`:

```java
private static final FileFormat FORMAT = new FileFormat("PLAINTEXT", 
"text/plain", 
   StandardCharsets.UTF_8, "txt");
@Override
public FileFormat getFormat() {
return FORMAT;
}
```

3. Since the `FormatWriterFactory` interface is now not only tasked with 
`RDFFormat`s, but also arbitrary `FileFormat`s, deprecating the 
`WriterFactory.getRdfWriter(OutputStream)` method presents us with the perfect 
opportunity to choose a more appropriate name for this method in the 
subinterface `FormatWriterFactory`. To this effect, I've opted to simply *not* 
un-deprecate the `FormatWriterFactory.getRdfWriter(OutputStream)` method, and 
instead choose a more appropriate name. The name I've provisionally opted for 
is:

```java
FormatWriter getFormatWriter(OutputStream);
```
To keep everything backwards compatible, I've added the following default 
implementation of `FormatWriterFactory.getRdfWriter(OutputStream)`:

```java
@Override
@Deprecated
default FormatWriter getRdfWriter(OutputStream os) {
return getFormatWriter(os);
}
```

4. Finally, I have one further question for discussion:
We could use this deprecation opportunity to further genericize the 
`FormatWriterFactory.getFormatWriter(OutputStream)` method, replacing:
```java
FormatWriter getFormatWriter(OutputStream);
```
with:
```java
TripleHandler getWriter(OutputStream);
```
which would allow `FormatWriterFactory` implementations to return arbitrary 
`TripleHandler`s instead of forcing them to return the more specific (but 
arguably *not* more useful) `FormatWriter` implementations. Where behavior 
specific to `FormatWriter` is actually needed, e.g. 
`FormatWriter.isAnnotated()` (a method which is actually *never* used anywhere 
in Any23), a check could be added as follows: 
```java
boolean isAnnotated(TripleHandler writer) {
return writer instanceof FormatWriter ? 
((FormatWriter)writer).isAnnotated() : false;
}
```

One additional benefit of doing this would be that 
`DelegatingWriterFactory` and `FormatWriterFactory` could both then extend some 
base interface as follows:
```
interface BaseWriterFactory {
TripleHandler getWriter(Output);
}
interface FormatWriterFactory extends BaseWriterFactory {
...
}
interface DelegatingWriterFactory extends BaseWriterFactory {
...
}
```

@lewismc Any comments?


---


[GitHub] any23 issue #122: ANY23-396 allow mapping/filtering TripleHandlers in Rover

2018-09-14 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/122
  
While we're deprecating things anyway, there are a few notes of interest 
which we should mull over before any merges into master happen here:

1. The method `WriterFactory.getMimeType()` appears to be redundant, as 
there also exists the `WriterFactory.getRdfFormat().getDefaultMimeType()`.

2. Also note the presence of `FileFormat`, the superclass of `RDFFormat`, 
which we could possibly use to make the `FormatWriterFactory` interface more 
generic (possibly helpful for the `URIListWriterFactory`, and other writer 
factories which similarly do not print RDF triples as output).

3. The method `WriterFactory.getRdfFormat()` is never actually used 
anywhere in the Any23 project.

4. `JSONWriterFactory` and `URIListWriterFactory` both throw 
`RuntimeException` in the `getRdfFormat()` method (the former questionably so, 
since there exists the `RDFFormat.RDFJSON` file format).


---


[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow

2018-09-14 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/121
  
@jgrzebyta please check out [this 
PR](https://github.com/apache/any23/pull/122), which is an implementation of 
what I've just described.

It seems like a lot simpler and less error-prone way to produce a 
domain-specific rdf graph.

Eager to know your thoughts!


---


[GitHub] any23 pull request #122: ANY23-396 allow mapping/filtering TripleHandlers in...

2018-09-14 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/122

ANY23-396 allow mapping/filtering TripleHandlers in Rover

Here is one possible alternative to the existing PR for ANY23-396.

**Pros:**

1. Fully backwards compatible
2. Extends `WriterFactory` with new `DelegatingWriterFactory` interface, 
which, rather than writing a `TripleHandler` to an output stream, writes a 
`TripleHandler` to another `TripleHandler`. This will allow users to produce a 
final domain-specific RDF graph of their choosing in Rover by implementing 
mapping/filtering `DelegatingWriterFactory` implementations. 
3. the `--format` flag in rover now represents a list of WriterFactory ids, 
rather than a single WriterFactory id. Each id in the list is composed with the 
one previous to it to construct the final `TripleHandler`. All writers in the 
list, except the last, are required to implement `DelegatingTripleHandler`.

**Cons:** 
1. this solution requires deprecating 3 methods in the `WriterFactory` 
interface (and then un-deprecating them in the extending `FormatWriterFactory` 
interface.) However, this drawback does not affect backwards compatibility. 

## ALTERNATIVE

In order to avoid the single "con" I have listed, the alternative to this 
solution would be, rather than extending the `WriterFactory` interface with 
`DelegatingWriterFactory`, to keep these two interfaces completely separate and 
define a new `DelegatingWriterFactoryRegistry` (analogous to the 
`WriterFactoryRegistry`) with a different `ServiceLoader` in order to load 
`DelegatingWriterFactory` implementations.

@jgrzebyta @lewismc Thoughts? 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-396

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/122.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #122


commit cb293e22b9352652d91474cd3e35233e75dc9fb9
Author: Hans 
Date:   2018-09-14T15:29:33Z

ANY23-396 allow mapping/filtering TripleHandlers in Rover




---


[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow

2018-09-12 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/121
  
@jgrzebyta One easy solution to my above comment that I can think of right 
off the bat is as follows:

First, we could extend the WriterFactory interface as follows (or similar):

```
interface DelegatingWriterFactory extends WriterFactory {
TripleHandler getWriter(TripleHandler delegate);
}
```

Second, in the rover `--format` flag (which actually accepts a 
WriterFactory *id*, not necessarily a format name), we could simply allow a 
comma-separated *list* of WriterFactory ids rather than a single id. Then, to 
construct the final writer, we'd compose each writer in the list with the 
previous one, i.e.:

```
Collections.reverse(listOfIds);

tripleHandler = 
getWriterFactoryForId(listOfIds.get(0)).getRdfWriter(outputStream);

for (String id : listOfIds.subList(1, listOfIds.size())) {
tripleHandler = 
((DelegatingWriterFactory)getWriterFactoryForId(id)).getWriter(tripleHandler);
}
```
This is just one initial idea, but food for thought. It also seems more in 
line with your concept of a "flow".

What do you think?



---


[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow

2018-09-12 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/121
  
@jgrzebyta But as far as rover goes, you're right: we currently don't have 
support for using an arbitrary triple handler. Looks like it expects an 
RDFFormat and then finds a triple handler based on that. 

I wonder if it would be possible to allow a more flexible way to specify 
triple handlers as rover arguments to fix this problem? While I don't think 
that creating a `ModelExtractor` as currently defined in this PR is the way to 
go, I do think that rover needs to be improved in this respect.

I will think on this.


---


[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

2018-09-12 Thread HansBrende
Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/121#discussion_r217159830
  
--- Diff: 
core/src/main/java/org/apache/any23/writer/BufferedTripleHandler.java ---
@@ -0,0 +1,161 @@
+package org.apache.any23.writer;
+
+import com.google.common.base.Throwables;
+import org.apache.any23.extractor.ExtractionContext;
+import org.eclipse.rdf4j.model.IRI;
+import org.eclipse.rdf4j.model.Model;
+import org.eclipse.rdf4j.model.Resource;
+import org.eclipse.rdf4j.model.Value;
+import org.eclipse.rdf4j.model.impl.LinkedHashModelFactory;
+import org.eclipse.rdf4j.model.impl.TreeModelFactory;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Map;
+import java.util.Stack;
+import java.util.TreeMap;
+
+/**
+ * Collects all statements until end document.
+ *
+ * All statements are kept within {@link Model}.
+ *
+ * @author Jacek Grzebyta (jgrzeb...@apache.org)
+ */
+public class BufferedTripleHandler implements TripleHandler {
+
+private static final Logger log = 
LoggerFactory.getLogger(BufferedTripleHandler.class);
+private TripleHandler underlying;
+private static boolean isDocumentFinish = false;
+
+private static class ContextHandler {
+ContextHandler(ExtractionContext ctx, Model m) {
+extractionContext = ctx;
+extractionModel = m;
+}
+ExtractionContext extractionContext;
+Model extractionModel;
+}
+
+private static class WorkflowContext {
+WorkflowContext(TripleHandler underlying) {
+this.rootHandler = underlying;
+}
+
+
+Stack extractors = new Stack<>();
+Map modelMap = new TreeMap<>();
+IRI documentIRI = null;
+TripleHandler rootHandler ;
+}
+
+public BufferedTripleHandler(TripleHandler underlying) {
+this.underlying = underlying;
+
+// hide model in the thread
+WorkflowContext wc = new WorkflowContext(underlying);
+BufferedTripleHandler.workflowContext.set(wc);
+}
+
+private static final ThreadLocal workflowContext = 
new ThreadLocal<>();
+
+/**
+ * Returns model which contains all other models.
+ * @return
+ */
+public static Model getModel() {
+return 
BufferedTripleHandler.workflowContext.get().modelMap.values().stream()
+.map(ch -> ch.extractionModel)
+.reduce(new LinkedHashModelFactory().createEmptyModel(), 
(mf, exm) -> {
+mf.addAll(exm);
+return mf;
+});
+}
+
+@Override
+public void startDocument(IRI documentIRI) throws 
TripleHandlerException {
+BufferedTripleHandler.workflowContext.get().documentIRI = 
documentIRI;
+}
+
+@Override
+public void openContext(ExtractionContext context) throws 
TripleHandlerException {
+//
+}
+
+@Override
+public void receiveTriple(Resource s, IRI p, Value o, IRI g, 
ExtractionContext context) throws TripleHandlerException {
+getModelForContext(context).add(s,p,o,g);
+}
+
+@Override
+public void receiveNamespace(String prefix, String uri, 
ExtractionContext context) throws TripleHandlerException {
+getModelForContext(context).setNamespace(prefix, uri);
+}
+
+@Override
+public void closeContext(ExtractionContext context) throws 
TripleHandlerException {
+//
+}
+
+@Override
+public void endDocument(IRI documentIRI) throws TripleHandlerException 
{
+BufferedTripleHandler.isDocumentFinish = true;
+}
+
+@Override
+public void setContentLength(long contentLength) {
+underlying.setContentLength(contentLength);
+}
+
+@Override
+public void close() throws TripleHandlerException {
+underlying.close();
+}
+
+/**
+ * Releases content of the model into underlying writer.
+ */
+public static void releaseModel() throws TripleHandlerException {
+if(!BufferedTripleHandler.isDocumentFinish) {
+throw new RuntimeException("Before releasing document should 
be finished.");
+}
+
+WorkflowContext workflowContext = 
BufferedTripleHandler.workflowContext.get();
+
+String lastExtractor = ((Stack) 
workflowContext.extractors).peek();
--- End diff --

@jgrzebyta IMHO, it would be vastly more straightforward to simply have the 
user extend the 
[`Composite

[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow

2018-09-12 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/121
  
Another thought:

Using the TripleHandler interface (as intended) to transform triples, 
rather than a separate ModelExtractor, has the added advantage that the triples 
might not necessarily need to be stored in memory during the transformation 
process. The user could implement either a "collecting" triple handler which 
stores statements in memory prior to transforming them, or a "streaming" triple 
handler for transformation-on-the-fly (e.g., if mapping some predicate A to 
some other predicate B), or some combination of these two concepts. The 
"collecting" ability could be easily supplemented with a `ModelWriter` or 
equivalent, as in [ANY23-397](https://issues.apache.org/jira/browse/ANY23-397).

But adding a separate "ModelExtractor" concept only muddles this 
already-existing ability to transform triples with TripleHandlers by 
introducing a redundant construct of more limited abstraction power than what 
already exists.

So for me:
-1 for ANY23-396
+1 for ANY23-397

@lewismc any thoughts?


---


[GitHub] any23 issue #121: ANY23-396 Add ability to run extractors in flow

2018-09-12 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/121
  
Aside from the comments I've made on this PR, I'm still not convinced that 
having a ModelExtractor is a good idea in the first place. Why not just create 
a ModelWriter (as in ANY23-397) or an equivalent "collecting" TripleHandler, 
and then allow the end user to transform the collected statements however they 
wish?

Having a ModelExtractor creates additional questions & complexities: in 
what order are the extractors executed? (Certainly the ModelExtractors would 
have to be executed last in order to have access to all previously collected 
statements.) What if multiple ModelExtractors are declared? Which ones have 
higher precedence in the extraction order?

I'm not sure that having a dedicated ModelExtractor is worth the trouble of 
dealing with these complexities, when a user could accomplish the same thing by 
simply transforming the statements collected by a ModelWriter or equivalent, or 
defining their own filtering and/or mapping TripleHandler.


---


[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

2018-09-12 Thread HansBrende
Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/121#discussion_r217044600
  
--- Diff: api/src/main/resources/default-configuration.properties ---
@@ -76,3 +76,6 @@ any23.extraction.csv.comment=#
 # A confidence threshold for the OpenIE extractions
 # Any extractions below this value will not be processed.
 any23.extraction.openie.confidence.threshold=0.5
+
+# Allows to enable(on)/disable(off) the workflow feature
+any23.extraction.workflows=off
--- End diff --

No extra flag should be needed for this.


---


[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

2018-09-12 Thread HansBrende
Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/121#discussion_r217044167
  
--- Diff: cli/src/main/java/org/apache/any23/cli/Rover.java ---
@@ -172,6 +174,8 @@ protected void configure() {
  defaultns);
 }
 
+
extractionParameters.setFlag(ExtractionParameters.EXTRACTION_WORKFLOWS_FLAG, 
workflow);
--- End diff --

We should not need a separate flag to enable certain extractors. If an 
extractor is contained within the extractor group we are using, then that 
should be, on its own, enough to enable itself.


---


[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

2018-09-12 Thread HansBrende
Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/121#discussion_r217042421
  
--- Diff: 
core/src/main/java/org/apache/any23/writer/BufferedTripleHandler.java ---
@@ -0,0 +1,161 @@
+package org.apache.any23.writer;
+
+import com.google.common.base.Throwables;
+import org.apache.any23.extractor.ExtractionContext;
+import org.eclipse.rdf4j.model.IRI;
+import org.eclipse.rdf4j.model.Model;
+import org.eclipse.rdf4j.model.Resource;
+import org.eclipse.rdf4j.model.Value;
+import org.eclipse.rdf4j.model.impl.LinkedHashModelFactory;
+import org.eclipse.rdf4j.model.impl.TreeModelFactory;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Map;
+import java.util.Stack;
+import java.util.TreeMap;
+
+/**
+ * Collects all statements until end document.
+ *
+ * All statements are kept within {@link Model}.
+ *
+ * @author Jacek Grzebyta (jgrzeb...@apache.org)
+ */
+public class BufferedTripleHandler implements TripleHandler {
+
+private static final Logger log = 
LoggerFactory.getLogger(BufferedTripleHandler.class);
+private TripleHandler underlying;
+private static boolean isDocumentFinish = false;
+
+private static class ContextHandler {
+ContextHandler(ExtractionContext ctx, Model m) {
+extractionContext = ctx;
+extractionModel = m;
+}
+ExtractionContext extractionContext;
+Model extractionModel;
+}
+
+private static class WorkflowContext {
+WorkflowContext(TripleHandler underlying) {
+this.rootHandler = underlying;
+}
+
+
+Stack extractors = new Stack<>();
+Map modelMap = new TreeMap<>();
+IRI documentIRI = null;
+TripleHandler rootHandler ;
+}
+
+public BufferedTripleHandler(TripleHandler underlying) {
+this.underlying = underlying;
+
+// hide model in the thread
+WorkflowContext wc = new WorkflowContext(underlying);
+BufferedTripleHandler.workflowContext.set(wc);
+}
+
+private static final ThreadLocal workflowContext = 
new ThreadLocal<>();
+
+/**
+ * Returns model which contains all other models.
+ * @return
+ */
+public static Model getModel() {
+return 
BufferedTripleHandler.workflowContext.get().modelMap.values().stream()
+.map(ch -> ch.extractionModel)
+.reduce(new LinkedHashModelFactory().createEmptyModel(), 
(mf, exm) -> {
+mf.addAll(exm);
+return mf;
+});
+}
+
+@Override
+public void startDocument(IRI documentIRI) throws 
TripleHandlerException {
+BufferedTripleHandler.workflowContext.get().documentIRI = 
documentIRI;
+}
+
+@Override
+public void openContext(ExtractionContext context) throws 
TripleHandlerException {
+//
+}
+
+@Override
+public void receiveTriple(Resource s, IRI p, Value o, IRI g, 
ExtractionContext context) throws TripleHandlerException {
+getModelForContext(context).add(s,p,o,g);
+}
+
+@Override
+public void receiveNamespace(String prefix, String uri, 
ExtractionContext context) throws TripleHandlerException {
+getModelForContext(context).setNamespace(prefix, uri);
+}
+
+@Override
+public void closeContext(ExtractionContext context) throws 
TripleHandlerException {
+//
+}
+
+@Override
+public void endDocument(IRI documentIRI) throws TripleHandlerException 
{
+BufferedTripleHandler.isDocumentFinish = true;
+}
+
+@Override
+public void setContentLength(long contentLength) {
+underlying.setContentLength(contentLength);
+}
+
+@Override
+public void close() throws TripleHandlerException {
+underlying.close();
+}
+
+/**
+ * Releases content of the model into underlying writer.
+ */
+public static void releaseModel() throws TripleHandlerException {
+if(!BufferedTripleHandler.isDocumentFinish) {
+throw new RuntimeException("Before releasing document should 
be finished.");
+}
+
+WorkflowContext workflowContext = 
BufferedTripleHandler.workflowContext.get();
+
+String lastExtractor = ((Stack) 
workflowContext.extractors).peek();
--- End diff --

Feels hacky... what if not all of the triples came from the same extractor?


---


[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

2018-09-12 Thread HansBrende
Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/121#discussion_r217041561
  
--- Diff: 
core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java ---
@@ -483,6 +488,14 @@ private SingleExtractionReport runExtractor(
 documentReport.getDocument(),
 extractionResult
 );
+} else if (extractor instanceof ModelExtractor) {
+final ModelExtractor modelExtractor = (ModelExtractor) 
extractor;
+final Model singleModel = BufferedTripleHandler.getModel();
--- End diff --

Should not be static.


---


[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

2018-09-12 Thread HansBrende
Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/121#discussion_r217041300
  
--- Diff: 
core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java ---
@@ -295,6 +294,12 @@ public SingleDocumentExtractionReport 
run(ExtractionParameters extractionParamet
 } finally {
try {
output.endDocument(documentIRI);
+
+   // in case of workflow flag release data from model
+if 
(extractionParameters.getFlag(ExtractionParameters.EXTRACTION_WORKFLOWS_FLAG)) {
+BufferedTripleHandler.releaseModel();
--- End diff --

This should not be a static call.


---


[GitHub] any23 pull request #121: ANY23-396 Add ability to run extractors in flow

2018-09-12 Thread HansBrende
Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/121#discussion_r217037875
  
--- Diff: 
core/src/main/java/org/apache/any23/writer/BufferedTripleHandler.java ---
@@ -0,0 +1,161 @@
+package org.apache.any23.writer;
+
+import com.google.common.base.Throwables;
+import org.apache.any23.extractor.ExtractionContext;
+import org.eclipse.rdf4j.model.IRI;
+import org.eclipse.rdf4j.model.Model;
+import org.eclipse.rdf4j.model.Resource;
+import org.eclipse.rdf4j.model.Value;
+import org.eclipse.rdf4j.model.impl.LinkedHashModelFactory;
+import org.eclipse.rdf4j.model.impl.TreeModelFactory;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Map;
+import java.util.Stack;
+import java.util.TreeMap;
+
+/**
+ * Collects all statements until end document.
+ *
+ * All statements are kept within {@link Model}.
+ *
+ * @author Jacek Grzebyta (jgrzeb...@apache.org)
+ */
+public class BufferedTripleHandler implements TripleHandler {
+
+private static final Logger log = 
LoggerFactory.getLogger(BufferedTripleHandler.class);
+private TripleHandler underlying;
+private static boolean isDocumentFinish = false;
+
+private static class ContextHandler {
+ContextHandler(ExtractionContext ctx, Model m) {
+extractionContext = ctx;
+extractionModel = m;
+}
+ExtractionContext extractionContext;
+Model extractionModel;
+}
+
+private static class WorkflowContext {
+WorkflowContext(TripleHandler underlying) {
+this.rootHandler = underlying;
+}
+
+
+Stack extractors = new Stack<>();
+Map modelMap = new TreeMap<>();
+IRI documentIRI = null;
+TripleHandler rootHandler ;
+}
+
+public BufferedTripleHandler(TripleHandler underlying) {
+this.underlying = underlying;
+
+// hide model in the thread
+WorkflowContext wc = new WorkflowContext(underlying);
+BufferedTripleHandler.workflowContext.set(wc);
+}
+
+private static final ThreadLocal workflowContext = 
new ThreadLocal<>();
--- End diff --

Model should not be static, unless there is a very good reason for doing so?


---


[GitHub] any23 issue #120: ANY23-393 Any23 master to build under JDK 10.X

2018-08-29 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/120
  
@lewismc  +1. Just out of curiosity, what is the new 
`javax.xml.bind:jaxb-api` dependency for?


---


[GitHub] any23 issue #118: ANY23-390 implement ICal, JCal, and XCal extractors

2018-08-27 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/118
  
@lewismc Glad you like it! Merged to master. Please let me know if you 
discover any issues.


---


[GitHub] any23 pull request #118: ANY23-390 implement ICal, JCal, and XCal extractors

2018-08-21 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/118

ANY23-390 implement ICal, JCal, and XCal extractors

This is my first stab at implementing the ical, jcal, and xcal extractors.

@lewismc Any input?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-390

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/118.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #118


commit 54a92960ac2fda9510041b6886eb7259a9b1220b
Author: Hans 
Date:   2018-08-21T16:37:35Z

ANY23-390 implement ICal, JCal, and XCal extractors




---


[GitHub] any23 issue #116: Any23 388

2018-08-17 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/116
  
@larsgsvensson it looks like you may have rebased in the wrong direction? 
One problem is that your master branch is 3 commits ahead of the apache/master 
branch. So here's what I would do (before making any more commits):

``` bash
git checkout master
git reset --hard HEAD~3
git pull https://github.com/apache/any23.git master
git push -f origin master

git checkout ANY23-388
git reset --hard HEAD~5
git rebase master
git push -f origin ANY23-388
```

After you do that, then make your changes to the ANY23-388 branch. Then:

```bash
git add .
git commit -m "ANY23-388 [message]"
git push origin ANY23-388
```

Then we should be good to go.



---


[GitHub] any23 pull request #116: Any23 388

2018-08-16 Thread HansBrende
Github user HansBrende commented on a diff in the pull request:

https://github.com/apache/any23/pull/116#discussion_r210745076
  
--- Diff: 
core/src/main/java/org/apache/any23/writer/RDFWriterTripleHandler.java ---
@@ -35,7 +35,7 @@
  */
 public abstract class RDFWriterTripleHandler implements FormatWriter, 
TripleHandler {
 
-private final RDFWriter writer;
+protected final RDFWriter writer;
 
--- End diff --

@jgrzebyta I don't see any reason to disallow protected access to the 
`writer` if we allow protected access to the constructor. Subclasses could 
bypass the `private` modifier anyway by simply saving a reference to the writer 
in the constructor.


---


[GitHub] any23 issue #116: Any23 388

2018-08-16 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/116
  
@larsgsvensson Thanks for the pull request! It looks like some of your 
commits reverse recent changes made to the master branch. You might want to 
just start over with a clean pull from master.


---


[GitHub] any23 issue #104: Any23 295: Implement ability to use librdfa

2018-08-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/104
  
Ok, I added some whitespace to a package-info.java file, should be showing 
up now.


---


[GitHub] any23 issue #104: Any23 295: Implement ability to use librdfa

2018-08-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/104
  
Ok, apparently due to some quirkiness of the way mirroring works, new 
branches in git-wip will not show up on github until an actual change is made 
to the branch.


---


[GitHub] any23 issue #104: Any23 295: Implement ability to use librdfa

2018-08-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/104
  
It is showing up here: https://git-wip-us.apache.org/repos/asf?p=any23.git

But not here: https://github.com/apache/any23/branches


---


[GitHub] any23 issue #104: Any23 295: Implement ability to use librdfa

2018-08-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/104
  
@lewismc I attempted to create a new branch using:

```
git checkout -b ANY23-295
git push canonical ANY23-295
```
to which git responded with:
```
To https://git-wip-us.apache.org/repos/asf/any23.git
 * [new branch]ANY23-295 -> ANY23-295
```
However, it isn't showing up under "Branches". Not sure why. It's showing 
up under my own "Branches" when I pushed to `origin`.


---


[GitHub] any23 issue #104: Any23 295: Implement ability to use librdfa

2018-08-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/104
  
@JulioCCBUcuenca Alright, sounds great then!


---


[GitHub] any23 issue #104: Any23 295: Implement ability to use librdfa

2018-08-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/104
  
@JulioCCBUcuenca Great to hear it passes the semargl test suites! However, 
`AbstractRDFaExtractorTestCase` contains only a fraction of the tests we run on 
RDFa. You should also test against the 
[RDFaExtractorTest](https://github.com/apache/any23/blob/master/core/src/test/java/org/apache/any23/extractor/rdfa/RDFaExtractorTest.java)
 and 
[RDFa11ExtractorTest](https://github.com/apache/any23/blob/master/core/src/test/java/org/apache/any23/extractor/rdfa/RDFa11ExtractorTest.java)
 classes.


---


[GitHub] any23 issue #104: Any23 295: Implement ability to use librdfa

2018-08-08 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/104
  
`Current XML parsers can use both lang or xml:lang, but since librdfa uses 
an old library for parsing XML it generates an error since it cannot identify 
the language.`

That seems worrisome to me... considering then, that most html pages will 
break the librdfa parser. 

@lewismc should we not test more thoroughly *before* merging to master? 
Maybe a separate branch instead?

Also, I would think that all of our current rdfa parsing tests should be 
duplicated for the librdfa parser to ensure that it is *at least as stable as 
our current rdfa parser*.  TBH, I'd be in favor of adding this into version 2.4 
rather than 2.3 so we have more time to thoroughly test the module.


---


[GitHub] any23 pull request #115: ANY23-385 improve encoding detection

2018-08-05 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/115

ANY23-385 improve encoding detection

1. Increase default sniff limit for text charset detection from 12000 bytes 
to 65536 bytes
2. Include results of xml declaration encoding detection
3. Include results of html meta charset encoding detection

mvn clean test -> all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-385

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/115.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #115


commit 22b3047d55f5e5b8fcba9c912424c9ed45313163
Author: Hans 
Date:   2018-08-05T23:39:01Z

ANY23-385 improve encoding detection




---


[GitHub] any23 pull request #114: ANY23-383 allow all unicode space chars in JSON-LD

2018-08-04 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/114

ANY23-383 allow all unicode space chars in JSON-LD

mvn clean test -> all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-383

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/114.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #114


commit 0df8cdba68fea0c6dcf819759627759c7597f0cb
Author: Hans 
Date:   2018-08-04T05:47:16Z

ANY23-383 allow all unicode space characters in JSON-LD




---


[GitHub] any23 pull request #113: ANY23-382 don't kill extraction on fatal json parsi...

2018-08-03 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/113

ANY23-382 don't kill extraction on fatal json parsing errors

mvn clean test -> all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-382

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/113.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #113


commit 837f92b9167d7944dbc88a965d6e17cf22b375e0
Author: Hans 
Date:   2018-08-03T21:06:15Z

ANY23-382 don't kill extraction on fatal json parsing errors




---


[GitHub] any23 pull request #112: ANY23-381 escape illegal characters in JSON-LD stri...

2018-08-02 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/112

ANY23-381 escape illegal characters in JSON-LD strings

mvn clean test -> all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-381

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/112.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #112


commit 817e744af90d8f3c9bf419e5c395c421e0c3924a
Author: Hans 
Date:   2018-08-02T21:33:36Z

ANY23-381 fix illegal unescaped characters in JSON-LD




---


[GitHub] any23 pull request #110: ANY23-380 disallow duplicate attribute keys

2018-08-02 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/110

ANY23-380 disallow duplicate attribute keys

I disallowed duplicate attribute keys in html to avoid 
`org.xml.sax.SAXParseException`s.

Along the way, I also cleaned up some annoying or unnecessary 
logging/console output produced by our massive suite of test cases.

Also cleaned up some javadoc/miscellaneous items.

mvn clean test -> all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-380

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/110.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #110


commit 4e3011a4d80545f04563f427687f4fa74e17103f
Author: Hans 
Date:   2018-08-01T21:06:55Z

ANY23-380 disallow duplicate attribute keys

commit 159aeb489473f600213142a746d39a49e3d3548b
Author: Hans 
Date:   2018-08-02T17:46:44Z

cleaned up annoying logging/console output

commit 0291f588d04859053ef4eb8845686bad824b4461
Author: Hans 
Date:   2018-08-02T18:01:19Z

added license and javadoc




---


[GitHub] any23 pull request #109: ANY23-379 remove invalid XML characters from docume...

2018-08-01 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/109

ANY23-379 remove invalid XML characters from document

mvn clean test -> all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-379

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/109.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #109


commit 36fba681e4295b65faf61b251308ecd8d2aa6771
Author: Hans 
Date:   2018-08-01T18:46:14Z

ANY23-379 remove invalid XML characters from document




---


[GitHub] any23 pull request #108: ANY23-378 clean commas in JSON-LD

2018-08-01 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/108

ANY23-378 clean commas in JSON-LD

Remove trailing commas from objects and arrays. Also replace semicolons 
with commas (compare to gson's `JsonReader.setLenient()`). 

mvn clean test -> all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-378

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/108.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #108


commit aae21370e70715f82f7cc868b9a298f1178d0f80
Author: Hans 
Date:   2018-08-01T16:25:21Z

ANY23-378 clean commas in JSON-LD




---


[GitHub] any23 pull request #107: ANY23-377 don't replace empty strings with 'Null'

2018-07-31 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/107

ANY23-377 don't replace empty strings with 'Null'

mvn clean test -> all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-377

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/107.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #107


commit a07d1f058fcdc2d994dcd220759310737fe68965
Author: Hans 
Date:   2018-07-31T21:37:25Z

ANY23-377 don't replace empty strings with 'Null'




---


[GitHub] any23 pull request #106: ANY23-376 fix IllegalArgumentException in microdata...

2018-07-31 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/106

ANY23-376 fix IllegalArgumentException in microdata extractor

mvn clean test -> all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-376

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/106.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #106


commit 6173637bb801da62b07b69be64fa2c75f8d54904
Author: Hans 
Date:   2018-07-31T20:35:55Z

ANY23-376 fix IllegalArgumentException in microdata extractor




---


[GitHub] any23 pull request #105: ANY23-374 fix schemeless microdata urls

2018-07-31 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/105

ANY23-374 fix schemeless microdata urls

Fixes microdata itemtype urls that are lacking a scheme by using a default 
scheme of "http".

mvn clean test -> all tests passed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-374

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/105.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #105


commit d283d70ceb692cacb1f31659ee5d5c987822028f
Author: Hans 
Date:   2018-07-31T17:21:26Z

ANY23-374 fix schemeless microdata urls




---


[GitHub] any23 issue #102: ANY23-367 update 'latest.stable.released' property

2018-07-18 Thread HansBrende
Github user HansBrende commented on the issue:

https://github.com/apache/any23/pull/102
  
@lewismc Any comments or am I good to merge this?


---


[GitHub] any23 pull request #103: ANY23-369 Resolved overlapping dependencies

2018-07-18 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/103

ANY23-369 Resolved overlapping dependencies

Taking a hint from the [`tika-parsers` 
pom](https://github.com/apache/tika/blob/master/tika-parsers/pom.xml) and the 
[`rdf4j-rio-jsonld` 
pom](https://github.com/eclipse/rdf4j/blob/master/rio/jsonld/pom.xml), I 
excluded the following libraries from the project:
- `stax:stax-api`
- `org.apache.httpcomponents:fluent-hc`
- `org.apache.httpcomponents:httpcore-nio`
- `org.apache.httpcomponents:httpcore-osgi`
- `org.apache.httpcomponents:httpclient-osgi`

mvn clean test -> all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-369

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/103.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #103


commit 0259c695cc9e75fd0156018976391bab04d4d3c1
Author: Hans 
Date:   2018-07-18T19:11:32Z

ANY23-369 Resolved overlapping dependencies




---


[GitHub] any23 pull request #102: ANY23-367 update 'latest.stable.released' property

2018-07-16 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/102

ANY23-367 update 'latest.stable.released' property

@lewismc anything else I need to do here to ensure this refactor works 
properly?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-367

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/102.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #102


commit 950631873fec6e931859ea22b3beb91577164b25
Author: Hans 
Date:   2018-07-16T14:38:21Z

ANY23-367 update 'latest.stable.released' property




---


[GitHub] any23 pull request #101: ANY23-366 resolved additional build warnings

2018-07-12 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/101

ANY23-366 resolved additional build warnings

1. Excluded `commons-logging` from dependencies to ensure `jcl-over-slf4j` 
works as expected
2. Changed deprecated 'name' tag to 'id' in `appassembler-maven-plugin`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-366

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/101.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #101


commit 8cd464be02e701b8e5d05d6f12bc3e22a6f0b0b4
Author: Hans 
Date:   2018-07-12T19:38:09Z

ANY23-366 excluded commons-logging from dependencies

commit 8d7b4fd67e26bc9ab07af0e26bc002c35b0c6176
Author: Hans 
Date:   2018-07-12T19:41:46Z

ANY23-366 changed 'name' to 'id' in appassembler-maven-plugin




---


[GitHub] any23 pull request #100: ANY23-365 resolved additional warnings

2018-07-11 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/100

ANY23-365 resolved additional warnings

Resolved the following warnings:
- Annotation `Author.class` is not retained for reflective access (in 
`o.a.a.cli.PluginVerifier`)
- `o.a.a.cli.PluginVerifier` uses unchecked or unsafe operations
- `sun.security.validator.ValidatorException` is internal proprietary API 
and may be removed in a future release (in `o.a.a.servlet.WebResponder`)

mvn clean test -> all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-365

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/100.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #100


commit 3f87cf3a8ca51650376d7f111613fe0c1eda74d5
Author: Hans 
Date:   2018-07-11T20:53:05Z

ANY23-365 resolved additional warnings




---


[GitHub] any23 pull request #99: ANY23-364 resolved POI deprecation warnings

2018-07-11 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/99

ANY23-364 resolved POI deprecation warnings

mvn clean test -> all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-364

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/99.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #99


commit 5a2613b848b317c54381bcc8d7b23ca1e27e3725
Author: Hans 
Date:   2018-07-11T20:10:46Z

ANY23-364 resolved POI deprecation warnings




---


[GitHub] any23 pull request #98: ANY23-363 updated httpclient/httpcore to 4.5.6/4.4.1...

2018-07-11 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/98

ANY23-363 updated httpclient/httpcore to 4.5.6/4.4.10

mvn clean test -> all tests passed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-363

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/98.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #98


commit 40619343dd8876dc447ea49ae952af33899b008f
Author: Hans 
Date:   2018-07-11T18:28:11Z

ANY23-363 updated httpclient/httpcore to 4.5.6/4.4.10




---


[GitHub] any23 pull request #97: ANY23-362 resolved rdf4j deprecation warnings

2018-07-11 Thread HansBrende
GitHub user HansBrende opened a pull request:

https://github.com/apache/any23/pull/97

ANY23-362 resolved rdf4j deprecation warnings

1. resolved rdf4j deprecation warnings
2. refactored for code style and improved singleton pattern in 
`RDFParserFactory`
3. updated no-arg constructors in `RDFXMLExtractor` and `TriXExtractor` to 
match their javadoc specs, along with the behavior all the other RDF extractor 
classes' no-arg constructors

mvn clean test -> all tests passed

@lewismc any comments before I merge this in?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HansBrende/any23 ANY23-362

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/97.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #97


commit 59091ef08aadc30e64abdff1a4b17cf81c2b6bbd
Author: Hans 
Date:   2018-07-11T16:13:30Z

ANY23-362 resolved rdf4j deprecation warnings




---


  1   2   >