[jira] [Commented] (ANY23-554) Avoid using carriage return to detect windows-1252 charset if content type has been identified from metadata

2022-01-05 Thread Hans Brende (Jira)


[ 
https://issues.apache.org/jira/browse/ANY23-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469136#comment-17469136
 ] 

Hans Brende commented on ANY23-554:
---

All that being said, sounds like there is a problem with those tests... I will 
investigate further once I get a chance.

> Avoid using carriage return to detect windows-1252 charset if content type 
> has been identified from metadata
> 
>
> Key: ANY23-554
> URL: https://issues.apache.org/jira/browse/ANY23-554
> Project: Apache Any23
>  Issue Type: Task
>Reporter: Peter Ansell
>Priority: Major
>
> Two encoding detection tests are failing on Windows and Windows Subsystem for 
> Linux due to a condition that overrides a meta tag with a heuristic, which is 
> not likely correct in its current form as carriage returns are present in 
> many different Windows produced documents, which may legitimately follow 
> ISO-8859-1.
> If someone has put a meta tag in with ISO-8859-1, we shouldn't be using the 
> presence of carriage return characters overriding that with an incompatible 
> windows specific codepage, windows-1252.
> The relevant code is:
> https://github.com/apache/any23/blob/any23-2.6/encoding/src/main/java/org/apache/any23/encoding/EncodingUtils.java#L62-L69
> The tests that are failing on Windows and WSL2 are:
> [INFO] Results:
> [INFO]
> [ERROR] Failures:
> [ERROR]   TikaEncodingDetectorTest.testISO8859HTML:58->assertEncoding:128
> Unexpected encoding expected:<[ISO-8859-1]> but was:<[windows-1252]>
> [ERROR]   TikaEncodingDetectorTest.testISO8859XHTML:63->assertEncoding:128
> Unexpected encoding expected:<[ISO-8859-1]> but was:<[windows-1252]>
> [INFO]
> [ERROR] Tests run: 12, Failures: 2, Errors: 0, Skipped: 0
> [INFO]
> [INFO] 
> 
> [INFO] Reactor Summary for Apache Any23 2.6:
> [INFO]
> [INFO] Apache Any23 ... SUCCESS [01:57 
> min]
> [INFO] Apache Any23 :: Base API ... SUCCESS [ 56.016 
> s]
> [INFO] Apache Any23 :: Test Resources . SUCCESS [  1.068 
> s]
> [INFO] Apache Any23 :: CSV Utilities .. SUCCESS [  2.759 
> s]
> [INFO] Apache Any23 :: Mime Type Detection  SUCCESS [01:10 
> min]
> [INFO] Apache Any23 :: Encoding Detection . FAILURE [  4.160 
> s]
> [INFO] Apache Any23 :: Core ... SKIPPED
> [INFO] Apache Any23 :: CLI  SKIPPED
> [INFO] 
> 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ANY23-554) Avoid using carriage return to detect windows-1252 charset if content type has been identified from metadata

2022-01-05 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende reassigned ANY23-554:
-

Assignee: Hans Brende

> Avoid using carriage return to detect windows-1252 charset if content type 
> has been identified from metadata
> 
>
> Key: ANY23-554
> URL: https://issues.apache.org/jira/browse/ANY23-554
> Project: Apache Any23
>  Issue Type: Task
>Reporter: Peter Ansell
>Assignee: Hans Brende
>Priority: Major
>
> Two encoding detection tests are failing on Windows and Windows Subsystem for 
> Linux due to a condition that overrides a meta tag with a heuristic, which is 
> not likely correct in its current form as carriage returns are present in 
> many different Windows produced documents, which may legitimately follow 
> ISO-8859-1.
> If someone has put a meta tag in with ISO-8859-1, we shouldn't be using the 
> presence of carriage return characters overriding that with an incompatible 
> windows specific codepage, windows-1252.
> The relevant code is:
> https://github.com/apache/any23/blob/any23-2.6/encoding/src/main/java/org/apache/any23/encoding/EncodingUtils.java#L62-L69
> The tests that are failing on Windows and WSL2 are:
> [INFO] Results:
> [INFO]
> [ERROR] Failures:
> [ERROR]   TikaEncodingDetectorTest.testISO8859HTML:58->assertEncoding:128
> Unexpected encoding expected:<[ISO-8859-1]> but was:<[windows-1252]>
> [ERROR]   TikaEncodingDetectorTest.testISO8859XHTML:63->assertEncoding:128
> Unexpected encoding expected:<[ISO-8859-1]> but was:<[windows-1252]>
> [INFO]
> [ERROR] Tests run: 12, Failures: 2, Errors: 0, Skipped: 0
> [INFO]
> [INFO] 
> 
> [INFO] Reactor Summary for Apache Any23 2.6:
> [INFO]
> [INFO] Apache Any23 ... SUCCESS [01:57 
> min]
> [INFO] Apache Any23 :: Base API ... SUCCESS [ 56.016 
> s]
> [INFO] Apache Any23 :: Test Resources . SUCCESS [  1.068 
> s]
> [INFO] Apache Any23 :: CSV Utilities .. SUCCESS [  2.759 
> s]
> [INFO] Apache Any23 :: Mime Type Detection  SUCCESS [01:10 
> min]
> [INFO] Apache Any23 :: Encoding Detection . FAILURE [  4.160 
> s]
> [INFO] Apache Any23 :: Core ... SKIPPED
> [INFO] Apache Any23 :: CLI  SKIPPED
> [INFO] 
> 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ANY23-554) Avoid using carriage return to detect windows-1252 charset if content type has been identified from metadata

2022-01-05 Thread Hans Brende (Jira)


[ 
https://issues.apache.org/jira/browse/ANY23-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469135#comment-17469135
 ] 

Hans Brende commented on ANY23-554:
---

A couple thoughts here:

1. ISO-8859-1 and windows-1252 are actually not incompatible, but synonyms, as 
defined by the HTML WHATWG specification: 
https://encoding.spec.whatwg.org/#ref-for-windows-1252%E2%91%A0

2. It is very common to mislabel Windows-1252 text as ISO-8859-1 (see 
https://en.wikipedia.org/wiki/Windows-1252 )

3. As mentioned in the comment from the linked code, the \r heuristic was 
copied from Tika's implementation so it has solid precedent

4. Labels are also heuristics... the question is, which heuristic should rank 
higher? The charset label heuristic should win sometimes, but not always due to 
the prevalence of mislabeled content on the web. For example, we'd definitely 
want to assign a byte-order mark higher priority than a label, *especially* in 
HTML markdown, since it is actually illegal to declare any meta encoding 
*except* UTF-8 in an HTML document! So one could say that the document is 
*already* malformed having a meta tag that differs from UTF-8. (See WHATWG: 
https://html.spec.whatwg.org/#charset).

> Avoid using carriage return to detect windows-1252 charset if content type 
> has been identified from metadata
> 
>
> Key: ANY23-554
> URL: https://issues.apache.org/jira/browse/ANY23-554
> Project: Apache Any23
>  Issue Type: Task
>Reporter: Peter Ansell
>Priority: Major
>
> Two encoding detection tests are failing on Windows and Windows Subsystem for 
> Linux due to a condition that overrides a meta tag with a heuristic, which is 
> not likely correct in its current form as carriage returns are present in 
> many different Windows produced documents, which may legitimately follow 
> ISO-8859-1.
> If someone has put a meta tag in with ISO-8859-1, we shouldn't be using the 
> presence of carriage return characters overriding that with an incompatible 
> windows specific codepage, windows-1252.
> The relevant code is:
> https://github.com/apache/any23/blob/any23-2.6/encoding/src/main/java/org/apache/any23/encoding/EncodingUtils.java#L62-L69
> The tests that are failing on Windows and WSL2 are:
> [INFO] Results:
> [INFO]
> [ERROR] Failures:
> [ERROR]   TikaEncodingDetectorTest.testISO8859HTML:58->assertEncoding:128
> Unexpected encoding expected:<[ISO-8859-1]> but was:<[windows-1252]>
> [ERROR]   TikaEncodingDetectorTest.testISO8859XHTML:63->assertEncoding:128
> Unexpected encoding expected:<[ISO-8859-1]> but was:<[windows-1252]>
> [INFO]
> [ERROR] Tests run: 12, Failures: 2, Errors: 0, Skipped: 0
> [INFO]
> [INFO] 
> 
> [INFO] Reactor Summary for Apache Any23 2.6:
> [INFO]
> [INFO] Apache Any23 ... SUCCESS [01:57 
> min]
> [INFO] Apache Any23 :: Base API ... SUCCESS [ 56.016 
> s]
> [INFO] Apache Any23 :: Test Resources . SUCCESS [  1.068 
> s]
> [INFO] Apache Any23 :: CSV Utilities .. SUCCESS [  2.759 
> s]
> [INFO] Apache Any23 :: Mime Type Detection  SUCCESS [01:10 
> min]
> [INFO] Apache Any23 :: Encoding Detection . FAILURE [  4.160 
> s]
> [INFO] Apache Any23 :: Core ... SKIPPED
> [INFO] Apache Any23 :: CLI  SKIPPED
> [INFO] 
> 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ANY23-441) TikaEncodingDetector: guessEncoding may throws an ArrayIndexOutOfBoundsException

2020-03-29 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-441:
--
Fix Version/s: (was: 2.5)
   2.4

> TikaEncodingDetector: guessEncoding may throws an 
> ArrayIndexOutOfBoundsException
> 
>
> Key: ANY23-441
> URL: https://issues.apache.org/jira/browse/ANY23-441
> Project: Apache Any23
>  Issue Type: Bug
>  Components: encoding
>Affects Versions: 2.3
>Reporter: Anthony Pessy
>Priority: Major
> Fix For: 2.4
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Using `TikaEncodingDetector.guessEncoding` may result in an 
> `ArrayIndexOutOfBoundsException`.
>  
> The following snippet:
> {noformat}
> String encoding = new TikaEncodingDetector().guessEncoding(new 
> URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html";).openStream());
> System.out.println(encoding);{noformat}
> Will result in the following exception:
> {noformat}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 
> out of bounds for length 32768Exception in thread "main" 
> java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 
> 32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100) 
> at org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at 
> org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at 
> org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at 
> org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at 
> org.jsoup.parser.Parser.parseFragment(Parser.java:140) at 
> org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184)
>  at 
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95)
>  at 
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159)
>  at 
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58){noformat}
> Whereas the expected result is `ISO-8859-15`
> Note the bunch of HTML at the bottom of the page after the `` tag.
>  
> Replacing:
> {code:java}
> ParseErrorList htmlErrors = ParseErrorList.tracking(Integer.MAX_VALUE);
> {code}
> By:
> {code:java}
> ParseErrorList htmlErrors = ParseErrorList.tracking(100);
> {code}
>  
> Will fix the issue. Not quite sure why, maybe at one point the errors are too 
> far and the reader cannot reset far enough...
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ANY23-441) TikaEncodingDetector: guessEncoding may throws an ArrayIndexOutOfBoundsException

2020-03-29 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-441.
---
  Assignee: Hans Brende
Resolution: Fixed

> TikaEncodingDetector: guessEncoding may throws an 
> ArrayIndexOutOfBoundsException
> 
>
> Key: ANY23-441
> URL: https://issues.apache.org/jira/browse/ANY23-441
> Project: Apache Any23
>  Issue Type: Bug
>  Components: encoding
>Affects Versions: 2.3
>Reporter: Anthony Pessy
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Using `TikaEncodingDetector.guessEncoding` may result in an 
> `ArrayIndexOutOfBoundsException`.
>  
> The following snippet:
> {noformat}
> String encoding = new TikaEncodingDetector().guessEncoding(new 
> URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html";).openStream());
> System.out.println(encoding);{noformat}
> Will result in the following exception:
> {noformat}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 
> out of bounds for length 32768Exception in thread "main" 
> java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 
> 32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100) 
> at org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at 
> org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at 
> org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at 
> org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at 
> org.jsoup.parser.Parser.parseFragment(Parser.java:140) at 
> org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184)
>  at 
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95)
>  at 
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159)
>  at 
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58){noformat}
> Whereas the expected result is `ISO-8859-15`
> Note the bunch of HTML at the bottom of the page after the `` tag.
>  
> Replacing:
> {code:java}
> ParseErrorList htmlErrors = ParseErrorList.tracking(Integer.MAX_VALUE);
> {code}
> By:
> {code:java}
> ParseErrorList htmlErrors = ParseErrorList.tracking(100);
> {code}
>  
> Will fix the issue. Not quite sure why, maybe at one point the errors are too 
> far and the reader cannot reset far enough...
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ANY23-428) RDFa parse issue if vocab not defined with trailing slash

2020-03-29 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-428.
---
  Assignee: Hans Brende
Resolution: Fixed

> RDFa parse issue if vocab not defined with trailing slash
> -
>
> Key: ANY23-428
> URL: https://issues.apache.org/jira/browse/ANY23-428
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 2.3
>Reporter: David Cockbill
>Assignee: Hans Brende
>Priority: Minor
> Fix For: 2.4
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If a RDFa vocab URL is missing a trailing forward slash, then the properties 
> are not expanded correctly.
> For example:
>  
> {code:java}
> https://schema.org"; typeof="BreadcrumbList">
> {code}
> rather than:
>  
> {code:java}
> https://schema.org/"; typeof="BreadcrumbList">
> {code}
> produces properties that look (in nTriples) as follows:
>  
>  
> {code:java}
>   
>  .
> _:n0  
>  .
> _:n1  
>  .
> {code}
>  
>  
> I'm sure the intention should be to join the properties and vocab with a 
> forward slash.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ANY23-449) Fix the online microdata test failure

2020-03-29 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-449.
---
Resolution: Fixed

> Fix the online microdata test failure
> -
>
> Key: ANY23-449
> URL: https://issues.apache.org/jira/browse/ANY23-449
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Test failure: https://builds.apache.org/job/Any23-trunk/1678/
> The test output reads:
> [ERROR] Failures:
> [ERROR]   MicrodataExtractorTest.runOnlineTests:273 1 failures out of 82 
> total tests
> Test 0026: Web Schemas TF: Schema.org tests: test 11 (format md)
> Test 0026: Web Schemas TF: Schema.org tests: test 11 (format md)
> https://w3c.github.io/microdata-rdf/tests/sdo_eg_md_11.html ==> 
> https://w3c.github.io/microdata-rdf/tests/sdo_eg_md_11.ttl
> EXPECT: _:0 http://schema.org/author 
> http://w3c.github.io/author/jd_salinger.html
> ...34 statements in common...
> ACTUAL: _:1 http://schema.org/author 
> https://w3c.github.io/author/jd_salinger.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ANY23-449) Fix the online microdata test failure

2020-03-29 Thread Hans Brende (Jira)
Hans Brende created ANY23-449:
-

 Summary: Fix the online microdata test failure
 Key: ANY23-449
 URL: https://issues.apache.org/jira/browse/ANY23-449
 Project: Apache Any23
  Issue Type: Bug
  Components: core
Affects Versions: 2.3
Reporter: Hans Brende
Assignee: Hans Brende
 Fix For: 2.4


Test failure: https://builds.apache.org/job/Any23-trunk/1678/

The test output reads:

[ERROR] Failures:
[ERROR]   MicrodataExtractorTest.runOnlineTests:273 1 failures out of 82 total 
tests
Test 0026: Web Schemas TF: Schema.org tests: test 11 (format md)


Test 0026: Web Schemas TF: Schema.org tests: test 11 (format md)
https://w3c.github.io/microdata-rdf/tests/sdo_eg_md_11.html ==> 
https://w3c.github.io/microdata-rdf/tests/sdo_eg_md_11.ttl
EXPECT: _:0 http://schema.org/author 
http://w3c.github.io/author/jd_salinger.html
...34 statements in common...
ACTUAL: _:1 http://schema.org/author 
https://w3c.github.io/author/jd_salinger.html





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ANY23-446) Fix bugs in Jsoup

2020-03-29 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-446.
---
Resolution: Fixed

> Fix bugs in Jsoup
> -
>
> Key: ANY23-446
> URL: https://issues.apache.org/jira/browse/ANY23-446
> Project: Apache Any23
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jsoup is giving us some issues in our encoding detection module, namely:
> https://github.com/jhy/jsoup/issues/1251  (which caused ANY23-441)
> and 
> https://github.com/jhy/jsoup/issues/1250  (which is going to make our 
> encoding detector blow up anytime we're detecting, e.g., UTF-16.)
> The latter issue is more serious than the former due to the potential 
> frequency of the errors.
> There is one pull request open in jsoup for the first issue which fixes it, 
> but unfortunately Jonathan Hedley (creator of jsoup) has not been active over 
> the past few months and I doubt it'll get merged anytime soon.
> I propose that we temporarily repackage a couple jsoup classes in our 
> encoding detection module and add some quick fixes. When the jsoup library 
> gets updated, we can potentially remove the repackaged classes again.
> One bonus advantage: this will allow us to implement a streaming approach to 
> encoding detection instead of our current strategy of building the entire DOM 
> to extract the plaintext (which is really overkill on memory usage).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ANY23-446) Fix bugs in Jsoup

2020-03-29 Thread Hans Brende (Jira)


[ 
https://issues.apache.org/jira/browse/ANY23-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070428#comment-17070428
 ] 

Hans Brende commented on ANY23-446:
---

Update: both these bugs are fixed in the newly released jsoup v1.13.1.

> Fix bugs in Jsoup
> -
>
> Key: ANY23-446
> URL: https://issues.apache.org/jira/browse/ANY23-446
> Project: Apache Any23
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>
> Jsoup is giving us some issues in our encoding detection module, namely:
> https://github.com/jhy/jsoup/issues/1251  (which caused ANY23-441)
> and 
> https://github.com/jhy/jsoup/issues/1250  (which is going to make our 
> encoding detector blow up anytime we're detecting, e.g., UTF-16.)
> The latter issue is more serious than the former due to the potential 
> frequency of the errors.
> There is one pull request open in jsoup for the first issue which fixes it, 
> but unfortunately Jonathan Hedley (creator of jsoup) has not been active over 
> the past few months and I doubt it'll get merged anytime soon.
> I propose that we temporarily repackage a couple jsoup classes in our 
> encoding detection module and add some quick fixes. When the jsoup library 
> gets updated, we can potentially remove the repackaged classes again.
> One bonus advantage: this will allow us to implement a streaming approach to 
> encoding detection instead of our current strategy of building the entire DOM 
> to extract the plaintext (which is really overkill on memory usage).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ANY23-446) Fix bugs in Jsoup

2019-09-29 Thread Hans Brende (Jira)
Hans Brende created ANY23-446:
-

 Summary: Fix bugs in Jsoup
 Key: ANY23-446
 URL: https://issues.apache.org/jira/browse/ANY23-446
 Project: Apache Any23
  Issue Type: Bug
Affects Versions: 2.3
Reporter: Hans Brende
Assignee: Hans Brende
 Fix For: 2.4


Jsoup is giving us some issues in our encoding detection module, namely:

https://github.com/jhy/jsoup/issues/1251  (which caused ANY23-441)

and 

https://github.com/jhy/jsoup/issues/1250  (which is going to make our encoding 
detector blow up anytime we're detecting, e.g., UTF-16.)

The latter issue is more serious than the former due to the potential frequency 
of the errors.

There is one pull request open in jsoup for the first issue which fixes it, but 
unfortunately Jonathan Hedley (creator of jsoup) has not been active over the 
past few months and I doubt it'll get merged anytime soon.

I propose that we temporarily repackage a couple jsoup classes in our encoding 
detection module and add some quick fixes. When the jsoup library gets updated, 
we can potentially remove the repackaged classes again.

One bonus advantage: this will allow us to implement a streaming approach to 
encoding detection instead of our current strategy of building the entire DOM 
to extract the plaintext (which is really overkill on memory usage).





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ANY23-430) Microdata and HTML's attribute case

2019-09-28 Thread Hans Brende (Jira)


[ 
https://issues.apache.org/jira/browse/ANY23-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16940120#comment-16940120
 ] 

Hans Brende commented on ANY23-430:
---

[~panthony] I've reviewed that site and tested it with Any23, and I can't seem 
to reproduce your issue. It appears to me that Any23's microdata extractor is 
picking up all the relevant attributes that Google's structured data tool does, 
lowercased or not (and some of those attributes are indeed camelcase, but Any23 
is able to pick those up just fine).

Would you mind double-checking what the issue is here? 

> Microdata and HTML's attribute case
> ---
>
> Key: ANY23-430
> URL: https://issues.apache.org/jira/browse/ANY23-430
> Project: Apache Any23
>  Issue Type: Bug
>  Components: microdata
>Affects Versions: 2.3
>Reporter: Anthony Pessy
>Priority: Major
> Fix For: 2.4
>
>
> Using the Microdata parser, I noticed that it found less attributes that 
> google testing tool.
> For exemple with the following page:
> [https://search.google.com/structured-data/testing-tool/u/0/#url=https%3A%2F%2Fwww.home24.de%2Fprodukt%2Fpendelleuchte-newtown-i-stahl-1-1077]
>  
> While investigating I noticed that the markup was `itemProp` whereas the 
> parser expect `itemprop`.
>  
> Because HTML attributes are expected to be case insensitive I believe the 
> case should not prevent the parser from working.
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ANY23-445) Review spotbugs issues

2019-09-25 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-445:
--
Fix Version/s: (was: 2.5)
   2.4

> Review spotbugs issues
> --
>
> Key: ANY23-445
> URL: https://issues.apache.org/jira/browse/ANY23-445
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.4
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.4
>
>
> Post ANY23-444 we can now run spotsbugs... currently it flags the following 
> issues
> {code}
> [INFO] --- spotbugs-maven-plugin:3.1.12.2:check (default-cli) @ 
> apache-any23-api ---
> [INFO] BugInstance size is 131
> [INFO] Error size is 0
> [INFO] Total bugs: 131
> [ERROR] 
> org.apache.any23.configuration.DefaultConfiguration.loadDefaultProperties() 
> may fail to clean up java.io.InputStream 
> [org.apache.any23.configuration.DefaultConfiguration, 
> org.apache.any23.configuration.DefaultConfiguration, 
> org.apache.any23.configuration.DefaultConfiguration] Obligation to clean up 
> resource created at DefaultConfiguration.java:[line 78] is not dischargedPath 
> continues at DefaultConfiguration.java:[line 81]Path continues at 
> DefaultConfiguration.java:[line 82] OBL_UNSATISFIED_OBLIGATION
> [ERROR] Redundant nullcheck of value, which is known to be non-null in 
> org.apache.any23.configuration.DefaultConfiguration.getFlagProperty(String) 
> [org.apache.any23.configuration.DefaultConfiguration] Redundant null check at 
> DefaultConfiguration.java:[line 132] RCN_REDUNDANT_NULLCHECK_OF_NONNULL_VALUE
> [ERROR] Found reliance on default encoding in 
> org.apache.any23.extractor.ExtractionException.printStackTrace(PrintStream): 
> new java.io.PrintWriter(OutputStream) 
> [org.apache.any23.extractor.ExtractionException] At 
> ExtractionException.java:[line 47] DM_DEFAULT_ENCODING
> [ERROR] String is incompatible with expected argument type 
> org.apache.any23.mime.MIMEType in 
> org.apache.any23.extractor.ExtractorGroup.supportsAllContentTypes(ExtractorFactory)
>  [org.apache.any23.extractor.ExtractorGroup] At ExtractorGroup.java:[line 82] 
> GC_UNRELATED_TYPES
> [ERROR] org.apache.any23.mime.MIMEType defines compareTo(MIMEType) and uses 
> Object.equals() [org.apache.any23.mime.MIMEType] At MIMEType.java:[line 134] 
> EQ_COMPARETO_USE_OBJECT_EQUALS
> [ERROR] Possible null pointer dereference in 
> org.apache.any23.plugin.Any23PluginManager.loadJARDir(File) due to return 
> value of called method [org.apache.any23.plugin.Any23PluginManager, 
> org.apache.any23.plugin.Any23PluginManager] Dereferenced at 
> Any23PluginManager.java:[line 194]Known null at Any23PluginManager.java:[line 
> 194] NP_NULL_ON_SOME_PATH_FROM_RETURN_VALUE
> [ERROR] The field name org.apache.any23.vocab.HCard.Address doesn't start 
> with a lower case letter [org.apache.any23.vocab.HCard] In HCard.java 
> NM_FIELD_NAMING_CONVENTION
> [ERROR] The field name org.apache.any23.vocab.HCard.Card doesn't start with a 
> lower case letter [org.apache.any23.vocab.HCard] In HCard.java 
> NM_FIELD_NAMING_CONVENTION
> [ERROR] The field name org.apache.any23.vocab.HCard.Geo doesn't start with a 
> lower case letter [org.apache.any23.vocab.HCard] In HCard.java 
> NM_FIELD_NAMING_CONVENTION
> [ERROR] Unread public/protected field: org.apache.any23.vocab.HCard.Address 
> [org.apache.any23.vocab.HCard] At HCard.java:[line 40] 
> URF_UNREAD_PUBLIC_OR_PROTECTED_FIELD
> [ERROR] Unread public/protected field: org.apache.any23.vocab.HCard.Card 
> [org.apache.any23.vocab.HCard] At HCard.java:[line 39] 
> URF_UNREAD_PUBLIC_OR_PROTECTED_FIELD
> [ERROR] Unread public/protected field: org.apache.any23.vocab.HCard.Geo 
> [org.apache.any23.vocab.HCard] At HCard.java:[line 41] 
> URF_UNREAD_PUBLIC_OR_PROTECTED_FIELD
> [ERROR] Unread public/protected field: 
> org.apache.any23.vocab.HCard.additional_name [org.apache.any23.vocab.HCard] 
> At HCard.java:[line 47] URF_UNREAD_PUBLIC_OR_PROTECTED_FIELD
> [ERROR] Unread public/protected field: org.apache.any23.vocab.HCard.adr 
> [org.apache.any23.vocab.HCard] At HCard.java:[line 70] 
> URF_UNREAD_PUBLIC_OR_PROTECTED_FIELD
> [ERROR] Unread public/protected field: org.apache.any23.vocab.HCard.altitude 
> [org.apache.any23.vocab.HCard] At HCard.java:[line 81] 
> URF_UNREAD_PUBLIC_OR_PROTECTED_FIELD
> [ERROR] Unread public/protected field: 
> org.apache.any23.vocab.HCard.anniversary [org.apache.any23.vocab.HCard] At 
> HCard.java:[line 68] URF_UNREAD_PUBLIC_OR_PROTECTED_FIELD
> [ERROR] Unread public/protected field: org.apache.any23.vocab.HCard.bday 
> [org.apache.any23.vocab.HCard] At HCard.java:[line 60] 
> URF_UNREAD_PUBLIC_OR_PROTECTED_FIELD
> [ERROR] Unread public/protected field: org.apache.any23.vocab.HCard.category 
> [org.apache.any23.vocab.HCard] At HCard.java:[line 57] 
> URF_UNREAD_PUBL

[jira] [Resolved] (ANY23-439) Replace commons-lang with commons-lang3

2019-09-25 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-439.
---
  Assignee: Hans Brende  (was: Lewis John McGibbney)
Resolution: Fixed

> Replace commons-lang with commons-lang3
> ---
>
> Key: ANY23-439
> URL: https://issues.apache.org/jira/browse/ANY23-439
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: core
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ANY23-437) Upgrade snakeyaml to v1.24

2019-09-25 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-437:
--
Fix Version/s: (was: 2.5)
   2.4

> Upgrade snakeyaml to v1.24
> --
>
> Key: ANY23-437
> URL: https://issues.apache.org/jira/browse/ANY23-437
> Project: Apache Any23
>  Issue Type: Task
>Reporter: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ANY23-439) Replace commons-lang with commons-lang3

2019-09-25 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-439:
--
Fix Version/s: (was: 2.5)
   2.4

> Replace commons-lang with commons-lang3
> ---
>
> Key: ANY23-439
> URL: https://issues.apache.org/jira/browse/ANY23-439
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: core
>Reporter: Hans Brende
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.4
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ANY23-436) Upgrade commons-csv to v1.7

2019-09-25 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-436:
--
Fix Version/s: (was: 2.5)
   2.4

> Upgrade commons-csv to v1.7
> ---
>
> Key: ANY23-436
> URL: https://issues.apache.org/jira/browse/ANY23-436
> Project: Apache Any23
>  Issue Type: Task
>Reporter: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ANY23-430) Microdata and HTML's attribute case

2019-09-25 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-430:
--
Fix Version/s: (was: 2.5)
   2.4

> Microdata and HTML's attribute case
> ---
>
> Key: ANY23-430
> URL: https://issues.apache.org/jira/browse/ANY23-430
> Project: Apache Any23
>  Issue Type: Bug
>  Components: microdata
>Affects Versions: 2.3
>Reporter: Anthony Pessy
>Priority: Major
> Fix For: 2.4
>
>
> Using the Microdata parser, I noticed that it found less attributes that 
> google testing tool.
> For exemple with the following page:
> [https://search.google.com/structured-data/testing-tool/u/0/#url=https%3A%2F%2Fwww.home24.de%2Fprodukt%2Fpendelleuchte-newtown-i-stahl-1-1077]
>  
> While investigating I noticed that the markup was `itemProp` whereas the 
> parser expect `itemprop`.
>  
> Because HTML attributes are expected to be case insensitive I believe the 
> case should not prevent the parser from working.
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ANY23-441) TikaEncodingDetector: guessEncoding may throws an ArrayIndexOutOfBoundsException

2019-09-25 Thread Hans Brende (Jira)


[ 
https://issues.apache.org/jira/browse/ANY23-441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938195#comment-16938195
 ] 

Hans Brende commented on ANY23-441:
---

FYI to anyone who's paying attention, this is the result of a jsoup bug which 
can be viewed here: [https://github.com/jhy/jsoup/issues/1251]

> TikaEncodingDetector: guessEncoding may throws an 
> ArrayIndexOutOfBoundsException
> 
>
> Key: ANY23-441
> URL: https://issues.apache.org/jira/browse/ANY23-441
> Project: Apache Any23
>  Issue Type: Bug
>  Components: encoding
>Affects Versions: 2.3
>Reporter: Anthony Pessy
>Priority: Major
> Fix For: 2.5
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Using `TikaEncodingDetector.guessEncoding` may result in an 
> `ArrayIndexOutOfBoundsException`.
>  
> The following snippet:
> {noformat}
> String encoding = new TikaEncodingDetector().guessEncoding(new 
> URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html";).openStream());
> System.out.println(encoding);{noformat}
> Will result in the following exception:
> {noformat}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 
> out of bounds for length 32768Exception in thread "main" 
> java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 
> 32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100) 
> at org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at 
> org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at 
> org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at 
> org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at 
> org.jsoup.parser.Parser.parseFragment(Parser.java:140) at 
> org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184)
>  at 
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95)
>  at 
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159)
>  at 
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58){noformat}
> Whereas the expected result is `ISO-8859-15`
> Note the bunch of HTML at the bottom of the page after the `` tag.
>  
> Replacing:
> {code:java}
> ParseErrorList htmlErrors = ParseErrorList.tracking(Integer.MAX_VALUE);
> {code}
> By:
> {code:java}
> ParseErrorList htmlErrors = ParseErrorList.tracking(100);
> {code}
>  
> Will fix the issue. Not quite sure why, maybe at one point the errors are too 
> far and the reader cannot reset far enough...
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ANY23-443) Improve efficiency of RDFa Extractor

2019-09-25 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-443:
--
Fix Version/s: (was: 2.5)
   2.4

> Improve efficiency of RDFa Extractor
> 
>
> Key: ANY23-443
> URL: https://issues.apache.org/jira/browse/ANY23-443
> Project: Apache Any23
>  Issue Type: Improvement
>Reporter: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Our RDFa Extractor is terribly inefficient. 
> 1st, we parse the html "tag soup" input stream into a DOM using Jsoup
> 2nd, we transform the DOM back into an input stream, containing strictly 
> valid XML to avoid errors in the underlying semargl parser
> 3rd, the underlying semargl parser resurrects this input stream as XML and 
> hands off XML streaming events to its underlying XmlSink. 
> 4th, semargl's XmlSink hands its own RDF events back to RDF4J, which in turn 
> hands them back to Any23. 
> I propose cutting out all these intermediate steps by simply walking the 
> original jsoup DOM and handing our own XML events directly to semargl's 
> XmlSink, which we will configure to give RDF events directly back to Any23. 
> This will also allow us to get rid of most (or possibly all) of the various 
> HTML-to-XML "fixups" we had to implement to prevent extraction failures.
> 
> *TL;DR:*
>  
> {{Jsoup → InputStream → RDF4J → XMLReader → RdfaParser → RDF4J → Any23}} 
> *becomes*
> {{Jsoup → RdfaParser → Any23}} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ANY23-443) Improve efficiency of RDFa Extractor

2019-09-25 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-443.
---
  Assignee: Hans Brende
Resolution: Fixed

> Improve efficiency of RDFa Extractor
> 
>
> Key: ANY23-443
> URL: https://issues.apache.org/jira/browse/ANY23-443
> Project: Apache Any23
>  Issue Type: Improvement
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Our RDFa Extractor is terribly inefficient. 
> 1st, we parse the html "tag soup" input stream into a DOM using Jsoup
> 2nd, we transform the DOM back into an input stream, containing strictly 
> valid XML to avoid errors in the underlying semargl parser
> 3rd, the underlying semargl parser resurrects this input stream as XML and 
> hands off XML streaming events to its underlying XmlSink. 
> 4th, semargl's XmlSink hands its own RDF events back to RDF4J, which in turn 
> hands them back to Any23. 
> I propose cutting out all these intermediate steps by simply walking the 
> original jsoup DOM and handing our own XML events directly to semargl's 
> XmlSink, which we will configure to give RDF events directly back to Any23. 
> This will also allow us to get rid of most (or possibly all) of the various 
> HTML-to-XML "fixups" we had to implement to prevent extraction failures.
> 
> *TL;DR:*
>  
> {{Jsoup → InputStream → RDF4J → XMLReader → RdfaParser → RDF4J → Any23}} 
> *becomes*
> {{Jsoup → RdfaParser → Any23}} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ANY23-433) Upgrade rdf4j to v3.0.0

2019-09-25 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-433.
---
Resolution: Fixed

> Upgrade rdf4j to v3.0.0
> ---
>
> Key: ANY23-433
> URL: https://issues.apache.org/jira/browse/ANY23-433
> Project: Apache Any23
>  Issue Type: Task
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> jsonld-java should be upgraded to v0.12.4 at the same time.
> jackson should be upgraded to v2.9.9 at the same time.
> This upgrade will allow the removal of a hacky workaround from the json-ld 
> html extractor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ANY23-433) Upgrade rdf4j to v3.0.0

2019-09-23 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende reassigned ANY23-433:
-

Assignee: Hans Brende

> Upgrade rdf4j to v3.0.0
> ---
>
> Key: ANY23-433
> URL: https://issues.apache.org/jira/browse/ANY23-433
> Project: Apache Any23
>  Issue Type: Task
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>
> jsonld-java should be upgraded to v0.12.4 at the same time.
> jackson should be upgraded to v2.9.9 at the same time.
> This upgrade will allow the removal of a hacky workaround from the json-ld 
> html extractor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ANY23-443) Improve efficiency of RDFa Extractor

2019-09-14 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-443:
--
Description: 
Our RDFa Extractor is terribly inefficient. 

1st, we parse the html "tag soup" input stream into a DOM using Jsoup
2nd, we transform the DOM back into an input stream, containing strictly valid 
XML to avoid errors in the underlying semargl parser
3rd, the underlying semargl parser resurrects this input stream as XML and 
hands off XML streaming events to its underlying XmlSink. 
4th, semargl's XmlSink hands its own RDF events back to RDF4J, which in turn 
hands them back to Any23. 

I propose cutting out all these intermediate steps by simply walking the 
original jsoup DOM and handing our own XML events directly to semargl's 
XmlSink, which we will configure to give RDF events directly back to Any23. 

This will also allow us to get rid of most (or possibly all) of the various 
HTML-to-XML "fixups" we had to implement to prevent extraction failures.



*TL;DR:*
 
{{Jsoup → InputStream → RDF4J → XMLReader → RdfaParser → RDF4J → Any23}} 

*becomes*

{{Jsoup → RdfaParser → Any23}} 


  was:
Our RDFa Extractor is terribly inefficient. 

1st, we parse the html "tag soup" input stream into a DOM using Jsoup
2nd, we transform the DOM back into an input stream, containing strictly valid 
XML to avoid errors in the underlying semargl parser
3rd, the underlying semargl parser resurrects this input stream as XML and 
hands off XML streaming events to its underlying XmlSink. 
4th, semargl's XmlSink hands its own RDF events back to RDF4J, which in turn 
hands them back to Any23. 

I propose cutting out all these intermediate steps by simply walking the 
original jsoup DOM and handing our own XML events directly to semargl's 
XmlSink, which we will configure to give RDF events directly back to Any23. 

This will also allow us to get rid of most (or possibly all) of the various 
HTML-to-XML "fixups" we had to implement to prevent extraction failures.


> Improve efficiency of RDFa Extractor
> 
>
> Key: ANY23-443
> URL: https://issues.apache.org/jira/browse/ANY23-443
> Project: Apache Any23
>  Issue Type: Improvement
>Reporter: Hans Brende
>Priority: Major
>
> Our RDFa Extractor is terribly inefficient. 
> 1st, we parse the html "tag soup" input stream into a DOM using Jsoup
> 2nd, we transform the DOM back into an input stream, containing strictly 
> valid XML to avoid errors in the underlying semargl parser
> 3rd, the underlying semargl parser resurrects this input stream as XML and 
> hands off XML streaming events to its underlying XmlSink. 
> 4th, semargl's XmlSink hands its own RDF events back to RDF4J, which in turn 
> hands them back to Any23. 
> I propose cutting out all these intermediate steps by simply walking the 
> original jsoup DOM and handing our own XML events directly to semargl's 
> XmlSink, which we will configure to give RDF events directly back to Any23. 
> This will also allow us to get rid of most (or possibly all) of the various 
> HTML-to-XML "fixups" we had to implement to prevent extraction failures.
> 
> *TL;DR:*
>  
> {{Jsoup → InputStream → RDF4J → XMLReader → RdfaParser → RDF4J → Any23}} 
> *becomes*
> {{Jsoup → RdfaParser → Any23}} 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ANY23-443) Improve efficiency of RDFa Extractor

2019-09-14 Thread Hans Brende (Jira)
Hans Brende created ANY23-443:
-

 Summary: Improve efficiency of RDFa Extractor
 Key: ANY23-443
 URL: https://issues.apache.org/jira/browse/ANY23-443
 Project: Apache Any23
  Issue Type: Improvement
Reporter: Hans Brende


Our RDFa Extractor is terribly inefficient. 

1st, we parse the html "tag soup" input stream into a DOM using Jsoup
2nd, we transform the DOM back into an input stream, containing strictly valid 
XML to avoid errors in the underlying semargl parser
3rd, the underlying semargl parser resurrects this input stream as XML and 
hands off XML streaming events to its underlying XmlSink. 
4th, semargl's XmlSink hands its own RDF events back to RDF4J, which in turn 
hands them back to Any23. 

I propose cutting out all these intermediate steps by simply walking the 
original jsoup DOM and handing our own XML events directly to semargl's 
XmlSink, which we will configure to give RDF events directly back to Any23. 

This will also allow us to get rid of most (or possibly all) of the various 
HTML-to-XML "fixups" we had to implement to prevent extraction failures.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ANY23-432) Upgrade owlapi to v5.1.11

2019-09-14 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-432.
---
  Assignee: Hans Brende
Resolution: Fixed

> Upgrade owlapi to v5.1.11
> -
>
> Key: ANY23-432
> URL: https://issues.apache.org/jira/browse/ANY23-432
> Project: Apache Any23
>  Issue Type: Task
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>
> This release contains a variety of bugfixes, including, e.g., 
> https://github.com/owlcs/owlapi/issues/813



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ANY23-330) Clean up Any23PluginManager

2019-09-14 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-330:
--
Attachment: (was: Assembly-UnityScript.dll)

> Clean up Any23PluginManager
> ---
>
> Key: ANY23-330
> URL: https://issues.apache.org/jira/browse/ANY23-330
> Project: Apache Any23
>  Issue Type: Bug
>  Components: Plugin Management
>Affects Versions: 2.1
>Reporter: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>
> I've been peeking at the Any23PluginManager class. There are a few issues:
> 1. {{getPlugins(Class)}}, {{getTools()}}, {{getExtractors()}}, and 
> {{getApplicableTools()}} never throw any exceptions, yet they all declare: 
> *{{throws IOException}}*.
> 2. {{configureExtractors(File...)}}, {{configureExtractors(ExtractorGroup)}}, 
> and {{getApplicableExtractors(ExtractorRegistry, File...)}} all throw 
> {{ServiceConfigurationError}}, but instead declare: *{{throws IOException, 
> IllegalAccessException, InstantiationException}}* (none of which are ever 
> thrown).
> 3. {{getApplicableExtractors(ExtractorRegistry, File...)}} never uses the 
> {{ExtractorRegistry}} argument. Behavior is identical to 
> {{configureExtractors(File...)}}. Behavior does not match javadoc.
> 4. {{configureExtractors(ExtractorGroup)}} never uses the {{ExtractorGroup}} 
> argument (but this deleting this parameter would create confusion with the 
> variadic {{configureExtractors(File...)}} method). Behavior does not match 
> javadoc.
> I'd argue that some of these methods are completely useless and should be 
> removed. None of them are being used in the OpenIE dynamic jar loading 
> example in the web service except {{getExtractors()}}.
> Note: after these issues are resolved, we may have to revisit ANY23-333.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ANY23-442) Move HTML preprocessing logic from BaseRDFExtractor to semargl Extractors

2019-09-14 Thread Hans Brende (Jira)


 [ 
https://issues.apache.org/jira/browse/ANY23-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-442.
---
Fix Version/s: 2.4
 Assignee: Hans Brende
   Resolution: Fixed

> Move HTML preprocessing logic from BaseRDFExtractor to semargl Extractors
> -
>
> Key: ANY23-442
> URL: https://issues.apache.org/jira/browse/ANY23-442
> Project: Apache Any23
>  Issue Type: Improvement
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Cf. https://github.com/apache/any23/pull/104#issuecomment-531068423



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ANY23-442) Move HTML preprocessing logic from BaseRDFExtractor to semargl Extractors

2019-09-12 Thread Hans Brende (Jira)
Hans Brende created ANY23-442:
-

 Summary: Move HTML preprocessing logic from BaseRDFExtractor to 
semargl Extractors
 Key: ANY23-442
 URL: https://issues.apache.org/jira/browse/ANY23-442
 Project: Apache Any23
  Issue Type: Improvement
Reporter: Hans Brende


Cf. https://github.com/apache/any23/pull/104#issuecomment-531068423



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ANY23-330) Clean up Any23PluginManager

2019-06-22 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870321#comment-16870321
 ] 

Hans Brende commented on ANY23-330:
---

[~lewismc] it appears someone attached an irrelevant dll file to this issue, 
but when I try to remove it it says "Only users with administrative privileges 
to remove an issue can remove attachments."

Can you help?

> Clean up Any23PluginManager
> ---
>
> Key: ANY23-330
> URL: https://issues.apache.org/jira/browse/ANY23-330
> Project: Apache Any23
>  Issue Type: Bug
>  Components: Plugin Management
>Affects Versions: 2.1
>Reporter: Hans Brende
>Priority: Major
> Fix For: 2.4
>
> Attachments: Assembly-UnityScript.dll
>
>
> I've been peeking at the Any23PluginManager class. There are a few issues:
> 1. {{getPlugins(Class)}}, {{getTools()}}, {{getExtractors()}}, and 
> {{getApplicableTools()}} never throw any exceptions, yet they all declare: 
> *{{throws IOException}}*.
> 2. {{configureExtractors(File...)}}, {{configureExtractors(ExtractorGroup)}}, 
> and {{getApplicableExtractors(ExtractorRegistry, File...)}} all throw 
> {{ServiceConfigurationError}}, but instead declare: *{{throws IOException, 
> IllegalAccessException, InstantiationException}}* (none of which are ever 
> thrown).
> 3. {{getApplicableExtractors(ExtractorRegistry, File...)}} never uses the 
> {{ExtractorRegistry}} argument. Behavior is identical to 
> {{configureExtractors(File...)}}. Behavior does not match javadoc.
> 4. {{configureExtractors(ExtractorGroup)}} never uses the {{ExtractorGroup}} 
> argument (but this deleting this parameter would create confusion with the 
> variadic {{configureExtractors(File...)}} method). Behavior does not match 
> javadoc.
> I'd argue that some of these methods are completely useless and should be 
> removed. None of them are being used in the OpenIE dynamic jar loading 
> example in the web service except {{getExtractors()}}.
> Note: after these issues are resolved, we may have to revisit ANY23-333.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-439) Replace commons-lang with commons-lang3

2019-06-15 Thread Hans Brende (JIRA)
Hans Brende created ANY23-439:
-

 Summary: Replace commons-lang with commons-lang3
 Key: ANY23-439
 URL: https://issues.apache.org/jira/browse/ANY23-439
 Project: Apache Any23
  Issue Type: Improvement
  Components: core
Reporter: Hans Brende
 Fix For: 2.4






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-435) Upgrade httpclient to v4.5.9

2019-06-15 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-435:
--
Description: Also, httpclient-cache and httpmime.  (was: Including 
httpclient, httpcore, httpmime, etc.)

> Upgrade httpclient to v4.5.9
> 
>
> Key: ANY23-435
> URL: https://issues.apache.org/jira/browse/ANY23-435
> Project: Apache Any23
>  Issue Type: Task
>Reporter: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>
> Also, httpclient-cache and httpmime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-435) Upgrade httpclient to v4.5.9

2019-06-15 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-435:
--
Summary: Upgrade httpclient to v4.5.9  (was: Upgrade httpcomponents 
libraries to v4.5.9, v4.4.11)

> Upgrade httpclient to v4.5.9
> 
>
> Key: ANY23-435
> URL: https://issues.apache.org/jira/browse/ANY23-435
> Project: Apache Any23
>  Issue Type: Task
>Reporter: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>
> Including httpclient, httpcore, httpmime, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-438) Upgrade slf4j-api to v1.7.26

2019-06-15 Thread Hans Brende (JIRA)
Hans Brende created ANY23-438:
-

 Summary: Upgrade slf4j-api to v1.7.26
 Key: ANY23-438
 URL: https://issues.apache.org/jira/browse/ANY23-438
 Project: Apache Any23
  Issue Type: Task
Reporter: Hans Brende
 Fix For: 2.4


Also, jcl-over-slf4j and jul-to-slf4j. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-437) Upgrade snakeyaml to v1.24

2019-06-15 Thread Hans Brende (JIRA)
Hans Brende created ANY23-437:
-

 Summary: Upgrade snakeyaml to v1.24
 Key: ANY23-437
 URL: https://issues.apache.org/jira/browse/ANY23-437
 Project: Apache Any23
  Issue Type: Task
Reporter: Hans Brende
 Fix For: 2.4






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-436) Upgrade commons-csv to v1.7

2019-06-15 Thread Hans Brende (JIRA)
Hans Brende created ANY23-436:
-

 Summary: Upgrade commons-csv to v1.7
 Key: ANY23-436
 URL: https://issues.apache.org/jira/browse/ANY23-436
 Project: Apache Any23
  Issue Type: Task
Reporter: Hans Brende
 Fix For: 2.4






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-435) Upgrade httpcomponents libraries to v4.5.9, v4.4.11

2019-06-15 Thread Hans Brende (JIRA)
Hans Brende created ANY23-435:
-

 Summary: Upgrade httpcomponents libraries to v4.5.9, v4.4.11
 Key: ANY23-435
 URL: https://issues.apache.org/jira/browse/ANY23-435
 Project: Apache Any23
  Issue Type: Task
Reporter: Hans Brende
 Fix For: 2.4


Including httpclient, httpcore, httpmime, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-434) Upgrade tika to v1.21

2019-06-15 Thread Hans Brende (JIRA)
Hans Brende created ANY23-434:
-

 Summary: Upgrade tika to v1.21
 Key: ANY23-434
 URL: https://issues.apache.org/jira/browse/ANY23-434
 Project: Apache Any23
  Issue Type: Task
Reporter: Hans Brende
 Fix For: 2.4


POI should be upgraded to v4.1.0 at the same time.
Commons-codec should be upgraded to v1.12 at the same time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-433) Upgrade rdf4j to v2.5.2

2019-06-15 Thread Hans Brende (JIRA)
Hans Brende created ANY23-433:
-

 Summary: Upgrade rdf4j to v2.5.2
 Key: ANY23-433
 URL: https://issues.apache.org/jira/browse/ANY23-433
 Project: Apache Any23
  Issue Type: Task
Reporter: Hans Brende
 Fix For: 2.4


jsonld-java should be upgraded to v0.12.4 at the same time.
jackson should be upgraded to v2.9.9 at the same time.

This upgrade will allow the removal of a hacky workaround from the json-ld html 
extractor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-431) Upgrade jsoup to v1.12.1

2019-06-15 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-431:
--
Description: This release contains a variety of bugfixes, including 
[#1009|https://github.com/jhy/jsoup/issues/1009], which will allow us to remove 
the JsoupUtils class, which was only present as a workaround for this bug.  
(was: This release contains a variety of bugfixes, 
including[#1009|https://github.com/jhy/jsoup/issues/1009], which will allow us 
to remove the JsoupUtils class, which was only present as a workaround for this 
bug.)

> Upgrade jsoup to v1.12.1
> 
>
> Key: ANY23-431
> URL: https://issues.apache.org/jira/browse/ANY23-431
> Project: Apache Any23
>  Issue Type: Task
>Affects Versions: 2.3
>Reporter: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>
> This release contains a variety of bugfixes, including 
> [#1009|https://github.com/jhy/jsoup/issues/1009], which will allow us to 
> remove the JsoupUtils class, which was only present as a workaround for this 
> bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-431) Upgrade jsoup to v1.12.1

2019-06-15 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-431:
--
Description: This release contains a variety of bugfixes, 
including[#1009|https://github.com/jhy/jsoup/issues/1009], which will allow us 
to remove the JsoupUtils class, which was only present as a workaround for this 
bug.  (was: This release contains a variety of bugfixes, including 
[#1009](https://github.com/jhy/jsoup/issues/1009), which will allow us to 
remove the JsoupUtils class, which was only present as a workaround for this 
bug.)

> Upgrade jsoup to v1.12.1
> 
>
> Key: ANY23-431
> URL: https://issues.apache.org/jira/browse/ANY23-431
> Project: Apache Any23
>  Issue Type: Task
>Affects Versions: 2.3
>Reporter: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>
> This release contains a variety of bugfixes, 
> including[#1009|https://github.com/jhy/jsoup/issues/1009], which will allow 
> us to remove the JsoupUtils class, which was only present as a workaround for 
> this bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-432) Upgrade owlapi to v5.1.11

2019-06-15 Thread Hans Brende (JIRA)
Hans Brende created ANY23-432:
-

 Summary: Upgrade owlapi to v5.1.11
 Key: ANY23-432
 URL: https://issues.apache.org/jira/browse/ANY23-432
 Project: Apache Any23
  Issue Type: Task
Reporter: Hans Brende
 Fix For: 2.4


This release contains a variety of bugfixes, including, e.g., 
https://github.com/owlcs/owlapi/issues/813



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-431) Upgrade jsoup to v1.12.1

2019-06-15 Thread Hans Brende (JIRA)
Hans Brende created ANY23-431:
-

 Summary: Upgrade jsoup to v1.12.1
 Key: ANY23-431
 URL: https://issues.apache.org/jira/browse/ANY23-431
 Project: Apache Any23
  Issue Type: Task
Affects Versions: 2.3
Reporter: Hans Brende
 Fix For: 2.4


This release contains a variety of bugfixes, including 
[#1009](https://github.com/jhy/jsoup/issues/1009), which will allow us to 
remove the JsoupUtils class, which was only present as a workaround for this 
bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ANY23-425) iCal, jCal, xCal extractors aren't listed in META-INF/services

2019-02-10 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende closed ANY23-425.
-
Resolution: Fixed

> iCal, jCal, xCal extractors aren't listed in META-INF/services
> --
>
> Key: ANY23-425
> URL: https://issues.apache.org/jira/browse/ANY23-425
> Project: Apache Any23
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (ANY23-425) iCal, jCal, xCal extractors aren't listed in META-INF/services

2019-02-10 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende reopened ANY23-425:
---

> iCal, jCal, xCal extractors aren't listed in META-INF/services
> --
>
> Key: ANY23-425
> URL: https://issues.apache.org/jira/browse/ANY23-425
> Project: Apache Any23
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-425) iCal, jCal, xCal extractors aren't listed in META-INF/services

2019-02-10 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-425.
---
Resolution: Fixed

> iCal, jCal, xCal extractors aren't listed in META-INF/services
> --
>
> Key: ANY23-425
> URL: https://issues.apache.org/jira/browse/ANY23-425
> Project: Apache Any23
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-425) iCal, jCal, xCal extractors aren't listed in META-INF/services

2019-02-10 Thread Hans Brende (JIRA)
Hans Brende created ANY23-425:
-

 Summary: iCal, jCal, xCal extractors aren't listed in 
META-INF/services
 Key: ANY23-425
 URL: https://issues.apache.org/jira/browse/ANY23-425
 Project: Apache Any23
  Issue Type: Bug
Affects Versions: 2.3
Reporter: Hans Brende
Assignee: Hans Brende
 Fix For: 2.3






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-415) NTriplesExtractor tries all text/plain files, causing numerous fatal issues

2019-02-09 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-415.
---
Resolution: Fixed

> NTriplesExtractor tries all text/plain files, causing numerous fatal issues
> ---
>
> Key: ANY23-415
> URL: https://issues.apache.org/jira/browse/ANY23-415
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Minor
> Fix For: 2.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since the NTriplesExtractorFactory includes a content type of "text/plain", 
> this causes every plain text file to be processed by the NTriplesExtractor, 
> which in turn causes huge numbers of completely unnecessary fatal issues 
> being sent to the extraction report.
> In my crawls, this mostly occurs for all the "humans.txt" files encountered.
> While this isn't a hugely serious bug, it is quite irritating as it does 
> really clutter up my logs.
>  
> Note: the NQuadsExtractorFactory (which can parse all the same documents as 
> NTriples) does *not* include a content type of "text/plain".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ANY23-415) NTriplesExtractor tries all text/plain files, causing numerous fatal issues

2019-02-09 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende reassigned ANY23-415:
-

Assignee: Hans Brende

> NTriplesExtractor tries all text/plain files, causing numerous fatal issues
> ---
>
> Key: ANY23-415
> URL: https://issues.apache.org/jira/browse/ANY23-415
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Minor
> Fix For: 2.3
>
>
> Since the NTriplesExtractorFactory includes a content type of "text/plain", 
> this causes every plain text file to be processed by the NTriplesExtractor, 
> which in turn causes huge numbers of completely unnecessary fatal issues 
> being sent to the extraction report.
> In my crawls, this mostly occurs for all the "humans.txt" files encountered.
> While this isn't a hugely serious bug, it is quite irritating as it does 
> really clutter up my logs.
>  
> Note: the NQuadsExtractorFactory (which can parse all the same documents as 
> NTriples) does *not* include a content type of "text/plain".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-415) NTriplesExtractor tries all text/plain files, causing numerous fatal issues

2019-02-09 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-415:
--
Fix Version/s: (was: 2.4)
   2.3

> NTriplesExtractor tries all text/plain files, causing numerous fatal issues
> ---
>
> Key: ANY23-415
> URL: https://issues.apache.org/jira/browse/ANY23-415
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 2.3
>Reporter: Hans Brende
>Priority: Minor
> Fix For: 2.3
>
>
> Since the NTriplesExtractorFactory includes a content type of "text/plain", 
> this causes every plain text file to be processed by the NTriplesExtractor, 
> which in turn causes huge numbers of completely unnecessary fatal issues 
> being sent to the extraction report.
> In my crawls, this mostly occurs for all the "humans.txt" files encountered.
> While this isn't a hugely serious bug, it is quite irritating as it does 
> really clutter up my logs.
>  
> Note: the NQuadsExtractorFactory (which can parse all the same documents as 
> NTriples) does *not* include a content type of "text/plain".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-370) Jenkins: IllegalStateException: checksum mismatch

2019-02-06 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-370:
--
Priority: Major  (was: Critical)

> Jenkins: IllegalStateException: checksum mismatch
> -
>
> Key: ANY23-370
> URL: https://issues.apache.org/jira/browse/ANY23-370
> Project: Apache Any23
>  Issue Type: Bug
>  Components: CIS (Jenkins)
>Affects Versions: 2.3
>Reporter: Hans Brende
>Priority: Major
> Fix For: 2.4
>
>
> Jenkins builds (e.g., 
> [#1597|https://builds.apache.org/job/Any23-trunk/1597/]) have been 
> sporadically timing out after 120 minutes during the *archival process*, even 
> though the actual *build process* completes successfully in less than half an 
> hour. I believe that this is due to the following error:
> {noformat}
> ERROR: Failed to archive \{ 
> org.apache.any23/apache-any23-service/2.3-SNAPSHOT/apache-any23-service-2.3-SNAPSHOT.pom=pom.xml,
>  
> org.apache.any23/apache-any23-service/2.3-20180718.234115-48/apache-any23-service-2.3-20180718.234115-48.war=target/apache-any23-service-2.3-SNAPSHOT.war,
>  
> org.apache.any23/apache-any23-service/2.3-20180718.234115-48/apache-any23-service-2.3-20180718.234115-48-without-deps.war=target/apache-any23-service-2.3-SNAPSHOT-without-deps.war,
>  
> org.apache.any23/apache-any23-service/2.3-20180718.234115-48/apache-any23-service-2.3-20180718.234115-48-with-deps.tar.gz=target/apache-any23-service-2.3-SNAPSHOT-with-deps.tar.gz,
>  
> org.apache.any23/apache-any23-service/2.3-20180718.234115-48/apache-any23-service-2.3-20180718.234115-48-with-deps.zip=target/apache-any23-service-2.3-SNAPSHOT-with-deps.zip,
>  
> org.apache.any23/apache-any23-service/2.3-20180718.234115-48/apache-any23-service-2.3-20180718.234115-48-without-deps.tar.gz=target/apache-any23-service-2.3-SNAPSHOT-without-deps.tar.gz,
>  
> org.apache.any23/apache-any23-service/2.3-20180718.234115-48/apache-any23-service-2.3-20180718.234115-48-without-deps.zip=target/apache-any23-service-2.3-SNAPSHOT-without-deps.zip,
>  
> org.apache.any23/apache-any23-service/2.3-20180718.234115-48/apache-any23-service-2.3-20180718.234115-48-server-embedded.tar.gz=target/apache-any23-service-2.3-SNAPSHOT-server-embedded.tar.gz,
>  
> org.apache.any23/apache-any23-service/2.3-20180718.234115-48/apache-any23-service-2.3-20180718.234115-48-server-embedded.zip=target/apache-any23-service-2.3-SNAPSHOT-server-embedded.zip
> } due to internal error; falling back to full archiving {noformat}
> {noformat}java.lang.IllegalStateException: End of stream while reading number
>  at jsync.protocol.BaseReader.readLong(BaseReader.java:40)
>  at jsync.protocol.BaseReader.readInt(BaseReader.java:26)
>  at jsync.protocol.ChangeStreamReader.next(ChangeStreamReader.java:54)
>  at jsync.protocol.ChangeInputStream.next(ChangeInputStream.java:27)
>  at jsync.protocol.ChangeInputStream.read(ChangeInputStream.java:71)
>  at 
> com.cloudbees.jenkins.plugins.jsync.archiver.MD5DigestingInputStream.read(MD5DigestingInputStream.java:39)
>  at com.google.common.io.LimitInputStream.read(LimitInputStream.java:79)
>  at java.io.FilterInputStream.read(FilterInputStream.java:107)
>  at com.google.common.io.ByteStreams.copy(ByteStreams.java:193)
>  at jsync.protocol.FileSequenceReader.read(FileSequenceReader.java:35)
>  at 
> com.cloudbees.jenkins.plugins.jsync.archiver.JSyncArtifactManager.remoteSync(JSyncArtifactManager.java:158)
>  at 
> com.cloudbees.jenkins.plugins.jsync.archiver.JSyncArtifactManager.archive(JSyncArtifactManager.java:76)
>  at hudson.maven.MavenBuild$ProxyImpl.performArchiving(MavenBuild.java:512)
>  at 
> hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:881)
>  at 
> hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
>  at hudson.model.Run.execute(Run.java:1794)
>  at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543)
>  at hudson.model.ResourceController.execute(ResourceController.java:97)
>  at hudson.model.Executor.run(Executor.java:429){noformat}
> Then, at some point during the *"full archiving"* process, we get the 
> following message:
> {noformat}
> Build timed out (after 120 minutes). Marking the build as aborted.
> {noformat}
> In order to mitigate this timeout error, I've temporarily increased the 
> timeout to 180 minutes. But this should not be necessary: we should try to 
> fix the underlying issue.
> Here's another similar error message from [build 
> #1599|https://builds.apache.org/job/Any23-trunk/1599]:
> {noformat}
> ERROR: Failed to archive {...} due to internal error; falling back to full 
> archiving
> java.lang.IllegalStateException: checksum mismatch after transfer (900817822 
> vs. 1339188126); 
> /x1/jenkins/jenkins-home/jobs/Any23-trunk/modules/org.apache.any23$apach

[jira] [Updated] (ANY23-424) Update dependencies

2019-02-06 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-424:
--
Description: 
Update owlapi to v5.1.9
Update rdf4j to v2.4.4
Update tika to v1.20
Update jsonld-java to v0.12.3
Update mockito to v2.24.0
Update jackson to v2.9.8
Update biweekly to v0.6.3
Update poi to v4.0.1
Update httpclient to v4.5.7
Update httpcore to v4.4.11
Update jetty to v9.4.14.v20181114

  was:
Update owlapi to v5.1.9
Update rdf4j to v2.4.3
Update tika to v1.20
Update jsonld-java to v0.12.3
Update mockito to v2.24.0
Update jackson to v2.9.8
Update biweekly to v0.6.3
Update poi to v4.0.1
Update httpclient to v4.5.7
Update httpcore to v4.4.11
Update jetty to v9.4.14.v20181114


> Update dependencies
> ---
>
> Key: ANY23-424
> URL: https://issues.apache.org/jira/browse/ANY23-424
> Project: Apache Any23
>  Issue Type: Improvement
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> Update owlapi to v5.1.9
> Update rdf4j to v2.4.4
> Update tika to v1.20
> Update jsonld-java to v0.12.3
> Update mockito to v2.24.0
> Update jackson to v2.9.8
> Update biweekly to v0.6.3
> Update poi to v4.0.1
> Update httpclient to v4.5.7
> Update httpcore to v4.4.11
> Update jetty to v9.4.14.v20181114



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-424) Update dependencies

2019-02-06 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-424.
---
Resolution: Fixed

> Update dependencies
> ---
>
> Key: ANY23-424
> URL: https://issues.apache.org/jira/browse/ANY23-424
> Project: Apache Any23
>  Issue Type: Improvement
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> Update owlapi to v5.1.9
> Update rdf4j to v2.4.4
> Update tika to v1.20
> Update jsonld-java to v0.12.3
> Update mockito to v2.24.0
> Update jackson to v2.9.8
> Update biweekly to v0.6.3
> Update poi to v4.0.1
> Update httpclient to v4.5.7
> Update httpcore to v4.4.11
> Update jetty to v9.4.14.v20181114



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-418) Take another look at encoding detection

2019-02-06 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-418.
---
Resolution: Fixed

> Take another look at encoding detection
> ---
>
> Key: ANY23-418
> URL: https://issues.apache.org/jira/browse/ANY23-418
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: encoding
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> In order to address various shortcomings of Tika encoding detection, I've had 
> to modify the TikaEncodingDetector several times. Cf. ANY23-385 and 
> ANY23-411. In the former, I placed a much greater weight on detected charsets 
> declared in html meta elements & xml declarations. In the latter, I placed a 
> much greater weight on charsets returned from HTTP Content-Type headers.
> However, after taking a look at TIKA-539, I'm thinking I should reduce this 
> added weight (for at least html meta elements), and perhaps ignore it 
> altogether (unless it happens to match UTF-8, since it seems that incorrect 
> declarations usually declare something *other than* UTF-8, when the correct 
> charset should be UTF-8).
> Something like > 90% of all webpages use UTF-8 encoding, and all of our 
> encoding detection errors to date have revolved around *something other than 
> UTF-8* being detected when the correct encoding was actually UTF-8, not the 
> other way around.
> Therefore, what I propose is the following: 
> (1) In the absence of a Content-Type header, any declared hints that the 
> charset is UTF-8 should add to the weight for UTF-8, while any declared hints 
> that the charset is not UTF-8 should be ignored. 
> (2) In the presence of a Content-Type header, any other declared hints should 
> be ignored, unless they match UTF-8 and do not match the Content-Type header, 
> in which case all hints, including the Content-Type header, should be ignored.
>  EDIT: The above 2 points are a simplification of what I've actually 
> implemented (specifically, I don't necessarily ignore non-UTF-8 hints). See 
> PR 131 for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ANY23-424) Update dependencies

2019-02-03 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende reassigned ANY23-424:
-

Assignee: Hans Brende

> Update dependencies
> ---
>
> Key: ANY23-424
> URL: https://issues.apache.org/jira/browse/ANY23-424
> Project: Apache Any23
>  Issue Type: Improvement
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> Update owlapi to v5.1.9
> Update rdf4j to v2.4.3
> Update tika to v1.20
> Update jsonld-java to v0.12.3
> Update mockito to v2.24.0
> Update jackson to v2.9.8
> Update biweekly to v0.6.3
> Update poi to v4.0.1
> Update httpclient to v4.5.7
> Update httpcore to v4.4.11
> Update jetty to v9.4.14.v20181114



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ANY23-418) Take another look at encoding detection

2019-02-03 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende reassigned ANY23-418:
-

Assignee: Hans Brende

> Take another look at encoding detection
> ---
>
> Key: ANY23-418
> URL: https://issues.apache.org/jira/browse/ANY23-418
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: encoding
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> In order to address various shortcomings of Tika encoding detection, I've had 
> to modify the TikaEncodingDetector several times. Cf. ANY23-385 and 
> ANY23-411. In the former, I placed a much greater weight on detected charsets 
> declared in html meta elements & xml declarations. In the latter, I placed a 
> much greater weight on charsets returned from HTTP Content-Type headers.
> However, after taking a look at TIKA-539, I'm thinking I should reduce this 
> added weight (for at least html meta elements), and perhaps ignore it 
> altogether (unless it happens to match UTF-8, since it seems that incorrect 
> declarations usually declare something *other than* UTF-8, when the correct 
> charset should be UTF-8).
> Something like > 90% of all webpages use UTF-8 encoding, and all of our 
> encoding detection errors to date have revolved around *something other than 
> UTF-8* being detected when the correct encoding was actually UTF-8, not the 
> other way around.
> Therefore, what I propose is the following: 
> (1) In the absence of a Content-Type header, any declared hints that the 
> charset is UTF-8 should add to the weight for UTF-8, while any declared hints 
> that the charset is not UTF-8 should be ignored. 
> (2) In the presence of a Content-Type header, any other declared hints should 
> be ignored, unless they match UTF-8 and do not match the Content-Type header, 
> in which case all hints, including the Content-Type header, should be ignored.
>  EDIT: The above 2 points are a simplification of what I've actually 
> implemented (specifically, I don't necessarily ignore non-UTF-8 hints). See 
> PR 131 for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-424) Update dependencies

2019-02-03 Thread Hans Brende (JIRA)
Hans Brende created ANY23-424:
-

 Summary: Update dependencies
 Key: ANY23-424
 URL: https://issues.apache.org/jira/browse/ANY23-424
 Project: Apache Any23
  Issue Type: Improvement
Affects Versions: 2.3
Reporter: Hans Brende
 Fix For: 2.3


Update owlapi to v5.1.9
Update rdf4j to v2.4.3
Update tika to v1.20
Update jsonld-java to v0.12.3
Update mockito to v2.24.0
Update jackson to v2.9.8
Update biweekly to v0.6.3
Update poi to v4.0.1
Update httpclient to v4.5.7
Update httpcore to v4.4.11
Update jetty to v9.4.14.v20181114



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-418) Take another look at encoding detection

2019-02-03 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-418:
--
Fix Version/s: (was: 2.4)
   2.3

> Take another look at encoding detection
> ---
>
> Key: ANY23-418
> URL: https://issues.apache.org/jira/browse/ANY23-418
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: encoding
>Affects Versions: 2.3
>Reporter: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> In order to address various shortcomings of Tika encoding detection, I've had 
> to modify the TikaEncodingDetector several times. Cf. ANY23-385 and 
> ANY23-411. In the former, I placed a much greater weight on detected charsets 
> declared in html meta elements & xml declarations. In the latter, I placed a 
> much greater weight on charsets returned from HTTP Content-Type headers.
> However, after taking a look at TIKA-539, I'm thinking I should reduce this 
> added weight (for at least html meta elements), and perhaps ignore it 
> altogether (unless it happens to match UTF-8, since it seems that incorrect 
> declarations usually declare something *other than* UTF-8, when the correct 
> charset should be UTF-8).
> Something like > 90% of all webpages use UTF-8 encoding, and all of our 
> encoding detection errors to date have revolved around *something other than 
> UTF-8* being detected when the correct encoding was actually UTF-8, not the 
> other way around.
> Therefore, what I propose is the following: 
> (1) In the absence of a Content-Type header, any declared hints that the 
> charset is UTF-8 should add to the weight for UTF-8, while any declared hints 
> that the charset is not UTF-8 should be ignored. 
> (2) In the presence of a Content-Type header, any other declared hints should 
> be ignored, unless they match UTF-8 and do not match the Content-Type header, 
> in which case all hints, including the Content-Type header, should be ignored.
>  EDIT: The above 2 points are a simplification of what I've actually 
> implemented (specifically, I don't necessarily ignore non-UTF-8 hints). See 
> PR 131 for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-416) NTriplesExtractor does not recognize "application/n-triples" mimetype

2018-11-22 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-416.
---
Resolution: Fixed
  Assignee: Hans Brende

> NTriplesExtractor does not recognize "application/n-triples" mimetype
> -
>
> Key: ANY23-416
> URL: https://issues.apache.org/jira/browse/ANY23-416
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> The standard mimetype for n-triples, which is "application/n-triples", is not 
> contained in the list of mimetypes in the NTriplesExtractorFactory!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-420) Handle Json+ld extraction failure

2018-11-22 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-420.
---
Resolution: Fixed

> Handle Json+ld extraction failure
> -
>
> Key: ANY23-420
> URL: https://issues.apache.org/jira/browse/ANY23-420
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3
>Reporter: dhirajforyou
>Assignee: Hans Brende
>Priority: Major
>  Labels: json-ld
> Fix For: 2.3
>
>
>  
> Added a property "applicationCategory" to json-ld bock and any23 extractor 
> failed.
> file referred: 
> test/resources/html/html-embedded-jsonld-extractor-multiple.html
> Extra block added:
> 
>     {
>     "applicationCategory": "Sample, Category, Link",
>     "@context": "http://schema.org";,
>     "@type": "SoftwareApplication",
>     "name": "Android Data",
>     "datePublished": "November 18, 2018"
> }
>  
>  
> [applicationCategory|https://schema.org/applicationCategory] accepts text and 
> url, but for the above test of "text" , getting error as :
> Apache Any23 FAILURE
> Execution terminated with errors: Illegal character in path at index 7: 
> Sample, Category, Link
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-420) Handle Json+ld extraction failure

2018-11-22 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-420:
--
Fix Version/s: 2.3

> Handle Json+ld extraction failure
> -
>
> Key: ANY23-420
> URL: https://issues.apache.org/jira/browse/ANY23-420
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3
>Reporter: dhirajforyou
>Assignee: Hans Brende
>Priority: Major
>  Labels: json-ld
> Fix For: 2.3
>
>
>  
> Added a property "applicationCategory" to json-ld bock and any23 extractor 
> failed.
> file referred: 
> test/resources/html/html-embedded-jsonld-extractor-multiple.html
> Extra block added:
> 
>     {
>     "applicationCategory": "Sample, Category, Link",
>     "@context": "http://schema.org";,
>     "@type": "SoftwareApplication",
>     "name": "Android Data",
>     "datePublished": "November 18, 2018"
> }
>  
>  
> [applicationCategory|https://schema.org/applicationCategory] accepts text and 
> url, but for the above test of "text" , getting error as :
> Apache Any23 FAILURE
> Execution terminated with errors: Illegal character in path at index 7: 
> Sample, Category, Link
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ANY23-420) Handle Json+ld extraction failure

2018-11-22 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende reassigned ANY23-420:
-

Assignee: Hans Brende

> Handle Json+ld extraction failure
> -
>
> Key: ANY23-420
> URL: https://issues.apache.org/jira/browse/ANY23-420
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3
>Reporter: dhirajforyou
>Assignee: Hans Brende
>Priority: Major
>  Labels: json-ld
>
>  
> Added a property "applicationCategory" to json-ld bock and any23 extractor 
> failed.
> file referred: 
> test/resources/html/html-embedded-jsonld-extractor-multiple.html
> Extra block added:
> 
>     {
>     "applicationCategory": "Sample, Category, Link",
>     "@context": "http://schema.org";,
>     "@type": "SoftwareApplication",
>     "name": "Android Data",
>     "datePublished": "November 18, 2018"
> }
>  
>  
> [applicationCategory|https://schema.org/applicationCategory] accepts text and 
> url, but for the above test of "text" , getting error as :
> Apache Any23 FAILURE
> Execution terminated with errors: Illegal character in path at index 7: 
> Sample, Category, Link
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-420) Handle Json+ld extraction failure

2018-11-22 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695933#comment-16695933
 ] 

Hans Brende commented on ANY23-420:
---

[~dhirajforyou] good point, Any23's error handling here leaves something to be 
desired. We can improve that on our side of things.

> Handle Json+ld extraction failure
> -
>
> Key: ANY23-420
> URL: https://issues.apache.org/jira/browse/ANY23-420
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3
>Reporter: dhirajforyou
>Priority: Major
>  Labels: json-ld
>
>  
> Added a property "applicationCategory" to json-ld bock and any23 extractor 
> failed.
> file referred: 
> test/resources/html/html-embedded-jsonld-extractor-multiple.html
> Extra block added:
> 
>     {
>     "applicationCategory": "Sample, Category, Link",
>     "@context": "http://schema.org";,
>     "@type": "SoftwareApplication",
>     "name": "Android Data",
>     "datePublished": "November 18, 2018"
> }
>  
>  
> [applicationCategory|https://schema.org/applicationCategory] accepts text and 
> url, but for the above test of "text" , getting error as :
> Apache Any23 FAILURE
> Execution terminated with errors: Illegal character in path at index 7: 
> Sample, Category, Link
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-420) Handle Json+ld extraction failure

2018-11-21 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695191#comment-16695191
 ] 

Hans Brende commented on ANY23-420:
---

[~p_ansell] this issue appears to be up your alley.

> Handle Json+ld extraction failure
> -
>
> Key: ANY23-420
> URL: https://issues.apache.org/jira/browse/ANY23-420
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3
>Reporter: dhirajforyou
>Priority: Major
>  Labels: json-ld
>
>  
> Added a property "applicationCategory" to json-ld bock and any23 extractor 
> failed.
> file referred: 
> test/resources/html/html-embedded-jsonld-extractor-multiple.html
> Extra block added:
> 
>     {
>     "applicationCategory": "Sample, Category, Link",
>     "@context": "http://schema.org";,
>     "@type": "SoftwareApplication",
>     "name": "Android Data",
>     "datePublished": "November 18, 2018"
> }
>  
>  
> [applicationCategory|https://schema.org/applicationCategory] accepts text and 
> url, but for the above test of "text" , getting error as :
> Apache Any23 FAILURE
> Execution terminated with errors: Illegal character in path at index 7: 
> Sample, Category, Link
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-420) Handle Json+ld extraction failure

2018-11-21 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695189#comment-16695189
 ] 

Hans Brende commented on ANY23-420:
---

Note: full stack trace appears as follows:

java.lang.IllegalArgumentException: Illegal character in path at index 7: 
Sample, Category, Link

at java.net.URI.create(URI.java:852)
at java.net.URI.resolve(URI.java:1036)
at com.github.jsonldjava.utils.JsonLdUrl.resolve(JsonLdUrl.java:274)
at com.github.jsonldjava.core.Context.expandIri(Context.java:539)
at com.github.jsonldjava.core.Context.expandValue(Context.java:1099)
at com.github.jsonldjava.core.JsonLdApi.expand(JsonLdApi.java:1007)
at com.github.jsonldjava.core.JsonLdApi.expand(JsonLdApi.java:847)
at com.github.jsonldjava.core.JsonLdApi.expand(JsonLdApi.java:1025)
at 
com.github.jsonldjava.core.JsonLdProcessor.expand(JsonLdProcessor.java:146)
at 
com.github.jsonldjava.core.JsonLdProcessor.toRDF(JsonLdProcessor.java:503)
at org.eclipse.rdf4j.rio.jsonld.JSONLDParser.parse(JSONLDParser.java:71)
at 
org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:226)

> Handle Json+ld extraction failure
> -
>
> Key: ANY23-420
> URL: https://issues.apache.org/jira/browse/ANY23-420
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3
>Reporter: dhirajforyou
>Priority: Major
>  Labels: json-ld
>
>  
> Added a property "applicationCategory" to json-ld bock and any23 extractor 
> failed.
> file referred: 
> test/resources/html/html-embedded-jsonld-extractor-multiple.html
> Extra block added:
> 
>     {
>     "applicationCategory": "Sample, Category, Link",
>     "@context": "http://schema.org";,
>     "@type": "SoftwareApplication",
>     "name": "Android Data",
>     "datePublished": "November 18, 2018"
> }
>  
>  
> [applicationCategory|https://schema.org/applicationCategory] accepts text and 
> url, but for the above test of "text" , getting error as :
> Apache Any23 FAILURE
> Execution terminated with errors: Illegal character in path at index 7: 
> Sample, Category, Link
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-420) Handle Json+ld extraction failure

2018-11-21 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695172#comment-16695172
 ] 

Hans Brende commented on ANY23-420:
---

[~dhirajforyou] This appears to be an issue with jsonld-java, a dependency of 
Any23. Please log this issue here instead: 
https://github.com/jsonld-java/jsonld-java/issues

> Handle Json+ld extraction failure
> -
>
> Key: ANY23-420
> URL: https://issues.apache.org/jira/browse/ANY23-420
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.3
>Reporter: dhirajforyou
>Priority: Major
>  Labels: json-ld
>
>  
> Added a property "applicationCategory" to json-ld bock and any23 extractor 
> failed.
> file referred: 
> test/resources/html/html-embedded-jsonld-extractor-multiple.html
> Extra block added:
> 
>     {
>     "applicationCategory": "Sample, Category, Link",
>     "@context": "http://schema.org";,
>     "@type": "SoftwareApplication",
>     "name": "Android Data",
>     "datePublished": "November 18, 2018"
> }
>  
>  
> [applicationCategory|https://schema.org/applicationCategory] accepts text and 
> url, but for the above test of "text" , getting error as :
> Apache Any23 FAILURE
> Execution terminated with errors: Illegal character in path at index 7: 
> Sample, Category, Link
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-395) any23.org 500 Internal Server Error

2018-11-07 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678820#comment-16678820
 ] 

Hans Brende commented on ANY23-395:
---

[~lewismc] Now I'm getting a 404 when I try to navigate to any23.org. Any idea 
what the problem is?

> any23.org 500 Internal Server Error
> ---
>
> Key: ANY23-395
> URL: https://issues.apache.org/jira/browse/ANY23-395
> Project: Apache Any23
>  Issue Type: Bug
>  Components: site
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Lewis John McGibbney
>Priority: Major
>
> When navigating to the site:
> http://any23-vm2.apache.org/any23/?format=best&uri=https%3A%2F%2Fwikipedia.org%2Fwiki%2FRSS&validation-mode=none
> I get a 500 internal server error. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-418) Take another look at encoding detection

2018-11-06 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-418:
--
Description: 
In order to address various shortcomings of Tika encoding detection, I've had 
to modify the TikaEncodingDetector several times. Cf. ANY23-385 and ANY23-411. 
In the former, I placed a much greater weight on detected charsets declared in 
html meta elements & xml declarations. In the latter, I placed a much greater 
weight on charsets returned from HTTP Content-Type headers.

However, after taking a look at TIKA-539, I'm thinking I should reduce this 
added weight (for at least html meta elements), and perhaps ignore it 
altogether (unless it happens to match UTF-8, since it seems that incorrect 
declarations usually declare something *other than* UTF-8, when the correct 
charset should be UTF-8).

Something like > 90% of all webpages use UTF-8 encoding, and all of our 
encoding detection errors to date have revolved around *something other than 
UTF-8* being detected when the correct encoding was actually UTF-8, not the 
other way around.

Therefore, what I propose is the following: 

(1) In the absence of a Content-Type header, any declared hints that the 
charset is UTF-8 should add to the weight for UTF-8, while any declared hints 
that the charset is not UTF-8 should be ignored. 

(2) In the presence of a Content-Type header, any other declared hints should 
be ignored, unless they match UTF-8 and do not match the Content-Type header, 
in which case all hints, including the Content-Type header, should be ignored.

 EDIT: The above 2 points are a simplification of what I've actually 
implemented (specifically, I don't necessarily ignore non-UTF-8 hints). See PR 
131 for details.

  was:
In order to address various shortcomings of Tika encoding detection, I've had 
to modify the TikaEncodingDetector several times. Cf. ANY23-385 and ANY23-411. 
In the former, I placed a much greater weight on detected charsets declared in 
html meta elements & xml declarations. In the latter, I placed a much greater 
weight on charsets returned from HTTP Content-Type headers.

However, after taking a look at TIKA-539, I'm thinking I should reduce this 
added weight (for at least html meta elements), and perhaps ignore it 
altogether (unless it happens to match UTF-8, since it seems that incorrect 
declarations usually declare something *other than* UTF-8, when the correct 
charset should be UTF-8).

Something like > 90% of all webpages use UTF-8 encoding, and all of our 
encoding detection errors to date have revolved around *something other than 
UTF-8* being detected when the correct encoding was actually UTF-8, not the 
other way around.

Therefore, what I propose is the following: 

(1) In the absence of a Content-Type header, any declared hints that the 
charset is UTF-8 should add to the weight for UTF-8, while any declared hints 
that the charset is not UTF-8 should be ignored. 

(2) In the presence of a Content-Type header, any other declared hints should 
be ignored, unless they match UTF-8 and do not match the Content-Type header, 
in which case all hints, including the Content-Type header, should be ignored.

 


> Take another look at encoding detection
> ---
>
> Key: ANY23-418
> URL: https://issues.apache.org/jira/browse/ANY23-418
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: encoding
>Affects Versions: 2.3
>Reporter: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> In order to address various shortcomings of Tika encoding detection, I've had 
> to modify the TikaEncodingDetector several times. Cf. ANY23-385 and 
> ANY23-411. In the former, I placed a much greater weight on detected charsets 
> declared in html meta elements & xml declarations. In the latter, I placed a 
> much greater weight on charsets returned from HTTP Content-Type headers.
> However, after taking a look at TIKA-539, I'm thinking I should reduce this 
> added weight (for at least html meta elements), and perhaps ignore it 
> altogether (unless it happens to match UTF-8, since it seems that incorrect 
> declarations usually declare something *other than* UTF-8, when the correct 
> charset should be UTF-8).
> Something like > 90% of all webpages use UTF-8 encoding, and all of our 
> encoding detection errors to date have revolved around *something other than 
> UTF-8* being detected when the correct encoding was actually UTF-8, not the 
> other way around.
> Therefore, what I propose is the following: 
> (1) In the absence of a Content-Type header, any declared hints that the 
> charset is UTF-8 should add to the weight for UTF-8, while any declared hints 
> that the charset is not UTF-8 should be ignored. 
> (2) In the presence of a Content-Type header, any other declared hints sh

[jira] [Created] (ANY23-418) Take another look at encoding detection

2018-11-01 Thread Hans Brende (JIRA)
Hans Brende created ANY23-418:
-

 Summary: Take another look at encoding detection
 Key: ANY23-418
 URL: https://issues.apache.org/jira/browse/ANY23-418
 Project: Apache Any23
  Issue Type: Improvement
  Components: encoding
Affects Versions: 2.3
Reporter: Hans Brende
 Fix For: 2.3


In order to address various shortcomings of Tika encoding detection, I've had 
to modify the TikaEncodingDetector several times. Cf. ANY23-385 and ANY23-411. 
In the former, I placed a much greater weight on detected charsets declared in 
html meta elements & xml declarations. In the latter, I placed a much greater 
weight on charsets returned from HTTP Content-Type headers.

However, after taking a look at TIKA-539, I'm thinking I should reduce this 
added weight (for at least html meta elements), and perhaps ignore it 
altogether (unless it happens to match UTF-8, since it seems that incorrect 
declarations usually declare something *other than* UTF-8, when the correct 
charset should be UTF-8).

Something like > 90% of all webpages use UTF-8 encoding, and all of our 
encoding detection errors to date have revolved around *something other than 
UTF-8* being detected when the correct encoding was actually UTF-8, not the 
other way around.

Therefore, what I propose is the following: 

(1) In the absence of a Content-Type header, any declared hints that the 
charset is UTF-8 should add to the weight for UTF-8, while any declared hints 
that the charset is not UTF-8 should be ignored. 

(2) In the presence of a Content-Type header, any other declared hints should 
be ignored, unless they match UTF-8 and do not match the Content-Type header, 
in which case all hints, including the Content-Type header, should be ignored.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-395) any23.org 500 Internal Server Error

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672239#comment-16672239
 ] 

Hans Brende commented on ANY23-395:
---

[~lewismc] Looks like they've restarted the VM! However, the page still looks 
like it has the 2018-08-28 version.

> any23.org 500 Internal Server Error
> ---
>
> Key: ANY23-395
> URL: https://issues.apache.org/jira/browse/ANY23-395
> Project: Apache Any23
>  Issue Type: Bug
>  Components: site
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Lewis John McGibbney
>Priority: Major
>
> When navigating to the site:
> http://any23-vm2.apache.org/any23/?format=best&uri=https%3A%2F%2Fwikipedia.org%2Fwiki%2FRSS&validation-mode=none
> I get a 500 internal server error. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-411) Use Content-Type to help determine encoding

2018-11-01 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-411:
--
Description: 
Incredibly enough, it seems that our encoding detector does not take the 
Content-Type header into account at all when trying to guess a document's 
charset encoding!

This has caused a problem for me with the page: 
http://w3c.github.io/microdata-rdf/tests/0065.html

Even though the Content-Type header is set to "text/html; charset=utf-8", we're 
guessing the charset to be: "IBM500", which in turn renders the page into 
complete gibberish. 

This must be a bug in Tika, because even when I set the declared encoding of 
the charset detector to UTF-8, IBM500 is still the most confident result.

Cf. https://issues.apache.org/jira/browse/TIKA-2771

  was:
Incredibly enough, it seems that our encoding detector does not take the 
Content-Type header into account at all when trying to guess a document's 
charset encoding!

This has caused a problem for me with the page: 
http://w3c.github.io/microdata-rdf/tests/0065.html

Even though the Content-Type header is set to "text/html; charset=utf-8", we're 
guessing the charset to be: "IBM500", which in turn renders the page into 
complete gibberish. 


> Use Content-Type to help determine encoding
> ---
>
> Key: ANY23-411
> URL: https://issues.apache.org/jira/browse/ANY23-411
> Project: Apache Any23
>  Issue Type: Bug
>  Components: encoding
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> Incredibly enough, it seems that our encoding detector does not take the 
> Content-Type header into account at all when trying to guess a document's 
> charset encoding!
> This has caused a problem for me with the page: 
> http://w3c.github.io/microdata-rdf/tests/0065.html
> Even though the Content-Type header is set to "text/html; charset=utf-8", 
> we're guessing the charset to be: "IBM500", which in turn renders the page 
> into complete gibberish. 
> This must be a bug in Tika, because even when I set the declared encoding of 
> the charset detector to UTF-8, IBM500 is still the most confident result.
> Cf. https://issues.apache.org/jira/browse/TIKA-2771



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-417) Inherent problems with mimetype detection

2018-11-01 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-417:
--
Description: 
N-Triples is a subset of Turtle, and it is also a subset of N-Quads. Turtle is 
a subset of TriG.

But when we are performing mimetype detection on a plain text file, we only 
sniff the first few kilobytes of data. Therefore, something we initially detect 
as N-Triples may in fact be a Turtle, Trig, or NQuads document. Something we 
initially detect as Turtle may in fact be a TriG document.

Therefore, if we detect that the document is Turtle, in the absence of a 
declared Content-Type, we should probably assume that it actually TriG, just in 
case.

If we can only detect that the document is N-Triples, that presents a problem, 
because it could also be either Turtle or N-Quads. Which do we choose?

Another problem I see is that we are detecting both N3 and Turtle in two 
separate steps. However, as I understand it, for the purposes of RDF, N3 is 
essentially a synonym for Turtle. So it doesn't really make sense to use two 
different detection steps for this. It appears that our N3 detection step is 
actually detecting N-Triples, which is not at all the same thing.

(Indeed, in {{org.eclipse.rdf4j.rio.n3.N3ParserFactory}}'s implementation of 
{{getParser()}} we see: {{return new TurtleParser()}})



  was:
N-Triples is a subset of Turtle, and it is also a subset of N-Quads. Turtle is 
a subset of TriG.

But when we are performing mimetype detection on a plain text file, we only 
sniff the first few kilobytes of data. Therefore, something we initially detect 
as N-Triples may in fact be a Turtle, Trig, or NQuads document. Something we 
initially detect as Turtle may in fact be a TriG document.

Therefore, if we detect that the document is Turtle, in the absence of a 
declared Content-Type, we should probably assume that it actually TriG, just in 
case.

If we can only detect that the document is N-Triples, that presents a problem, 
because it could also be either Turtle or N-Quads. Which do we choose?

Another problem I see is that we are detecting both N3 and Turtle in two 
separate steps. However, as I understand it, for the purposes of RDF, N3 is 
essentially a synonym for Turtle. So it doesn't really make sense to use two 
different detection steps for this. It appears that our N3 detection step is 
actually detecting N-Triples, which is not at all the same thing.




> Inherent problems with mimetype detection
> -
>
> Key: ANY23-417
> URL: https://issues.apache.org/jira/browse/ANY23-417
> Project: Apache Any23
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 2.3
>Reporter: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> N-Triples is a subset of Turtle, and it is also a subset of N-Quads. Turtle 
> is a subset of TriG.
> But when we are performing mimetype detection on a plain text file, we only 
> sniff the first few kilobytes of data. Therefore, something we initially 
> detect as N-Triples may in fact be a Turtle, Trig, or NQuads document. 
> Something we initially detect as Turtle may in fact be a TriG document.
> Therefore, if we detect that the document is Turtle, in the absence of a 
> declared Content-Type, we should probably assume that it actually TriG, just 
> in case.
> If we can only detect that the document is N-Triples, that presents a 
> problem, because it could also be either Turtle or N-Quads. Which do we 
> choose?
> Another problem I see is that we are detecting both N3 and Turtle in two 
> separate steps. However, as I understand it, for the purposes of RDF, N3 is 
> essentially a synonym for Turtle. So it doesn't really make sense to use two 
> different detection steps for this. It appears that our N3 detection step is 
> actually detecting N-Triples, which is not at all the same thing.
> (Indeed, in {{org.eclipse.rdf4j.rio.n3.N3ParserFactory}}'s implementation of 
> {{getParser()}} we see: {{return new TurtleParser()}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-417) Inherent problems with mimetype detection

2018-11-01 Thread Hans Brende (JIRA)
Hans Brende created ANY23-417:
-

 Summary: Inherent problems with mimetype detection
 Key: ANY23-417
 URL: https://issues.apache.org/jira/browse/ANY23-417
 Project: Apache Any23
  Issue Type: Bug
  Components: mime
Affects Versions: 2.3
Reporter: Hans Brende
 Fix For: 2.3


N-Triples is a subset of Turtle, and it is also a subset of N-Quads. Turtle is 
a subset of TriG.

But when we are performing mimetype detection on a plain text file, we only 
sniff the first few kilobytes of data. Therefore, something we initially detect 
as N-Triples may in fact be a Turtle, Trig, or NQuads document. Something we 
initially detect as Turtle may in fact be a TriG document.

Therefore, if we detect that the document is Turtle, in the absence of a 
declared Content-Type, we should probably assume that it actually TriG, just in 
case.

If we can only detect that the document is N-Triples, that presents a problem, 
because it could also be either Turtle or N-Quads. Which do we choose?

Another problem I see is that we are detecting both N3 and Turtle in two 
separate steps. However, as I understand it, for the purposes of RDF, N3 is 
essentially a synonym for Turtle. So it doesn't really make sense to use two 
different detection steps for this. It appears that our N3 detection step is 
actually detecting N-Triples, which is not at all the same thing.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-395) any23.org 500 Internal Server Error

2018-11-01 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671617#comment-16671617
 ] 

Hans Brende commented on ANY23-395:
---

[~lewismc] Any update on this?

> any23.org 500 Internal Server Error
> ---
>
> Key: ANY23-395
> URL: https://issues.apache.org/jira/browse/ANY23-395
> Project: Apache Any23
>  Issue Type: Bug
>  Components: site
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Lewis John McGibbney
>Priority: Major
>
> When navigating to the site:
> http://any23-vm2.apache.org/any23/?format=best&uri=https%3A%2F%2Fwikipedia.org%2Fwiki%2FRSS&validation-mode=none
> I get a 500 internal server error. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-416) NTriplesExtractor does not recognize "application/n-triples" mimetype

2018-10-31 Thread Hans Brende (JIRA)
Hans Brende created ANY23-416:
-

 Summary: NTriplesExtractor does not recognize 
"application/n-triples" mimetype
 Key: ANY23-416
 URL: https://issues.apache.org/jira/browse/ANY23-416
 Project: Apache Any23
  Issue Type: Bug
  Components: extractors
Affects Versions: 2.3
Reporter: Hans Brende
 Fix For: 2.3


The standard mimetype for n-triples, which is "application/n-triples", is not 
contained in the list of mimetypes in the NTriplesExtractorFactory!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-415) NTriplesExtractor tries all text/plain files, causing numerous fatal issues

2018-10-31 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-415:
--
Description: 
Since the NTriplesExtractorFactory includes a content type of "text/plain", 
this causes every plain text file to be processed by the NTriplesExtractor, 
which in turn causes huge numbers of completely unnecessary fatal issues being 
sent to the extraction report.

In my crawls, this mostly occurs for all the "humans.txt" files encountered.

While this isn't a hugely serious bug, it is quite irritating as it does really 
clutter up my logs.

 
Note: the NQuadsExtractorFactory (which can parse all the same documents as 
NTriples) does *not* include a content type of "text/plain".

  was:
Since the NTriplesExtractorFactory includes a content type of "text/plain", 
this causes every plain text file to be processed by the NTriplesExtractor, 
which in turn causes huge numbers of completely unnecessary fatal issues being 
sent to the extraction report. 

In my crawls, this mostly occurs for all the "humans.txt" files encountered.

While this isn't a hugely serious bug, it is quite irritating as it does really 
clutter up my logs.


> NTriplesExtractor tries all text/plain files, causing numerous fatal issues
> ---
>
> Key: ANY23-415
> URL: https://issues.apache.org/jira/browse/ANY23-415
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 2.3
>Reporter: Hans Brende
>Priority: Minor
> Fix For: 2.3
>
>
> Since the NTriplesExtractorFactory includes a content type of "text/plain", 
> this causes every plain text file to be processed by the NTriplesExtractor, 
> which in turn causes huge numbers of completely unnecessary fatal issues 
> being sent to the extraction report.
> In my crawls, this mostly occurs for all the "humans.txt" files encountered.
> While this isn't a hugely serious bug, it is quite irritating as it does 
> really clutter up my logs.
>  
> Note: the NQuadsExtractorFactory (which can parse all the same documents as 
> NTriples) does *not* include a content type of "text/plain".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-395) any23.org 500 Internal Server Error

2018-10-31 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670701#comment-16670701
 ] 

Hans Brende commented on ANY23-395:
---

[~lewismc] Were you able to update the VM? Navigating to any23.org, at the 
bottom of the page I still see: "Apache Any23 v.2.3-SNAPSHOT (2018-*08-28* 
14:12:59+)", but we need the 2018-10-31 version.

> any23.org 500 Internal Server Error
> ---
>
> Key: ANY23-395
> URL: https://issues.apache.org/jira/browse/ANY23-395
> Project: Apache Any23
>  Issue Type: Bug
>  Components: site
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Lewis John McGibbney
>Priority: Major
>
> When navigating to the site:
> http://any23-vm2.apache.org/any23/?format=best&uri=https%3A%2F%2Fwikipedia.org%2Fwiki%2FRSS&validation-mode=none
> I get a 500 internal server error. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-415) NTriplesExtractor tries all text/plain files, causing numerous fatal issues

2018-10-31 Thread Hans Brende (JIRA)
Hans Brende created ANY23-415:
-

 Summary: NTriplesExtractor tries all text/plain files, causing 
numerous fatal issues
 Key: ANY23-415
 URL: https://issues.apache.org/jira/browse/ANY23-415
 Project: Apache Any23
  Issue Type: Bug
  Components: extractors
Affects Versions: 2.3
Reporter: Hans Brende
 Fix For: 2.3


Since the NTriplesExtractorFactory includes a content type of "text/plain", 
this causes every plain text file to be processed by the NTriplesExtractor, 
which in turn causes huge numbers of completely unnecessary fatal issues being 
sent to the extraction report. 

In my crawls, this mostly occurs for all the "humans.txt" files encountered.

While this isn't a hugely serious bug, it is quite irritating as it does really 
clutter up my logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-395) any23.org 500 Internal Server Error

2018-10-31 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670578#comment-16670578
 ] 

Hans Brende commented on ANY23-395:
---

[~lewismc] Great! Let me know when the update is live!

> any23.org 500 Internal Server Error
> ---
>
> Key: ANY23-395
> URL: https://issues.apache.org/jira/browse/ANY23-395
> Project: Apache Any23
>  Issue Type: Bug
>  Components: site
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Lewis John McGibbney
>Priority: Major
>
> When navigating to the site:
> http://any23-vm2.apache.org/any23/?format=best&uri=https%3A%2F%2Fwikipedia.org%2Fwiki%2FRSS&validation-mode=none
> I get a 500 internal server error. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ANY23-395) any23.org 500 Internal Server Error

2018-10-31 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670100#comment-16670100
 ] 

Hans Brende edited comment on ANY23-395 at 10/31/18 1:21 PM:
-

[~lewismc] As mentioned over in INFRA-16986, I've uploaded my ssh key for VM 
access.

Can you update the VM with my latest commit? I think it'll fix the 500 error at 
least.


was (Author: hansbrende):
[~lewismc] As mentioned over in INFRA-16986, I'm uploaded my ssh key for VM 
access.

Can you update the VM with my latest commit? I think it'll fix the 500 error at 
least.

> any23.org 500 Internal Server Error
> ---
>
> Key: ANY23-395
> URL: https://issues.apache.org/jira/browse/ANY23-395
> Project: Apache Any23
>  Issue Type: Bug
>  Components: site
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Lewis John McGibbney
>Priority: Major
>
> When navigating to the site:
> http://any23-vm2.apache.org/any23/?format=best&uri=https%3A%2F%2Fwikipedia.org%2Fwiki%2FRSS&validation-mode=none
> I get a 500 internal server error. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-395) any23.org 500 Internal Server Error

2018-10-31 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16670100#comment-16670100
 ] 

Hans Brende commented on ANY23-395:
---

[~lewismc] As mentioned over in INFRA-16986, I'm uploaded my ssh key for VM 
access.

Can you update the VM with my latest commit? I think it'll fix the 500 error at 
least.

> any23.org 500 Internal Server Error
> ---
>
> Key: ANY23-395
> URL: https://issues.apache.org/jira/browse/ANY23-395
> Project: Apache Any23
>  Issue Type: Bug
>  Components: site
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Lewis John McGibbney
>Priority: Major
>
> When navigating to the site:
> http://any23-vm2.apache.org/any23/?format=best&uri=https%3A%2F%2Fwikipedia.org%2Fwiki%2FRSS&validation-mode=none
> I get a 500 internal server error. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-413) CSV Extractor attempts to be too smart

2018-10-30 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669508#comment-16669508
 ] 

Hans Brende commented on ANY23-413:
---

Also see: 
https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164

> CSV Extractor attempts to be too smart
> --
>
> Key: ANY23-413
> URL: https://issues.apache.org/jira/browse/ANY23-413
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 2.3
>Reporter: Hans Brende
>Priority: Minor
> Fix For: 2.3
>
>
> Currently, our CSV extractor tries to figure out what the datatype of each 
> cell is simply by attempting to parse a float or integer from the cell and 
> falling back on "string".
> This is problematic because cells that look like numbers may not, in fact, be 
> numbers.
> Consider a column of version numbers, such as:
> 4
> 4.1
> 4.1.1
> etc.
> Currently our csv extractor will assign the following datatypes to this 
> column:
> 4 -> integer
> 4.1 -> float
> 4.1.1 -> string
> We could improve this guessing ability by parsing the entire column before 
> assigning a datatype, and then using the least-specific datatype encountered. 
> However, this solution would also be problematic because then we'd have to 
> hold the entire table in memory before generating any triples. And it still 
> wouldn't guarantee correctness.
> Without structured data telling us what the original datatype was, I don't 
> think assigning any datatypes other than "string" to string values is 
> worthwhile.
> Another problem is that the extractor strips leading and trailing whitespaces 
> from all values, including string values. While this behavior probably 
> wouldn't present a problem for most use-cases, it does mean that the 
> algorithm is lossy.
> Cf. ANY23-218



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-413) CSV Extractor attempts to be too smart

2018-10-30 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669492#comment-16669492
 ] 

Hans Brende commented on ANY23-413:
---

One possible fix for this would be to:

(1) Buffer a few kilobytes worth of input before writing out the first triple. 
(Which we're already doing to detect the column-separator character).
(2) Use the *least* specific possible datatype found in the buffered input for 
a column as the *most* specific datatype we will assign to items in that column.
(3) If we don't get enough representative samples for a column in the few 
kilobytes that we buffer to be reasonably confident in our choice of datatype, 
fall back to string.

> CSV Extractor attempts to be too smart
> --
>
> Key: ANY23-413
> URL: https://issues.apache.org/jira/browse/ANY23-413
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 2.3
>Reporter: Hans Brende
>Priority: Minor
> Fix For: 2.3
>
>
> Currently, our CSV extractor tries to figure out what the datatype of each 
> cell is simply by attempting to parse a float or integer from the cell and 
> falling back on "string".
> This is problematic because cells that look like numbers may not, in fact, be 
> numbers.
> Consider a column of version numbers, such as:
> 4
> 4.1
> 4.1.1
> etc.
> Currently our csv extractor will assign the following datatypes to this 
> column:
> 4 -> integer
> 4.1 -> float
> 4.1.1 -> string
> We could improve this guessing ability by parsing the entire column before 
> assigning a datatype, and then using the least-specific datatype encountered. 
> However, this solution would also be problematic because then we'd have to 
> hold the entire table in memory before generating any triples. And it still 
> wouldn't guarantee correctness.
> Without structured data telling us what the original datatype was, I don't 
> think assigning any datatypes other than "string" to string values is 
> worthwhile.
> Another problem is that the extractor strips leading and trailing whitespaces 
> from all values, including string values. While this behavior probably 
> wouldn't present a problem for most use-cases, it does mean that the 
> algorithm is lossy.
> Cf. ANY23-218



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-414) Support reverse itemprops in microdata

2018-10-30 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-414.
---
Resolution: Fixed
  Assignee: Hans Brende

> Support reverse itemprops in microdata
> --
>
> Key: ANY23-414
> URL: https://issues.apache.org/jira/browse/ANY23-414
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: microdata
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> Microdata has an experimental feature called reverse itemprops. For details, 
> see: http://w3c.github.io/microdata-rdf/#reverse-itemprop
> Although it's still "experimental", we may as well support it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-414) Support reverse itemprops in microdata

2018-10-30 Thread Hans Brende (JIRA)
Hans Brende created ANY23-414:
-

 Summary: Support reverse itemprops in microdata
 Key: ANY23-414
 URL: https://issues.apache.org/jira/browse/ANY23-414
 Project: Apache Any23
  Issue Type: Improvement
  Components: microdata
Affects Versions: 2.3
Reporter: Hans Brende
 Fix For: 2.3


Microdata has an experimental feature called reverse itemprops. For details, 
see: http://w3c.github.io/microdata-rdf/#reverse-itemprop

Although it's still "experimental", we may as well support it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-240) Option to process html tags as spaces in Microdata

2018-10-29 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-240:
--
Affects Version/s: 2.3

> Option to process html tags as spaces in Microdata
> --
>
> Key: ANY23-240
> URL: https://issues.apache.org/jira/browse/ANY23-240
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: extractors, microdata
>Affects Versions: 2.3
>Reporter: Andrey Kutuzov
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> When extracting Microdata from html pages, any23 silently drops all html tags 
> inside predicates' values. See, for example, 
> http://schema.org/Recipe/ingredients at http://kuking.net/3_2070.htm.
> The problem is that on this page (and many others) ingredients are separated 
> from each other only with '' tag. After any23 drops it, the content 
> becomes mixed and unintelligible. At the same time, Google Structured Data 
> Testing Tool separates them properly with spaces.
> Is it possible to implement this behavior (replacing  tags with spaces) 
> in any23 as an option?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-240) Option to process html tags as spaces in Microdata

2018-10-29 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-240.
---
Resolution: Fixed
  Assignee: Hans Brende  (was: Andrey Kutuzov)

> Option to process html tags as spaces in Microdata
> --
>
> Key: ANY23-240
> URL: https://issues.apache.org/jira/browse/ANY23-240
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: extractors, microdata
>Reporter: Andrey Kutuzov
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> When extracting Microdata from html pages, any23 silently drops all html tags 
> inside predicates' values. See, for example, 
> http://schema.org/Recipe/ingredients at http://kuking.net/3_2070.htm.
> The problem is that on this page (and many others) ingredients are separated 
> from each other only with '' tag. After any23 drops it, the content 
> becomes mixed and unintelligible. At the same time, Google Structured Data 
> Testing Tool separates them properly with spaces.
> Is it possible to implement this behavior (replacing  tags with spaces) 
> in any23 as an option?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-332) Plugin-specific properties shouldn't be declared in default-configuration.properties

2018-10-29 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-332:
--
Summary: Plugin-specific properties shouldn't be declared in 
default-configuration.properties  (was: Plugin-specific properties shouldn't be 
declared in default-configuration.properties?)

> Plugin-specific properties shouldn't be declared in 
> default-configuration.properties
> 
>
> Key: ANY23-332
> URL: https://issues.apache.org/jira/browse/ANY23-332
> Project: Apache Any23
>  Issue Type: Improvement
>Affects Versions: 2.1
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Minor
> Fix For: 2.3
>
>
> I noticed that one of the properties in the 
> {{default-configuration.properties}} file is called 
> "{{any23.extraction.openie.confidence.threshold}}". 
> However, given that OpenIE is a dynamically-loaded plugin (not part of the 
> core module), it doesn't make sense to me to have OpenIE-specific properties 
> declared in the default-configuration file in the *api* module (it also shows 
> up in IntelliJ as being the only "unused" property in the config file).
> It might make more sense to have a separate OpenIE-specific configuration 
> file declared in the OpenIE plugin jar, and then that file would be appended 
> to the api default-configuration file when doing OpenIE stuff.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-402) Deprecate JSONWriter, JSONWriterFactory

2018-10-29 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-402:
--
Summary: Deprecate JSONWriter, JSONWriterFactory  (was: Deprecate 
JSONWriter, JSONWriterFactory?)

> Deprecate JSONWriter, JSONWriterFactory
> ---
>
> Key: ANY23-402
> URL: https://issues.apache.org/jira/browse/ANY23-402
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Trivial
> Fix For: 2.3
>
>
> Our JSONWriter class, so far as I can tell, does not conform to any 
> specification for writing out triples as JSON. It neither writes JSON-LD nor 
> RDF/JSON (which is superseded by JSON-LD). 
> Now that we have the JSONLDWriter, should we deprecate the JSONWriter class?
> (Also note that we do not have a corresponding JSONExtractor. Cf. ANY23-129)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ANY23-413) CSV Extractor attempts to be too smart

2018-10-29 Thread Hans Brende (JIRA)
Hans Brende created ANY23-413:
-

 Summary: CSV Extractor attempts to be too smart
 Key: ANY23-413
 URL: https://issues.apache.org/jira/browse/ANY23-413
 Project: Apache Any23
  Issue Type: Bug
  Components: extractors
Affects Versions: 2.3
Reporter: Hans Brende
 Fix For: 2.3


Currently, our CSV extractor tries to figure out what the datatype of each cell 
is simply by attempting to parse a float or integer from the cell and falling 
back on "string".

This is problematic because cells that look like numbers may not, in fact, be 
numbers.

Consider a column of version numbers, such as:
4
4.1
4.1.1
etc.

Currently our csv extractor will assign the following datatypes to this column:
4 -> integer
4.1 -> float
4.1.1 -> string

We could improve this guessing ability by parsing the entire column before 
assigning a datatype, and then using the least-specific datatype encountered. 
However, this solution would also be problematic because then we'd have to hold 
the entire table in memory before generating any triples. And it still wouldn't 
guarantee correctness.

Without structured data telling us what the original datatype was, I don't 
think assigning any datatypes other than "string" to string values is 
worthwhile.

Another problem is that the extractor strips leading and trailing whitespaces 
from all values, including string values. While this behavior probably wouldn't 
present a problem for most use-cases, it does mean that the algorithm is lossy.

Cf. ANY23-218





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-38) Use a single logging tool: slf4j

2018-10-29 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-38?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-38.
--
Resolution: Fixed

> Use a single logging tool: slf4j
> 
>
> Key: ANY23-38
> URL: https://issues.apache.org/jira/browse/ANY23-38
> Project: Apache Any23
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Lewis John McGibbney
>Assignee: Peter Ansell
>Priority: Minor
> Fix For: 2.3
>
>
> Using LogUtil is an convoluted method for logging, we should remove this and 
> clean up the logging in Any23 trunk.
> Update 2018: we now use slf4j in combination with slf4j-log4j12, and have 
> removed all other logging frameworks (with the exception of 
> java.util.logging) from the classpath (see ANY23-356 and ANY23-366). All 
> that's left to do is to refactor o.a.a.util.LogUtils class to remove all 
> references to JUL. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-332) Plugin-specific properties shouldn't be declared in default-configuration.properties?

2018-10-29 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-332.
---
Resolution: Fixed
  Assignee: Hans Brende

> Plugin-specific properties shouldn't be declared in 
> default-configuration.properties?
> -
>
> Key: ANY23-332
> URL: https://issues.apache.org/jira/browse/ANY23-332
> Project: Apache Any23
>  Issue Type: Improvement
>Affects Versions: 2.1
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Minor
> Fix For: 2.3
>
>
> I noticed that one of the properties in the 
> {{default-configuration.properties}} file is called 
> "{{any23.extraction.openie.confidence.threshold}}". 
> However, given that OpenIE is a dynamically-loaded plugin (not part of the 
> core module), it doesn't make sense to me to have OpenIE-specific properties 
> declared in the default-configuration file in the *api* module (it also shows 
> up in IntelliJ as being the only "unused" property in the config file).
> It might make more sense to have a separate OpenIE-specific configuration 
> file declared in the OpenIE plugin jar, and then that file would be appended 
> to the api default-configuration file when doing OpenIE stuff.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-406) Cannot suppress Tika warnings

2018-10-29 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-406.
---
   Resolution: Fixed
 Assignee: Hans Brende
Fix Version/s: 2.3

> Cannot suppress Tika warnings
> -
>
> Key: ANY23-406
> URL: https://issues.apache.org/jira/browse/ANY23-406
> Project: Apache Any23
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.2
> Environment: Eclipse/Macbook Pro/Java 8 
>Reporter: David Cockbill
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> I am writing a Java application; and have pulled in Any23 2.2 from Maven 
> Central. I am having issues with the suppression of the following Tika 
> warnings:
>  
> {code:java}
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> TIFFImageWriter not loaded. tiff files will not be processed
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> {code}
> Now, I understand I can just pull in these dependancies; but that then 
> introduces licensing issues.
>  
> I have also searched around the for other solutions. I have seen many issue 
> reports that claim the fix is to quash the warnings using a tika-config.xml 
> file. The suggested fixes are slightly ambiguous to me.
> I was unsure whether to use 'tiki.confg' or 'tika.config.file' for the System 
> Property for the config file. I was also unsure whether to set either of the 
> following:
>  * 
>  * 
> Regardless, I have tried all combinations to no avail. When constructing an 
> Any23 instance, I still get the warnings.
> I then had a quick look through the code and the Any23 class has a Mime Type 
> Detector:
>  
> {code:java}
> private MIMETypeDetector mimeTypeDetector = new TikaMIMETypeDetector( new 
> WhiteSpacesPurifier() );
> {code}
>  
> This in turn constructs a TikaConfig from the following resource:
>  
> {code:java}
> public static final String RESOURCE_NAME = 
> "/org/apache/any23/mime/tika-config.xml";
> {code}
>  
> This configuration file certainly does not include the aforementioned methods 
> of suppressing the warnings; and as far as I can determine will override any 
> configuration file that I could inject.
> From stepping through the code the warnings appear whilst constructing  
> TikaMIMETypeDetector.
> So, am I missing something; or is there no way to suppress these warnings?
> When looking for solutions to this I have seen the same warnings in the 
> traces for other bugs. I assume these authors are maybe not worried about 
> these?
> Thanks for any advice, and I apologise if I have not selected the correct 
> components etc. for the issue.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-402) Deprecate JSONWriter, JSONWriterFactory?

2018-10-29 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-402.
---
   Resolution: Fixed
 Assignee: Hans Brende
Fix Version/s: 2.3

> Deprecate JSONWriter, JSONWriterFactory?
> 
>
> Key: ANY23-402
> URL: https://issues.apache.org/jira/browse/ANY23-402
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Trivial
> Fix For: 2.3
>
>
> Our JSONWriter class, so far as I can tell, does not conform to any 
> specification for writing out triples as JSON. It neither writes JSON-LD nor 
> RDF/JSON (which is superseded by JSON-LD). 
> Now that we have the JSONLDWriter, should we deprecate the JSONWriter class?
> (Also note that we do not have a corresponding JSONExtractor. Cf. ANY23-129)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-402) Deprecate JSONWriter, JSONWriterFactory?

2018-10-29 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-402:
--
Description: 
Our JSONWriter class, so far as I can tell, does not conform to any 
specification for writing out triples as JSON. It neither writes JSON-LD nor 
RDF/JSON (which is superseded by JSON-LD). 

Now that we have the JSONLDWriter, should we deprecate the JSONWriter class?

(Also note that we do not have a corresponding JSONExtractor. Cf. ANY23-129)

  was:
Our JSONWriter class, so far as I can tell, does not conform to any 
specification for writing out triples as JSON. It neither writes JSON-LD nor 
RDF/JSON (which is superseded by JSON-LD). 

Now that we have the JSONLDWriter, should we deprecate the JSONWriter class?


> Deprecate JSONWriter, JSONWriterFactory?
> 
>
> Key: ANY23-402
> URL: https://issues.apache.org/jira/browse/ANY23-402
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.3
>Reporter: Hans Brende
>Priority: Trivial
>
> Our JSONWriter class, so far as I can tell, does not conform to any 
> specification for writing out triples as JSON. It neither writes JSON-LD nor 
> RDF/JSON (which is superseded by JSON-LD). 
> Now that we have the JSONLDWriter, should we deprecate the JSONWriter class?
> (Also note that we do not have a corresponding JSONExtractor. Cf. ANY23-129)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ANY23-392) Lunching maven-jetty-plugin: Problem accessing /apache-any23-service/resources/form.html

2018-10-29 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-392:
--
Fix Version/s: 2.3

> Lunching maven-jetty-plugin: Problem accessing 
> /apache-any23-service/resources/form.html
> 
>
> Key: ANY23-392
> URL: https://issues.apache.org/jira/browse/ANY23-392
> Project: Apache Any23
>  Issue Type: Bug
>  Components: service
>Affects Versions: 2.3
>Reporter: Jacek Grzebyta
>Assignee: Jacek Grzebyta
>Priority: Major
> Fix For: 2.3
>
>
> When using {{maven-jetty-plugin}} web container starts properly but when I 
> try to access root web page: [http://localhost:8080/apache-any23-service/] I 
> have error:
>  
> {code:java}
> HTTP ERROR 404
> Problem accessing /apache-any23-service/resources/form.html. Reason:
> Not Found
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-154) Not able to extract microdata in few test cases

2018-10-28 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-154.
---
Resolution: Fixed
  Assignee: Hans Brende

> Not able to extract microdata in few test cases
> ---
>
> Key: ANY23-154
> URL: https://issues.apache.org/jira/browse/ANY23-154
> Project: Apache Any23
>  Issue Type: Bug
>  Components: microdata
>Affects Versions: 0.7.0
> Environment: Windows 7 32bit
> JDK 1.6.0_38
> Intel Core 2 duo and 4GB RAM
>Reporter: Kunal P
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
> Attachments: XOYRVIbK.part, neeraj.nowfloats.com.htm
>
>
> we are using ApacheAny23 API for extracting microdata from the given web-page 
> as part of internal project.
> we have some test cases where api is not able to parse the microdata. 
> www.neeraj.nowfloats.com (The web page is not following schema.org standards 
> strictly)
> I am giving the snippit of the HTML code here.
>  itemtype="http://schema.org/Offer";>
>   
> 
> It clearly shows that given microdata is a child of some parent microdata 
> specification as it contains itemscope as well as itemprop in the same tag. 
> And the given  tag has no parent microdata specification.
> The method used for extracting ItemScopes is as follows,
> import org.apache.any23.extractor.microdata.ItemScope;
> import org.apache.any23.extractor.microdata.MicrodataParser;
> import org.apache.any23.extractor.microdata.MicrodataParserReport;
> Document dom = getDomDocument(String html)
> MicrodataParserReport report = MicrodataParser.getMicrodata(dom);
> ItemScope[] items = report.getDetectedItemScopes();
> here, items doesnt contain any ItemScope which has above test case. 
> In such scenario, how can we extract microdata from the page using any23 api.
> Is there any way to relax the criterion of itemprop and itemscope not 
> appearing in the same tag so that we get the data from the webpage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-215) Forward slashes in URL's should not be escaped in RDF output

2018-10-26 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-215.
---
Resolution: Fixed

> Forward slashes in URL's should not be escaped in RDF output 
> -
>
> Key: ANY23-215
> URL: https://issues.apache.org/jira/browse/ANY23-215
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: service
>Affects Versions: 1.0
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.3
>
>
> Some sample output showing the unescaped forward slashes '/'
> [
>   {
> "type": "uri",
> "value": "http:\/\/any23.org\/tmp\/"
>   },
>   "http:\/\/www.w3.org\/1999\/xhtml\/vocab#icon",
>   {
> "type": "uri",
> "value": "http:\/\/bits.wikimedia.org\/favicon\/wikipedia.ico"
>   },
>   null
> ],
> We should ensure that they are unescaped when we print them to output. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ANY23-215) Forward slashes in URL's should not be escaped in RDF output

2018-10-26 Thread Hans Brende (JIRA)


[ 
https://issues.apache.org/jira/browse/ANY23-215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665828#comment-16665828
 ] 

Hans Brende commented on ANY23-215:
---

[~lewismc] As it appears this issue has been fixed, and we're planning on 
deprecating the JSONWriter class anyway (see ANY23-402), I'm going to mark this 
issue as "resolved". But please re-open if you see fit.

> Forward slashes in URL's should not be escaped in RDF output 
> -
>
> Key: ANY23-215
> URL: https://issues.apache.org/jira/browse/ANY23-215
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: service
>Affects Versions: 1.0
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 2.3
>
>
> Some sample output showing the unescaped forward slashes '/'
> [
>   {
> "type": "uri",
> "value": "http:\/\/any23.org\/tmp\/"
>   },
>   "http:\/\/www.w3.org\/1999\/xhtml\/vocab#icon",
>   {
> "type": "uri",
> "value": "http:\/\/bits.wikimedia.org\/favicon\/wikipedia.ico"
>   },
>   null
> ],
> We should ensure that they are unescaped when we print them to output. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ANY23-401) Upgrade to Tika 1.19.1

2018-10-26 Thread Hans Brende (JIRA)


 [ 
https://issues.apache.org/jira/browse/ANY23-401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende resolved ANY23-401.
---
Resolution: Fixed
  Assignee: Hans Brende

> Upgrade to Tika 1.19.1
> --
>
> Key: ANY23-401
> URL: https://issues.apache.org/jira/browse/ANY23-401
> Project: Apache Any23
>  Issue Type: Improvement
>  Components: CLI, core, encoding, mime
>Affects Versions: 2.3
>Reporter: Hans Brende
>Assignee: Hans Brende
>Priority: Major
> Fix For: 2.3
>
>
> Tika 1.19 supports Java 11, so after upgrading, we'll be able to remove the 
> newly added jaxb dependencies.
> We'll also need to upgrade POI to version 4.0.0 and commons-compress to 
> version 1.18 to match the Tika versions.
> Also, we should now be able to remove the commons-logging exclusions we have 
> under our tika dependency declarations (see 
> https://issues.apache.org/jira/browse/TIKA-2690 ).
> For a more complete list of improvements, see: 
> http://tika.apache.org/1.19/index.html
> Tika 1.19.1 has just been released as well, which adds two critical bug fixes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   >