Hans Brende created ANY23-446:
---------------------------------
Summary: Fix bugs in Jsoup
Key: ANY23-446
URL: https://issues.apache.org/jira/browse/ANY23-446
Project: Apache Any23
Issue Type: Bug
Affects Versions: 2.3
Reporter: Hans Brende
Assignee: Hans Brende
Fix For: 2.4
Jsoup is giving us some issues in our encoding detection module, namely:
https://github.com/jhy/jsoup/issues/1251 (which caused ANY23-441)
and
https://github.com/jhy/jsoup/issues/1250 (which is going to make our encoding
detector blow up anytime we're detecting, e.g., UTF-16.)
The latter issue is more serious than the former due to the potential frequency
of the errors.
There is one pull request open in jsoup for the first issue which fixes it, but
unfortunately Jonathan Hedley (creator of jsoup) has not been active over the
past few months and I doubt it'll get merged anytime soon.
I propose that we temporarily repackage a couple jsoup classes in our encoding
detection module and add some quick fixes. When the jsoup library gets updated,
we can potentially remove the repackaged classes again.
One bonus advantage: this will allow us to implement a streaming approach to
encoding detection instead of our current strategy of building the entire DOM
to extract the plaintext (which is really overkill on memory usage).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)