[
https://issues.apache.org/jira/browse/ANY23-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070510#comment-17070510
]
Hudson commented on ANY23-446:
------------------------------
UNSTABLE: Integrated in Jenkins build Any23-trunk #1679 (See
[https://builds.apache.org/job/Any23-trunk/1679/])
ANY23-446 update jsoup to v1.13.1 (hans: rev
f7967b641ac590972aa77ff6c83d2b6977afbaa3)
* (edit) pom.xml
> Fix bugs in Jsoup
> -----------------
>
> Key: ANY23-446
> URL: https://issues.apache.org/jira/browse/ANY23-446
> Project: Apache Any23
> Issue Type: Bug
> Affects Versions: 2.3
> Reporter: Hans Brende
> Assignee: Hans Brende
> Priority: Major
> Fix For: 2.4
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Jsoup is giving us some issues in our encoding detection module, namely:
> https://github.com/jhy/jsoup/issues/1251 (which caused ANY23-441)
> and
> https://github.com/jhy/jsoup/issues/1250 (which is going to make our
> encoding detector blow up anytime we're detecting, e.g., UTF-16.)
> The latter issue is more serious than the former due to the potential
> frequency of the errors.
> There is one pull request open in jsoup for the first issue which fixes it,
> but unfortunately Jonathan Hedley (creator of jsoup) has not been active over
> the past few months and I doubt it'll get merged anytime soon.
> I propose that we temporarily repackage a couple jsoup classes in our
> encoding detection module and add some quick fixes. When the jsoup library
> gets updated, we can potentially remove the repackaged classes again.
> One bonus advantage: this will allow us to implement a streaming approach to
> encoding detection instead of our current strategy of building the entire DOM
> to extract the plaintext (which is really overkill on memory usage).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)