[ 
https://issues.apache.org/jira/browse/ANY23-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070510#comment-17070510
 ] 

Hudson commented on ANY23-446:
------------------------------

UNSTABLE: Integrated in Jenkins build Any23-trunk #1679 (See 
[https://builds.apache.org/job/Any23-trunk/1679/])
ANY23-446 update jsoup to v1.13.1 (hans: rev 
f7967b641ac590972aa77ff6c83d2b6977afbaa3)
* (edit) pom.xml


> Fix bugs in Jsoup
> -----------------
>
>                 Key: ANY23-446
>                 URL: https://issues.apache.org/jira/browse/ANY23-446
>             Project: Apache Any23
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Assignee: Hans Brende
>            Priority: Major
>             Fix For: 2.4
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jsoup is giving us some issues in our encoding detection module, namely:
> https://github.com/jhy/jsoup/issues/1251  (which caused ANY23-441)
> and 
> https://github.com/jhy/jsoup/issues/1250  (which is going to make our 
> encoding detector blow up anytime we're detecting, e.g., UTF-16.)
> The latter issue is more serious than the former due to the potential 
> frequency of the errors.
> There is one pull request open in jsoup for the first issue which fixes it, 
> but unfortunately Jonathan Hedley (creator of jsoup) has not been active over 
> the past few months and I doubt it'll get merged anytime soon.
> I propose that we temporarily repackage a couple jsoup classes in our 
> encoding detection module and add some quick fixes. When the jsoup library 
> gets updated, we can potentially remove the repackaged classes again.
> One bonus advantage: this will allow us to implement a streaming approach to 
> encoding detection instead of our current strategy of building the entire DOM 
> to extract the plaintext (which is really overkill on memory usage).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to