Hans Brende created ANY23-446:
---------------------------------

             Summary: Fix bugs in Jsoup
                 Key: ANY23-446
                 URL: https://issues.apache.org/jira/browse/ANY23-446
             Project: Apache Any23
          Issue Type: Bug
    Affects Versions: 2.3
            Reporter: Hans Brende
            Assignee: Hans Brende
             Fix For: 2.4


Jsoup is giving us some issues in our encoding detection module, namely:

https://github.com/jhy/jsoup/issues/1251  (which caused ANY23-441)

and 

https://github.com/jhy/jsoup/issues/1250  (which is going to make our encoding 
detector blow up anytime we're detecting, e.g., UTF-16.)

The latter issue is more serious than the former due to the potential frequency 
of the errors.

There is one pull request open in jsoup for the first issue which fixes it, but 
unfortunately Jonathan Hedley (creator of jsoup) has not been active over the 
past few months and I doubt it'll get merged anytime soon.

I propose that we temporarily repackage a couple jsoup classes in our encoding 
detection module and add some quick fixes. When the jsoup library gets updated, 
we can potentially remove the repackaged classes again.

One bonus advantage: this will allow us to implement a streaming approach to 
encoding detection instead of our current strategy of building the entire DOM 
to extract the plaintext (which is really overkill on memory usage).





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to