[ 
https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350529#comment-16350529
 ] 

Tim Allison commented on TIKA-1599:
-----------------------------------

>DOM could lead to higher memory usage

Y, that's my concern, esp because our CommonCrawl docs are truncated at 1MB so 
we aren't going to see major problems in that corpus.

 

I've kicked off a fresh full run of Tika 1.17 against the corpus, and I've 
updated my jsoup code on my personal fork.  Once the 1.17 run finishes, I'll 
kick off the jsoup fork against the html files.

 

Unrelated topic: does anyone have a shareable example of an html file with a 
base64 (or other) embedded file inside of an html file?  I don't think we're 
currently handling these, and it would be nice to do that.

> Switch from TagSoup to JSoup
> ----------------------------
>
>                 Key: TIKA-1599
>                 URL: https://issues.apache.org/jira/browse/TIKA-1599
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.7, 1.8
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-1599-crazy-files.tar.gz, 
> tagsoup_vs_jsoup_reports.zip
>
>
> There are several Tika issues related to how TagSoup cleans up HTML 
> ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be 
> under active development.
> On the other hand I know of several projects that are now using 
> [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only 
> one main contributor) under the MIT license.
> I haven't looked into how hard it would be to switch this dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to