RE: HTML parsing, script tags,

Markus Jelsma Fri, 30 Jun 2017 08:20:16 -0700

TagSoup is notorious for being utterly unmaintained, but i can be forced to do 
what, at least, i needed:


        // We'll change the schema to allow tables inside anchors!
        Schema schema = new HTMLSchema();
                
        // Have meta reported everywhere, also in the body
        schema.elementType("meta", HTMLSchema.M_EMPTY, 65535, 0);

        // https://issues.apache.org/jira/browse/TIKA-985
        String html5Elements[] = { "article", "aside", "audio", "bdi",
          "command", "datalist", "details", "embed", "summary", "figure",
          "figcaption", "footer", "header", "hgroup", "keygen", "mark",
          "meter", "nav", "output", "progress", "section", "source", "time",
          "track", "video", "figurecaption" };

        for (String html5Element : html5Elements) {
          schema.elementType(html5Element, HTMLSchema.M_ANY, 255, 0);
        }
        
        schema.elementType("a", HTMLSchema.M_ANY, 65535, 0);
        
        // Set up a parse context
        ParseContext context = new ParseContext();
        context.set(Schema.class, schema);
        context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);

The changed HTMLSchema and the usage of IdentityHtmlMapper makes it possible to 
return stuff that non-default TagSoup cannot.

Regards,
Markus
 
 
-----Original message-----
> From:Allison, Timothy B. <talli...@mitre.org>
> Sent: Friday 30th June 2017 17:13
> To: user@tika.apache.org
> Subject: RE: HTML parsing, script tags, 
> 
> Wait, Tagsoup is not returning the start element events in the same order as 
> the html?  I don’t know think we can fix that or your other points, but would 
> you be willing to share triggering documents and open an issue for each 
> problem. 
> We should include those issues in our ongoing conversation about swapping out 
> the underlying html parser for something more modern. 
> Sorry Tika isn’t working for you on this, and thank you! 
> From: Jim Idle [mailto:ji...@proofpoint.com] 
 
> Sent: Friday, June 30, 2017 1:23 AM
 
> To: user@tika.apache.org
 
> Subject: RE: HTML parsing, script tags,  
> Well I got a long way with the Tika wrapper around tag soup but then while 
> chasing down a bug I realized that I was not getting the startElement events 
> in the order that they are seen in the HTML file. It also ignores <!doctype> 
> and unknown
 
>  elements. 
> I can’t see anyway to change that and as knowing the structure of the 
> document is very important then I will have to stop using Tika for HTML I 
> guess and go back to validator.nu 
> Just posting this here for posterity really. 
> Jim 
> From: Ken Krugler [mailto:kkrugler_li...@transpac.com 
> <mailto:kkrugler_li...@transpac.com>]
 
> 
 
> Sent: Wednesday, June 28, 2017 23:06
 
> To: user@tika.apache.org <mailto:user@tika.apache.org>
 
> Subject: Re: HTML parsing, script tags,  
> Hi Jim, 
> On Jun 28, 2017, at 12:07am, Jim Idle <ji...@proofpoint.com 
> <mailto:ji...@proofpoint.com>> wrote: 
> So right now it looks the HTML parser only sends through script tags if the 
> hay a src attribute. Is this likely to change or should I use another parser 
> for HTML? I could submit a patch for this of course. 
> You can use a custom mapper if you want to alter which tags get passed 
> through. 
> E.g. check out IdentityHtmlMapper in Tika for a mapper that passes through 
> everything. 
> Also, does anyone have an opinion if the underlying tag soup stuff is 
> tolerant of HTML in a similar manner to browsers which will try to render 
> anything) or is expecting well-formed HTML. I can go look at the Tag Soup 
> stuff directly of
 
>  course, but just wondered if anyone has experience of using Tika to parse 
>HTML.  
> TagSoup (and JSoup and NekoHTML) are all Java libraries that try to fix up 
> broken HTML, with varying degrees of success, depending on the way that HTML 
> is broken. 
> — Ken 
> -------------------------- 
> Ken Krugler 
> 1 530-210-6378 
> http://www.scaleunlimited.com 
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scaleunlimited.com&d=DwMFaQ&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=zuXxc_gqb1VxiPCWTZMAcxEylZFKvjehEPUN183MkaM&s=CeitiWqk1nlp0ZL44NBYgX8weEIk24cx2yU7HA2AWFs&e=>custom
>  big data solutions & training 
> Hadoop, Cascading, Cassandra & Solr

RE: HTML parsing, script tags,

Reply via email to