[ 
https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491804#comment-16491804
 ] 

ASF GitHub Bot commented on TIKA-2100:
--------------------------------------

tballison commented on issue #238: TIKA-2100 extract content language from html 
lang attribute
URL: https://github.com/apache/tika/pull/238#issuecomment-392283950
 
 
   So... go forth!
   
   Side note: It’d be fun to see counts for elements w a lang attr in our
   corpus.
   
   On Sat, May 26, 2018 at 3:41 PM Tim Allison <talli...@apache.org> wrote:
   
   > If :lang is special, we should treat it specially :).  If there are other
   > attrs that go on the html entity, we can add them later? Onward!
   >
   > On Sat, May 26, 2018 at 3:26 PM Chris Mattmann <notificati...@github.com>
   > wrote:
   >
   >> @tballison <https://github.com/tballison> I hear you and am open to
   >> alternatives. What is a better way to do this? I think missing the lang
   >> attribute is a pretty bad thing and have seen it in the past. It feels 
like
   >> HTMLParser as a parser should contribute it (and I don't think you object
   >> to that) perhaps via Metadata.set and then what should we do to propagate
   >> it?
   >>
   >> —
   >> You are receiving this because you were mentioned.
   >>
   >>
   >> Reply to this email directly, view it on GitHub
   >> <https://github.com/apache/tika/pull/238#issuecomment-392283049>, or mute
   >> the thread
   >> 
<https://github.com/notifications/unsubscribe-auth/AGbWvr0PxWiZRcDWrjJq2St5NmMcdldfks5t2axbgaJpZM4UN5BA>
   >> .
   >>
   >
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Html Parser does not keep the html tag attributes
> -------------------------------------------------
>
>                 Key: TIKA-2100
>                 URL: https://issues.apache.org/jira/browse/TIKA-2100
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> Parsing a very simple html like 
>  <!DOCTYPE html>
> <html lang="en">
> <head>
> <title>Page Title</title>
> </head>
> <body>
> <h1 align="left">My First Heading</h1>
> <p>My first paragraph.</p>
> </body>
> </html> 
> you won't be able to access the html tag's attributes (here lang="en") in the 
> ContentHandler : 
> *in the method startElement(String ns, String localName, String name,
>       Attributes atts), atts is empty.
> *Moreover it seems that the html tag's attributes are not passed trough the 
> HtmlMapper.mapSafeAttribute method too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to