Hi Jaya,

> On May 24, 2018, at 1:34 PM, Johnson, Jaya <jaya.john...@moodys.com> wrote:
> 
> No I don’t care about that – I have something like
> List of underwiters….some text
> <TABLE>
> <TR><TD>…..
> Etc etc
> <TABLE>
>   <>
> I want to get all of that – so can I look for say all table tags content in 
> them and then say a few words before the tag TABLE. I can do the parsing etc.

In that case you should be able to use your own content handler (which will get 
a stream of SAX events), and process the elements as they come in. E.g. 
something like...

        File document = new File(target);
        Parser parser = new AutoDetectParser();
        ContentHandler handler = new MyTableAwareContentHandler();
        Metadata metadata = new Metadata();
        parser.parse(new FileInputStream(document), handler, metadata, new 
ParseContext());

where ContentHandler is org.xml.sax.ContentHandler.

— Ken

>  
> Thnaks.
> From: Ken Krugler [mailto:kkrugler_li...@transpac.com 
> <mailto:kkrugler_li...@transpac.com>] 
> Sent: Thursday, May 24, 2018 4:09 PM
> To: user@tika.apache.org <mailto:user@tika.apache.org>
> Subject: Re: Extract HTML objects using TIKA
>  
> Hi Jaya,
>  
> On May 24, 2018, at 12:42 PM, Johnson, Jaya <jaya.john...@moodys.com 
> <mailto:jaya.john...@moodys.com>> wrote:
>  
>  
> I was wondering if it was possible to extract all tables from an HTML 
> document using TIKA is there anything out of the box or would one have to 
> write something.
>  
> Tika will call the content handler you provide with the standard set of table 
> elements. From DefaultHtmlMapper.java:
>  
>         ….
>         put("TABLE", "table");
>         put("THEAD", "thead");
>         put("TBODY", "tbody");
>         put("TR", "tr");
>         put("TH", "th");
>         put("TD", "td”);
>         ….
>  
> But often when people ask about extracting tables, they’re actually 
> interested in getting structured data (column names, data types, etc). And 
> that’s something Tika doesn’t automagically do for you.
>  
> It would be interesting to create such a thing (similar to what we did for 
> Boilerpipe) for use with Tika. E.g. see https://github.com/seagatesoft/sde 
> <https://github.com/seagatesoft/sde>
>  
> — Ken
>  
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com <http://www.scaleunlimited.com/>
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>  
> -----------------------------------------
> Moody's monitors email communications through its networks for regulatory 
> compliance purposes and to protect its customers, employees and business and 
> where allowed to do so by applicable law. The information contained in this 
> e-mail message, and any attachment thereto, is confidential and may not be 
> disclosed without our express permission. If you are not the intended 
> recipient or an employee or agent responsible for delivering this message to 
> the intended recipient, you are hereby notified that you have received this 
> message in error and that any review, dissemination, distribution or copying 
> of this message, or any attachment thereto, in whole or in part, is strictly 
> prohibited. If you have received this message in error, please immediately 
> notify us by telephone, fax or e-mail and delete the message and all of its 
> attachments. Every effort is made to keep our network free from viruses. You 
> should, however, review this e-mail message, as well as any attachment 
> thereto, for viruses. We take no responsibility and have no liability for any 
> computer virus which may be transferred via this e-mail message. 
> -----------------------------------------

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378

Reply via email to