Hi Jaya, > On May 24, 2018, at 1:34 PM, Johnson, Jaya <jaya.john...@moodys.com> wrote: > > No I don’t care about that – I have something like > List of underwiters….some text > <TABLE> > <TR><TD>….. > Etc etc > <TABLE> > <> > I want to get all of that – so can I look for say all table tags content in > them and then say a few words before the tag TABLE. I can do the parsing etc.
In that case you should be able to use your own content handler (which will get a stream of SAX events), and process the elements as they come in. E.g. something like... File document = new File(target); Parser parser = new AutoDetectParser(); ContentHandler handler = new MyTableAwareContentHandler(); Metadata metadata = new Metadata(); parser.parse(new FileInputStream(document), handler, metadata, new ParseContext()); where ContentHandler is org.xml.sax.ContentHandler. — Ken > > Thnaks. > From: Ken Krugler [mailto:kkrugler_li...@transpac.com > <mailto:kkrugler_li...@transpac.com>] > Sent: Thursday, May 24, 2018 4:09 PM > To: user@tika.apache.org <mailto:user@tika.apache.org> > Subject: Re: Extract HTML objects using TIKA > > Hi Jaya, > > On May 24, 2018, at 12:42 PM, Johnson, Jaya <jaya.john...@moodys.com > <mailto:jaya.john...@moodys.com>> wrote: > > > I was wondering if it was possible to extract all tables from an HTML > document using TIKA is there anything out of the box or would one have to > write something. > > Tika will call the content handler you provide with the standard set of table > elements. From DefaultHtmlMapper.java: > > …. > put("TABLE", "table"); > put("THEAD", "thead"); > put("TBODY", "tbody"); > put("TR", "tr"); > put("TH", "th"); > put("TD", "td”); > …. > > But often when people ask about extracting tables, they’re actually > interested in getting structured data (column names, data types, etc). And > that’s something Tika doesn’t automagically do for you. > > It would be interesting to create such a thing (similar to what we did for > Boilerpipe) for use with Tika. E.g. see https://github.com/seagatesoft/sde > <https://github.com/seagatesoft/sde> > > — Ken > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com <http://www.scaleunlimited.com/> > Custom big data solutions & training > Flink, Solr, Hadoop, Cascading & Cassandra > > ----------------------------------------- > Moody's monitors email communications through its networks for regulatory > compliance purposes and to protect its customers, employees and business and > where allowed to do so by applicable law. The information contained in this > e-mail message, and any attachment thereto, is confidential and may not be > disclosed without our express permission. If you are not the intended > recipient or an employee or agent responsible for delivering this message to > the intended recipient, you are hereby notified that you have received this > message in error and that any review, dissemination, distribution or copying > of this message, or any attachment thereto, in whole or in part, is strictly > prohibited. If you have received this message in error, please immediately > notify us by telephone, fax or e-mail and delete the message and all of its > attachments. Every effort is made to keep our network free from viruses. You > should, however, review this e-mail message, as well as any attachment > thereto, for viruses. We take no responsibility and have no liability for any > computer virus which may be transferred via this e-mail message. > ----------------------------------------- -------------------------------------------- http://about.me/kkrugler +1 530-210-6378