Re: looking to contribute

Joey Hong Sat, 19 Dec 2015 11:05:23 -0800

Regarding TIKA-1329, I found the tike-site on the Subversion source code, and I called:

svn checkout https://svn.apache.org/repos/asf/tika/site/publish/1.11/.

Since this isn’t part of the main tika/trunk repository, I was wondering if I should still follow the same protocol and svn commit my changes to the site folder. In case I shouldn’t, I’ve attached my changes to the usage examples page of the website below. I basically added how to parse documents with embedded docs using the RecursiveParserWrapper class, and how to serialize the returned Metadata list to JSON, with some description.

Thanks,

Joey

Title: Apache Tika – Tika API Usage Examples

Apache Tika API Usage Examples

This page provides a number of examples on how to use the various Tika APIs. All of the examples shown are also available in the Tika Example module in SVN.

Apache Tika API Usage Examples

Parsing

Tika provides a number of different ways to parse a file. These provide different levels of control, flexibility, and complexity.

Parsing using the Tika Facade

The Tika facade, provides a number of very quick and easy ways to have your content parsed by Tika, and return the resulting plain text

public String parseToStringExample() throws IOException, SAXException, TikaException {
    Tika tika = new Tika();
    try (InputStream stream = ParsingExample.class.getResourceAsStream("test.doc")) {
        return tika.parseToString(stream);
    }
}

Parsing using the Auto-Detect Parser

For more control, you can call the Tika Parsers directly. Most likely, you'll want to start out using the Auto-Detect Parser, which automatically figures out what kind of content you have, then calls the appropriate parser for you.

public String parseExample() throws IOException, SAXException, TikaException {
    
AutoDetectParser parser = new AutoDetectParser();
    
BodyContentHandler handler = new BodyContentHandler();
    
Metadata metadata = new Metadata();
    
try (InputStream stream = ParsingExample.class.getResourceAsStream("test.doc")) {
        
parser.parse(stream, handler, metadata);
        
return handler.toString();
    
}
}

Parsing using the Recursive Parser Wrapper

When you want to parse embedded documents, you can extract content from both the enclosing document and all embedded ones by passing the parser into the ParseContext instance.

public String parseEmbeddedExample() throws IOException, SAXException, TikaException {
    
AutoDetectParser parser = new AutoDetectParser();
    
BodyContentHandler handler = new BodyContentHandler();
    
Metadata metadata = new Metadata();
    
ParseContext context = new ParseContext();
    
context.set(Parser.class, parser);
    
try (InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx")) {
        
parser.parse(stream, handler, metadata);
        
return handler.toString();
    }
}

Alternatively, you can use the RecursiveParserWrapper, which handles passing the parser into ParseContext. This wrapper class returns a list of Metadata objects, where the first element is the metadata and content for the container document, and the rest for each embedded document.

public List<Metadata> recursiveParserWrapperExample() throws IOException, SAXException, TikaException {
    
Parser p = new  AutoDetectParser();
    
ContentHandlerFactor factory = new BasicContentHandlerFactory(
         
BasicContentHandlerFactory.HANDLER_TYPE.HTML, -1);

RecursiveParserWrapper wrapper = new RecursiveParserWrapper(p, factory);
    
Metadata metadata = new Metadata();
    
metadata.set(Metadata.RESOURCE_NAME_KEY, "test_recursive_embedded.docx");
    
ParseContext context = new ParseContext();
    
try (InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx")) {
        
wrapper.parse(stream, new DefaultHandler(), metadata, context)
    }
    
return wrapper.getMetadata();
}

The JsonMetadataList class can serialize the metadata list into JSON, and deserialize back into the list.

public String serializedRecursiveParserWrapperExample() throws IOException, SAXException, TikaException {
    
 List<Metadata> metadataList = recursiveParserWrapperExample();
    
 StringWriter writer = new StringWriter();
    
 JsonMetadataList.toJson(metadataList, writer);
    return writer.toString();}

Picking different output formats

With Tika, you can get the textual content of your files returned in a number of different formats. These can be plain text, html, xhtml, xhtml of one part of the file etc. This is controlled based on the ContentHandler you supply to the Parser.

Parsing to Plain Text

By using the BodyContentHandler, you can request that Tika return only the content of the document's body as a plain-text string.

public String parseToPlainText() throws IOException, SAXException, TikaException {
    BodyContentHandler handler = new BodyContentHandler();
 
    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc")) {
        parser.parse(stream, handler, metadata);
        return handler.toString();
    }
}

Parsing to XHTML

By using the ToXMLContentHandler, you can get the XHTML content of the whole document as a string.

public String parseToHTML() throws IOException, SAXException, TikaException {
    ContentHandler handler = new ToXMLContentHandler();
 
    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc")) {
        parser.parse(stream, handler, metadata);
        return handler.toString();
    }
}

If you just want the body of the xhtml document, without the header, you can chain together a BodyContentHandler and a ToXMLContentHandler as shown:

public String parseBodyToHTML() throws IOException, SAXException, TikaException {
    ContentHandler handler = new BodyContentHandler(
            new ToXMLContentHandler());
 
    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc")) {
        parser.parse(stream, handler, metadata);
        return handler.toString();
    }
}

Fetching just certain bits of the XHTML

It possible to execute XPath queries on the parse results, to fetch only certain bits of the XHTML.

public String parseOnePartToHTML() throws IOException, SAXException, TikaException {
    // Only get things under html -> body -> div (class=header)
    XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
    Matcher divContentMatcher = xhtmlParser.parse("/xhtml:html/xhtml:body/xhtml:div/descendant::node()");
    ContentHandler handler = new MatchingContentHandler(
            new ToXMLContentHandler(), divContentMatcher);
 
    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test2.doc")) {
        parser.parse(stream, handler, metadata);
        return handler.toString();
    }
}

Custom Content Handlers

The textual output of parsing a file with Tika is returned via the SAX ContentHandler you pass to the parse method. It is possible to customise your parsing by supplying your own ContentHandler which does special things.

Extract Phone Numbers from Content into the Metadata

By using the PhoneExtractingContentHandler, you can have any phone numbers found in the textual content of the document extracted and placed into the Metadata object for you.

Streaming the plain text in chunks

Sometimes, you want to chunk the resulting text up, perhaps to output as you go minimising memory use, perhaps to output to HDFS files, or any other reason! With a small custom content handler, you can do that.

public List<String> parseToPlainTextChunks() throws IOException, SAXException, TikaException {
    final List<String> chunks = new ArrayList<>();
    chunks.add("");
    ContentHandlerDecorator handler = new ContentHandlerDecorator() {
        @Override
        public void characters(char[] ch, int start, int length) {
            String lastChunk = chunks.get(chunks.size() - 1);
            String thisStr = new String(ch, start, length);
 
            if (lastChunk.length() + length > MAXIMUM_TEXT_CHUNK_SIZE) {
                chunks.add(thisStr);
            } else {
                chunks.set(chunks.size() - 1, lastChunk + thisStr);
            }
        }
    };
 
    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test2.doc")) {
        parser.parse(stream, handler, metadata);
        return chunks;
    }
}

Translation

Tika provides a pluggable Translation system, which allow you to send the results of parsing off to an external system or program to have the text translated into another language.

Translation using the Microsoft Translation API

In order to use the Microsoft Translation API, you need to sign up for a Microsoft account, get an API key, then pass the key to Tika before translating.

public String microsoftTranslateToFrench(String text) {
    MicrosoftTranslator translator = new MicrosoftTranslator();
    // Change the id and secret! See http://msdn.microsoft.com/en-us/library/hh454950.aspx.
    translator.setId("dummy-id");
    translator.setSecret("dummy-secret");
    try {
        return translator.translate(text, "fr");
    } catch (Exception e) {
        return "Error while translating.";
    }
}

Language Identification

Tika provides support for identifying the language of text, through the LanguageIdentifier class.

public String identifyLanguage(String text) {
    LanguageIdentifier identifier = new LanguageIdentifier(text);
    return identifier.getLanguage();
}

Additional Examples

A number of other examples are also available, including all of the examples from the Tika In Action book. These can all be found in the Tika Example module in SVN.

On Dec 18, 2015, at 5:33 AM, Allison, Timothy B. <talli...@mitre.org> wrote:

Y, I think we could use some updates there, but I think the key part of what I haven't gotten around to doing is [0].

If I understand that correctly, we have to update some stuff on the site branch.

[0] https://issues.apache.org/jira/browse/TIKA-1329?focusedCommentId=14295800&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14295800

-----Original Message-----
From: Joey Hong [mailto:jxih...@gmail.com]
Sent: Thursday, December 17, 2015 1:31 PM
To: dev@tika.apache.org
Subject: Re: looking to contribute

Thanks for the advice! I’ll start with some documentation and tests and move to harder tasks from there.

Regarding the JIRA instance for TIKA-1329, would the documentation for the RecursiveParserWrapper go with the RecursiveMetadata page on the wiki?

Thanks,
Joey

On Dec 17, 2015, at 5:32 AM, Allison, Timothy B. <talli...@mitre.org> wrote:

Speaking of the docs/examples, TIKA-1329 is still open because I haven't gotten around to documenting it.

Y, if you'd like a report of exceptions, let me know. IIRC, it would be great if we could improve on XML detection (we're currently over detecting), and there's plenty of work to do on html parsing TIKA-1599.

I also have probably a full grad student semester worth of curation project ideas on the test corpus. Not glamorous, but very useful for the community.

Then there's the eval code itself...that still needs to make it into shape to be added.

I agree with Nick though, start small on documentation/examples.

Cheers,

Tim

-----Original Message-----
From: Nick Burch [mailto:apa...@gagravarr.org]
Sent: Wednesday, December 16, 2015 4:23 PM
To: dev@tika.apache.org
Subject: Re: looking to contribute

On Wed, 16 Dec 2015, Joey Hong wrote:
My name is Joey. I am a college freshmen with programming experience
looking to get into the world of open-source. I was hoping to
contribute to the Tika project, and was wondering if there were any
tasks that a beginner like me could tackle. I am willing to do
anything, whether it be fixing a minor bug, or adding test suites or documentation.

On the docs / examples side, we have a few examples on the website, but probably not enough! One thing might be to look through those, identify gaps with your fresh eyes, and work on those. We also have instructions for some more complicated integrations on the wiki, maybe try some of those and feed back on which ones aren't clear enough?

If you want to try more coding, Tim quite often runs Tika against some large filesets, and has a nifty tool to report on what breaks. He can hopefully point you at the most recent report! Maybe have a look through that, identify a few common failures from unidentified or common exceptions, and try to fix one or two of those?

Nick