Regarding TIKA-1329, I found the tike-site on the Subversion source code, and I called: svn checkout https://svn.apache.org/repos/asf/tika/site/publish/1.11/. Since this isn’t part of the main tika/trunk repository, I was wondering if I should still follow the same protocol and svn commit my changes to the site folder. In case I shouldn’t, I’ve attached my changes to the usage examples page of the website below. I basically added how to parse documents with embedded docs using the RecursiveParserWrapper class, and how to serialize the returned Metadata list to JSON, with some description. Thanks, Joey |
Apache Tika API Usage ExamplesThis page provides a number of examples on how to use the various Tika APIs. All of the examples shown are also available in the Tika Example module in SVN.
ParsingTika provides a number of different ways to parse a file. These provide different levels of control, flexibility, and complexity. Parsing using the Tika FacadeThe Tika facade, provides a number of very quick and easy ways to have your content parsed by Tika, and return the resulting plain text
Parsing using the Auto-Detect ParserFor more control, you can call the Tika Parsers directly. Most likely, you'll want to start out using the Auto-Detect Parser, which automatically figures out what kind of content you have, then calls the appropriate parser for you.
Parsing using the Recursive Parser WrapperWhen you want to parse embedded documents, you can extract content from both the enclosing document and all embedded ones by passing the parser into the ParseContext instance.
Alternatively, you can use the RecursiveParserWrapper, which handles passing the parser into ParseContext. This wrapper class returns a list of Metadata objects, where the first element is the metadata and content for the container document, and the rest for each embedded document.
The JsonMetadataList class can serialize the metadata list into JSON, and deserialize back into the list.
Picking different output formatsWith Tika, you can get the textual content of your files returned in a number of different formats. These can be plain text, html, xhtml, xhtml of one part of the file etc. This is controlled based on the ContentHandler you supply to the Parser. Parsing to Plain TextBy using the BodyContentHandler, you can request that Tika return only the content of the document's body as a plain-text string.
Parsing to XHTMLBy using the ToXMLContentHandler, you can get the XHTML content of the whole document as a string.
If you just want the body of the xhtml document, without the header, you can chain together a BodyContentHandler and a ToXMLContentHandler as shown:
Fetching just certain bits of the XHTMLIt possible to execute XPath queries on the parse results, to fetch only certain bits of the XHTML.
Custom Content HandlersThe textual output of parsing a file with Tika is returned via the SAX ContentHandler you pass to the parse method. It is possible to customise your parsing by supplying your own ContentHandler which does special things. Extract Phone Numbers from Content into the MetadataBy using the PhoneExtractingContentHandler, you can have any phone numbers found in the textual content of the document extracted and placed into the Metadata object for you. Streaming the plain text in chunksSometimes, you want to chunk the resulting text up, perhaps to output as you go minimising memory use, perhaps to output to HDFS files, or any other reason! With a small custom content handler, you can do that.
TranslationTika provides a pluggable Translation system, which allow you to send the results of parsing off to an external system or program to have the text translated into another language. Translation using the Microsoft Translation APIIn order to use the Microsoft Translation API, you need to sign up for a Microsoft account, get an API key, then pass the key to Tika before translating.
Language IdentificationTika provides support for identifying the language of text, through the LanguageIdentifier class.
Additional ExamplesA number of other examples are also available, including all of the examples from the Tika In Action book. These can all be found in the Tika Example module in SVN. |
|