Author: jukka Date: Tue Sep 16 13:33:31 2008 New Revision: 696042 URL: http://svn.apache.org/viewvc?rev=696042&view=rev Log: TIKA-157: List all the document formats supported by Tika
More format documentation Modified: incubator/tika/trunk/src/site/apt/formats.apt Modified: incubator/tika/trunk/src/site/apt/formats.apt URL: http://svn.apache.org/viewvc/incubator/tika/trunk/src/site/apt/formats.apt?rev=696042&r1=696041&r2=696042&view=diff ============================================================================== --- incubator/tika/trunk/src/site/apt/formats.apt (original) +++ incubator/tika/trunk/src/site/apt/formats.apt Tue Sep 16 13:33:31 2008 @@ -22,13 +22,39 @@ This page lists all the document formats supported by Apache Tika. [bzip2 compression (application/x-bzip)] - TODO + Tika uses an adapted version of the bzip2 parsing code from + {{{http://ant.apache.org/}Apache Ant}} to decompress bzip2 streams. + The bzip2 code is originally based on work by Keiron Liddle from + Aftex Software. Support for bzip2 compression was added in Tika 0.2. + + The bzip2 parser decompresses the incoming stream and passes the + resulting stream to a configured delegate parser. If the + <<<RESOURCE_NAME_KEY>>> metadata property is set to a file name + that matches the common patterns <<<*.{tbz2,tbz}>>> or <<<*.{bz2,bz}>>>, + then name is replaced with <<<*.tar>>> or <<<*>>> respectively before + passing the decompressed stream for further parsing. + + Bzip2 compression is automatically detected based on a magic header + or glob patterns. [Extensible Markup Language (application/xml)] TODO [gzip compression (application/x-gzip)] - TODO + Tika uses Java's built-in gzip support to decompress gzip streams. + Support for gzip compression was added in Tika 0.2. + + The gzip parser simply uses the + {{{http://java.sun.com/j2se/1.5.0/docs/api/java/util/zip/GZIPInputStream.html}GZIPInputStream}} + class to decompress the incoming stream. The resulting stream is + passed to a configured delegate parser. If the + <<<RESOURCE_NAME_KEY>>> metadata property is set to a file name + that matches the common patterns <<<*.tgz2>>> or <<<*{.gz,-gz}>>>, + then name is replaced with <<<*.tar>>> or <<<*>>> respectively before + passing the decompressed stream for further parsing. + + Gzip compression is automatically detected based on a magic header + or glob patterns. [HyperText Markup Language (text/html)] TODO @@ -61,6 +87,9 @@ upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}} for the current status of this issue. + Microsoft Word documents are automatically detected based on a magic + header or a glob pattern. + For an example of parsing Microsoft Word files, see the {{{xref-test/org/apache/tika/parser/microsoft/WordParserTest.html}WordParserTest}} test case. @@ -88,6 +117,9 @@ upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}} for the current status of this issue. + Microsoft Excel spreadsheets are automatically detected based on a magic + header or a glob pattern. + For an example of parsing Microsoft Excel files, see the {{{xref-test/org/apache/tika/parser/microsoft/ExcelParserTest.html}ExcelParserTest}} test case. @@ -112,6 +144,9 @@ upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}} for the current status of this issue. + Microsoft PowerPoint presentations are automatically detected based on + a magic header or a glob pattern. + For an example of parsing Microsoft PowerPoint files, see the {{{xref-test/org/apache/tika/parser/microsoft/PowerPointParserTest.html}PowerPointParserTest}} test case. @@ -131,6 +166,9 @@ Generic Microsoft Office document properties like title, author, and keywords are returned as metadata properties. + Microsoft Visio diagrams are automatically detected based on a magic + header or a glob pattern. + [Microsoft Outlook (application/vnd.ms-outlook)] Tika uses the {{{http://poi.apache.org/hsmf/}HSMF}} API in {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft @@ -140,6 +178,9 @@ the From, To, Cc, and Bcc addresses (formatted for display) along with the body text of text/plain messages. + Microsoft Outlook messages are automatically detected based on a magic + header or a glob pattern. + For an example of parsing Microsoft Outlook files, see the {{{xref-test/org/apache/tika/parser/microsoft/OutlookParserTest.html}OutlookParserTest}} test case. @@ -174,7 +215,9 @@ simplify encoding detection. [Portable Document Format (application/pdf)] - TODO + Tika uses the {{{http://www.pdfbox.org}PDFBox}} library to parse + Portable Document Format (PDF) documents. Support for PDF was added + in Tika 0.1. [Rich Text Format (application/rtf)] Tika uses Java's built-in Swing library to parse Rich Text Format (RTF) @@ -186,7 +229,10 @@ Document metadata extraction is currently not supported. [tar archive (application/x-tar)] - TODO + Tika uses an adapted version of the tar parsing code from + {{{http://ant.apache.org/}Apache Ant}} to parse tar archives. + The tar code is originally based on work by Timothy Gerard Endres. + Support for tar archives was added in Tika 0.2. [ZIP archive (application/zip)] TODO