formats.apt site.xml

jukka Sun, 14 Sep 2008 09:54:55 -0700

Author: jukka
Date: Sun Sep 14 09:53:55 2008
New Revision: 695253

URL: http://svn.apache.org/viewvc?rev=695253&view=rev
Log:
TIKA-157: List all the document formats supported by Tika


First draft of the list of supported formats.

Added:
    incubator/tika/trunk/src/site/apt/formats.apt
Modified:
    incubator/tika/trunk/src/site/site.xml

Added: incubator/tika/trunk/src/site/apt/formats.apt
URL: 
http://svn.apache.org/viewvc/incubator/tika/trunk/src/site/apt/formats.apt?rev=695253&view=auto
==============================================================================
--- incubator/tika/trunk/src/site/apt/formats.apt (added)
+++ incubator/tika/trunk/src/site/apt/formats.apt Sun Sep 14 09:53:55 2008
@@ -0,0 +1,192 @@
+                       --------------------------
+                       Supported Document Formats
+                       --------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Supported Document Formats
+
+   This page lists all the document formats supported by Apache Tika.
+
+   [bzip2 compression (application/x-bzip)]
+    TODO
+
+   [Extensible Markup Language (application/xml)]
+    TODO
+
+   [gzip compression (application/x-gzip)]
+    TODO
+
+   [HyperText Markup Language (text/html)]
+    TODO
+
+   [Images (image/*)]
+    TODO
+
+   [Java class files]
+    TODO
+
+   [Java jar archives]
+    TODO
+
+   [Microsoft Word (application/msword)]
+    Tika uses the {{{http://poi.apache.org/hwpf/}HWPF}} API in
+    {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
+    Word documents. Support for Microsoft Word was added in Tika 0.1.
+
+    The Word parser in Tika simply the POI
+    
{{{http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html}WordExtractor}}
+    class to extract text paragraphs from Word documents. Support for more
+    complex content structures is not yet implemented; see
+    {{{https://issues.apache.org/jira/browse/TIKA-123}TIKA-123}} for this
+    issue.
+
+    Generic Microsoft Office document properties like title, author, and
+    keywords are returned as metadata properties.
+
+    Support for the new XML-based Word 2007 format is pending for a POI
+    upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
+    for the current status of this issue.
+
+    For an example of parsing Microsoft Word files, see the
+    
{{{xref-test/org/apache/tika/parser/microsoft/WordParserTest.html}WordParserTest}}
+    test case.
+
+   [Microsoft Excel (application/vnd.ms-excel)]
+    Tika uses the {{{http://poi.apache.org/hssf/}HSSF}} API in
+    {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
+    Excel spreadsheets. Support for Microsoft Excel was added in Tika 0.1.
+
+    The Excel parser in Tika uses the HSSF event model and is able to recreate
+    much of the document structure, including all (non-empty) worksheets and
+    their table structures. Formula results are extracted as stored in the
+    Excel file, and cell links are exposed as XHTML links. These features
+    were added in Tika 0.2.
+
+    Cell comments and formatting are currently not supported. See
+    {{{https://issues.apache.org/jira/browse/TIKA-148}TIKA-148}} and
+    {{{https://issues.apache.org/jira/browse/TIKA-103}TIKA-103}} for the
+    respective issues.
+
+    Generic Microsoft Office document properties like title, author, and
+    keywords are returned as metadata properties.
+
+    Support for the new XML-based Excel 2007 format is pending for a POI
+    upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
+    for the current status of this issue.
+
+    For an example of parsing Microsoft Excel files, see the
+    
{{{xref-test/org/apache/tika/parser/microsoft/ExcelParserTest.html}ExcelParserTest}}
+    test case.
+
+   [Microsoft PowerPoint (application/vnd.ms-powerpoint)]
+    Tika uses the {{{http://poi.apache.org/hslf/}HSLF}} API in
+    {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
+    PowerPoint presentations. Support for Microsoft PowerPoint was added
+    in Tika 0.1.
+
+    The PowerPoint parser in Tika simply the POI
+    
{{{http://poi.apache.org/apidocs/org/apache/poi/hslf/extractor/PowerPointExtractor.html}PowerPointExtractor}}
+    class to extract all text as a single paragraph from a PowerPoint document.
+    Support for more complex content structures is not yet implemented; see
+    {{{https://issues.apache.org/jira/browse/TIKA-123}TIKA-123}} for this
+    issue.
+
+    Generic Microsoft Office document properties like title, author, and
+    keywords are returned as metadata properties.
+
+    Support for the new XML-based PowerPoint 2007 format is pending for a POI
+    upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
+    for the current status of this issue.
+
+    For an example of parsing Microsoft PowerPoint files, see the
+    
{{{xref-test/org/apache/tika/parser/microsoft/PowerPointParserTest.html}PowerPointParserTest}}
+    test case.
+
+   [Microsoft Visio (application/vnd.visio)]
+    Tika uses the {{{http://poi.apache.org/hdgf/}HDGF}} API in
+    {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
+    Visio diagrams. Support for Microsoft Visio was added in Tika 0.2.
+
+    The Visio parser in Tika simply the POI
+    
{{{http://poi.apache.org/apidocs/org/apache/poi/hdgf/extractor/VisioTextExtractor.html}VisioExtractor}}
+    class to extract all text entries from Visio documents.
+    Support for more complex content structures is not yet implemented; see
+    {{{https://issues.apache.org/jira/browse/TIKA-123}TIKA-123}} for this
+    issue.
+
+    Generic Microsoft Office document properties like title, author, and
+    keywords are returned as metadata properties.
+
+   [Microsoft Outlook (application/vnd.ms-outlook)]
+    Tika uses the {{{http://poi.apache.org/hsmf/}HSMF}} API in
+    {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
+    Outlook messages. Support for Microsoft Outlook was added in Tika 0.2.
+
+    The Outlook parser in Tika extracts the subject of the message and
+    the From, To, Cc, and Bcc addresses (formatted for display) along
+    with the body text of text/plain messages.
+
+    For an example of parsing Microsoft Outlook files, see the
+    
{{{xref-test/org/apache/tika/parser/microsoft/OutlookParserTest.html}OutlookParserTest}}
+    test case.
+
+   [MP3 Audio (audio/mp3)]
+    TODO
+
+   [OpenDocument (application/vnd.oasis.opendocument.*)]
+    TODO
+
+   [Plain text (text/plain)]
+    Tika uses the
+    {{{http://www.icu-project.org/}International Components for Unicode}}
+    Java library (ICU4J) to parse plain text. Support for plain text was added
+    in Tika 0.1.
+
+    Extracting text content from plain text files is actually a relatively
+    complex task due to the fact that the character encoding of the text
+    file is often unknown to the parser.
+
+    The text parser in Tika uses the ICU4J
+    
{{{http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html}CharsetDetector}}
+    class to automatically detect the character encoding of any text input.
+    As an added benefit, the ICU4J library is in some cases able to detect
+    also the language in which the text is written.
+
+    The character encoding and language of the plain text document are
+    returned as the <<<Metadata.CONTENT_ENCODING>>> and <<<Metadata.LANGUAGE>>>
+    metadata properties. If the (declared) content encoding of a text document
+    is already known to the client application, then it can be supplied as the
+    <<<Metadata.CONTENT_ENCODING>>> metadata property to the parser to
+    simplify encoding detection.
+
+   [Portable Document Format (application/pdf)]
+    TODO
+
+   [Rich Text Format (application/rtf)]
+    Tika uses Java's built-in Swing library to parse Rich Text Format (RTF)
+    documents. Support for RTF was added in Tika 0.1.
+
+    The RTF parser in Tika uses the Swing
+    
{{{http://java.sun.com/j2se/1.5.0/docs/api/javax/swing/text/rtf/RTFEditorKit.html}RTFEditorKit}}
+    class to extract all text from an RTF document as a single paragraph.
+    Document metadata extraction is currently not supported.
+
+   [tar archive (application/x-tar)]
+    TODO
+
+   [ZIP archive (application/zip)]
+    TODO

Modified: incubator/tika/trunk/src/site/site.xml
URL: 
http://svn.apache.org/viewvc/incubator/tika/trunk/src/site/site.xml?rev=695253&r1=695252&r2=695253&view=diff
==============================================================================
--- incubator/tika/trunk/src/site/site.xml (original)
+++ incubator/tika/trunk/src/site/site.xml Sun Sep 14 09:53:55 2008
@@ -39,7 +39,8 @@
       <item name="Introduction" href="index.html"/>
       <item name="Download" href="download.html"/>
       <item name="Documentation" href="documentation.html"/>
+      <item name="Supported Formats" href="formats.html"/>
     </menu>
     <menu ref="reports"/>
   </body>
-</project>
\ No newline at end of file
+</project>

svn commit: r695253 - in /incubator/tika/trunk/src/site: apt/formats.apt site.xml

Reply via email to