Added: dev/tika/CHANGES-1.26.txt ============================================================================== --- dev/tika/CHANGES-1.26.txt (added) +++ dev/tika/CHANGES-1.26.txt Wed Mar 24 15:03:10 2021 @@ -0,0 +1,2693 @@ +Release 1.26 - 03/24/2021 + + * Fix thread safety bug in OpenOffice parser (TIKA-3334). + + * The "writeLimit" header now pertains to the combined characters + written per container document (and embedded documents) in the /rmeta + endpoint in tika-server (TIKA-3325); it no longer functions only + per container or embedded document. + + * Extract more embedded files in PDFs by recursively processing the + embedded file tree (TIKA-3332). + + * Allow for case insensitive headers for configuration of the PDFParser + and the TesseractOCRParser in tika-server via Subhajit Das (TIKA-3320). + + * Improve detection and parsing of XPS files (TIKA-3316). + + * General dependency upgrades (TIKA-3244). + + * Great optimization in ForkParser (TIKA-3237). + + * Fix parsing of emails attached to other emails in PST files (TIKA-3004). + + * MP3 parser should output the xmpDM:duration metadata as seconds not + milliseconds, consistent with the other Audio and Video parsers (TIKA-3318). + + * MP4 parser check if any of the Compatible Brands match when identifying + the subtype (TIKA-3310). + +Release 1.25 - 11/25/2020 + + * Fix inconsistent license in xmpcore (TIKA-3204). + + * General upgrades including some dependencies with + recently found security vulnerabilities (TIKA-3119). + + * Add detection and a parser for flat ODF files (TIKA-3159). + + * Add extraction of macros from ODF files (TIKA-3161). + + * Add mime detection for hprof and hprof text files (TIKA-3144). + + * Add TextSignature and TextProfileSignature to tika-eval (TIKA-3145 and TIKA-3146) + + * Create a metadata filter to trigger tika-eval stats post parsing (TIKA-3140) + + * Add a configurable metadata-filter for the RecursiveParserWrapper (TIKA-3137) + + * Add status endpoint to tika-server (TIKA-3129). + + * Remove whitelist/blacklist terminology (TIKA-3120) + + * Add detection for parquet files (TIKA-3115). + + * Add detection and parsing for bplist (TIKA-3104). + + * Enable metadata value filtering for RecursiveParserWrapper (TIKA-3137) + + * Add a basic parser for plist files based on com.googlecode.plist:dd-plist (TIKA-3104). + + * Read hyperlinked images from ODT files (TIKA-3156). + + * Updated GrobidRESTParser to use new API location (TIKA-3191). + + * Add FileProfiler to tika-eval (TIKA-3216). + + * Add status endpoint to tika-server (TIKA-3129). + + * Improved handling of zip files with STORED entries with + data descriptor (TIKA-3196). + + * Add parsers for XLZ, IDML and MIF (TIKA-2976, TIKA-3188 and TIKA-3189). + + * Add the beginnings of a format-aware fuzzing module (TIKA-3083). + + * Add wrapper for Linux 'file' command for mime detection (TIKA-3215). + + * Added ability to skip parsing of embedded files in Tika Server (TIKA-3227). + +Release 1.24.1 - 4/17/2020 + + * Add detection and a parser for flat ODF files (TIKA-3159). + + * Add extraction of macros from ODF files (TIKA-3161). + + * Add mime detection for hprof and hprof text files (TIKA-3144). + + * Add TextSignature and TextProfileSignature to tika-eval (TIKA-3145 and TIKA-3146) + + * Create a metadata filter to trigger tika-eval stats post parsing (TIKA-3140) + + * Add a configurable metadata-filter for the RecursiveParserWrapper (TIKA-3137) + + * Add status endpoint to tika-server (TIKA-3129). + + * Remove whitelist/blacklist terminology (TIKA-3120) + + * Add detection for parquet files (TIKA-3115). + + * Add detection and parsing for bplist (TIKA-3104). + + * Enable metadata value filtering + + * Add a basic parser for plist files based on com.googlecode.plist:dd-plist (TIKA-3104). + +Release 1.24.1 - 4/17/2020 + + * Allow gzip compression of input and output streams for tika-server (TIKA-3073). + +Release 1.24 - 3/11/2020 + + * Add scripts to run tika-server as a service via Eric Pugh, + and add these scripts and jar as a new artifact in the release (TIKA-3010). + + * Upgrade Drew Noakes' metadata-extractor (TIKA-2952). + + * Enable optional extraction of structural tags in PDFs (alpha-grade) (TIKA-3026). + + * Tika app's --extract mode now outputs to STDOUT (TIKA-3035). + + * Add an optional Preflight parser for PDFs (TIKA-3055). + + * Improve detection of some zip-based formats (TIKA-3057). + + * Upgrade metadata-extractor to 2.13.0 (TIKA-2952). + + * Upgrade to POI 4.1.2 (TIKA-3047). + + * Extract XMP from PSD files (TIKA-3050). + + * Added XMLProfiler as an optional parser to profile XFA and XMP + in PDFs (TIKA-3045). + + * Extract inline images that rely on the DCT filter from PDFs (TIKA-3041). + + * Upgrade to PDFBox 2.0.19 (TIKA-3033). + + * Fix bug in ASM parser configuration (TIKA-2992). + + * Upgrade to java-libpst 0.9.3 (TIKA-2546). + + * Fixed XLIFF12Parser failures with ToXMLHandler (TIKA-3014). + +Release 1.23 - 12/02/2019 + + * NOTE: The PDFParser now relies on OCRDPI to render page images when + users configure OCR on rendered page images. This will have the effect + of increasing rendered image size (TIKA-2624). + + * NOTE: tika-server no longer returns 415 for file types for which there + is no parser. + + * Fix bug in AUTO OCR strategy in the PDFParser (TIKA-3002). + + * Fix incorrect height and width metadata extraction from JPEG images (TIKA-2630). + + * Upgrade to POI 4.1.1 (TIKA-2851). + + * Upgrade to PDFBox 2.0.17 (TIKA-2951). + + * Ensure that the PDFParser respects custom configuration of Tesseract + from tika-config.xml via Eric Pugh (TIKA-2970). + + * Add parser for XLIFF v1.2 files (TIKA-2975). + + * Add mime type detection support for WebAssembly (TIKA-2894), + HEIF / HEIC images (TIKA-2942), Digilite FDF (TIKA-2988); + and xml-root detection for XFDF (TIKA-2990) and XDP (TIKA-2989). + + * Add an XLZ Parser (TIKA-2976). + + * Fix deadlock with ForkParser when InputStream throws IOException (TIKA-2892). + +Release 1.22 - 07/29/2019 + + * NOTE: tika-server no longer hard-codes the HtmlParser to handle + XML files (TIKA-2910). Users must now configure that behavior + via a tika-config.xml file. + + * NOTE: Known regression: PDFBOX-4587 -- PDF passwords with codepoints + between 0xF000 and 0XF0000 will cause an exception. + + * Add parser for HWP v5 files via SooMyung Lee (soomyung) and + JinSup Kim (ddoleye) (TIKA-2909). + + * Fix order of closing streams to avoid "Failed to close temporary resource" + exception in TesseractOCRParser (TIKA-2908). + + * Improve AutoDetectReader performance by caching encoding + detector (TIKA-1568). + + * Prevent RTFParser from outputting illegal tag combinations (TIKA-2889). + + * Fix RereadableInputStream to release all resources (TIKA-2903). + + * Implement custom language identifier in the tika-eval module based on + OpenNLP's language detector; add 18 languages and add common words + lists for all 121 languages (TIKA-2790). + + * Fix NPE in MimeTypesReader.releaseParser() via Eamonn Saunders (TIKA-2896). + + * Fix RTFParser to extract more content (TIKA-2883). + + * Add clientSubmitTime to the metadata extracted from PST files (TIKA-2898). + + * Improve StreamingZipContainerDetector for xltx, xltm and + several other file formats (TIKA-2886). + +Release 1.21 - 05/14/2019 + + * Add optional AUTO mode to OCR'ing of PDFs. If tesseract is installed + and on the path, and this option is selected programmatically + or via TikaConfig(), the PDFParser will use heuristics to decide + whether or not to run OCR per page on PDFs. (TIKA-2749) + + * The ZipContainerDetector's default behavior was changed to run + streaming detection up to its markLimit. Users can get the + legacy behavior (spool-to-file/rely-on-underlying-file-in-TikaInputStream) + by setting markLimit=-1. The POIFSContainerDetector requires an underlying file; + it will try to spool the file to disk; if the file's length is > markLimit, + it will not attempt detection; set markLimit to -1 for legacy behavior (TIKA-2849). + + * Upgrade PDFBox to 2.0.14 (TIKA-2834). + + * Add CSV detection and replace TXTParser with TextAndCSVParser; + users can turn off CSV detection by excluding the TextAndCSVParser + and adding back the TXTParser via tika-config (TIKA-2833). + + * Add a CSVParser. CSV detection is currently based solely on filename + and/or information conveyed via Metadata (TIKA-2826). + + * General upgrades: asm, bouncycastle, commons-codec, commons-lang3, cxf, + guava, h2, httpcomponents, jackcess, junrar, Lucene, mime4j, opennlp, parso, + sqlite-jdbc (provided), zstd-jni (provided) (TIKA-2824) + + * Bundle xerces2 with tika-parsers (TIKA-2802). + + * Upgrade jaxb to 2.3.2 (TIKA-2819). + + * Upgrade jackson to 2.9.8 (TIKA-2717). + + * Update tika-eval's common tokens lists (TIKA-2822). + + * Handle bad tags in tika-eval more robustly (TIKA-2810). + + * Add reports for tags in tika-eval (TIKA-2809). + + * Extract text from SDT element within textboxes in .docx files (TIKA-2807). + + * Try to handle truncated OOXML files more robustly (TIKA-2765). + +Release 1.20 - 12/17/2018 + + * Upgrade to POI 4.0.1 (TIKA-2751). + + * Integrate/parameterize new angles handling in + PDFBox (TIKA-2779). + + * Upgrade to PDFBox 2.0.13 (TIKA-2788). + + * Prevent content within <style/> and <script/> elements + to be written in the ToTextContentHandler (TIKA-2550). + + * Switch child to parent communication to a shared memory-mapped + file in tika-server's -spawnChild mode. + + * Fix bug in tika-server when run in legacy mode (not -spawnChild) + that caused it to return 503 on documents submitted after + it hit an OutOfMemoryError (TIKA-2776). + + * Upgrade jaxb-runtime and javax.activation (TIKA-2778). + + * tika-app in batch mode now requires an interrupt or + kill signal to the parent process to stop the parent + and the child processes (TIKA-2780). + + * Bulk upgrade of dependencies (TIKA-2775). + + * Improve language id efficiency in tika-eval (TIKA-2777). + + * Upgrade sqlite "provided" dependency to 3.25.2 (TIKA-2773). + + * Remove duplication of notes in PPT slides (TIKA-2735) + + * Use -javaHome or $JAVA_HOME (if they exist) when + spawning child in tika-server's -spawnChild mode. + + * Fixed closing of styles around Hyperlinks in Word Parser + Contributed by Ronan O'Sullivan (TIKA-2599). + +Release 1.19.1 - 10/4/2018 + + * Update PDFBox to 2.0.12, jempbox to 1.8.16 + and jbig2 to 3.0.2 (TIKA-2745). + + * Fix regression in parser for MP3 files (TIKA-2730). + + * Updated Python Dependency Check for TesseractOCR (TIKA-2740). + + * Improve SAXParser robustness (TIKA-2727). + + * Remove dependency on slf4j-log4j12 by upgrading jmatio (TIKA-2742). + + * Replace com.sun.xml.bind:jaxb-impl and jaxb-core with + org.glassfish.jaxb:jaxb-runtime and jaxb-core (TIKA-2743) + +Release 1.19 - 9/14/2018 + + * Require Java 8 (TIKA-2679). + + * Enable building with Java 11 (TIKA-2668) + + * Add an option to make tika-server robust against infinite loops, + OOMs, and memory leaks (TIKA-2725). + + * Allow configuration of the Tesseract parser via the standard + tika-config.xml options (TIKA-2705). + + * Improve handling of empty cells across table-based + formats (TIKA-2479). + + * Add a Standards compliant HTML encoding detector + via Gerard Bouchar (TIKA-2673). + + * Improved XML parsing -- limited default entity expansions to 20. + To raise this limit, add -Djdk.xml.entityExpansionLimit=XXX to + your commandline. + + * Mime magic improvements for Olympus RAW (TIKA-2658), interpreted + server-side languages via HTTP (TIKA-2648), MHTML (TIKA-2723) + + * Add absolute timeout to ForkParser rather than testing + for active (TIKA-2656). + + * Make the RecursiveParserWrapper work with the ForkParser (TIKA-2655). + + * Allow the ForkParser to specify a directory containing tika-app.jar + for use by the ForkServer. This allows users to keep most of the + parser dependencies out of their code; and it allows for an easy + addition of optional jars for Parser dependencies, + such as the xerial sqlite jar (TIKA-2653). + + * Use a pool for SAXParsers and DOMBuilders rather than creating + a new parser/builder for every parse. + For better performance, set XMLReaderUtils.setPoolSize() to the + number of threads you're using with Tika (TIKA-2645). + + * Add the RecursiveParserWrapperHandler to improve the RecursiveParserWrapper + API slightly (TIKA-2644). + + * Upgraded to Commons-Compress 1.18 (TIKA-2707). + + * Upgraded to Apache POI 4.0.0 (TIKA-2552). + + * Upgraded to Apache PDFBox 2.0.11 (TIKA-2681). + + * Upgraded to deeplearning4j 1.0.0-beta2 (TIKA-2672). + + * Upgraded jmatio to 1.4 (TIKA-2667) + + * Upgraded Apache Lucene to 7.4.0 in tika-eval and tika-examples (TIKA-2695). + + * Upgraded junrar to 1.0.1 (TIKA-2664). + + * Numerous other upgrades (TIKA-2692). + + * Excluded Spring as a transitive dependency (TIKA-2721). + +Release 1.18 - 4/20/2018 + + * Upgrade jackson to 2.9.5 (TIKA-2634). + + * Add support for brotli (TIKA-2621). + + * Upgrade PDFBox to 2.0.9 and include new jbig2-imageio + from org.apache.pdfbox (TIKA-2579 and TIKA-2607). + + * Support for TIFF images in PDF files (TIKA-2338) + + * Detection of full encrypted 7z files (TIKA-2568) + + * Various new mimes and typo fixes in tika-mimetypes.xml + via Andreas Meier (TIKA-2527). + + * Revert to listenForAllRecords=false in ExcelExtractor + via Grigoriy Alekseev (TIKA-2590) + + * Add workaround to identify TIFFs that might confuse + commons-compress's tar detection via Daniel Schmidt + (TIKA-2591) + + * Ignore non-IANA supported charsets in HTML meta-headers + during charset detection in HTMLEncodingDetector + via Andreas Meier (TIKA-2592) + + * Add detection and parsing of zstd (if user provides + com.github.luben:zstd-jni) via Andreas Meier (TIKA-2576) + + * Allow for RFC822 detection for files starting with "dkim-" + and/or "x-" via Andreas Meier (TIKA-2578 and TIKA-2587) + + * Extract xlsx files embedded in OLE objects within PPT and PPTX + via Brian McColgan (TIKA-2588). + + * Extract files embedded in HTML and javascript inside HTML + that are stored in the Data URI scheme (TIKA-2563). + + * Extract text from grouped text boxes in PPT (TIKA-2569). + + * Extract language metadata item from PDF files via Matt Sheppard (TIKA-2559) + + * RFC822 with multipart/mixed, first text element should be treated + as the main body of the email, not an attachment (TIKA-2547). + + * Swap out com.tdunning:json for com.github.openjson:openjson to avoid + jar conflicts (TIKA-2556). + + * No longer hardcode HtmlParser for XML files in tika-server (TIKA-2551). + + * Require Java 8 (TIKA-2553). + + * Add a parser for XPS (TIKA-2524). + + * Mime magic for Dolby Digital AC3 and EAC3 files + + * Fixed bug where TesseractOCRParser ignores configured ImageMagickPath, + and set rotation script to ignore Python warnings (TIKA-2509) + + * Upgrade geo-apis to 3.0.1 (TIKA-2535) + + * Mime definition and magic improvements for text-based programming + and config formats (TIKA-2554, TIKA-2567, TIKA-1141) + + * Added local Docker image build using dockerfile-maven-plugin to allow + images to be built from source (TIKA-1518). + + * Support for SAS7BDAT data files (TIKA-2462) + + * Handle .epub files using .htm rather than .html extensions for the + embedded contents (TIKA-1288) + + * Mime magic for ACES Images (TIKA-2628) and DPX Images (TIKA-2629) + + * For sparse XLSX and XLSB files, always output missing cells to + the left of filled ones (matching XLS), and optionally output + missing rows on all 3 formats if requested via the + OfficeParserContext (TIKA-2479) + +Release 1.17 - 12/8/2017 + + ***NOTE: THIS IS THE LAST VERSION OF TIKA THAT WILL RUN + ON Java 7. The next versions will require Java 8*** + + * Fix thread-safety in ChmExtractor (TIKA-2519). + + * Upgrade cxf to 3.0.16 (TIKA-2516). + + * Allow users to configure maxMainMemoryBytes for PDFs via shrike (PR-213). + + * Extract underline and strikethrough in docx (TIKA-2347 and TIKA-2512). + + * Cache TikaConfig in EmbeddedDocumentUtil for better performance + in documents with large number of attachments (TIKA-2511). + + * Extract media files from ooxml (TIKA-2510). + + * Standardize the way the Image and Video captioning + dockers and extraction work (TIKA-2400, GitHub-208) + + * Upgrade to xmpcore 5.1.3 (TIKA-2034). + + * Upgrade to metadata-extractor 2.10.1 (TIKA-2486). + + * Upgrade to OpenNLP 1.8.3 (TIKA-2502). + + * Upgrade to Jackson 2.9.2 (TIKA-2501). + + * Catch potential NPE in getting InputStream for attachments + in PST file (TIKA-2488). + + * Upgrade to PDFBox 2.0.8 (TIKA-2489). + + * Allow configuration of markLimit in EncodingDetectors + via tika-config.xml (TIKA-2485). + + * RFC822Parser now selects the best alternative for + multipart/alternative body components. This aligns with the + behavior of the OutlookParser (TIKA-2478). Users can select + legacy behavior via the "extractAllAlternatives" parameter + in the RFC822 parser definition in tika-config.xml. + + * Narrow mime detection for ms-owner files and add detection + for .nls files (TIKA-2469). + + * Fix bug in CharsetDetector that led to different detected charsets + depending on whether user setText with a byte[] or an InputStream + via Sean Story (TIKA-2475). + + * Remove JAXB for easier use with Java 9 via Robert Munteanu (TIKA-2466). + + * Upgrade to POI 3.17 (TIKA-2429). + + * Enabling extraction of standard references from text (TIKA-2449). + + * Load external custom mimetypes XML from system property + tika.custom-mimetypes (TIKA-2460). + + * Extract number of tiffs in a multi-page tiff (TIKA-2451). + + * Fix detection of emails extracted from mbox (TIKA-2456). + + * Add OverrideDetector and allow PSTParser to specify body content type + as text or html -- to avoid incorrect auto-detection of + rfc/mbox, etc. (TIKA-2454) + + * AutoDetectParser throws ZeroByteFileException for zero-byte files after + detection on the file extension (TIKA-2450). + + * Extract phonetic runs in docx with experimental SAX parser (TIKA-2448). + + * Extract phonetic runs from xls and allow users to turn off extraction + of phonetic runs in both xls and xlsx (TIKA-2440). + + * OOXML locale should be set by POI's LocaleUtil not Locale.getDefault(). + Fix unit tests to be robust against different locales in OOXML + and ExcelParser (TIKA-2438). + + * Upgrade to PDFBox 2.0.7 (TIKA-2431). + + * Tika now has support for automatic image captioning, that + combines Computer Vision and Natural Language Processing to + automatically generate a readable caption for an image + (TIKA-2262, TIKA-2355, TIKA-2402, Gh-198, Gh-196, Gh-189). + + * Add TestCorruptedFiles to allow devs to test parsers against + corrupted input files (TIKA-2430). + + * Correct Mimetype definition for Windows batch files (CMD and BAT) + which are the same (TIKA-2445) + + * PSDParser memory use improvements (TIKA-2447) + + * Add underline extraction from Word documents (doc/docx) via Stuart Hendren + as well as strikethrough extraction in docx (TIKA-2347, GitHub-173) + + * Corrected Tesseract OCR rotation.py script and made it a configurable + option via Peter Weiss (TIKA-2385) + +Release 1.16 - 7/7/2017 + + * Exclude jj2000 from edu.ucar grip to avoid potential + license conflicts with ASL 2.0 + + * Add Age recognition using Ensemble model for Linear regression + and Apache OpenNLP Maximum Entropy. Tika can now detect age from + text (TIKA-1988). + + * Add Tika Deep Learning support for the VGG16 model for + Very Deep Convolutional Networks for Large-Scale Image Recognition. + Now Tika supports both Inception v3/v4 and VGG16 based image + recognition (TIKA-2298). + + * Extract macros from PPT (TIKA-2089). + + * Extract absolute path for last saved location when available + in .xlsx and .xlsb (TIKA-2335). + + * Rename SentimentParser to SentimentAnalysisParser to + prevent conflict with dependency (TIKA-2368). + + * tika-app now extracts inline images in PDFs by + default, and it includes a warning to users that this is not the + default behavior elsewhere in Tika (TIKA-2374). + + * Allow configurability of warnings for problems during + parser initialization (TIKA-2389). + + * Upgrade to Jackcess 2.1.8 (TIKA-2380). + + * Upgrade to POI 3.17-beta1 (TIKA-2336). + + * Remove non-ASL-2.0-compatible org.json (TIKA-1804). + + * Allow extraction of <script> elements in HTML as embedded "MACRO". + Users must turn this on via TikaConfig (TIKA-2391). + + * Allow users to turn off extraction of headers and footers + from .doc, .docx, .xls, .xlsx, .xlsb (TIKA-2362) + + * Extract text from charts in .docx, .pptx, .xlsx and .xlsb + (TIKA-2254). + + * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb + (TIKA-1945). + + * Fix bug in tika-server that led to an attempt to close the + input stream twice (TIKA-2384). + + * Enable base32 encoding of digests and enable BouncyCastle implementations + of digest algorithms (TIKA-2386). + + * Add snap builds to codebase (TIKA-2401) + + * Canonical Mimetype of WAVE audio changed to match RFC 2361 defined + version, audio/vnd.wave, older audio/x-wav remains as an alias + + * Upgrade "provided" xerial to 3.19.3 (TIKA-2412). + + * Upgrade Gson to 2.8.1 (TIKA-2414). + + * Upgrade mime4j to 0.8.1 (TIKA-2413). + + * Mime magic improvements for GraphViz (TIKA-2422), HTML files which + claim to be XML but aren't quite valid XML (TIKA-2419) and QuickTime + / MP4 (TIKA-2418) + +Release 1.15 - 05/23/2017 + + * Tika now has a module for Deep Learning powered by the + DL4J toolkit. The initial included model is for InceptionV3 + and so using this module, natively in Java, Tika can use + Deep learning for metadata/text extraction from Images using + the power of the Inception model (Github-165). + + * A new parser for sentiment analysis using a categorical + (multi-class, anry, sad, neutral, like, love) and binary + (positive/negative) was added leveraging the USC data + science work (TIKA-2016). + + * Tika now has the ability to automatically detect objects in videos, + using OpenCV and Tensorflow (TIKA-2322). + + * Change default behavior to parse embedded documents even if the user + forgets to specify a Parser.class in the ParseContext (TIKA-2096). + Users who wish to parse only the container document should set + an EmptyParser as the Parser.class in the ParseContext. + + * Change default behavior of Office Parsers to _not_ extract + Macros. User needs to setExtractMacros to "true" (TIKA-2302). + + * Added tika-eval module (TIKA-1332). + + * Unified logging across Tika: SLF4J as logging API, Apache Log4j as + implementation with JCL and JUL bridges in standalone tools like + tika-app, tika-batch and tika-server (TIKA-2245). + + * Add parser for XLSB files (TIKA-1195). + + * Add parsers for EMF/WMF files (TIKA-2246/TIKA-2247). + + * Add parsers for WordPerfect and QuattroPro (.qpw) files. + Contributed by Pascal Essiembre (TIKA-1946 and TIKA-2228). + + * Add experimental SAX parser for .pptx files. To select this parser, + set useSAXPptxExtractor(true) on OfficeParserConfig (TIKA-2210). + + * Add experimental SAX parser for .docx files. To select this parser, + set useSAXDocxExtractor(true) on OfficeParserConfig (TIKA-1321, TIKA-2191). + + * Add mime detection and parser for Word 2006ML format (TIKA-2179). + + * Bug fix for WordPerfect via Pascal Essiembre (TIKA-2352). + + * Added "text-main" equivalent option to tika-server via + /tika/main (TIKA-2343). + + * Enabled configuration of the EncodingDetector used by + parsers that extend AbstractEncodingDetectorParser (TIKA-2273). + + * Prevent easily preventable OOMs for both detection and parsing + of some compression formats (TIKA-2330). + + * Extract images and thumbnails from ODT via Sam Bayer (TIKA-2295). + + * Fix potential NPE in FeedParser via Julien Nioche (TIKA-2269). + + * Official mime types for BMP, EMF and WMF have been registered with + IANA, so switch to these (image/bmp image/emf image/wmf) (TIKA-2250) + + * Be more parsimonious with BufferedInputStreams via Josh Hight + (TIKA-2244). + + * Enable handling of hyphenated language codes in TesseractOCRParser + via Graham Russell (TIKA-2231). + + * Improve style tags in ODT (TIKA-2242). + + * Add container detection for embedded MSEquation files (TIKA-2238). + + * Add parsing of JBIG2 and extraction of JBIG2 from PDFs when + required dependencies are added to class path by user. + Contributed by Pascal Essiembre (TIKA-2232). + + * Mime magic for the OneNote family (.one / .onetoc / .onepkg), no parser + (TIKA-2224). + + * Add configurability of "preserve-interword-spacing" to + TesseractOCRParser (TIKA-2190). + + * Upgrade to PDFBox 2.0.6 and JempBox 1.8.13 (TIKA-2209/TIKA-2236/TIKA-2361). + + * Refactor MockParser to consolidate service loading + and mime types into tika-core/src/test (TIKA-2195). + + * Enabled extraction of embedded objects from headers, footers, + footnotes, endnotes and comments in legacy .docx parser (TIKA-2192). + + * Allow extraction of PDActions (including Javascript) from + PDFs (TIKA-2090). This is turned off by default. Users + must setExtractActions(true) on the PDFParserConfig. + + * Change default behavior in experimental .docx parser to ignore + deleted text to align with .doc (TIKA-2187). + + * Upgrade to POI 3.16 (TIKA-2116, TIKA-2181, TIKA-2329). + + * Allow configuration of timeout for ForkParser (TIKA-2170). + + * Add extraction of .jpx inline images from PDFs when required + dependencies are added by user to class path (TIKA-2175). + + * Add .jpx, .jp2, .ppm to formats handled by Tesseract (TIKA-2174). + + * Upgrade SQLite "provided" dependency to 3.16.1 (TIKA-2334). + + * Update Apache CXF version to 3.0.12 (TIKA-2292). + + * Add Lingo24 Language Detector (TIKA-2297). + + * Further mime magic for WebVTT (TIKA-1772) + + * Extend support for increased PSM options up to 13 for modern + versions of Tesseract (TIKA-2357). + + * Prevent potential resource leak by closing TrueTypeFont + via Cameron Rollheiser (TIKA-2370). + +Release 1.14 - 10/19/2016 + + * Extract all headers from MSG/RFC822 (TIKA-2122). + + * Upgrade metadata-extractor to 2.9.1 (TIKA-2113). + + * Extract PDF DocInfo metadata into separate keys to prevent + overwriting by XMP metadata (TIKA-2057). + + * Re-enable fileUrl for tika-server (TIKA-2081). If you choose, + to use this feature, beware of the security vulnerabilities! + See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271 + + * Add Tesseract's hOCR output format as an option, via Eric Pugh + (TIKA-2093) + + * Extract macros from MSOffice files (TIKA-2069). + + * Maintain passed-in mime in TXTParser (TIKA-2047). + + * Upgrade to POI.3-15 (TIKA-2013). + + * Upgrade to PDFBox 2.0.3 (TIKA-2051). + + * Fix hyperlinks with formatting in DOC and DOCX (TIKA-1255 + and TIKA-2078) + + * Tika now is integrated with the Tensorflow library from Google + and it can use its Inception v3 image classification model to + identify objects in images (TIKA-1993). + + * Parser configuration is now type-safe and parameters for parsers + can have assigned types (TIKA-1508, TIKA-1986). + + * Prevent OOM/permanent hang on some corrupt CHM files (TIKA-2040). + + * Upgrade ICU4J charset detection components to fix multithreading + bug (TIKA-2041). + + * Upgrade to Jackcess 2.1.4 (TIKA-2039). + + * Maintain more significant digits in cells of "General" format + in XLS and XLSX (TIKA-2025). + + * Avoid mark/reset issues when extracting or detecting embedded resources + in RFC822 emails (TIKA-2037). + + * Improving accuracy of Tesseract for better extraction of numeric + and alphanumeric text from images (TIKA-2021, TIKA-2031). + + * Improve extraction of embedded documents from PPT, PPTX and XLSX + (TIKA-2026). + + * Add parser for applefile (AppleSingle) (TIKA-2022). + + * Add mime types, mime magic and/or globs for: + * Endnote Import File (TIKA-2011) + * DJVU files (TIKA-2009) + * MS Owner File (TIKA-2008) + * Windows Media Metafile (TIKA-2004) + * iCal and vCalendar (TIKA-2006) + * MBOX (TIKA-2042) + * Stata DTA (TIKA-2064) + + * Add configurable maximum threshold for number of events extracted + from the XMP Media Management Schema in JempboxExtractor (TIKA-1999). + + * Integrate TesseractOCR with full page image rendering for PDFs (TIKA-1994). + + * Add mime detection via Nick C and parser for DBF files (TIKA-1513). + + * Add mime detection and parsers for MSOffice 2003 XML Word + and Excel formats (TIKA-1958). + + * Extract hyperlinks from PPT, PPTX, XSLX (TIKA-1454). + + * Upgrade to Commons Compress 1.12 (supports progress on TIKA-1358) + +Release 1.13 - 05/08/2016 + + * Upgrade to PDFBox 2.0.1 (TIKA-1285/TIKA-1959). + MAJOR CHANGES in PDFParser: + * The classic sequential parser is no longer available. + * Tiff files are no longer extracted by default. See + https://pdfbox.apache.org/2.0/dependencies.html#optional-components + for optional components to process Tiff files. + * Some truncated/corrupted files that had some content extracted + with 1.8.x may have no content extracted in 2.0.x (see TIKA-1912). + + * The MIT-NLP Information Extraction (MITIE) Named Entity + Recognition (NER) system is now supported in Tika + (TIKA-1913, GitHub-108). + + * Tika now supports the use of the Yandex translation + service (TIKA-1943, GitHub-106). + + * Tika now uses NER to extract scientific measurements + from text using either GROBID Quantities which uses + conditional random fields and NLTK which uses regular + expressesions (TIKA-1917, GitHub-104). + + * Fixed JournalParser to handle null responses from + GROBID and to log a message (TIKA-1925). + + * Refactored Language Detector into tika-landetect module, + added default N-Gram implementation, Optimaize Lang + Detector and MIT Text.jl implementation + (TIKA-1872, TIKA-1696, TIKA-1723). + + * Extract metadata from MP4 videos whether or not the + PooledTimeSeries parser is available via Aditya Dhulipala + (TIKA-1844). + + * Fix NPE when trying to get embedded image identifier in + WordParser (TIKA-1956). + + * Improvements to MIME database for detection of Scientific + and other formats present in the TREC-DD-Polar dataset + (TIKA-1881, GitHub-85, TIKA-1883, TIKA-1884, TIKA-1886, + TIKA-1882). + + * LinkContentHandler now extracts links from script tags + via Joseph Naegele (TIKA-1937). + + * Handle per page IOExceptions more robustly in PDFParser (TIKA-1948). + + * Upgrade commons-compress to 1.11 (TIKA-1949). + + * Add detection for embedded MSChart.Graph files (TIKA-1033). + + * Fix NPE in Sqlite parser from Nick C (TIKA-1927). + + * Fix NPE in Open Document parser from Nick C (TIKA-1916). + + * Upgrade mp4parser's isoparser to 1.1.7 (TIKA-1924 and TIKA-1931). + + * Upgrade BouncyCastle to 1.54 (TIKA-1923). + + * Upgrade Jackcess to 2.1.3 (TIKA-1922). + + * Upgrade Drew Noakes' metadata-extractor to 2.8.1 (TIKA-1921). + + * Upgrade Gson in tika-serialization to 2.6.2 (TIka-1920). + + * Upgrade commons-cli in tika-batch to 1.3.1 (TIKA-1919). + + * Add XMPMM support to PDFParser and JpegParser via Jempbox (TIKA-1894). + + * Move serialization of TikaConfig to tika-core and enable dumping + of the config file via tika-app (TIKA-1657). + + * Tika now incorporates the Natural Language Toolkit (NLTK) from the + Python community as an option for Named Entity Recognition (TIKA-1876). + + * Add support for XFA extraction via Pascal Essiembre (TIKA-1857). + + * Upgrade to sqlite-jdbc 3.8.11.2 (TIKA-1861). NOTE: this dependency + is still <scope>provided</scope>. You need to include this dependency + in order to parse sqlite files. + + * Upgrade to POI 3.15-beta1 (TIKA-1895). + + * Upgrade to Jackson 2.7.1 (TIKA-1869). + + * Upgrade to Apache SIS 0.6 (TIKA-1878). + + * RichTextContentHandler moved from the Server package to Core (TIKA-1870). + + * Added ZeroSizeFileDetector to support application/x-zerovalue via + Adesh Gupta (TIKA-1885). + + * Addition of types information to Grobid quantities parser via + Can Menekse (TIKA-1965). + +Release 1.12 - 01/24/2016 + + * Support for iFrames and element link extraction is provided in + the link Content Handler (TIKA-1835). + + * Slide notes are now linked to the slide XHTML in the PPT output + (TIKA-1840). + + * JSON tests in Tika server were updated to remove impossible casts + (Github-73). + + * Fix bug in GeoTopicParser where NER is reused instead of instantiated + with each request (TIKA-1834). + + * Upgrade rome to 1.5.1 && Downgrade Rome dependency to 0.9 to avoid + nasty NPE (TIKA-1820, TIKA-1516) + + * The NamedEntityParser was enhanced to generate text content + in addition to metadata (TIKA-1815, TIKA-1816). + + * A significant speed-up is made to the GeoTopicParser by + using the new REST server capabilities from Lucene Geo + Gazetteer (TIKA-1803). + + * A parser to compute motion properties in Videos, e.g., + Histogram of Oriented Gradients and Histogram of Optical Flows + using the Pooled Time Series algorithm, was added (TIKA-1798). + + * Provide NamedEntityParser which exposes Named Entity Recognition + from OpenNLP and Stanford NER providers (TIKA-1787, GitHub-61, + GitHub-62). + + * Allow XHTMLContentHandler to pass attributes of html element + via Markus Jelsma (TIKA-1782). + + * Fix regression with spacing in PPT via Andreas Beeker (TIKA-1777). + + * Tika Facade parse methods for Path and File added which take a + Metadata object, to mirror the existing InputStream one (GitHub-60) + + * GeoParser fix for loading the NER model from a jar file (TIKA-1791) + + +Release 1.11 - 10/18/2015 + + * Java7 API support for allowing java.nio.file.Path as method arguments + was added to Tika and to ParsingReader, TikaFileTypeDetector, and to + Tika Config (TIKA-1745, TIKA-1746, TIKA-1751). + + * MIME support was added for WebVTT: The Web Video Text Tracks Format + files (TIKA-1772). + + * MIME magic improved to ensure emails detected as message/rfc822 + (TIKA-1771). + + * Upgrade to Jackcess Encrypt 2.1.1 to avoid binary incompatibility + with Bouncy Castle (TIKA-1736). + + * Make div and other markup more consistent between PPT and + PPTX (TIKA-1755). + + * Parse multiple authors from MSOffice's semi-colon delimited + author field (TIKA-1765). + + * Include CTAKESConfig.properties within tika-parsers resources + by default (TIKA-1741). + + * Prevent infinite recursion when processing inline images + in PDF files by limiting extraction of duplicate images + within the same page (TIKA-1742). + + * Upgrade to POI 3.13-final (via Andreas Beeker) (TIKA-1707). + + * Upgraded tika-batch to use Path throughout (TIKA-1747 and + (TIKA-1754). + + * Upgraded to Path in TikaInputStream (via Yaniv Kunda) (TIKA-1744). + + * Changed default content handler type for "/rmeta" in tika-server + to "xml" to align with "-J" option in tika-app. + Clients can now specify handler types via PathParam. (TIKA-1716). + + * The fantastic GROBID (or Grobid) GeneRation Of BIbliographic Data + for machine learning from PDF files is now integrated as a + Tika parser (TIKA-1699, TIKA-1712). + + * The ability to specify the Tesseract Config Path was added + to the OCR Parser (TIKA-1703). + + * Upgraded to ASM 5.0.4 (TIKA-1705). + + * Corrected Tika Config XML detector definition explicit loading + of MimeTypes (TIKA-1708) + + * In Tika Parsers, Batch, Server, App and Examples, use Apache + Commons IO instead of inlined ex-Commons classes, and the Java 7 + Standard Charset definitions (TIKA-1710) + + * Upgraded to Commons Compress 1.10, which enables zlib compressed + archives support (TIKA-1718) + + +Release 1.10 - 8/1/2015 + + * Tika Config XML can now be used to create composite detectors, + and exclude detectors that DefaultDetector would otherwise + have used. This brings support in-line with Parsers. (TIKA-1702) + + * Reverted to legacy sort order of parsers that was + mistakenly reversed in Tika 1.9 (TIKA-1689). + + * Upgrade to POI 3.13-beta1 (TIKA-1667). + + * Upgrade to PDFBox 1.8.10 (TIKA-1588). + + * MimeTypes now tries to find a registered type with and + without parameters (TIKA-1692). + + * Added more robust error handling for encoding detection + of .MSG files (TIKA-1238). + + * Fixed bug in Tika's use of the Jackcess parser that + prevented reading of v97 Access files (TIKA-1681). + + * Upgrade xerial.org's sqlite-jdbc to 3.8.10.1. NOTE: + as of Tika 1.9, this jar is "provided." Make sure + to upgrade your provided jar! (TIKA-1687). + + * Add header/footer extraction to xls (via Aeham Abushwashi) + (TIKA-1400). + + * Drop the source file name from the embedded file path in + RecursiveParserWrapper's "X-TIKA:embedded_resource_path" + (TIKA-1673). + + * Upgraded to Java 7 (TIKA-1536). + + * Non-standards compliant emails are now correctly detected + as message/rfc822 (TIKA-1602). + + * Added parser for MS Access files via Jackcess. Many thanks + to Health Market Science, Brian O'Neill and James Ahlborn + for relicensing Jackcess to Apache v2! (TIKA-1601) + + * GDALParser now correctly sets "nitf" as a supported + MediaType (TIKA-1664). + + * Added DigestingParser to calculate digest hashes + and record them in metadata. Integrated with + tika-app and tika-server (TIKA-1663). + + * Fixed ZipContainerDetector to detect all IPA files + (TIKA-1659). + + +Release 1.9 - 6/6/2015 + + * The ability to use the cTAKES clinical text + knowledge extraction system for biomedical data is + now included as a Tika parser (TIKA-1645, TIKA-1642). + + * Tika-server allows a user to specify the Tika config + from the command line (TIKA-1652, TIKA-1426). + + * Matlab file detection has been improved (TIKA-1634). + + * The EXIFTool was added as an External parser + (TIKA-1639). + + * If FFMPEG is installed and on the PATH, it is a + usable Parser in Tika now (TIKA-1510). + + * Fixes have been applied to the ExternalParser to make + it functional (TIKA-1638). + + * Tika service loading can now be more verbose with the + org.apache.tika.service.error.warn system property (TIKA-1636). + + * Tika Server now allows for metadata extraction from remote + URLs and in addition it outputs the detected language as a + metadata field (TIKA-1625). + + * OUTPUT_FILE_TOKEN not being replaced in ExternalParser + contributed by Pascal Essiembre (TIKA-1620). + + * Tika REST server now supports language identification + (TIKA-1622). + + * All of the example code from the Tika in Action book has + been donated to Tika and added to tika-examples (TIKA-1562). + + * Tika server now logs errors determining ContentDisposition + (TIKA-1621). + + * An algorithm for using Byte Histogram frequencies to construct + a Neural Network and to perform MIME detection was added + (TIKA-1582). + + * A Bayesian algorithm for MIME detection by probabilistic + means was added (TIKA-1517). + + * Tika now incorporates the Apache Spatial Information + System capability of parsing Geographic ISO 19139 + files (TIKA-443). It can also detect those files as + well. + + * Update the MimeTypes code to support inheritance + (TIKA-1535). + + * Provide ability to parse and identify Global Change + Master Directory Interchange Format (GCMD DIF) + scientific data files (TIKA-1532). + + * Improvements to detect CBOR files by extension (TIKA-1610). + + * Change xerial.org's sqlite-jdbc jar to "provided" (TIKA-1511). + Users will now need to add sqlite-jdbc to their classpath for + the Sqlite3Parser to work. + + * ExternalParser.check now catches (suppresses) SecurityException + and returns false, so it's OK to run Tika with a security policy + that does not allow execution of external processes (TIKA-1628). + +Release 1.8 - 4/13/2015 + + * Fix null pointer when processing ODT footer styles (TIKA-1600). + + * Upgrade to com.drewnoakes' metadata-extractor to 2.0 and + add parser for webp metadata (TIKA-1594). + + * Duration extracted from MP3s with no ID3 tags (TIKA-1589). + + * Upgraded to PDFBox 1.8.9 (TIKA-1575). + + * Tika now supports the IsaTab data standard for bioinformatics + both in terms of MIME identification and in terms of parsing + (TIKA-1580). + + * Tika server can now enable CORS requests with the command line + "--cors" or "-C" option (TIKA-1586). + + * Update jhighlight dependency to avoid using LGPL license. Thank + @kkrugler for his great contribution (TIKA-1581). + + * Updated HDF and NetCDF parsers to output file version in + metadata (TIKA-1578 and TIKA-1579). + + * Upgraded to POI 3.12-beta1 (TIKA-1531). + + * Added tika-batch module for directory to directory batch + processing. This is a new, experimental capability, and the API will + likely change in future releases (TIKA-1330). + + * Translator.translate() Exceptions are now restricted to + TikaException and IOException (TIKA-1416). + + * Tika now supports MIME detection for Microsoft Extended + Makefiles (EMF) (TIKA-1554). + + * Tika has improved delineation in XML and HTML MIME detection + (TIKA-1365). + + * Upgraded the Drew Noakes metadata-extractor to version 2.7.2 + (TIKA-1576). + + * Added basic style support for ODF documents, contributed by + Axel Dörfler (TIKA-1063). + + * Move Tika server resources and writers to separate + org.apache.tika.server.resource and writer packages (TIKA-1564). + + * Upgrade UCAR dependencies to 4.5.5 (TIKA-1571). + + * Fix Paths in Tika server welcome page (TIKA-1567). + + * Fixed infinite recursion while parsing some PDFs (TIKA-1038). + + * XHTMLContentHandler now properly passes along body attributes, + contributed by Markus Jelsma (TIKA-995). + + * TikaCLI option --compare-file-magic to report mime types known to + the file(1) tool but not known / fully known to Tika. + + * MediaTypeRegistry support for returning known child types. + + * Support for excluding certain Parsers from being + used by DefaultParser via the Tika Config file, using the new + parser-exclude tag (TIKA-1558). + + * Detect Global Change Master Directory (GCMD) Directory + Interchange Format (DIF) files (TIKA-1561). + + * Tika's JAX-RS server can now return stacktraces for + parse exceptions (TIKA-1323). + + * Added MockParser for testing handling of exceptions, errors + and hangs in code that uses parsers (TIKA-1553). + + * The ForkParser service removed from Activator. Rollback of (TIKA-1354). + + * Increased the speed of language identification by + a factor of two -- contributed by Toke Eskildsen (TIKA-1549). + + * Added parser for Sqlite3 db files. Some users will need to + exclude the dependency on xerial.org's sqlite-jdbc because + it contains native libs (TIKA-1511). + + * Use POST instead of PUT for tika-server form methods + (TIKA-1547). + + * A basic wrapper around the UNIX file command was + added to extract Strings. In addition a parse to + handle Strings parsing from octet-streams using Latin1 + charsets as added (TIKA-1541, TIKA-1483). + + * Add test files and detection mechanism for Gridded + Binary (GRIB) files (TIKA-1539). + + * The RAR parser was updated to handle Chinese characters + using the functionality provided by allowing encoding to + be used within ZipArchiveInputStream (TIKA-936). + + * Fix out of memory error in surefire plugin (TIKA-1537). + + * Build a parser to extract data from GRIB formats (TIKA-1423). + + * Upgrade to Commons Compress 1.9 (TIKA-1534). + + * Include media duration in metadata parsed by MP4Parser (TIKA-1530). + + * Support password protected 7zip files (using a PasswordProvider, + in keeping with the other password supporting formats) (TIKA-1521). + + * Password protected Zip files should not trigger an exception (TIKA-1028). + +Release 1.7 - 1/9/2015 + + * Fixed resource leak in OutlookPSTParser that caused TikaException + when invoked via AutoDetectParser on Windows (TIKA-1506). + + * HTML tags are properly stripped from content by FeedParser + (TIKA-1500). + + * Tika Server support for selecting a single metadata key; + wrapped MetadataEP into MetadataResource (TIKA-1499). + + * Tika Server support for JSON and XMP views of metadata (TIKA-1497). + + * Tika Parent uses dependency management to keep duplicate + dependencies in different modules the same version (TIKA-1384). + + * Upgraded slf4j to version 1.7.7 (TIKA-1496). + + * Tika Server support for RecursiveParserWrapper's JSON output + (endpoint=rmeta) equivalent to (TIKA-1451's) -J option + in tika-app (TIKA-1498). + + * Tika Server support for providing the password for files on a + per-request basis through the Password http header (TIKA-1494). + + * Simple support for the BPG (Better Portable Graphics) image format + (TIKA-1491, TIKA-1495). + + * Prevent exceptions from being thrown for some malformed + mp3 files (TIKA-1218). + + * Reformat pom.xml files to use two spaces per indent (TIKA-1475). + + * Fix warning of slf4j logger on Tika Server startup (TIKA-1472). + + * Tika CLI and GUI now have option to view JSON rendering of output + of RecursiveParserWrapper (TIKA-1451). + + * Tika now integrates the Geospatial Data Abstraction Library + (GDAL) for parsing hundreds of geospatial formats (TIKA-605, + TIKA-1503). + + * ExternalParsers can now use Regexs to specify dynamic keys + (TIKA-1441). + + * Thread safety issues in ImageMetadataExtractor were resolved + (TIKA-1369). + + * The ForkParser service is now registered in Activator + (TIKA-1354). + + * The Rome Library was upgraded to version 1.5 (TIKA-1435). + + * Add markup for files embedded in PDFs (TIKA-1427). + + * Extract files embedded in annotations in PDFS (TIKA-1433). + + * Upgrade to PDFBox 1.8.8 (TIKA-1419, TIKA-1442). + + * Add RecursiveParserWrapper (aka Jukka's and Nick's) + RecursiveMetadataParser (TIKA-1329) + + * Add example for how to dump TikaConfig to XML (TIKA-1418). + + * Allow users to specify a tika config file for tika-app (TIKA-1426). + + * PackageParser includes the last-modified date from the archive + in the metadata, when handling embedded entries (TIKA-1246) + + * Created a new Tesseract OCR Parser to extract text from images. + Requires installation of Tesseract before use (TIKA-93). + + * Basic parser for older Excel formats, such as Excel 4, 5 and 95, + which can get simple text, and metadata for Excel 5+95 (TIKA-1490) + + +Release 1.6 - 08/31/2014 + + * Parse output should indicate which Parser was actually used + (TIKA-674). + + * Use the forbidden-apis Maven plugin to check for unsafe Java + operations (TIKA-1387). + + * Created an ExternalTranslator class to interface with command + line Translators (TIKA-1385). + + * Created a MosesTranslator as a subclass of ExternalTranslator + that calls the Moses Decoder machine translation program (TIKA-1385). + + * Created the tika-example module. It will have examples of how to + use the main Tika interfaces (TIKA-1390). + + * Upgraded to Commons Compress 1.8.1 (TIKA-1275). + + * Upgraded to POI 3.11-beta1 (TIKA-1380). + + * Tika now extracts SDTCell content from tables in .docx files (TIKA-1317). + + * Tika now supports detection of the Persian/Farsi language. + (TIKA-1337) + + * The Tika Detector interface is now exposed through the JAX-RS + server (TIKA-1336, TIKA-1336). + + * Tika now has support for parsing binary Matlab files as part of + our larger effort to increase the number of scientific data formats + supported. (TIKA-1327) + + * The Tika Server URLs for the unpacker resources have been changed, + to bring them under a common prefix (TIKA-1324). The mapping is + /unpacker/{id} -> /unpack/{id} + /all/{id} -> /unpack/all/{id} + + * Added module and core Tika interface for translating text between + languages and added a default implementation that call's Microsoft's + translate service (TIKA-1319) + + * Added an Translator implementation that calls Lingo24's Premium + Machine Translation API (TIKA-1381) + + * Made RTFParser's list handling slightly more robust against corrupt + list metadata (TIKA-1305) + + * Fixed bug in CLI json output (TIKA-1291/TIKA-1310) + + * Added ability to turn off image extraction from PDFs (TIKA-1294). + Users must now turn on this capability via the PDFParserConfig. + + * Upgrade to PDFBox 1.8.6 (TIKA-1290, TIKA-1231, TIKA-1233, TIKA-1352) + + * Zip Container Detection for DWFX and XPS formats, which are OPC + based (TIKA-1204, TIKA-1221) + + * Added a user facing welcome page to the Tika Server, which + says what it is, and a very brief summary of what is available. + (TIKA-1269) + + * Added Tika Server endpoints to list the available mime types, + Parsers and Detectors, similar to the --list-<foo> methods on + the Tika CLI App (TIKA-1270) + + * Improvements to NetCDF and HDF parsing to mimic the output of + ncdump and extract text dimensions and spatial and variable + information from scientific data files (TIKA-1265) + + * Extract attachments from RTF files (TIKA-1010) + + * Support Outlook Personal Folders File Format *.pst (TIKA-623) + + * Added mime entries for additional Ogg based formats (TIKA-1259) + + * Updated the Ogg Vorbis plugin to v0.4, which adds detection for a wider + range of Ogg formats, and parsers for more Ogg Audio ones (TIKA-1113) + + * PDF: Images in PDF documents can now be extracted as embedded resources. + (TIKA-1268) + + * Fixed RuntimeException thrown for certain Word Documents (TIKA-1251). + + * CLI: TikaCLI now has another option: --list-parser-details-apt, which outputs + the list of supported parsers in APT format. This is used to generate the list + on the formats page (TIKA-411). + +Release 1.5 - 02/04/2014 + + * Fixed bug in handling of embedded file processing in PDFs (TIKA-1228). + + * Added SourceCodeParser to support java, Groovy, C++ files (TIKA-1224). + + * Updated Tika Server to support multipart/form-data payloads (TIKA-1198). + + * Updated Tika Server to CXF 2.7.8 (TIKA-1197). + + * Updated Tika Server to accept requests over wildcard addresses (TIKA-1196). + + * Added option to use alternate NonSequentialPDFParser (TIKA-1201). + + * Content from PDF AcroForms is now extracted (TIKA-973). + + * Fixed invalid asterisks from master slide in PPT (TIKA-1171). + + * Added test cases to confirm handling of auto-date in PPT and PPTX (TIKA-817). + + * Text from tables in PPT files is once again extracted correctly (TIKA-1076). + + * Text is extracted from text boxes in XLSX (TIKA-1100). + + * Tika no longer hangs when processing Excel files with custom fraction format (TIKA-1132). + + * Disconcerting stacktrace from missing beans no longer printed for some DOCX files (TIKA-792). + + * Upgraded POI to 3.10-beta2 (TIKA-1173). + + * Upgraded PDFBox to 1.8.4 (TIKA-1230). + + * Made HtmlEncodingDetector more flexible in finding meta + header charset (TIKA-1001). + + * Added sanitized test HTML file for local file test (TIKA-1139). + + * Fixed bug that prevented attachments within a PDF from being processed + if the PDF itself was an attachment (TIKA-1124). + + * Text from paragraph-level structured document tags in DOCX files is now extracted (TIKA-1130). + + * RTF: Fixed ArrayIndexOutOfBoundsException when parsing list override (TIKA-1192). + + * CLI: TikaCLI now escapes invalid filename characters as hex + characters (TIKA-1078). + +Release 1.4 - 06/15/2013 + + * Removed a test HTML file with a poorly chosen GPL text in it (TIKA-1129). + + * Improvements to tika-server to allow it to produce text/html and + text/xml content (TIKA-1126, TIKA-1127). + + * Improvements were made to the Compressor Parser to handle g'zipped files + that require the decompressConcatenated option set to true (TIKA-1096). + + * Addressed a typographic error that was preventing from detection of + awk files (TIKA-1081). + + * Added a new end-point to Tika's JAX-RS REST server that only detects + the media-type based on a small portion of the document submitted + (TIKA-1047). + + * RTF: Ordered and unordered lists are now extracted (TIKA-1062). + + * MP3: Audio duration is now extracted (TIKA-991) + + * Java .class files: upgraded from ASM 3.1 to ASM 4.1 for parsing + the Java bytecodes (TIKA-1053). + + * Mime Types: Definitions extended to optionally include Link (URL) and + UTI, along with details for several common formats (TIKA-1012 / TIKA-1083) + + * Exceptions when parsing OLE10 embedded documents, when parsing + summary information from Office documents, and when saving + embedded documennts in TikaCLI are now logged instead + of aborting extraction (TIKA-1074) + + * MS Word: line tabular character is now replaced with newline + (TIKA-1128) + + * XML: ElementMetadataHandlers can now optionally accept duplicate + and empty values (TIKA-1133) + +Release 1.3 - 01/19/2013 + + * Mimetype definitions added for more common programming languages, + including common extensions, but not magic patterns. (TIKA-1055) + + * MS Word: When a Word (.doc) document contains embedded files or + links to external documents, Tika now places a <div + class="embedded" id="_XXX"/> placeholder into the XHTML so you can + see where in the main text the embedded document occurred + (TIKA-956, TIKA-1019). Embedded Wordpad/RTF documents are now + recognized (TIKA-982). + + * PDF: Text from pop-up annotations is now extracted (TIKA-981). + Text from bookmarks is now extracted (TIKA-1035). + + * PKCS7: Detached signatures no longer through NullPointerException + (TIKA-986). + + * iWork: The chart name for charts embedded in numbers documents is + now extracted (TIKA-918). + + * CLI: TikaCLI -m now handles multi-valued metadata keys correctly + (previously it only printed the first value). (TIKA-920) + + * MS Word (.docx): When a Word (.docx) document contains embedded + files, Tika now places a <div class="embedded" id="XXX"/> into the + XHTML so you can see where in the main text the embedded document + occurred. The id (rId) is included in the Metadata of each + embedded document as the new Metadata.EMBEDDED_RELATIONSHIP_ID + key, and TikaCLI prepends the rId (if present) onto the filename + it extracts (TIKA-989). Fixed NullPointerException when style is + null (TIKA-1006). Text inside text boxes is now extracted + (TIKA-1005). + + * RTF: Page, word, character count and creation date metadata are + now extracted for RTF documents (TIKA-999). + + * MS PowerPoint (.pptx): When a PowerPoint (.pptx) document contains + embedded files, Tika now places a <div class="embedded" id="XXX"/> into the + XHTML so you can see where in the main text the embedded document + occurred. The id (rId) is included in the Metadata of each + embedded document as the new Metadata.EMBEDDED_RELATIONSHIP_ID + key, and TikaCLI prepends the rId (if present) onto the filename + it extracts (TIKA-997, TIKA-1032). + + * MS PowerPoint (.ppt): When a PowerPoint (.ppt) document contains + embedded files, Tika now places a <div class="embedded" id="XXX"/> into the + XHTML so you can see where in the main text the embedded document + occurred (TIKA-1025). Text from the master slide is now extracted + (TIKA-712). + + * MHTML: fixed Null charset name exception when a mime part has an + unrecognized charset (TIKA-1011). + + * MP3: if an ID3 tag was encoded in UTF-16 with only the BOM then on + certain JVMs this would incorrectly extract the BOM as the tag's + value (TIKA-1024). + + * ZIP: placeholders (<div class="embedded" id="<entry name>"/>) are + now left in the XHTML so you can see where each archive member + appears (TIKA-1036). TikaCLI would hit FileNotFoundException when + extracting files that were under sub-directories from a ZIP + archive, because it failed to create the parent directories first + (TIKA-1031). + + * XML: a space character is now added before each element + (TIKA-1048) + +Release 1.2 - 07/10/2012 +--------------------------------- + + * Tika's JAX-RS based Network server now is based on Apache CXF, + which is available in Maven Central and now allows the server + module to be packaged and included in our release + (TIKA-593, TIKA-901). + + * Tika: parseToString now lets you specify the max string length + per-call, in addition to per-Tika-instance. (TIKA-870) + + * Tika now has the ability to detect FITS (Flexible Image Transport System) + files (TIKA-874). + + * Images: Fixed file handle leak in ImageParser. (TIKA-875) + + * iWork: Comments in Pages files are now extracted (TIKA-907). + Headers, footers and footnotes in Pages files are now extracted + (TIKA-906). Don't throw NullPointerException on passsword + protected iWork files, even though we can't parse their contents + yet (TIKA-903). Text extracted from Keynote text boxes and bullet + points no longer runs together (TIKA-910). Also extract text for + Pages documents created in layout mode (TIKA-904). Table names + are now extracted in Numbers documents (TIKA-924). Content added + to master slides is also extracted (TIKA-923). + + * Archive and compression formats: The Commons Compress dependency was + upgraded from 1.3 to 1.4.1. With this change Tika can now parse also + Unix dump archives and documents compressed using the XZ and Pack200 + compression formats. (TIKA-932) + + * KML: Tika now has basic support for Keyhole Markup Language documents + (KML and KMZ) used by tools like Google Earth. See also + http://www.opengeospatial.org/standards/kml/. (TIKA-941) + + * CLI: You can now use the TIKA_PASSWORD environment variable or the + --password=X command line option to specify the password that Tika CLI + should use for opening encrypted documents (TIKA-943). + + * Character encodings: Tika's character encoding detection mechanism was + improved by adding integration to the juniversalchardet library that + implements Mozilla's universal charset detection algorithm. The slower + ICU4J algorithms are still used as a fallback thanks to their wider + coverage of custom character encodings. (TIKA-322, TIKA-471) + + * Charset parameter: Related to the character encoding improvements + mentioned above, Tika now returns the detected character encoding as + a "charset" parameter of the content type metadata field for text/plain + and text/html documents. For example, instead of just "text/plain", the + returned content type will be something like "text/plain; charset=UTF-8" + for a UTF-8 encoded text document. Character encoding information is still + present also in the content encoding metadata field for backwards + compatibility, but that field should be considered deprecated. (TIKA-431) + + * Extraction of embedded resources from OLE2 Office Documents, where + the resource isn't another office document, has been fixed (TIKA-948) + +Release 1.1 - 3/7/2012 +--------------------------------- + + * Link Extraction: The rel attribute is now extracted from + links per the LinkConteHandler. (TIKA-824) + + * MP3: Fixed handling of UTF-16 (two byte) ID3v2 tags (previously + the last character in a UTF-16 tag could be corrupted) (TIKA-793) + + * Performance: Loading of the default media type registry is now + significantly faster. (TIKA-780) + + * PDF: Allow controlling whether overlapping duplicated text should + be removed. Disabling this (the default) can give big + speedups to text extraction and may workaround cases where + non-duplicated characters were incorrectly removed (TIKA-767). + Allow controlling whether text tokens should be sorted by their x/y + position before extracting text (TIKA-612); this is necessary for + certain PDFs. Fixed cases where too many </p> tags appear in the + XHTML output, causing NPE when opening some PDFs with the GUI + (TIKA-778). + + * RTF: Fixed case where a font change would result in processing + bytes in the wrong font's charset, producing bogus text output + (TIKA-777). Don't output whitespace in ignored group states, + avoiding excessive whitespace output (TIKA-781). Binary embedded + content (using \bin control word) is now skipped correctly; + previously it could cause the parser to incorrectly extract binary + content as text (TIKA-782). + + * CLI: New TikaCLI option "--list-detectors", which displays the + mimetype detectors that are available, similar to the existing + "--list-parsers" option for parsers. (TIKA-785). + + * Detectors: The order of detectors, as supplied via the service + registry loader, is now controlled. User supplied detectors are + prefered, then Tika detectors (such as the container aware ones), + and finally the core Tika MimeTypes is used as a backup. This + allows for specific, detailed detectors to take preference over + the default mime magic + filename detector. (TIKA-786) + + * Microsoft Project (MPP): Filetype detection has been fixed, + and basic metadata (but no text) is now extracted. (TIKA-789) + + * Outlook: fixed NullPointerException in TikaGUI when messages with + embedded RTF or HTML content were filtered (TIKA-801). + + * Ogg Vorbis and FLAC: Parser added for Ogg Vorbis and FLAC audio + files, which extract audio metadata and tags (TIKA-747) + + * MP4: Improved mime magic detection for MP4 based formats (including + QuickTime, MP4 Video and Audio, and 3GPP) (TIKA-851) + + * MP4: Basic metadata extracting parser for MP4 files added, which includes + limited audio and video metadata, along with the iTunes media metadata + (such as Artist and Title) (TIKA-852) + + * Document Passwords: A new ParseContext object, PasswordProvider, + has been added. This provides a way to supply the password for + a document during processing. Currently, only password protected + PDFs and Microsoft OOXML Files are supported. (TIKA-850) + +Release 1.0 - 11/4/2011 +--------------------------------- + +The most notable changes in Tika 1.0 over previous releases are: + + * API: All methods, classes and interfaces that were marked as + deprecated in Tika 0.10 have been removed to clean up the API + (TIKA-703). You may need to adjust and recompile client code + accordingly. The declared OSGi package versions are now 1.0, and + will thus not resolve for client bundles that still refer to 0.x + versions (TIKA-565). + + * Configuration: The context class loader of the current thread is + no longer used as the default for loading configured parser and + detector classes. You can still pass an explicit class loader + to the configuration mechanism to get the previous behaviour. + (TIKA-565) + + * OSGi: The tika-core bundle will now automatically pick up and use + any available Parser and Detector services when deployed to an OSGi + environment. The tika-parsers bundle provides such services based on + for all the supported file formats for which the upstream parser library + is available. If you don't want to track all the parser libraries as + separate OSGi bundles, you can use the tika-bundle bundle that packages + tika-parsers together with all its upstream dependencies. (TIKA-565) + + * RTF: Hyperlinks in RTF documents are now extracted as an <a + href=...>...</a> element (TIKA-632). The RTF parser is also now + more robust when encountering too many closing {'s vs. opening {'s + (TIKA-733). + + * MS Word: From Word (.doc) documents we now extract optional hyphen + as Unicode zero-width space (U+200B), and non-breaking hyphen as + Unicode non-breaking hyphen (U+2011). (TIKA-711) + + * Outlook: Tika can now process also attachments in Outlook messages. + (TIKA-396) + + * MS Office: Performance of extracting embedded office docs was improved. + (TIKA-753) + + * PDF: The PDF parser now extracts paragraphs within each page + (TIKA-742) and can now optionally extract text from PDF + annotations (TIKA-738). There's also an option to enable (the + default) or disable auto-space insertion (TIKA-724). + + * Language detection: Tika can now detect Belarusian, Catalan, + Esperanto, Galician, Lithuanian (TIKA-582), Romanian, Slovak, + Slovenian, and Ukrainian (TIKA-681). + + * Java: Tika no longer ships retrotranslated Java 1.4 binaries along + with the normal ones that work with Java 5 and higher. (TIKA-744) + + * OpenOffice documents: header/footer text is now extracted for text, + presentation and spreadsheet documents (TIKA-736) + +Tika 1.0 relies on the following set of major dependencies (generated using +mvn dependency:tree from tika-parsers): + + org.apache.tika:tika-parsers:bundle:1.0 + +- org.apache.tika:tika-core:jar:1.0:compile + +- edu.ucar:netcdf:jar:4.2-min:compile + | \- org.slf4j:slf4j-api:jar:1.5.6:compile + +- org.apache.james:apache-mime4j-core:jar:0.7:compile + +- org.apache.james:apache-mime4j-dom:jar:0.7:compile + +- org.apache.commons:commons-compress:jar:1.3:compile + +- commons-codec:commons-codec:jar:1.5:compile + +- org.apache.pdfbox:pdfbox:jar:1.6.0:compile + | +- org.apache.pdfbox:fontbox:jar:1.6.0:compile + | +- org.apache.pdfbox:jempbox:jar:1.6.0:compile + | \- commons-logging:commons-logging:jar:1.1.1:compile + +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile + +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile + +- org.apache.poi:poi:jar:3.8-beta4:compile + +- org.apache.poi:poi-scratchpad:jar:3.8-beta4:compile + +- org.apache.poi:poi-ooxml:jar:3.8-beta4:compile + | +- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta4:compile + | | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile + | \- dom4j:dom4j:jar:1.6.1:compile + +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile + +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile + +- asm:asm:jar:3.1:compile + +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile + +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile + +- rome:rome:jar:0.9:compile + \- jdom:jdom:jar:1.0:compile + +The following people have contributed to Tika 1.0 by submitting or commenting +on the issues resolved in this release: + +Andrzej Bialecki +Antoni Mylka +Benson Margulies +Chris A. Mattmann +Cristian Vat +Dave Meikle +David Smiley +Dennis Adler +Erik Hetzner +Ingo Renner +Jeremias Maerki +Jeremy Anderson +Jeroen van Vianen +John Bartak +Jukka Zitting +Julien Nioche +Ken Krugler +Mark Butler +Maxim Valyanskiy +Michael Bryant +Michael McCandless +Nick Burch +Pablo Queixalos +Uwe Schindler +Žygimantas Medelis + + +See http://s.apache.org/Zk6 for more details on these contributions. + + +Release 0.10 - 09/25/2011 +------------------------- + +The most notable changes in Tika 0.10 over previous releases are: + + * A parser for CHM help files was added. (TIKA-245) + + * TIKA-698: Invalid characters are now replaced with the Unicode + replacement character (U+FFFD), whereas before such characters were + replaced with spaces, so you may need to change your processing of + Tika's output to now handle U+FFFD. + + * The RTF parser was rewritten to perform its own direct shallow + parse of the RTF content, instead of using RTFEditorKit from + javax.swing. This fixes several issues in the old parser, + including doubling of Unicode characters in certain cases + (TIKA-683), exceptions on mal-formed RTF docs (TIKA-666), and + missing text from some elements (header/footer, hyperlinks, + footnotes, text inside pictures). + + * Handling of temporary files within Tika was much improved + (TIKA-701, TIKA-654, TIKA-645, TIKA-153) + + * The Tika GUI got a facelift and some extra features (TIKA-635) + + * The apache-mime4j dependency of the email message parser was upgraded + from version 0.6 to 0.7 (TIKA-716). The parser also now accepts a + MimeConfig object in the ParseContext as configuration (TIKA-640). + +Tika 0.10 relies on the following set of major dependencies (generated using +mvn dependency:tree from tika-parsers): + + org.apache.tika:tika-parsers:bundle:0.10 + +- org.apache.tika:tika-core:jar:0.10:compile + +- edu.ucar:netcdf:jar:4.2-min:compile + | \- org.slf4j:slf4j-api:jar:1.5.6:compile + +- org.apache.james:apache-mime4j-core:jar:0.7:compile + +- org.apache.james:apache-mime4j-dom:jar:0.7:compile + +- org.apache.commons:commons-compress:jar:1.1:compile + +- commons-codec:commons-codec:jar:1.4:compile + +- org.apache.pdfbox:pdfbox:jar:1.6.0:compile + | +- org.apache.pdfbox:fontbox:jar:1.6.0:compile + | +- org.apache.pdfbox:jempbox:jar:1.6.0:compile + | \- commons-logging:commons-logging:jar:1.1.1:compile + +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile + +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile + +- org.apache.poi:poi:jar:3.8-beta4:compile + +- org.apache.poi:poi-scratchpad:jar:3.8-beta4:compile + +- org.apache.poi:poi-ooxml:jar:3.8-beta4:compile + | +- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta4:compile + | | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile + | \- dom4j:dom4j:jar:1.6.1:compile + +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile + +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile + +- asm:asm:jar:3.1:compile + +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile + +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile + +- rome:rome:jar:0.9:compile + \- jdom:jdom:jar:1.0:compile + +The following people have contributed to Tika 0.10 by submitting or commenting +on the issues resolved in this release: + + Alain Viret + Alex Ott + Alexander Chow + Andreas Kemkes + Andrew Khoury + Babak Farhang + Benjamin Douglas + Benson Margulies + Chris A. Mattmann + chris hudson + Chris Lott + Cristian Vat + Curt Arnold + Cynthia L Wong + Dave Brosius + David Benson + Enrico Donelli + Erik Hetzner + Erna de Groot + Gabriele Columbro + Gavin + Geoff Jarrad + Gregory Kanevsky + gunter rombauts + Henning Gross + Henri Bergius + Ingo Renner + Ingo Wiarda + Izaak Alpert + Jan Hââydahl + Jens Wilmer + Jeremy Anderson + Joseph Vychtrle + Joshua Turner + Jukka Zitting + Julien Nioche + Karl Heinz Marbaise + Ken Krugler + Kostya Gribov + Luciano Leggieri + Mads Hansen + Mark Butler + Matt Sheppard + Maxim Valyanskiy + Michael McCandless + Michael Pisula + Murad Shahid + Nick Burch + Oleg Tikhonov + Pablo Queixalos + Paul Jakubik + Raimund Merkert + Rajiv Kumar + Robert Trickey + Sami Siren + samraj + Selva Ganesan + Sjoerd Smeets + Stephen Duncan Jr + Tran Nam Quang + Uwe Schindler + Vitaliy Filippov + +See http://s.apache.org/vR for more details on these contributions. + + +Release 0.9 - 02/13/2011 +------------------------ + +The most notable changes in Tika 0.9 over previous releases are: + + * A critical bugfix preventing metadata from printing to the + command line when the underlying Parser didn't generate + XHTML output was fixed. (TIKA-596) + + * The 0.8 version of Tika included a NetCDF jar file that pulled + in tremendous amounts of redundant dependencies. This has + been addressed in Tika 0.9 by republishing a minimal NetCDF + jar and changing Tika to depend on that. (TIKA-556) + + * MIME detection for iWork, and OpenXML documents has been + improved. (TIKA-533, TIKA-562, TIKA-588) + + * A critical backwards incompatible bug in PDF parsing that + was introduced in Tika 0.8 has been fixed. (TIKA-548) + + * Support for forked parsing in separate processes was added. + (TIKA-416) + + * Tika's language identifier now supports the Lithuanian + language. (TIKA-582) + +Tika 0.9 relies on the following set of major dependencies (generated using +mvn dependency:tree from tika-parsers): + + org.apache.tika:tika-parsers:bundle:0.9 + +- org.apache.tika:tika-core:jar:0.9:compile + +- edu.ucar:netcdf:jar:4.2-min:compile + | \- org.slf4j:slf4j-api:jar:1.5.6:compile + +- commons-httpclient:commons-httpclient:jar:3.1:compile + | +- commons-logging:commons-logging:jar:1.1.1:compile (version managed from 1.0.4) + | \- commons-codec:commons-codec:jar:1.2:compile + +- org.apache.james:apache-mime4j:jar:0.6:compile + +- org.apache.commons:commons-compress:jar:1.1:compile + +- org.apache.pdfbox:pdfbox:jar:1.4.0:compile + | +- org.apache.pdfbox:fontbox:jar:1.4.0:compile + | \- org.apache.pdfbox:jempbox:jar:1.4.0:compile + +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile + +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile + +- org.apache.poi:poi:jar:3.7:compile + +- org.apache.poi:poi-scratchpad:jar:3.7:compile + +- org.apache.poi:poi-ooxml:jar:3.7:compile + | +- org.apache.poi:poi-ooxml-schemas:jar:3.7:compile + | | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile + | \- dom4j:dom4j:jar:1.6.1:compile + +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile + +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:compile + +- asm:asm:jar:3.1:compile + +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile + +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile + +- rome:rome:jar:0.9:compile + \- jdom:jdom:jar:1.0:compile + +The following people have contributed to Tika 0.9 by submitting or commenting +on the issues resolved in this release: + + Alex Skochin + Alexander Chow + Antoine L. + Antoni Mylka + Benjamin Douglas + Benson Margulies + Chris A. Mattmann + Cristian Vat + Cyriel Vringer + David Benson + Erik Hetzner + Gabriel Miklos + Geoff Jarrad + Jukka Zitting + Ken Krugler + Kostya Gribov + Leszek Piotrowicz + Martijn van Groningen + Maxim Valyanskiy + Michel Tremblay + Nick Burch + paul + Paul Pearcy + Peter van Raamsdonk + Piotr Bartosiewicz + Reinhard Schwab + Scott Severtson + Shinsuke Sugaya + Staffan Olsson + Steve Kearns + Tom Klonikowski + âΩygimantas Medelis + +See http://s.apache.org/qi for more details on these contributions. + + +Release 0.8 - 11/07/2010 +------------------------ + +The most notable changes in Tika 0.8 over previous releases are: + + * Language identification is now dynamically configurable, + managed via a config file loaded from the classpath. (TIKA-490) + + * Tika now supports parsing Feeds by wrapping the underlying + Rome library. (TIKA-466) + + * A quick-start guide for Tika parsing was contributed. (TIKA-464) + + * An approach for plumbing through XHTML attributes was added. (TIKA-379) + + * Media type hierarchy information is now taken into account when + selecting the best parser for a given input document. (TIKA-298) + + * Support for parsing common scientific data formats including netCDF + and HDF4/5 was added (TIKA-400 and TIKA-399). + + * Unit tests for Windows have been fixed, allowing TestParsers + to complete. (TIKA-398) + +Tika 0.8 relies on the following set of major dependencies (generated using +mvn dependency:tree from tika-parsers): + + org.apache.tika:tika-parsers:bundle:0.8 + +- org.apache.tika:tika-core:jar:0.8:compile + +- edu.ucar:netcdf:jar:4.2:compile + | \- org.slf4j:slf4j-api:jar:1.5.6:compile + +- commons-httpclient:commons-httpclient:jar:3.1:compile + | +- commons-logging:commons-logging:jar:1.1.1:compile (version managed from 1.0.4) + | \- commons-codec:commons-codec:jar:1.2:compile + +- org.apache.commons:commons-compress:jar:1.1:compile + +- org.apache.pdfbox:pdfbox:jar:1.3.1:compile + | +- org.apache.pdfbox:fontbox:jar:1.3.1:compile + | \- org.apache.pdfbox:jempbox:jar:1.3.1:compile + +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile + +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile + +- org.apache.poi:poi:jar:3.7:compile + +- org.apache.poi:poi-scratchpad:jar:3.7:compile + +- org.apache.poi:poi-ooxml:jar:3.7:compile + | +- org.apache.poi:poi-ooxml-schemas:jar:3.7:compile + | | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile + | \- dom4j:dom4j:jar:1.6.1:compile + +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile + +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:compile + +- asm:asm:jar:3.1:compile + +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile + +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile + +- rome:rome:jar:0.9:compile + \- jdom:jdom:jar:1.0:compile + +The following people have contributed to Tika 0.8 by submitting or commenting +on the issues resolved in this release: + + âà ukasz Wiktor + Adam Wilmer + Alex Baranau + Alex Ott + Andrâ© Ricardo + Andrey Barhatov + Andrey Sidorenko + Antoni Mylka + Arturo Beltran + Attila Kirâ°ly + Brad Greenlee + Bruno Dumon + Chris A. Mattmann + Chris Bamford + Christophe Gourmelon + Dave Meikle + David Weekly + Dmitry Kuzmenko + Erik Hetzner + Geoff Jarrad + Gerd Bremer + Grant Ingersoll + Jan Hââydahl + Jean-Philippe Ricard + Jeremias Maerki + Joao Garcia + Jukka Zitting + Julien Nioche + Ken Krugler + Liam O'Boyle + Mads Hansen + Marcel May + Markus Goldbach + Martijn van Groningen + Maxim Valyanskiy + Mike Hays + Miroslav Pokorny + Nick Burch + Otis Gospodnetic + Peter van Raamsdonk + Peter Wolanin + peter_lena...@ibi.com + Piotr Bartosiewicz + Radek + Rajiv Kumar + Reinhard Schwab + rick cameron + Robert Muir + Sanjeev Rao + Simon Tyler + Sjoerd Smeets + Slavomir Varchula + Staffan Olsson + Tom De Leu + Uwe Schindler + Victor Kazakov + +See http://s.apache.org/ab0 for more details on these contributions. + + +Release 0.7 - 3/31/2010 +----------------------- + +The most notable changes in Tika 0.7 over previous releases are: + + * MP3 file parsing was improved, including Channel and SampleRate + extraction and ID3v2 support (TIKA-368, TIKA-372). Further, audio + parsing mime detection was also improved for the MIDI format. (TIKA-199) + + * Tika no longer relies on X11 for its RTF parsing functionality. (TIKA-386) + + * A Thread-safe bug in the AutoDetectParser was discovered and + addressed. (TIKA-374) + + * Upgrade to PDFBox 1.0.0. The new PDFBox version improves PDF parsing + performance and fixes a number of text extraction issues. (TIKA-380) + +The following people have contributed to Tika 0.7 by submitting or commenting +on the issues resolved in this release: + + Adam Rauch + Benson Margulies + Brett S. + Chris A. Mattmann + Daan de Wit + Dave Meikle + Durville + Ingo Renner + Jukka Zitting + Ken Krugler + Kenny Neal + Markus Goldbach + Maxim Valyanskiy + Nick Burch + Sami Siren + Uwe Schindler + +See http://tinyurl.com/yklopby for more details on these contributions. + + +Release 0.6 - 01/20/2010 +------------------------ + +The most notable changes in Tika 0.6 over the previous release are: + + * Mime-type detection for HTML (and all types) has been improved, allowing malformed + HTML files and those HTML files that require a bit more observed content + before the type is properly detected, are now correctly identified by + the AutoDetectParser. (TIKA-327, TIKA-357, TIKA-366, TIKA-367) + + * Tika now has an additional OSGi bundle packaging that includes all the + required parser libraries. This bundle package makes it easy to use all + Tika features in an OSGi environment. (TIKA-340, TIKA-342) + + * The Apache POI dependency used for parsing Microsoft Office file formats + has been upgraded to version 3.6. The most visible improvement in this + version is the notably reduced ooxml jar file size. The tika-app jar size + is now down to 15MB from the 25MB in Tika 0.5. (TIKA-353) + + * Handling of character encoding information in input metadata and HTML + <meta> tags has been improved. When no applicable encoding information is + available, the encoding is detected by looking at the input data. + (TIKA-332, TIKA-334, TIKA-335, TIKA-341) + + * Some document types like Excel spreadsheets contain content like + numbers or formulas whose exact text format depends on the current locale. + So far Tika has used the platform default locale in such cases, but + clients can now explicitly specify the locale by passing a Locale instance + in the parse context. (TIKA-125) + + * The default text output encoding of the tika-app jar is now UTF-8 + when running on Mac OS X. This is because the default encoding used + by Java is not compatible with the console application in Mac OS X. + On all other platforms the text output from tika-app still uses + the platform default encoding. (TIKA-324) + + * A flash video (video/x-flv) parser has been added. (TIKA-328) + + * The handling of Number and Date cell formatting within the Microsoft Excel + documents has been added. This include currencies, percentages and + scientific formats. (TIKA-103) + +The following people have contributed to Tika 0.6 by submitting or commenting +on the issues resolved in this release: + + Andrzej Bialecki + Bertrand Delacretaz + Chris A. Mattmann + Dave Meikle + Erik Hetzner + Felix Meschberger + Jukka Zitting + Julien Nioche + Ken Krugler + Luke Nezda + Maxim Valyanskiy + Niall Pemberton + Peter Wolanin + Piotr B. + Sami Siren + Yuan-Fang Li + +See http://tinyurl.com/yc3dk67 for more details on these contributions. + + +Release 0.5 - 11/14/2009 +------------------------ + +The most notable changes in Tika 0.5 over the previous release are: + + * Improved RDF/OWL mime detection using both MIME magic as well as + pattern matching (TIKA-309) + + * An org.apache.tika.Tika facade class has been added to simplify common + text extraction and type detection use cases. (TIKA-269) + + * A new parse context argument was added to the Parser.parse() method. + This context map can be used to pass things like a delegate parser or + other settings to the parsing process. The previous parse() method + signature has been deprecated and will be removed in Tika 1.0. (TIKA-275) + + * A simple ngram-based language detection mechanism has been added along + with predefined language profiles for 18 languages. (TIKA-209) + + * The media type registry in Tika was synchronized with the MIME type + configuration in the Apache HTTP Server. Tika now knows about 1274 + different media types and can detect 672 of those using 927 file + extension and 280 magic byte patterns. (TIKA-285) + + * Tika now uses the Apache PDFBox version 0.8.0-incubating for parsing PDF + documents. This version is notably better than the 0.7.3 release used + earlier. (TIKA-158) + +The following people have contributed to Tika 0.5 by submitting or commenting +on the issues resolved in this release: + + Alex Baranov + Bart Hanssens + Benson Margulies + Chris A. Mattmann + Daan de Wit + Erik Hetzner + Frank Hellwig + Jeff Cadow + Joachim Zittmayr + Jukka Zitting + Julien Nioche + Ken Krugler + Maxim Valyanskiy + MRIT64 + Paul Borgermans + Piotr B. + Robert Newson + Sascha Szott + Ted Dunning + Thilo Goetz + Uwe Schindler + Yuan-Fang Li + +See http://tinyurl.com/yl9prwp for more details on these contributions. + + +Release 0.4 - 07/14/2009 +------------------------ + +The most notable changes in Tika 0.4 over the previous release are: + + * Tika has been split to three different components for increased + modularity. The tika-core component contains the key interfaces and + core functionality of Tika, tika-parsers contains all the adapters + to external parser libraries, and tika-app bundles everything together + in a single executable jar file. (TIKA-219) + + * All the three Tika components are packaged as OSGi bundles. (TIKA-228) + + * Tika now uses the new Commons Compress library for improved support + of compression and packaging formats like gzip, bzip2, tar, cpio, + ar, zip and jar. (TIKA-204) + + * The memory use of parsing Excel sheets with lots of numbers + has been considerably reduced. (TIKA-211) + + * The AutoDetectParser now has basic protection against "zip bomb" + attacks, where a specially crafted input document can expand to + practically infinite amount of output text. (TIKA-216) + + * The ParsingReader class can now use a thread pool or a more complex + execution model (java.util.concurrent.Executor) for the background + parsing task. (TIKA-215) + + * Automatic type detection of text- and XML-based documents has been + improved. (TIKA-225) + + * Charset detection functionality from the ICU4J library was inlined + in Tika to avoid the dependency to the large ICU4J jar. (TIKA-229) + + * Composite parsers like the AutoDetectParser now make sure that any + RuntimeExceptions, IOExceptions or SAXExceptions unrelated to the given + document stream or content handler are converted to TikaExceptions + before being passed to the client. (TIKA-198, TIKA-237) + +The following people have contributed to Tika 0.4 by submitting or commenting +on the issues resolved in this release: + + Chris A. Mattmann + Daan de Wit + Dave Meikle + David Weekly + Jeremias Maerki + Jonathan Koren + Jukka Zitting + Karl Heinz Marbaise + Keith R. Bennett + Maxim Valyanskiy + Niall Pemberton + Robert Burrell Donkin + Sami Siren + Siddharth Gargate + Uwe Schindler + +See http://tinyurl.com/mgv9o3 for more details on these contributions. + + +Release 0.3 - 03/09/2009 +------------------------ + +The most notable changes in Tika 0.3 over the previous release are: + + * Tika now supports mime type glob patterns specified using + standard JDK 1.4 (and beyond) syntax via the isregex attribute + on the glob tag. See: + + http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html + + for more information. (TIKA-194) + + * Tika now supports the Office Open XML format used by + Microsoft Office 2007. (TIKA-152) + + * All the metadata keys for Microsoft Office document properties are now + included as constants in the MSOffice interface. Clients should use + these constants instead of the raw string values to refer to specific + metadata items. (TIKA-186) + + * Automatic detection of document types in Tika has been improved. + For example Tika can now detect plain text just by looking at the first + few bytes of the document. (TIKA-154) + + * Tika now disables the loading of all external entities in XML files + that it parses as input documents. This improves security and avoids + problems with potentially broken references. (TIKA-185) + + * Tika now replaces all invalid XML characters in the extracted text + content with spaces. This prevents problems when output from Tika + is processed with XML tools. (TIKA-180) + + * The Tika CLI now correctly flushes its buffers when invoked with the + --text argument. This prevents the end of the text output from being + lost. (TIKA-179) + + * Embedded text in MIDI files is now extracted. For example many karaoke + files contain song lyrics embedded as MIDI text. + + * The text content of Microsoft Outlook message files no longer appears as + multiple copies in the extracted text. (TIKA-197) + + * The ParsingReader class now makes most document metadata available + already before any of the extracted text is consumed. This makes it + easier for example to construct Lucene Document instances that contain + both extracted text and metadata. (TIKA-203) + +See http://tinyurl.com/tika-0-3-changes for a list of all changes in Tika 0.3. + +The following people have contributed to Tika 0.3 by submitting or commenting +on the issues resolved in this release: + + Andrzej Rusin + Chris A. Mattmann + Dave Meikle + Georger Araââ«jo + Guillermo Arribas + Jonathan Koren + Jukka Zitting + Karl Heinz Marbaise + Kumar Raja Jana + Paul Borgermans + Peter Becker + Sâ©bastien Michel + Uwe Schindler + +See http://tinyurl.com/tika-0-3-contributions for more details on +these contributions. + + +Release 0.2 - 12/04/2008 +------------------------ +
[... 253 lines stripped ...]