[RESULT][VOTE] Apache Tika 1.10 Release Candidate #1
Hi Everyone, Thanks everyone for their votes. The VOTE to release Tika 1.9 RC #1 has passed with the following tally: +1: Dave Meikle* Sergey Beryozkin* Tim Allison* Konstantin Gribov* Chris Mattmann* Oleg Tikhonov* Ken Krugler* Tyler Palsulich* Hong-Thai Nguyen* ±0: None -1: None * = PMC Member I'll push out the release now. Cheers, Dave
[jira] [Updated] (TIKA-776) ExifTool Embedder
[ https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-776: - Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release ExifTool Embedder - Key: TIKA-776 URL: https://issues.apache.org/jira/browse/TIKA-776 Project: Tika Issue Type: New Feature Components: metadata Affects Versions: 1.0 Environment: ExifTool is required (http://www.sno.phy.queensu.ca/~phil/exiftool/) Reporter: Ray Gauss II Labels: embed, exiftool, patch Fix For: 1.11 Attachments: tika-parsers-exiftool-embed-patch.txt This patch adds an ExifTool ExternalEmbedder which builds upon the work in issue TIKA-774 and TIKA-775. In the tika-parsers an ExiftoolExternalEmbedder is added which extends ExternalEmbedder to programmatically create an Embedder which calls the ExifTool command line to embed tika metadata into a file stream and an ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and XMP fields then parses the resulting file stream to verify the operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1435) Update rome dependency to 1.5
[ https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1435: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Update rome dependency to 1.5 - Key: TIKA-1435 URL: https://issues.apache.org/jira/browse/TIKA-1435 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Johannes Mockenhaupt Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.11 Attachments: netcdf-deps-changes.diff Rome 1.5 has been released to Sonatype (https://github.com/rometools/rome/issues/183). Though the website (http://rometools.github.io/rome/) is blissfully ignorant of that. The update is mostly maintenance, adopting slf4j and generics as well as moving the namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1106) CLAVIN Integration
[ https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1106: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release CLAVIN Integration -- Key: TIKA-1106 URL: https://issues.apache.org/jira/browse/TIKA-1106 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.3 Environment: All Reporter: Adam Estrada Assignee: Chris A. Mattmann Priority: Minor Labels: entity, geospatial, new-parser Fix For: 1.11 I've been evaluating CLAVIN as a way to extract location information from unstructured text. It seems like meshing it with Tika in some way would make a lot of sense. From CLAVIN website... {quote} CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source software package for document geotagging and geoparsing that employs context-based geographic entity resolution. It combines a variety of open source tools with natural language processing techniques to extract location names from unstructured text documents and resolve them against gazetteer records. Importantly, CLAVIN does not simply look up location names; rather, it uses intelligent heuristics in an attempt to identify precisely which Springfield (for example) was intended by the author, based on the context of the document. CLAVIN also employs fuzzy search to handle incorrectly-spelled location names, and it recognizes alternative names (e.g., Ivory Coast and Côte d'Ivoire) as referring to the same geographic entity. By enriching text documents with structured geo data, CLAVIN enables hierarchical geospatial search and advanced geospatial analytics on unstructured data. {quote} There was only one other instance of the word clavin mentioned in the ASF jira site so I thought it was definitely worth posting here. https://github.com/Berico-Technologies/CLAVIN -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-987) Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted
[ https://issues.apache.org/jira/browse/TIKA-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-987: - Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted Key: TIKA-987 URL: https://issues.apache.org/jira/browse/TIKA-987 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Fix For: 1.11 Attachments: picture.doc, picture_3.doc I have two Word docs, both containing the same drawing, but one has text added. In one case (picture.doc) the extraction is correct: it contains only an embedded image.wmf; when I view the image it's correct. In the second case (picture_3.doc) the picture is extracted as image (no extension), and is 0 bytes, and there is an invalid character (mapped to unicode replacement char) inserted before the image: {noformat} title/ /head bodyp�img src=embedded:image1 alt=image1//p p/ p/ pvehicle /p {noformat} (Though, the text vehicle is extracted correctly). I dug a bit, and with the 2nd doc there is an embedded {SHAPE * MERGEFORMAT} field, which we invoke WordExtractor.handleSpecialCharacterRuns on, and somehow it extracts the 0-byte no-extension image as well as the invalid character. With the first doc there is no field (at least not one that's handle with handleSpecialCharacterRuns...). Otherwise I'm not sure how to fix... it could be something is going wrong in how POI parses the Pictures from PictureSource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1672) Integrate tika-java7 component
[ https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1672: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Integrate tika-java7 component -- Key: TIKA-1672 URL: https://issues.apache.org/jira/browse/TIKA-1672 Project: Tika Issue Type: Improvement Reporter: Tyler Palsulich Fix For: 1.11 Code requiring Java 7 doesn't need to be in a separate module now that TIKA-1536 (upgrade to Java 7) is done. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1379) error in Tika().detect for xml files with xades signature
[ https://issues.apache.org/jira/browse/TIKA-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1379: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release error in Tika().detect for xml files with xades signature - Key: TIKA-1379 URL: https://issues.apache.org/jira/browse/TIKA-1379 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.4 Reporter: Alessandro De Angelis Labels: new-parser Fix For: 1.11 we tried to get the mime type of an xml file with xades signature embedded. the result is text/html and not the expected text/xml or application/xml. here is an example of the xml file: {code} VERBALI ad_cod=D69017 batch_id=0 cds_cod=D69 data_app=2013-09-23 VERBALE Id=1 tipologia=Verbale esame VERB_NUM00094853 0003 2/VERB_NUM DATA_APP2013-09-23/DATA_APP DATA_ESA2013-09-23/DATA_ESA AD_CODD69017/AD_COD ADFILOSOFIA DELLA SCIENZA/AD CDS_CODD69/CDS_COD CDSTEATRO E ARTI VISIVE/CDS TIPO_ESA/TIPO_ESA MAT1233456/MAT NOMEPAOLINO/NOME COGNOMEPAPERINO/COGNOME VOTO23.0/VOTO VOTODECOD23/VOTODECOD CAUSALE/CAUSALE TIPO_MODULO/TIPO_MODULO IMG_PATH/IMG_PATH AA_SES_ID2012/AA_SES_ID AD_CFU6.0/AD_CFU NOTA/NOTA ATENEO9/ATENEO ATENEO_DESجامعة البندقية - TEST/ATENEO_DES TIPO_DOCUMENTOVerbale_3/TIPO_DOCUMENTO TITOLARE_PROCEDIMENTOQUI QUO QUA/TITOLARE_PROCEDIMENTO AD_STU_CODD69017/AD_STU_COD AD_STUFILOSOFIA DELLA SCIENZA/AD_STU CDS_STU_CODD69/CDS_STU_COD CDS_STUTEATRO E ARTI VISIVE/CDS_STU DOCENTEQUI QUO QUA/DOCENTE DATA_DOCUMENTO26-09-2013 09:55:53 CEST(+0200)/DATA_DOCUMENTO SOFTWARE_DI_CREAZIONE NOME3/NOME VERSIONE11.09.03/VERSIONE /SOFTWARE_DI_CREAZIONE /VERBALEds:Signature xmlns:ds=http://www.w3.org/2000/09/xmldsig#; Id=sig08744308748201048377 ds:SignedInfo ds:CanonicalizationMethod Algorithm=http://www.w3.org/2006/12/xml-c14n11;/ds:CanonicalizationMethod ds:SignatureMethod Algorithm=http://www.w3.org/2001/04/xmldsig-more#rsa-sha256;/ds:SignatureMethod ds:Reference URI= ds:Transforms ds:Transform Algorithm=http://www.w3.org/2002/06/xmldsig-filter2; dsig-xpath:XPath xmlns:dsig-xpath=http://www.w3.org/2002/06/xmldsig-filter2; Filter=subtract/descendant::ds:Signature/dsig-xpath:XPath /ds:Transform ds:Transform Algorithm=http://www.w3.org/TR/1999/REC-xslt-19991116; xsl:stylesheet xmlns:kion=http://www.kion.it/webesse3/multilingua; xmlns:xsl=http://www.w3.org/1999/XSL/Transform; exclude-result-prefixes=kion version=1.0 kion:ml module=FirmaDigitale target=kion/kion:ml xsl:output method=xml/xsl:output xsl:variable name=mostra_ad_figlie select=1/xsl:variable xsl:variable name=verbale_root select=/VERBALI/VERBALE/xsl:variable xsl:variable name=sostituzione_root select=/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO/xsl:variable xsl:variable name=RAGG_ROOT select=/VERBALI/VERBALE/RAGGRUPPAMENTO/xsl:variable xsl:variable name=COMM_ROOT select=/VERBALI/VERBALE/COMMISSIONE/xsl:variable xsl:template match=/ html head meta content=text/html;charset=UTF-8 http-equiv=Content-Type/meta xsl:choose xsl:when test=$sostituzione_root titleDichiarazione conformità Verbale Esame/title /xsl:when xsl:otherwise titleVerbalizzazione esame/title /xsl:otherwise /xsl:choose style type=text/css td {font-family: Arial; font-size:10pt;} div {font-family: Arial; font-size:10pt;} pre {font-family: Arial; font-size:10pt;} /style /head body table xsl:choose xsl:when test=$sostituzione_root trtd align=center colspan=2bigstrongxsl:value-of select=$verbale_root/ATENEO_DES/xsl:value-of/strong/bigbr/br/td/tr trtd align=center
[jira] [Updated] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE
[ https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1308: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Support in memory parse mode(don't create temp file): to support run Tika in GAE Key: TIKA-1308 URL: https://issues.apache.org/jira/browse/TIKA-1308 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: jefferyyuan Labels: gae Fix For: 1.11 I am trying to use Tika in GAE and write a simple servlet to extract meta data info from jpeg: {code} String urlStr = req.getParameter(imageUrl); byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr)); ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData); Metadata metadata = new Metadata(); BodyContentHandler ch = new BodyContentHandler(); AutoDetectParser parser = new AutoDetectParser(); parser.parse(bais, ch, metadata, new ParseContext()); bais.close(); {code} This fails with exception: {code} Caused by: java.lang.SecurityException: Unable to create temporary file at java.io.File.createTempFile(File.java:1986) at org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66) at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242 {code} Checked the code, in org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, Metadata, ParseContext), it creates a temp file from the input stream. I can understand why tika create temp file from the stream: so tika can parse it multiple times. But as GAE and other cloud servers are getting more popular, is it possible to avoid create temp file: instead we can copy the origin stream to a byteArray stream, so tika can also parse it multiple times. -- This will have a limit on the file size, as tika keeps the whole file in memory, but this can make tika work in GAE and maybe other cloud server. We can add a parameter in parser.parse to indicate whether do in memory parse only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment
[ https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-894: - Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Add webapp mode for Tika Server, simplifies deployment -- Key: TIKA-894 URL: https://issues.apache.org/jira/browse/TIKA-894 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 1.1, 1.2 Reporter: Chris Wilson Labels: maven, newbie, patch Fix For: 1.11 Attachments: tika-server-webapp.patch For use in production services, Tika Server should really be deployed as a WAR file, under a reliable servlet container that knows how to run as a system service, for example Tomcat or JBoss. This is especially important on Windows, where I wasted an entire day trying to make TikaServerCli run as some kind of a service. Maven makes building a webapp pretty trivial. With the attached patch applied, mvn war:war should work. It seems to run fine in Tomcat, which makes Windows deployment much simpler. Just install Tomcat and drop the WAR file into tomcat's webapps directory and you're away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1108) Represent individual slides in pptx
[ https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1108: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Represent individual slides in pptx --- Key: TIKA-1108 URL: https://issues.apache.org/jira/browse/TIKA-1108 Project: Tika Issue Type: Improvement Components: parser Reporter: Daniel Bonniot de Ruisselet Fix For: 1.11 When parsing ppt, tika produces for each slide: div class=slide However for pptx these seem to be missing, all the text is directly under body. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1688) Tika Version in Metadata
[ https://issues.apache.org/jira/browse/TIKA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1688: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Tika Version in Metadata Key: TIKA-1688 URL: https://issues.apache.org/jira/browse/TIKA-1688 Project: Tika Issue Type: Improvement Reporter: Paul Ramirez Priority: Minor Fix For: 1.11 Could this be added as X-Tika:version that way downstream there would be traceability to extraction based on version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL
[ https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1696: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Language Identification with Text Processing Toolkit from MITLL --- Key: TIKA-1696 URL: https://issues.apache.org/jira/browse/TIKA-1696 Project: Tika Issue Type: New Feature Components: languageidentifier Reporter: Paul Ramirez Fix For: 1.11 The aim here is to extend the methods for language identification within text. MIT Lincoln Labs has an open source library [1] written in Julia. Having spoken with the MITLL guys there is a possibility that there is a scala version of this library which would make it easier to package in with Tika. At this point I'm not quite sure how many languages this library supports by default but it can be extended when provided some training data. [1] https://github.com/mit-nlp/Text.jl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1616) Tika Parser for GIBS Metadata
[ https://issues.apache.org/jira/browse/TIKA-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1616: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Tika Parser for GIBS Metadata - Key: TIKA-1616 URL: https://issues.apache.org/jira/browse/TIKA-1616 Project: Tika Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.11 [GIBS|https://earthdata.nasa.gov/about-eosdis/science-system-description/eosdis-components/global-imagery-browse-services-gibs] metadata currently consists of simple stuff in the WMTS GetCapabilities request (e.g. http://map1.vis.earthdata.nasa.gov/wmts-arctic/1.0.0/WMTSCapabilities.xml) which includes available layers, extents, time ranges, map projections, color maps, etc. We will eventually have more detailed visualization metadata available in ECHO/CMR which will include linkages to data products, provenance, etc. Some investigation and a Tika parser would be excellent to extract and assimilate GIBS Metadata. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1366) Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse
[ https://issues.apache.org/jira/browse/TIKA-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1366: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse Key: TIKA-1366 URL: https://issues.apache.org/jira/browse/TIKA-1366 Project: Tika Issue Type: Improvement Components: server Reporter: Sergey Beryozkin Priority: Minor Fix For: 1.11 Some of Tika Server services will benefit from optionally supporting JAX-RS 2.0 AsyncResponse -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-891: - Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Use POST in addition to PUT on method calls in tika-server -- Key: TIKA-891 URL: https://issues.apache.org/jira/browse/TIKA-891 Project: Tika Issue Type: Improvement Components: general Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Trivial Labels: newbie Fix For: 1.11 Per Jukka's email: http://s.apache.org/uR It would be a better use of REST/HTTP verbs to use POST to put content to a resource where we don't intend to store that content (which is the implication of PUT). Max suggested adding: {code} @POST {code} annotations to the methods we are currently exposing using PUT to take care of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1425) Automatic batching of Microsoft service calls
[ https://issues.apache.org/jira/browse/TIKA-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1425: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Automatic batching of Microsoft service calls - Key: TIKA-1425 URL: https://issues.apache.org/jira/browse/TIKA-1425 Project: Tika Issue Type: Improvement Components: translation Affects Versions: 1.6 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.11 Right now when I use the following code I get the stack trace at the bottom of this description. This seems to be because the Request URI is too large to make the service request. We need to have a mechansim within the call to Tika.translate which will, on a service-by-service basis, determine the maximum Request URI which can be sent. I beleive that this should be on the Tika side as how else am I meant to know the maximum request size? {code:title=translator.java|borderStyle=solid} +Translator translate = new MicrosoftTranslator(); +((MicrosoftTranslator) translate).setId(...); +((MicrosoftTranslator) translate).setSecret(...); for (java.util.Map.EntryText, Parse entry : parseResult) { Parse parse = entry.getValue(); LOG.info(-\nUrl\n---\n); @@ -201,7 +207,7 @@ System.out.print(parse.getData().toString()); if (dumpText) { LOG.info(-\nParseText\n-\n); -System.out.print(parse.getText()); +System.out.print(translate.translate(parse.getText(), fr)); } {code} {code:title=stacktrace.log|borderStyle=solid} Exception in thread main java.lang.Exception: [microsoft-translator-api] Error retrieving translation : Server returned HTTP response code: 414 for URL: http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0... ... at com.memetix.mst.MicrosoftTranslatorAPI.retrieveString(MicrosoftTranslatorAPI.java:202) at com.memetix.mst.translate.Translate.execute(Translate.java:61) at com.memetix.mst.translate.Translate.execute(Translate.java:76) at org.apache.tika.language.translate.MicrosoftTranslator.translate(MicrosoftTranslator.java:104) at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:210) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:228) Caused by: java.io.IOException: Server returned HTTP response code: 414 for URL: http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0%BE%D1%80%D1%83%D0%B... ... at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1675) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1673) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1671) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1244) at com.memetix.mst.MicrosoftTranslatorAPI.retrieveResponse(MicrosoftTranslatorAPI.java:178) at com.memetix.mst.MicrosoftTranslatorAPI.retrieveString(MicrosoftTranslatorAPI.java:199) ... 6 more Caused by: java.io.IOException: Server returned HTTP response code: 414 for URL: http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0%BE... ... at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468) at com.memetix.mst.MicrosoftTranslatorAPI.retrieveResponse(MicrosoftTranslatorAPI.java:177) ... 7 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-985) Support for HTML5 elements
[ https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-985: - Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Support for HTML5 elements -- Key: TIKA-985 URL: https://issues.apache.org/jira/browse/TIKA-985 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.2 Reporter: Markus Jelsma Fix For: 1.11 Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, TIKA-985-1.3-3.patch, TIKA-985-1.5.patch TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, section). This prevents some custom ContentHandlers from reading expected elements and/or attributes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1208) Migrate Any23 mime contributions to Tika
[ https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1208: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Migrate Any23 mime contributions to Tika Key: TIKA-1208 URL: https://issues.apache.org/jira/browse/TIKA-1208 Project: Tika Issue Type: Sub-task Components: mime Reporter: Lewis John McGibbney Fix For: 1.11 Attachments: TIKA-1208.patch We begin with one of the most obvious areas in which there is overlap. In short, the appeal of this package is the addition of detection for the following types: - text/n3 - text/rdf+n3 - application/n3 - text/x-nquads - text/rdf+nq - text/nq - application/nq - text/turtle - application/x-turtle - application/turtle - application/trix Therefore although both Tika and Any23 execute the task of Mimetype-related tasks, there is a contribution to be made. This involves the trasferral of code pertaining to pattern recogition, Mimetype XML defitinions within tika-mimetypes.xml and a Purifier implementation that removes all the eventual blank characters at the header of a file that might prevents its MIME Type detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1508) Add uniformity to parser parameter configuration
[ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1508: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Add uniformity to parser parameter configuration Key: TIKA-1508 URL: https://issues.apache.org/jira/browse/TIKA-1508 Project: Tika Issue Type: Improvement Reporter: Tim Allison Fix For: 1.11 We can currently configure parsers by the following means: 1) programmatically by direct calls to the parsers or their config objects 2) sending in a config object through the ParseContext 3) modifying .properties files for specific parsers (e.g. PDFParser) Rather than scattering the landscape with .properties files for each parser, it would be great if we could specify parser parameters in the main config file, something along the lines of this: {noformat} parser class=org.apache.tika.parser.audio.AudioParser params int name=someparam12/int str name=someOtherParam2something or other/str /params mimeaudio/basic/mime mimeaudio/x-aiff/mime mimeaudio/x-wav/mime /parser {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1417) Create Extract Embedded Images from PDFs Example
[ https://issues.apache.org/jira/browse/TIKA-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1417: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Create Extract Embedded Images from PDFs Example Key: TIKA-1417 URL: https://issues.apache.org/jira/browse/TIKA-1417 Project: Tika Issue Type: Improvement Components: example Reporter: Tyler Palsulich Priority: Minor Fix For: 1.11 Users commonly want to turn on extraction of images embedded in PDFs (e.g. TIKA-1414). Tika has the capability, but it's not clear how to use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1516) Downgrade Rome dependency to 0.9 to avoid nasty NPE
[ https://issues.apache.org/jira/browse/TIKA-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1516: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Downgrade Rome dependency to 0.9 to avoid nasty NPE --- Key: TIKA-1516 URL: https://issues.apache.org/jira/browse/TIKA-1516 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.11 Attachments: TIKA-1516.patch As documented [in this thread|http://www.mail-archive.com/dev%40nutch.apache.org/msg15755.html] Nutch's [parse-tika|https://github.com/apache/nutch/blob/trunk/src/plugin/parse-tika/plugin.xml#L56] uses Rome 1.0, this is inherited directly from the Tika pom.xml for the [same depenency|https://github.com/apache/tika/blob/trunk/tika-parsers/pom.xml#L184]. A downgrade is required. {code} java.lang.Exception: java.lang.ExceptionInInitializerError at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) Caused by: java.lang.ExceptionInInitializerError at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:136) at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:70) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:105) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:101) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.NullPointerException at java.util.Properties$LineReader.readLine(Properties.java:418) at java.util.Properties.load0(Properties.java:337) at java.util.Properties.load(Properties.java:325) at com.sun.syndication.io.impl.PropertiesLoader.init(PropertiesLoader.java:74) at com.sun.syndication.io.impl.PropertiesLoader.getPropertiesLoader(PropertiesLoader.java:46) at com.sun.syndication.io.impl.PluginManager.init(PluginManager.java:54) at com.sun.syndication.io.impl.PluginManager.init(PluginManager.java:46) at com.sun.syndication.feed.synd.impl.Converters.init(Converters.java:40) at com.sun.syndication.feed.synd.SyndFeedImpl.clinit(SyndFeedImpl.java:59) ... 16 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-774) ExifTool Parser
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-774: - Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release ExifTool Parser --- Key: TIKA-774 URL: https://issues.apache.org/jira/browse/TIKA-774 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.0 Environment: Requires be installed (http://www.sno.phy.queensu.ca/~phil/exiftool/) Reporter: Ray Gauss II Labels: features, new-parser, newbie, patch Fix For: 1.11 Attachments: testJPEG_IPTC_EXT.jpg, tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt Adds an external parser that calls ExifTool to extract extended metadata fields from images and other content types. In the core project: An ExifTool interface is added which contains Property objects that define the metadata fields available. An additional Property constructor for internalTextBag type. In the parsers project: An ExiftoolMetadataExtractor is added which does the work of calling ExifTool on the command line and mapping the response to tika metadata fields. This extractor could be called instead of or in addition to the existing ImageMetadataExtractor and JempboxExtractor under TiffParser and/or JpegParser but those have not been changed at this time. An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor. An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool metadata fields to existing tika and Drew Noakes metadata fields if enabled. An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag implementations in XML files. An ExifToolParserTest is added which tests several expected XMP and IPTC metadata values in testJPEG_IPTC_EXT.jpg. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1609) Leverage Google's LibPhonenumber for enhanced phone number extraction and metadata modeling
[ https://issues.apache.org/jira/browse/TIKA-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1609: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Leverage Google's LibPhonenumber for enhanced phone number extraction and metadata modeling --- Key: TIKA-1609 URL: https://issues.apache.org/jira/browse/TIKA-1609 Project: Tika Issue Type: New Feature Components: core Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.11 Google's Libphonenumber can provide us with comprehensive support for modeling Phone number metadata properly in Tika. During the development of this patch I realized two things, namely * This is not a parser as such as Phone numbers are not mapped to any particular Mimetype * In addition, there can be many phone numbers per document, so this is most likely a Content Handler of sorts * Tika's Metadata support is currently too restrictive to allow us to persist many complex objects e.g. String, Object. We need to expand Meatdata support over and above String, String[]. https://github.com/googlei18n/libphonenumber/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content
[ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-819: - Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Make Option to Exclude Embedded Files' Text for Text Content Key: TIKA-819 URL: https://issues.apache.org/jira/browse/TIKA-819 Project: Tika Issue Type: New Feature Components: general Affects Versions: 1.0 Environment: Windows-7 + JDK 1.6 u26 Reporter: Albert L. Fix For: 1.11 It would be nice to be able to disable text content from embedded files. For example, if I have a DOCX with an embedded PPTX, then I would like the option to disable text from the PPTX from showing up when asking for the text content from DOCX. In other words, it would be nice to have the option to get text content *only* from the DOCX instead of the DOCX+PPTX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1343) Create a Tika Translator implementation that uses JoshuaDecoder
[ https://issues.apache.org/jira/browse/TIKA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1343: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Create a Tika Translator implementation that uses JoshuaDecoder --- Key: TIKA-1343 URL: https://issues.apache.org/jira/browse/TIKA-1343 Project: Tika Issue Type: New Feature Components: translation Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.11 The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine translation system hosted at Github: http://joshua-decoder.org/ Joshua takes in corpuses and trains models that can then be used to do language translation. Currently there is support for e.g., Spanisn-English, Indian dialects-English, Chinese-English, and a few others. https://github.com/joshua-decoder/joshua/ It would be nice to build a Tika Translator on top of Joshua. There are of course several issues with this: * the models are huge - so we'll need a separate package or Maven module, maybe tika-translate-joshua or something to release the models and we'll need to build the models. I just went through the process of building the Spanish-English one, and it still needs to be rebuilt b/c I did it wrong, but it took over a day * there is a configuration for Joshua, and so we need some way of passing that config into the Translator. Not sure of the best way to do this. * Joshua isn't in the Central repository. I've started a discussion on the Joshua lists about this: https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0 Anyhoo, I've got a working patch right now with hard code stuff, and a manual install into my Maven repo for brave souls out there that want to try it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1697) Parser Implementation for AkomaNtoso Legal XML Documents
[ https://issues.apache.org/jira/browse/TIKA-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1697: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Parser Implementation for AkomaNtoso Legal XML Documents Key: TIKA-1697 URL: https://issues.apache.org/jira/browse/TIKA-1697 Project: Tika Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.11 [AkomaNtoso|http://www.akomantoso.org/] is an established OASIS Legal Document XML standard and used pervasively within parliaments and other legislative arenas. This issue should utilize the [akomantoso-lib|https://github.com/kohsah/akomantoso-lib] to parse and populate Metadata for AkomaNtoso .xml and .akn documents. I'll send a PR for this soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1395) Create embedded image extraction example
[ https://issues.apache.org/jira/browse/TIKA-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1395: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Create embedded image extraction example Key: TIKA-1395 URL: https://issues.apache.org/jira/browse/TIKA-1395 Project: Tika Issue Type: Sub-task Components: example Reporter: Tyler Palsulich Priority: Minor Fix For: 1.11 Create an example of how to turn do embedded image extraction and parsing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1390) Create tika-example module
[ https://issues.apache.org/jira/browse/TIKA-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1390: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Create tika-example module -- Key: TIKA-1390 URL: https://issues.apache.org/jira/browse/TIKA-1390 Project: Tika Issue Type: Bug Components: example Reporter: Tyler Palsulich Fix For: 1.11 This issue will track the initial creation of the tika-example module. Subtasks will be used for the first few examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1540) New Tika plugin for image based feature extraction using computer vision techniques
[ https://issues.apache.org/jira/browse/TIKA-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1540: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release New Tika plugin for image based feature extraction using computer vision techniques --- Key: TIKA-1540 URL: https://issues.apache.org/jira/browse/TIKA-1540 Project: Tika Issue Type: New Feature Environment: cross platform Reporter: Aashish Chaudhary Assignee: Lewis John McGibbney Labels: gsoc2015 Fix For: 1.11 Attachments: TIKA-vision.achaudhary.150209.patch.txt This will be a web-service client based parser to perform image feature extraction using Computer Vision techniques. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1513: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Add mime detection and parsing for dbf files Key: TIKA-1513 URL: https://issues.apache.org/jira/browse/TIKA-1513 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.11 I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. Let's add dbf parsing to Tika. Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-715: - Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Labels: newbie Fix For: 1.11 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR!
[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1607: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1456) Visual Sentiment API parser
[ https://issues.apache.org/jira/browse/TIKA-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1456: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Visual Sentiment API parser --- Key: TIKA-1456 URL: https://issues.apache.org/jira/browse/TIKA-1456 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Labels: gsoc2015 Fix For: 1.11 Integrate the Visual Sentibank API as a parser for images. We can use Aperture from CMU, it's released under the MIT license: https://github.com/d8w/aperture -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1640) Make ExternalParser support aliases for key names in extracted metadata
[ https://issues.apache.org/jira/browse/TIKA-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1640: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Make ExternalParser support aliases for key names in extracted metadata --- Key: TIKA-1640 URL: https://issues.apache.org/jira/browse/TIKA-1640 Project: Tika Issue Type: Improvement Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.11 Over in TIKA-1639, we were discussing the work outside of Tika that [~rgauss] did (per [~gagravarr]) on the EXIFTool parsing. I added support in TIKA-1639 for this, but one thing Ray's code-based work did that my config oriented work didn't is allow for renaming extracted metadata key names to better support having consistent metadata across parsers. Here's one way to do it: ExternalParser could have a config section like so: {code:xml} aliases metadata key=foo alias=bar/ metadata key=foo2 alias=bar2/ /aliases {code} Then this could be used to rename metadata keys. I'll implement that in this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1465) Implement extraction of non-global variables from netCDF3 and netCDF4
[ https://issues.apache.org/jira/browse/TIKA-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1465: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Implement extraction of non-global variables from netCDF3 and netCDF4 - Key: TIKA-1465 URL: https://issues.apache.org/jira/browse/TIKA-1465 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.11 Speaking to Eric Nienhouse at the ongoing NSF funded Polar Cyberinfrastructure hackathon in NYC, we became aware that variables parameters contained within netCDF3 and netCDF4 are just as valuable (if not more valuable) as global attribute values. AFAIK, right now we only extract global attributes however we could extend the support to cater for the above observations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
[ https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1059: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Better Handling of InterruptedException in ExternalParser and ExternalEmbedder -- Key: TIKA-1059 URL: https://issues.apache.org/jira/browse/TIKA-1059 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Ray Gauss II Fix For: 1.11 The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch {{InterruptedException}} and ignore it. The methods should either call {{interrupt()}} on the current thread or re-throw the exception, possibly wrapped in a {{TikaException}}. See TIKA-775 for a previous discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1301) Establish TikaServer on Apache hosted VM
[ https://issues.apache.org/jira/browse/TIKA-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1301: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Establish TikaServer on Apache hosted VM Key: TIKA-1301 URL: https://issues.apache.org/jira/browse/TIKA-1301 Project: Tika Issue Type: Bug Components: server Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.11 Over in Any23, Infra recently provisioned us with a nice shiny new VM to run our service on http://any23.org I would like to do the same for Tika. I have some scripts on the Any23 VM which will pull stable nightly tika-server snapshots and deploy them to the VM. This is really nice for both dev's and users alike. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1577: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.11 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1318) Use of Deprecated Word6Extractor.getParagraphText() Method
[ https://issues.apache.org/jira/browse/TIKA-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1318: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Use of Deprecated Word6Extractor.getParagraphText() Method -- Key: TIKA-1318 URL: https://issues.apache.org/jira/browse/TIKA-1318 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Tyler Palsulich Priority: Minor Labels: deprecation Fix For: 1.11 org.apache.tika.parser.microsoft.WordExtractor.parseWord6() uses the deprecated Word6Extractor.getParagraphText() method. getParagraphText() is supposed to return a String[] with an element for each paragraph in the text. The replacement is getText(), which lets paragraph, cell, etc separation be implementation specific. I'm not sure, at this point, how the POI WordExtractor separates them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-988) We don't extract a placeholder for a Word document embedded in an Excel document
[ https://issues.apache.org/jira/browse/TIKA-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-988: - Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release We don't extract a placeholder for a Word document embedded in an Excel document Key: TIKA-988 URL: https://issues.apache.org/jira/browse/TIKA-988 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Fix For: 1.11 Attachments: bug31373.xls In TIKA-956 we fixed the Word parser so that at the point where an embedded document appears, we output a div class=embedded id=_XXX/ tag. It would be nice to do this for documents embedded in Excel too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies
[ https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1367: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Tika documentation should list tika-parsers parser dependencies --- Key: TIKA-1367 URL: https://issues.apache.org/jira/browse/TIKA-1367 Project: Tika Issue Type: Improvement Components: documentation Reporter: Sergey Beryozkin Fix For: 1.11 tika-parsers module has many strong transitive parser dependencies. Maven users of tika-parsers have to exclude all the transitivie dependencies manually. Documenting the list of the existing transitive dependencies and keeping the list up to date will help developers exclude the libraries not needed for a given project. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1328) Translate Metadata and Content
[ https://issues.apache.org/jira/browse/TIKA-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1328: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Translate Metadata and Content -- Key: TIKA-1328 URL: https://issues.apache.org/jira/browse/TIKA-1328 Project: Tika Issue Type: New Feature Components: translation Reporter: Tyler Palsulich Fix For: 1.11 Right now, Translation is only done on Strings. Ideally, users would be able to turn on translation while parsing. I can think of a couple options: - Make a TranslateAutoDetectParser. Automatically detect the file type, parse it, then translate the content. - Make a Context switch. When true, translate the content regardless of the parser used. I'm not sure the best way to go about this method, but I prefer it over another Parser. Regardless, we need a black or white list for translation. I think black list would be the way to go -- which fields should not be translated (dates, versions, ...) Any ideas? Also, somewhat unrelated, does anyone know of any other open source translation libraries? If we were really lucky, it wouldn't depend on an online service. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1657) Allow easier dumping of TikaConfig file from tika-core
[ https://issues.apache.org/jira/browse/TIKA-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1657: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Allow easier dumping of TikaConfig file from tika-core -- Key: TIKA-1657 URL: https://issues.apache.org/jira/browse/TIKA-1657 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.11 In TIKA-1418, we added an example for how to dump the config file so that users could easily modify it. I think we should go further and make this an option at the tika-core level with hooks for tika-app and tika-server. I propose adding a main() to TikaConfig that will print the xml config file that Tika is currently using to stdout. I'd like to put this into core so that e.g. Solr's DIH users can get by without having to download tika-app separately. There's every chance that I've not accounted for issues with dynamic loading etc. Also, I'd be ok with only having this available in tika-app and tika-server if there are good reasons. Feedback? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1598) Parser Implementation for Streaming Video
[ https://issues.apache.org/jira/browse/TIKA-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1598: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Parser Implementation for Streaming Video - Key: TIKA-1598 URL: https://issues.apache.org/jira/browse/TIKA-1598 Project: Tika Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: memex Fix For: 1.11 A number of us have been discussing a Tika implementation which could, for example, bind to a live multimedia stream and parse content from the stream until it finished. An excellent example would be watching Bonnie Scotland beating R. of Ireland in the upcoming European Championship Qualifying - Group D on Sat 13 Jun @ 17:00 GMT :) I located a JMF Wrapper for ffmpeg which 'may' enable us to do this http://sourceforge.net/projects/jffmpeg/ I am not sure... plus it is not licensed liberally enough for us to include so if there are other implementations then please post them here. I 'may' be able to have a crack at implementing this next week. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1674) Add example to show how to extract embedded files
[ https://issues.apache.org/jira/browse/TIKA-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1674: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release Add example to show how to extract embedded files - Key: TIKA-1674 URL: https://issues.apache.org/jira/browse/TIKA-1674 Project: Tika Issue Type: New Feature Reporter: Tim Allison Priority: Minor Fix For: 1.11 On tika-user, we received a question on how to extract embedded files. Let's add an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1505) chmparser breaks down when extracting from file of CHM format v3
[ https://issues.apache.org/jira/browse/TIKA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-1505: -- Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release chmparser breaks down when extracting from file of CHM format v3 Key: TIKA-1505 URL: https://issues.apache.org/jira/browse/TIKA-1505 Project: Tika Issue Type: Bug Reporter: Bin Hawking Fix For: 1.11 chmparser throws exception or returns faulty text when: 1. extracting from file of CHM format version 3 2. chm file with lzx reset interval 2 3. chm file with 5000 objects I am making the fix now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-980) MicrodataContentHandler for Apache Tika
[ https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Meikle updated TIKA-980: - Fix Version/s: (was: 1.10) 1.11 * Pushed to 1.11 following 1.10 release MicrodataContentHandler for Apache Tika --- Key: TIKA-980 URL: https://issues.apache.org/jira/browse/TIKA-980 Project: Tika Issue Type: New Feature Components: parser Reporter: Markus Jelsma Assignee: Ken Krugler Fix For: 1.11 Attachments: TIKA-980-1.3-1.patch, TIKA-980-1.3-2.patch, TIKA-980-1.3-3.patch, TIKA-980-1.3-4.patch, TIKA-980-1.3-5.patch ContentHandler for Apache Tika capable of building a data structure containing Microdata item scopes and item properties. The Item* classes are borrowed from the Apache Any23 project and are slightly modified to accomodate this SAX-based extractor vs the original DOM-based extractor. The provided unit test outputs two item scopes about the Europe and NA ApacheCon events and each has a nested property. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment
[ https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662971#comment-14662971 ] Ian Williams commented on TIKA-894: --- I am out of the office until Mon 10 Aug 2015. Regards Ian Add webapp mode for Tika Server, simplifies deployment -- Key: TIKA-894 URL: https://issues.apache.org/jira/browse/TIKA-894 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 1.1, 1.2 Reporter: Chris Wilson Labels: maven, newbie, patch Fix For: 1.11 Attachments: tika-server-webapp.patch For use in production services, Tika Server should really be deployed as a WAR file, under a reliable servlet container that knows how to run as a system service, for example Tomcat or JBoss. This is especially important on Windows, where I wasted an entire day trying to make TikaServerCli run as some kind of a service. Maven makes building a webapp pretty trivial. With the attached patch applied, mvn war:war should work. It seems to run fine in Tomcat, which makes Windows deployment much simpler. Just install Tomcat and drop the WAR file into tomcat's webapps directory and you're away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[ANNOUNCE] Apache Tika 1.10 release
The Apache Tika project is pleased to announce the release of Apache Tika 1.10. The release contents have been pushed out to the main Apache release site and to the Central sync, so the releases should be available as soon as the mirrors get the syncs. Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 1.10 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/tika/CHANGES-1.10.txt http://www.apache.org/dist/tika/CHANGES-1.10.txt Apache Tika is available in source form from the following download page: http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.10-src.zip http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.10-src.zip Apache Tika is also available in binary form or for use using Maven 2 from the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/ http://repo1.maven.org/maven2/org/apache/tika/ In the initial 48 hours, the release may not be available on all mirrors. When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: https://people.apache.org/keys/group/tika.asc https://people.apache.org/keys/group/tika.asc For more information on Apache Tika, visit the project home page: http://tika.apache.org/ http://tika.apache.org/ -- David Meikle, on behalf of the Apache Tika community
Re: [ANNOUNCE] Apache Tika 1.10 release
Thanks, Dave! On Sat, Aug 8, 2015, 7:01 AM David Meikle dmei...@apache.org wrote: The Apache Tika project is pleased to announce the release of Apache Tika 1.10. The release contents have been pushed out to the main Apache release site and to the Central sync, so the releases should be available as soon as the mirrors get the syncs. Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Apache Tika 1.10 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/tika/CHANGES-1.10.txt http://www.apache.org/dist/tika/CHANGES-1.10.txt Apache Tika is available in source form from the following download page: http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.10-src.zip http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.10-src.zip Apache Tika is also available in binary form or for use using Maven 2 from the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/ http://repo1.maven.org/maven2/org/apache/tika/ In the initial 48 hours, the release may not be available on all mirrors. When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: https://people.apache.org/keys/group/tika.asc https://people.apache.org/keys/group/tika.asc For more information on Apache Tika, visit the project home page: http://tika.apache.org/ http://tika.apache.org/ -- David Meikle, on behalf of the Apache Tika community