[jira] [Updated] (TIKA-1688) Tika Version in Metadata
[ https://issues.apache.org/jira/browse/TIKA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1688: Fix Version/s: (was: 1.12) 1.13 > Tika Version in Metadata > > > Key: TIKA-1688 > URL: https://issues.apache.org/jira/browse/TIKA-1688 > Project: Tika > Issue Type: Improvement >Reporter: Paul Ramirez >Priority: Minor > Fix For: 1.13 > > > Could this be added as X-Tika:version that way downstream there would be > traceability to extraction based on version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1609) Leverage Google's LibPhonenumber for enhanced phone number extraction and metadata modeling
[ https://issues.apache.org/jira/browse/TIKA-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1609: Fix Version/s: (was: 1.12) 1.13 > Leverage Google's LibPhonenumber for enhanced phone number extraction and > metadata modeling > --- > > Key: TIKA-1609 > URL: https://issues.apache.org/jira/browse/TIKA-1609 > Project: Tika > Issue Type: New Feature > Components: core >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.13 > > > Google's Libphonenumber can provide us with comprehensive support for > modeling Phone number metadata properly in Tika. > During the development of this patch I realized two things, namely > * This is not a parser as such as Phone numbers are not mapped to any > particular Mimetype > * In addition, there can be many phone numbers per document, so this is most > likely a Content Handler of sorts > * Tika's Metadata support is currently too restrictive to allow us to > persist many complex objects e.g. String, Object. We need to expand Meatdata > support over and above String, String[]. > https://github.com/googlei18n/libphonenumber/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1607: Fix Version/s: (was: 1.12) 1.13 > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607v1_rough_rough.patch, > TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1640) Make ExternalParser support aliases for key names in extracted metadata
[ https://issues.apache.org/jira/browse/TIKA-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1640: Fix Version/s: (was: 1.12) 1.13 > Make ExternalParser support aliases for key names in extracted metadata > --- > > Key: TIKA-1640 > URL: https://issues.apache.org/jira/browse/TIKA-1640 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.13 > > > Over in TIKA-1639, we were discussing the work outside of Tika that [~rgauss] > did (per [~gagravarr]) on the EXIFTool parsing. I added support in TIKA-1639 > for this, but one thing Ray's code-based work did that my config oriented > work didn't is allow for renaming extracted metadata key names to better > support having consistent metadata across parsers. > Here's one way to do it: > ExternalParser could have a config section like so: > {code:xml} > > > > > {code} > Then this could be used to rename metadata keys. > I'll implement that in this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-985) Support for HTML5 elements
[ https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-985: --- Fix Version/s: (was: 1.12) 1.13 > Support for HTML5 elements > -- > > Key: TIKA-985 > URL: https://issues.apache.org/jira/browse/TIKA-985 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.2 >Reporter: Markus Jelsma > Fix For: 1.13 > > Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, > TIKA-985-1.3-3.patch, TIKA-985-1.5.patch > > > TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, > section). This prevents some custom ContentHandlers from reading expected > elements and/or attributes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1706) Bring back commons-io to tika-core
[ https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1706: Fix Version/s: (was: 1.12) 1.13 > Bring back commons-io to tika-core > -- > > Key: TIKA-1706 > URL: https://issues.apache.org/jira/browse/TIKA-1706 > Project: Tika > Issue Type: Improvement > Components: core >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.13 > > Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch > > > TIKA-249 inlined select commons-io classes in order to simplify the > dependency tree and save some space. > I believe these arguments are weaker nowadays due to the following concerns: > - Most of the non-core modules already use commons-io, and since tika-core is > usually not used by itself, commons-io is already included with it > - Since some modules use both tika-core and commons-io, it's not clear which > code should be used > - Having the inlined classes causes more maintenance and/or technology debt > (which in turn causes more maintenance) > - Newer commons-io code utilizes newer platform code, e.g. using Charset > objects instead of encoding names, being able to use StringBuilder instead of > StringBuffer, and so on. > I'll be happy to provide a patch to replace usages of the inlined classes > with commons-io classes if this is accepted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE
[ https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1308: Fix Version/s: (was: 1.12) 1.13 > Support in memory parse mode(don't create temp file): to support run Tika in > GAE > > > Key: TIKA-1308 > URL: https://issues.apache.org/jira/browse/TIKA-1308 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.5 >Reporter: jefferyyuan > Labels: gae > Fix For: 1.13 > > > I am trying to use Tika in GAE and write a simple servlet to extract meta > data info from jpeg: > {code} > String urlStr = req.getParameter("imageUrl"); > byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr)); > ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData); > Metadata metadata = new Metadata(); > BodyContentHandler ch = new BodyContentHandler(); > AutoDetectParser parser = new AutoDetectParser(); > parser.parse(bais, ch, metadata, new ParseContext()); > bais.close(); > {code} > This fails with exception: > {code} > Caused by: java.lang.SecurityException: Unable to create temporary file > at java.io.File.createTempFile(File.java:1986) > at > org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66) > at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533) > at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242 > {code} > Checked the code, in > org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, > Metadata, ParseContext), it creates a temp file from the input stream. > I can understand why tika create temp file from the stream: so tika can parse > it multiple times. > But as GAE and other cloud servers are getting more popular, is it possible > to avoid create temp file: instead we can copy the origin stream to a > byteArray stream, so tika can also parse it multiple times. > -- This will have a limit on the file size, as tika keeps the whole file in > memory, but this can make tika work in GAE and maybe other cloud server. > We can add a parameter in parser.parse to indicate whether do in memory parse > only. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1329) Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser
[ https://issues.apache.org/jira/browse/TIKA-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1329: Fix Version/s: (was: 1.12) 1.13 > Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser > --- > > Key: TIKA-1329 > URL: https://issues.apache.org/jira/browse/TIKA-1329 > Project: Tika > Issue Type: Sub-task > Components: parser >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > Attachments: TIKA-1329-site.patch, TIKA-1329v2.patch, > test_recursive_embedded.docx > > > Jukka and Nick have a great demo of parsing metadata recursively on the > [wiki|http://wiki.apache.org/tika/RecursiveMetadata]. For TIKA-1302, I'd > like to use something similar, and I think that others may find it useful for > tika-app and tika-server. > I took the code from the wiki and made some modifications. I'm not sure if > we should put this in parsers or in a new module for "examples." Given that > I think this would be useful for tika-app and tika-server, I'd prefer > parsers, but I'm open to any input...including "let's not." -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1616) Tika Parser for GIBS Metadata
[ https://issues.apache.org/jira/browse/TIKA-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1616: Fix Version/s: (was: 1.12) 1.13 > Tika Parser for GIBS Metadata > - > > Key: TIKA-1616 > URL: https://issues.apache.org/jira/browse/TIKA-1616 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.13 > > > [GIBS|https://earthdata.nasa.gov/about-eosdis/science-system-description/eosdis-components/global-imagery-browse-services-gibs] > metadata currently consists of simple stuff in the WMTS GetCapabilities > request (e.g. > http://map1.vis.earthdata.nasa.gov/wmts-arctic/1.0.0/WMTSCapabilities.xml) > which includes available layers, extents, time ranges, map projections, color > maps, etc. We will eventually have more detailed visualization metadata > available in ECHO/CMR which will include linkages to data products, > provenance, etc. > Some investigation and a Tika parser would be excellent to extract and > assimilate GIBS Metadata. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-980) MicrodataContentHandler for Apache Tika
[ https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-980: --- Fix Version/s: (was: 1.12) 1.13 > MicrodataContentHandler for Apache Tika > --- > > Key: TIKA-980 > URL: https://issues.apache.org/jira/browse/TIKA-980 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma >Assignee: Ken Krugler > Fix For: 1.13 > > Attachments: TIKA-980-1.3-1.patch, TIKA-980-1.3-2.patch, > TIKA-980-1.3-3.patch, TIKA-980-1.3-4.patch, TIKA-980-1.3-5.patch > > > ContentHandler for Apache Tika capable of building a data structure > containing Microdata item scopes and item properties. The Item* classes are > borrowed from the Apache Any23 project and are slightly modified to > accomodate this SAX-based extractor vs the original DOM-based extractor. > The provided unit test outputs two item scopes about the Europe and NA > ApacheCon events and each has a nested property. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1435) Update rome dependency to 1.5
[ https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1435: Fix Version/s: (was: 1.12) 1.13 > Update rome dependency to 1.5 > - > > Key: TIKA-1435 > URL: https://issues.apache.org/jira/browse/TIKA-1435 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Johannes Mockenhaupt >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.13 > > Attachments: netcdf-deps-changes.diff > > > Rome 1.5 has been released to Sonatype > (https://github.com/rometools/rome/issues/183). Though the website > (http://rometools.github.io/rome/) is blissfully ignorant of that. The update > is mostly maintenance, adopting slf4j and generics as well as moving the > namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-891: --- Fix Version/s: (was: 1.12) 1.13 > Use POST in addition to PUT on method calls in tika-server > -- > > Key: TIKA-891 > URL: https://issues.apache.org/jira/browse/TIKA-891 > Project: Tika > Issue Type: Improvement > Components: general >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann >Priority: Trivial > Labels: newbie > Fix For: 1.13 > > > Per Jukka's email: > http://s.apache.org/uR > It would be a better use of REST/HTTP "verbs" to use POST to put content to a > resource where we don't intend to store that content (which is the > implication of PUT). Max suggested adding: > {code} > @POST > {code} > annotations to the methods we are currently exposing using PUT to take care > of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1577: Fix Version/s: (was: 1.12) 1.13 > NetCDF Data Extraction > -- > > Key: TIKA-1577 > URL: https://issues.apache.org/jira/browse/TIKA-1577 > Project: Tika > Issue Type: Improvement > Components: handler, parser >Affects Versions: 1.7 >Reporter: Ann Burgess >Assignee: Ann Burgess > Labels: features, handler > Fix For: 1.13 > > Original Estimate: 504h > Remaining Estimate: 504h > > A netCDF classic or 64-bit offset dataset is stored as a single file > comprising two parts: > - a header, containing all the information about dimensions, attributes, and > variables except for the variable data; > - a data part, comprising fixed-size data, containing the data for variables > that don't have an unlimited dimension; and variable-size data, containing > the data for variables that have an unlimited dimension. > The NetCDFparser currently extracts the "header part". > -- text extracts file Dimensions and Variables > -- metadata extracts Global Attributes > We want the option to extract the "data part" of NetCDF files. > Lets use the NetCDF test file for our dev testing: > tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment
[ https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-894: --- Fix Version/s: (was: 1.12) 1.13 > Add webapp mode for Tika Server, simplifies deployment > -- > > Key: TIKA-894 > URL: https://issues.apache.org/jira/browse/TIKA-894 > Project: Tika > Issue Type: Improvement > Components: packaging >Affects Versions: 1.1, 1.2 >Reporter: Chris Wilson > Labels: maven, newbie, patch > Fix For: 1.13 > > Attachments: tika-server-webapp.patch > > > For use in production services, Tika Server should really be deployed as a > WAR file, under a reliable servlet container that knows how to run as a > system service, for example Tomcat or JBoss. > This is especially important on Windows, where I wasted an entire day trying > to make TikaServerCli run as some kind of a service. > Maven makes building a webapp pretty trivial. With the attached patch > applied, "mvn war:war" should work. It seems to run fine in Tomcat, which > makes Windows deployment much simpler. Just install Tomcat and drop the WAR > file into tomcat's webapps directory and you're away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1417) Create Extract Embedded Images from PDFs Example
[ https://issues.apache.org/jira/browse/TIKA-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1417: Fix Version/s: (was: 1.12) 1.13 > Create Extract Embedded Images from PDFs Example > > > Key: TIKA-1417 > URL: https://issues.apache.org/jira/browse/TIKA-1417 > Project: Tika > Issue Type: Improvement > Components: example >Reporter: Tyler Palsulich >Priority: Minor > Fix For: 1.13 > > > Users commonly want to "turn on" extraction of images embedded in PDFs (e.g. > TIKA-1414). Tika has the capability, but it's not clear how to use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1395) Create embedded image extraction example
[ https://issues.apache.org/jira/browse/TIKA-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1395: Fix Version/s: (was: 1.12) 1.13 > Create embedded image extraction example > > > Key: TIKA-1395 > URL: https://issues.apache.org/jira/browse/TIKA-1395 > Project: Tika > Issue Type: Sub-task > Components: example >Reporter: Tyler Palsulich >Priority: Minor > Fix For: 1.13 > > > Create an example of how to turn do embedded image extraction and parsing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1425) Automatic batching of Microsoft service calls
[ https://issues.apache.org/jira/browse/TIKA-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1425: Fix Version/s: (was: 1.12) 1.13 > Automatic batching of Microsoft service calls > - > > Key: TIKA-1425 > URL: https://issues.apache.org/jira/browse/TIKA-1425 > Project: Tika > Issue Type: Improvement > Components: translation >Affects Versions: 1.6 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.13 > > > Right now when I use the following code I get the stack trace at the bottom > of this description. This seems to be because the Request URI is too large to > make the service request. We need to have a mechansim within the call to > Tika.translate which will, on a service-by-service basis, determine the > maximum Request URI which can be sent. I beleive that this should be on the > Tika side as how else am I meant to know the maximum request size? > {code:title=translator.java|borderStyle=solid} > +Translator translate = new MicrosoftTranslator(); > +((MicrosoftTranslator) translate).setId("..."); > +((MicrosoftTranslator) translate).setSecret("..."); > for (java.util.Map.Entry entry : parseResult) { >Parse parse = entry.getValue(); >LOG.info("-\nUrl\n---\n"); > @@ -201,7 +207,7 @@ >System.out.print(parse.getData().toString()); >if (dumpText) { > LOG.info("-\nParseText\n-\n"); > -System.out.print(parse.getText()); > +System.out.print(translate.translate(parse.getText(), "fr")); >} > {code} > {code:title=stacktrace.log|borderStyle=solid} > Exception in thread "main" java.lang.Exception: [microsoft-translator-api] > Error retrieving translation : Server returned HTTP response code: 414 for > URL: > http://api.microsofttranslator.com/V2/Ajax.svc/Translate?&from=&to=fr&text=%D0%A4%D0... > ... > at > com.memetix.mst.MicrosoftTranslatorAPI.retrieveString(MicrosoftTranslatorAPI.java:202) > at com.memetix.mst.translate.Translate.execute(Translate.java:61) > at com.memetix.mst.translate.Translate.execute(Translate.java:76) > at > org.apache.tika.language.translate.MicrosoftTranslator.translate(MicrosoftTranslator.java:104) > at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:210) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:228) > Caused by: java.io.IOException: Server returned HTTP response code: 414 for > URL: > http://api.microsofttranslator.com/V2/Ajax.svc/Translate?&from=&to=fr&text=%D0%A4%D0%BE%D1%80%D1%83%D0%B... > ... > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1675) > at > sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1673) > at java.security.AccessController.doPrivileged(Native Method) > at > sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1671) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1244) > at > com.memetix.mst.MicrosoftTranslatorAPI.retrieveResponse(MicrosoftTranslatorAPI.java:178) > at > com.memetix.mst.MicrosoftTranslatorAPI.retrieveString(MicrosoftTranslatorAPI.java:199) > ... 6 more > Caused by: java.io.IOException: Server returned HTTP response code: 414 for > URL: > http://api.microsofttranslator.com/V2/Ajax.svc/Translate?&from=&to=fr&text=%D0%A4%D0%BE... > ... > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626) > at > java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468) > at > com.memetix.mst.MicrosoftTranslatorAPI.retrieveResponse(MicrosoftTranslatorAPI.java:177) > ... 7 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1657) Allow easier XML serialization of TikaConfig
[ https://issues.apache.org/jira/browse/TIKA-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1657: Fix Version/s: (was: 1.12) 1.13 > Allow easier XML serialization of TikaConfig > > > Key: TIKA-1657 > URL: https://issues.apache.org/jira/browse/TIKA-1657 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > Attachments: TIKA-1558-blacklist-effective.xml, TIKA-1657v1.patch > > > In TIKA-1418, we added an example for how to dump the config file so that > users could easily modify it. I think we should go further and make this an > option at the tika-core level with hooks for tika-app and tika-server. I > propose adding a main() to TikaConfig that will print the xml config file > that Tika is currently using to stdout. > I'd like to put this into core so that e.g. Solr's DIH users can get by > without having to download tika-app separately. > There's every chance that I've not accounted for issues with dynamic loading > etc. Also, I'd be ok with only having this available in tika-app and > tika-server if there are good reasons. > Feedback? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1829) org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92) NPE
[ https://issues.apache.org/jira/browse/TIKA-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1829: Fix Version/s: (was: 1.12) 1.13 > org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92) > NPE > > > Key: TIKA-1829 > URL: https://issues.apache.org/jira/browse/TIKA-1829 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 > Environment: OSX 10.11 >Reporter: frank >Priority: Critical > Labels: easyfix > Fix For: 1.13 > > Attachments: TesseractOCRParser.java > > > Just need to add a check on parameter of context. > 2016-01-11 12:36:52.328 [http-nio-8080-exec-9] WARN > o.a.j.core.query.lucene.NodeIndexer - Exception while indexing binary property > java.lang.NullPointerException: null > at > org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92) > ~[tika-parsers-1.11.jar:1.11] > at > org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) > ~[tika-core-1.11.jar:1.11] > at > org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:87) > ~[tika-core-1.11.jar:1.11] > at > org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253) > ~[tika-core-1.11.jar:1.11] > at > org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) > ~[tika-core-1.11.jar:1.11] > at > org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253) > ~[tika-core-1.11.jar:1.11] > at > org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) > ~[tika-core-1.11.jar:1.11] > at > org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253) > ~[tika-core-1.11.jar:1.11] > at > org.apache.jackrabbit.core.query.lucene.NodeIndexer.isSupportedMediaType(NodeIndexer.java:934) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:448) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.query.lucene.NodeIndexer.addValue(NodeIndexer.java:338) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.query.lucene.NodeIndexer.createDoc(NodeIndexer.java:270) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:1246) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.query.lucene.SearchIndex.mergeAggregatedNodeIndexes(SearchIndex.java:1539) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:1247) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.query.lucene.SearchIndex.updateNodes(SearchIndex.java:667) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.SearchManager.onEvent(SearchManager.java:408) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.observation.EventConsumer.consumeEvents(EventConsumer.java:249) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.observation.ObservationDispatcher.dispatchEvents(ObservationDispatcher.java:225) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.observation.EventStateCollection.dispatch(EventStateCollection.java:475) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.state.SharedItemStateManager$Update.end(SharedItemStateManager.java:856) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.state.SharedItemStateManager.update(SharedItemStateManager.java:1537) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.state.LocalItemStateManager.update(LocalItemStateManager.java:400) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.state.XAItemStateManager.update(XAItemStateManager.java:354) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.state.LocalItemStateManager.update(LocalItemStateManager.java:375) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.version.VersionManagerImplBase$WriteOperation.save(VersionManagerImplBase.java:470) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.version.VersionManagerImplBase.checkoutCheckin(VersionManagerImplBase.java:215) > [jackrabbit-core-2.8.0.jar:2.8.0] > at > org.apache.jackrabbit.core.VersionManagerImpl.access$400(VersionManagerImpl.java:73) > [jackrabbit-core-2.8.0.jar:2.8.0] >
[jira] [Updated] (TIKA-1436) improvement to PDFParser
[ https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1436: Fix Version/s: (was: 1.12) 1.13 > improvement to PDFParser > > > Key: TIKA-1436 > URL: https://issues.apache.org/jira/browse/TIKA-1436 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Stefano Fornari > Labels: parser, pdf > Fix For: 1.13 > > Attachments: ste-20140927.patch > > > with regards to the thread "[PDFParser] - read limited number of characters" > on Mar 29, I would like to propose the attached patch. I noticed that in Tika > 1.6 there have been some work around a better handling of the > WriteLimitReachedException condition, but I believe it could be even > improved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1801) Integrate MITIE Named Entity Recognition support
[ https://issues.apache.org/jira/browse/TIKA-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1801: Fix Version/s: (was: 1.12) 1.13 > Integrate MITIE Named Entity Recognition support > > > Key: TIKA-1801 > URL: https://issues.apache.org/jira/browse/TIKA-1801 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.13 > > > Add support for Named Entity Recognition (NER) support from MITIE the library > from MIT-LL: > https://github.com/mit-nlp/MITIE -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-774) ExifTool Parser
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-774: --- Fix Version/s: (was: 1.12) 1.13 > ExifTool Parser > --- > > Key: TIKA-774 > URL: https://issues.apache.org/jira/browse/TIKA-774 > Project: Tika > Issue Type: New Feature > Components: parser >Affects Versions: 1.0 > Environment: Requires be installed > (http://www.sno.phy.queensu.ca/~phil/exiftool/) >Reporter: Ray Gauss II > Labels: features, new-parser, newbie, patch > Fix For: 1.13 > > Attachments: testJPEG_IPTC_EXT.jpg, > tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt > > > Adds an external parser that calls ExifTool to extract extended metadata > fields from images and other content types. > In the core project: > An ExifTool interface is added which contains Property objects that define > the metadata fields available. > An additional Property constructor for internalTextBag type. > In the parsers project: > An ExiftoolMetadataExtractor is added which does the work of calling ExifTool > on the command line and mapping the response to tika metadata fields. This > extractor could be called instead of or in addition to the existing > ImageMetadataExtractor and JempboxExtractor under TiffParser and/or > JpegParser but those have not been changed at this time. > An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor. > An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool > metadata fields to existing tika and Drew Noakes metadata fields if enabled. > An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag > implementations in XML files. > An ExifToolParserTest is added which tests several expected XMP and IPTC > metadata values in testJPEG_IPTC_EXT.jpg. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle
[ https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1276: Fix Version/s: (was: 1.12) 1.13 > Missing embedded dependencies in tika-bundle > > > Key: TIKA-1276 > URL: https://issues.apache.org/jira/browse/TIKA-1276 > Project: Tika > Issue Type: Bug > Components: packaging >Affects Versions: 1.5 > Environment: OSGI, Apache Felix via Apache Sling Launcher >Reporter: Rupert Westenthaler > Fix For: 1.13 > > Attachments: TIKA-1276_20140423_rwesten.diff, > TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_3_rwesten.diff, > TIKA-1276_20140428_rwesten.diff > > > While updating from tika 1.2 to 1.5 I that the > `org.apache.tika:tika-bundle:1.5` module has some missing dependences. > 1. `com.uwyn:jhighlight:1.0` is not embedded > Because of that installing the bundle results in the following exception > {code} > org.osgi.framework.BundleException: Unresolved constraint in bundle > org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement > [103.0] osgi.wiring.package; > (osgi.wiring.package=com.uwyn.jhighlight.renderer)) > org.osgi.framework.BundleException: Unresolved constraint in bundle > org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement > [103.0] osgi.wiring.package; > (osgi.wiring.package=com.uwyn.jhighlight.renderer) > at > org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) > at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) > at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) > at > org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) > at java.lang.Thread.run(Thread.java:744) > {code} > 2. `org.ow2.asm:asm:4.1` is not embedded because > `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and > therefore the `Embed-Dependency` directive `asm` does not match any > dependency. > Because of that one do get the following exception (after fixing (1)) > {code} > org.osgi.framework.BundleException: Unresolved constraint in bundle > org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement > [96.0] osgi.wiring.package; > (&(osgi.wiring.package=org.objectweb.asm)(version>=4.1.0)(!(version>=5.0.0 > org.osgi.framework.BundleException: Unresolved constraint in bundle > org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement > [96.0] osgi.wiring.package; > (&(osgi.wiring.package=org.objectweb.asm)(version>=4.1.0)(!(version>=5.0.0))) > at > org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) > at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) > at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) > at > org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) > at java.lang.Thread.run(Thread.java:744) > {code} > There are two possibilities to fix this (a) change the `Embed-Dependency` to > `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the > tika-bundle pom file. > 3. `edu.ucar:netcdf:4.2-min` is not embedded > Because of that one does get the following exception (after fixing (1) and > (2)) > {code} > org.osgi.framework.BundleException: Unresolved constraint in bundle > org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement > [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)) > org.osgi.framework.BundleException: Unresolved constraint in bundle > org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement > [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2) > at > org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962) > at org.apache.felix.framework.Felix.startBundle(Felix.java:2025) > at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279) > at > org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304) > at java.lang.Thread.run(Thread.java:744) > {code} > 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime > After fixing the above issues the tika-bundle was started successfully. > However when extracting EXIG metadata from a jpeg image I got the following > exception. > {code} > java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException > at > com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112) > at > com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71) > at > org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91) >
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Fix Version/s: (was: 1.12) 1.13 > Some parsers produce non-well-formed XHTML SAX events > - > > Key: TIKA-715 > URL: https://issues.apache.org/jira/browse/TIKA-715 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 0.10 >Reporter: Michael McCandless > Labels: newbie > Fix For: 1.13 > > Attachments: TIKA-715.patch > > > With TIKA-683 I committed simple, commented out code to > SafeContentHandler, to verify that the SAX events produced by the > parser have valid (matched) tags. Ie, each startElement("foo") is > matched by the closing endElement("foo"). > I only did basic nesting test, plus checking that is never > embedded inside another ; we could strengthen this further to check > that all tags only appear in valid parents... > I was able to use this to fix issues with the new RTF parser > (TIKA-683), but I was surprised that some other parsers failed the new > asserts. > It could be these are relatively minor offenses (eg closing a table > w/o closing the tr) and we need not do anything here... but I think > it'd be cleaner if all our parsers produced matched, well-formed XHTML > events. > I haven't looked into any of these... it could be they are easy to fix. > Failures: > {noformat} > testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) > Time elapsed: 0.032 sec <<< ERROR! > java.lang.AssertionError: end tag=body with no startElement > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) > at > org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) > testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: > 0.116 sec <<< ERROR! > java.lang.AssertionError: mismatched elements open=tr close=table > at > org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) > at > org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) > at > org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) > at > com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) > at > com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at > org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) > at > org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) > testMultipart(org.apache.tika.parse
[jira] [Updated] (TIKA-1672) Integrate tika-java7 component
[ https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1672: Fix Version/s: (was: 1.12) 1.13 > Integrate tika-java7 component > -- > > Key: TIKA-1672 > URL: https://issues.apache.org/jira/browse/TIKA-1672 > Project: Tika > Issue Type: Improvement >Reporter: Tyler Palsulich > Fix For: 1.13 > > > Code requiring Java 7 doesn't need to be in a separate module now that > TIKA-1536 (upgrade to Java 7) is done. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1800) MediaType#parse does not decode escaped special characters
[ https://issues.apache.org/jira/browse/TIKA-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1800: Fix Version/s: (was: 1.12) 1.13 > MediaType#parse does not decode escaped special characters > -- > > Key: TIKA-1800 > URL: https://issues.apache.org/jira/browse/TIKA-1800 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.11 >Reporter: Roberto Benedetti > Fix For: 1.13 > > > Special characters in parameter value are escaped in canonical string > representation but they are not unescaped when the canonical string > representation is parsed. > {code:java} > MediaType mType = new MediaType(MediaType.APPLICATION_XML, "x-report", > "#report@"); > String cType = mType.toString(); // application/xml; x-report="#report\@" > assertEquals("application/xml; x-report=\"#report\\@\"", cType); // success > mType = MediaType.parse(cType); > String report = mType.getParameters().get("x-report"); // #report\@ > assertEquals("#report@", report); // failure > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-776) ExifTool Embedder
[ https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-776: --- Fix Version/s: (was: 1.12) 1.13 > ExifTool Embedder > - > > Key: TIKA-776 > URL: https://issues.apache.org/jira/browse/TIKA-776 > Project: Tika > Issue Type: New Feature > Components: metadata >Affects Versions: 1.0 > Environment: ExifTool is required > (http://www.sno.phy.queensu.ca/~phil/exiftool/) >Reporter: Ray Gauss II > Labels: embed, exiftool, patch > Fix For: 1.13 > > Attachments: tika-parsers-exiftool-embed-patch.txt > > > This patch adds an ExifTool ExternalEmbedder which builds upon the work in > issue TIKA-774 and TIKA-775. > In the tika-parsers an ExiftoolExternalEmbedder is added which extends > ExternalEmbedder to programmatically create an Embedder which calls the > ExifTool command line to embed tika metadata into a file stream and an > ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and > XMP fields then parses the resulting file stream to verify the operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1598) Parser Implementation for Streaming Video
[ https://issues.apache.org/jira/browse/TIKA-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1598: Fix Version/s: (was: 1.12) 1.13 > Parser Implementation for Streaming Video > - > > Key: TIKA-1598 > URL: https://issues.apache.org/jira/browse/TIKA-1598 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.13 > > > A number of us have been discussing a Tika implementation which could, for > example, bind to a live multimedia stream and parse content from the stream > until it finished. > An excellent example would be watching Bonnie Scotland beating R. of Ireland > in the upcoming European Championship Qualifying - Group D on Sat 13 Jun @ > 17:00 GMT :) > I located a JMF Wrapper for ffmpeg which 'may' enable us to do this > http://sourceforge.net/projects/jffmpeg/ > I am not sure... plus it is not licensed liberally enough for us to include > so if there are other implementations then please post them here. > I 'may' be able to have a crack at implementing this next week. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1318) Use of Deprecated Word6Extractor.getParagraphText() Method
[ https://issues.apache.org/jira/browse/TIKA-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1318: Fix Version/s: (was: 1.12) 1.13 > Use of Deprecated Word6Extractor.getParagraphText() Method > -- > > Key: TIKA-1318 > URL: https://issues.apache.org/jira/browse/TIKA-1318 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 >Reporter: Tyler Palsulich >Priority: Minor > Labels: deprecation > Fix For: 1.13 > > > org.apache.tika.parser.microsoft.WordExtractor.parseWord6() uses the > deprecated Word6Extractor.getParagraphText() method. getParagraphText() is > supposed to return a String[] with an element for each paragraph in the text. > The replacement is getText(), which lets paragraph, cell, etc separation be > implementation specific. I'm not sure, at this point, how the POI > WordExtractor separates them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1505) chmparser breaks down when extracting from file of CHM format v3
[ https://issues.apache.org/jira/browse/TIKA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1505: Fix Version/s: (was: 1.12) 1.13 > chmparser breaks down when extracting from file of CHM format v3 > > > Key: TIKA-1505 > URL: https://issues.apache.org/jira/browse/TIKA-1505 > Project: Tika > Issue Type: Bug >Reporter: Bin Hawking > Fix For: 1.13 > > > chmparser throws exception or returns faulty text when: > 1. extracting from file of CHM format version 3 > 2. chm file with lzx reset interval > 2 > 3. chm file with >5000 objects > I am making the fix now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1456) Visual Sentiment API parser
[ https://issues.apache.org/jira/browse/TIKA-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1456: Fix Version/s: (was: 1.12) 1.13 > Visual Sentiment API parser > --- > > Key: TIKA-1456 > URL: https://issues.apache.org/jira/browse/TIKA-1456 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Labels: gsoc2015 > Fix For: 1.13 > > > Integrate the Visual Sentibank API as a parser for images. We can use > Aperture from CMU, it's released under the MIT license: > https://github.com/d8w/aperture -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1705) Update ASM dependency to 5.0.4
[ https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1705: Fix Version/s: (was: 1.12) 1.13 > Update ASM dependency to 5.0.4 > -- > > Key: TIKA-1705 > URL: https://issues.apache.org/jira/browse/TIKA-1705 > Project: Tika > Issue Type: Task >Affects Versions: 1.7 >Reporter: Uwe Schindler >Assignee: Dave Meikle > Fix For: 1.13 > > Attachments: TIKA-1705-2.patch, TIKA-1705.patch > > > Currently the Class file parser uses ASM 4.1. This older version cannot read > Java 8 / Java 9 class files (fails with Exception). > The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The > code change is only to update the visitor version, so it gets new Java 8 > features like lambdas reported, but this is not really required, but should > be done for full support. > FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM > 5, too. > You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no > problem with Lucene using a newer version). Since ASM 4.x the updates are > more easy (no visitor interfaces anymore, instead abstract classes), so it > does not break if you just replace the JAR file. So just see this as a > recommendatation, not urgent! Solr/Lucene will also work without this patch > (it just replaces the shipped ASM by newer version in our packaging). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1674) Add example to show how to extract embedded files
[ https://issues.apache.org/jira/browse/TIKA-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1674: Fix Version/s: (was: 1.12) 1.13 > Add example to show how to extract embedded files > - > > Key: TIKA-1674 > URL: https://issues.apache.org/jira/browse/TIKA-1674 > Project: Tika > Issue Type: New Feature >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > On tika-user, we received a question on how to extract embedded files. Let's > add an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1724) Create parser for .obo file format.
[ https://issues.apache.org/jira/browse/TIKA-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1724: Fix Version/s: (was: 1.12) 1.13 > Create parser for .obo file format. > --- > > Key: TIKA-1724 > URL: https://issues.apache.org/jira/browse/TIKA-1724 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.13 > > Attachments: TIKA-1724.patch, TIKA-1724.patch > > > This parser implementation caters for files of the [OBO Flat File Format > Guide, version 1.4|http://purl.obolibrary.org/obo/oboformat/spec.html] > MimeType. > The OBO format is the text file format used by OBO-Edit, the open source, > platform-independent application for viewing and editing ontologies. This > file format is used heavily within the clinical and biomedical fields as a > particular flat file serialization for ontologies. .obo files are 'typically' > accompanied by corresponding .owl serializations as this is also another file > format used pervasively within the clinical and biomedical fields. > I would sincerely appreciate code review. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1301) Establish TikaServer on Apache hosted VM
[ https://issues.apache.org/jira/browse/TIKA-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1301: Fix Version/s: (was: 1.12) 1.13 > Establish TikaServer on Apache hosted VM > > > Key: TIKA-1301 > URL: https://issues.apache.org/jira/browse/TIKA-1301 > Project: Tika > Issue Type: Bug > Components: server >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.13 > > > Over in Any23, Infra recently provisioned us with a nice shiny new VM to run > our service on > http://any23.org > I would like to do the same for Tika. I have some scripts on the Any23 VM > which will pull stable nightly tika-server snapshots and deploy them to the > VM. This is really nice for both dev's and users alike. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled
[ https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1815: Fix Version/s: (was: 1.12) 1.13 > Text content from parser is empty when NamedEntityParser is enabled > --- > > Key: TIKA-1815 > URL: https://issues.apache.org/jira/browse/TIKA-1815 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.13 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > When the NamedEntityParser is enabled, the Tika#parseToString() and other > parse() methods produces an empty string. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1738) ForkClient does not always delete temporary bootstrap jar
[ https://issues.apache.org/jira/browse/TIKA-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1738: Fix Version/s: (was: 1.12) 1.13 > ForkClient does not always delete temporary bootstrap jar > - > > Key: TIKA-1738 > URL: https://issues.apache.org/jira/browse/TIKA-1738 > Project: Tika > Issue Type: Bug > Components: core > Environment: Windows 10 >Reporter: Yaniv Kunda >Priority: Minor > Fix For: 1.13 > > Attachments: TIKA-1738.patch > > > ForkClient creates a new temporary bootstrap jar each time it's instantiated, > and tries to delete it in the {{close()}} method, after destroying the > process. > Possibly a Windows-specific behavior, the OS seem to still hold a handle to > the file a bit after the process is destroyed, causing the delete() method to > do nothing. > This is recreated by simply running ForkParserTest on my machine. > In a long-running process,this could fill the temp folder with many bootstrap > jars that will never be deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1697) Parser Implementation for AkomaNtoso Legal XML Documents
[ https://issues.apache.org/jira/browse/TIKA-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1697: Fix Version/s: (was: 1.12) 1.13 > Parser Implementation for AkomaNtoso Legal XML Documents > > > Key: TIKA-1697 > URL: https://issues.apache.org/jira/browse/TIKA-1697 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.13 > > > [AkomaNtoso|http://www.akomantoso.org/] is an established OASIS Legal > Document XML standard and used pervasively within parliaments and other > legislative arenas. > This issue should utilize the > [akomantoso-lib|https://github.com/kohsah/akomantoso-lib] to parse and > populate Metadata for AkomaNtoso .xml and .akn documents. > I'll send a PR for this soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-988) We don't extract a placeholder for a Word document embedded in an Excel document
[ https://issues.apache.org/jira/browse/TIKA-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-988: --- Fix Version/s: (was: 1.12) 1.13 > We don't extract a placeholder for a Word document embedded in an Excel > document > > > Key: TIKA-988 > URL: https://issues.apache.org/jira/browse/TIKA-988 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Michael McCandless > Fix For: 1.13 > > Attachments: bug31373.xls > > > In TIKA-956 we fixed the Word parser so that at the point where an embedded > document appears, we output a tag. > It would be nice to do this for documents embedded in Excel too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1518) Docker with Tika Server
[ https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1518: Fix Version/s: (was: 1.12) 1.13 > Docker with Tika Server > --- > > Key: TIKA-1518 > URL: https://issues.apache.org/jira/browse/TIKA-1518 > Project: Tika > Issue Type: New Feature >Reporter: Paul Ramirez > Fix For: 1.13 > > > This version should be able to demonstrate as many of Apache Tika's > capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to > show parsers which require installation of other dependencies. In addition, > this should help move TIKA-1301 forward and should leverage the suggestion > made by [~lewismc] of a script which can pull down the latest version of > Apache Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1379) error in Tika().detect for xml files with xades signature
[ https://issues.apache.org/jira/browse/TIKA-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1379: Fix Version/s: (was: 1.12) 1.13 > error in Tika().detect for xml files with xades signature > - > > Key: TIKA-1379 > URL: https://issues.apache.org/jira/browse/TIKA-1379 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.4 >Reporter: Alessandro De Angelis > Labels: new-parser > Fix For: 1.13 > > > we tried to get the mime type of an xml file with xades signature embedded. > the result is "text/html" and not the expected "text/xml" or > "application/xml". > here is an example of the xml file: > {code} > > > 00094853 0003 2 > 2013-09-23 > 2013-09-23 > D69017 > FILOSOFIA DELLA SCIENZA > D69 > TEATRO E ARTI VISIVE > > 1233456 > PAOLINO > PAPERINO > 23.0 > 23 > > > > 2012 > 6.0 > > 9 > جامعة البندقية - TEST > Verbale_3 > QUI QUO QUA > D69017 > FILOSOFIA DELLA SCIENZA > D69 > TEATRO E ARTI VISIVE > QUI QUO QUA > 26-09-2013 09:55:53 CEST(+0200) > > 3 > 11.09.03 > > http://www.w3.org/2000/09/xmldsig#"; > Id="sig08744308748201048377"> > > Algorithm="http://www.w3.org/2006/12/xml-c14n11";> > Algorithm="http://www.w3.org/2001/04/xmldsig-more#rsa-sha256";> > > > http://www.w3.org/2002/06/xmldsig-filter2";> > xmlns:dsig-xpath="http://www.w3.org/2002/06/xmldsig-filter2"; > Filter="subtract">/descendant::ds:Signature > > http://www.w3.org/TR/1999/REC-xslt-19991116";> > http://www.kion.it/webesse3/multilingua"; > xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; > exclude-result-prefixes="kion" version="1.0"> > > > >select="/VERBALI/VERBALE"> >select="/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO"> >select="/VERBALI/VERBALE/RAGGRUPPAMENTO"> >select="/VERBALI/VERBALE/COMMISSIONE"> > > > > >http-equiv="Content-Type"> > >test="$sostituzione_root"> > Dichiarazione > conformità Verbale Esame > > > Verbalizzazione > esame > > > >td {font-family: Arial; font-size:10pt;} >div {font-family: Arial; font-size:10pt;} >pre {font-family: Arial; font-size:10pt;} > > > > > >test="$sostituzione_root"> >colspan="2"> select="$verbale_root/ATENEO_DES"> >colspan="2">DICHIARAZIONE DI > CONFORMITÀ >colspan="2">Il sottoscritto select="$verbale_root/TITOLARE_PROCEDIMENTO">, docente di > > > > > > >test="$sostituzione_root/MOTIVAZIONE"> > > PREMESSO CHE > > > > select="$sostituzione_root/MOTIVAZIONE"> > > > > > > > > > DICHIARA > > >
[jira] [Updated] (TIKA-1343) Create a Tika Translator implementation that uses JoshuaDecoder
[ https://issues.apache.org/jira/browse/TIKA-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1343: Fix Version/s: (was: 1.12) 1.13 > Create a Tika Translator implementation that uses JoshuaDecoder > --- > > Key: TIKA-1343 > URL: https://issues.apache.org/jira/browse/TIKA-1343 > Project: Tika > Issue Type: New Feature > Components: translation >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.13 > > > The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine > translation system hosted at Github: > http://joshua-decoder.org/ > Joshua takes in corpuses and trains models that can then be used to do > language translation. Currently there is support for e.g., Spanisn->English, > Indian dialects->English, Chinese->English, and a few others. > https://github.com/joshua-decoder/joshua/ > It would be nice to build a Tika Translator on top of Joshua. There are of > course several issues with this: > * the models are huge - so we'll need a separate package or Maven module, > maybe tika-translate-joshua or something to release the models and we'll need > to build the models. I just went through the process of building the > Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, > but it took over a day > * there is a configuration for Joshua, and so we need some way of passing > that config into the Translator. Not sure of the best way to do this. > * Joshua isn't in the Central repository. I've started a discussion on the > Joshua lists about this: > https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0 > Anyhoo, I've got a working patch right now with hard code stuff, and a manual > install into my Maven repo for brave souls out there that want to try it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1808) Head section closed too eager
[ https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1808: Fix Version/s: (was: 1.12) 1.13 > Head section closed too eager > - > > Key: TIKA-1808 > URL: https://issues.apache.org/jira/browse/TIKA-1808 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.11 >Reporter: Markus Jelsma > Fix For: 1.13 > > > XHTMLContentHandler has some logic that closes the head section too early, or > this is a problem in TagSoup. In this [1] case a element appears in the > head, causing the head to be closed. Subsequent elements do not appear > in custom ContentHandlers so i cannot read the document's title, or any other > meta tags. > It can be fixed by using a custom HTMLSchema in the ParseContext, e.g. > schema.elementType("div", HTMLSchema.M_EMPTY, 65535, 0); but this isn't > really an elegant solution. > [1] http://www.aljazeera.com/news/2015/05/150516182251747.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1208) Migrate Any23 mime contributions to Tika
[ https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1208: Fix Version/s: (was: 1.12) 1.13 > Migrate Any23 mime contributions to Tika > > > Key: TIKA-1208 > URL: https://issues.apache.org/jira/browse/TIKA-1208 > Project: Tika > Issue Type: Sub-task > Components: mime >Reporter: Lewis John McGibbney > Fix For: 1.13 > > Attachments: TIKA-1208.patch > > > We begin with one of the most obvious areas in which there > is overlap. > In short, the appeal of this package is the addition of detection > for the following types: > - text/n3 > - text/rdf+n3 > - application/n3 > - text/x-nquads > - text/rdf+nq > - text/nq > - application/nq > - text/turtle > - application/x-turtle > - application/turtle > - application/trix > > Therefore although both Tika and Any23 execute the task of Mimetype-related > tasks, there is a contribution to be made. This involves the trasferral of > code pertaining to pattern recogition, Mimetype XML defitinions within > tika-mimetypes.xml and a Purifier implementation that removes all > the eventual blank characters at the header of a file that might > prevents its MIME Type detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1106) CLAVIN Integration
[ https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1106: Fix Version/s: (was: 1.12) 1.13 > CLAVIN Integration > -- > > Key: TIKA-1106 > URL: https://issues.apache.org/jira/browse/TIKA-1106 > Project: Tika > Issue Type: New Feature > Components: parser >Affects Versions: 1.3 > Environment: All >Reporter: Adam Estrada >Assignee: Chris A. Mattmann >Priority: Minor > Labels: entity, geospatial, new-parser > Fix For: 1.13 > > > I've been evaluating CLAVIN as a way to extract location information from > unstructured text. It seems like meshing it with Tika in some way would make > a lot of sense. From CLAVIN website... > {quote} > CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source > software package for document geotagging and geoparsing that employs > context-based geographic entity resolution. It combines a variety of open > source tools with natural language processing techniques to extract location > names from unstructured text documents and resolve them against gazetteer > records. Importantly, CLAVIN does not simply "look up" location names; > rather, it uses intelligent heuristics in an attempt to identify precisely > which "Springfield" (for example) was intended by the author, based on the > context of the document. CLAVIN also employs fuzzy search to handle > incorrectly-spelled location names, and it recognizes alternative names > (e.g., "Ivory Coast" and "Côte d'Ivoire") as referring to the same geographic > entity. By enriching text documents with structured geo data, CLAVIN enables > hierarchical geospatial search and advanced geospatial analytics on > unstructured data. > {quote} > There was only one other instance of the word "clavin" mentioned in the ASF > jira site so I thought it was definitely worth posting here. > https://github.com/Berico-Technologies/CLAVIN -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1540) New Tika plugin for image based feature extraction using computer vision techniques
[ https://issues.apache.org/jira/browse/TIKA-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1540: Fix Version/s: (was: 1.12) 1.13 > New Tika plugin for image based feature extraction using computer vision > techniques > --- > > Key: TIKA-1540 > URL: https://issues.apache.org/jira/browse/TIKA-1540 > Project: Tika > Issue Type: New Feature > Environment: cross platform >Reporter: Aashish Chaudhary >Assignee: Lewis John McGibbney > Labels: gsoc2015 > Fix For: 1.13 > > Attachments: TIKA-vision.achaudhary.150209.patch.txt > > > This will be a web-service client based parser to perform image feature > extraction using Computer Vision techniques. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1295) Make some Dublin Core items multi-valued
[ https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1295: Fix Version/s: (was: 1.12) 1.13 > Make some Dublin Core items multi-valued > > > Key: TIKA-1295 > URL: https://issues.apache.org/jira/browse/TIKA-1295 > Project: Tika > Issue Type: Bug > Components: metadata >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 1.13 > > > According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, > dc:title, dc:description and dc:rights should allow multiple values because > of language alternatives. Unless anyone objects in the next few days, I'll > switch those to Property.toInternalTextBag() from Property.toInternalText(). > I'll also modify PDFParser to extract dc:rights. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag
[ https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-539: --- Fix Version/s: (was: 1.12) 1.13 > Encoding detection is too biased by encoding in meta tag > > > Key: TIKA-539 > URL: https://issues.apache.org/jira/browse/TIKA-539 > Project: Tika > Issue Type: Improvement > Components: metadata, parser >Affects Versions: 0.8, 0.9, 0.10 >Reporter: Reinhard Schwab >Assignee: Ken Krugler >Priority: Minor > Fix For: 1.13 > > Attachments: TIKA-539.patch, TIKA-539_2.patch > > > if the encoding in the meta tag is wrong, this encoding is detected, > even if there is the right encoding set in metadata before(which can be from > http response header). > test code to reproduce: > static String content = "\n" > + " content=\"application/xhtml+xml; charset=iso-8859-1\" />" > + "Über den Wolken\n"; > /** >* @param args >* @throws IOException >* @throws TikaException >* @throws SAXException >*/ > public static void main(String[] args) throws IOException, SAXException, > TikaException { > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/html"); > metadata.set(Metadata.CONTENT_ENCODING, "UTF-8"); > System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); > InputStream in = new > ByteArrayInputStream(content.getBytes("UTF-8")); > AutoDetectParser parser = new AutoDetectParser(); > BodyContentHandler h = new BodyContentHandler(1); > parser.parse(in, h, metadata, new ParseContext()); > System.out.print(h.toString()); > System.out.println(metadata.get(Metadata.CONTENT_ENCODING)); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
[ https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1059: Fix Version/s: (was: 1.12) 1.13 > Better Handling of InterruptedException in ExternalParser and ExternalEmbedder > -- > > Key: TIKA-1059 > URL: https://issues.apache.org/jira/browse/TIKA-1059 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.3 >Reporter: Ray Gauss II > Fix For: 1.13 > > > The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch > {{InterruptedException}} and ignore it. > The methods should either call {{interrupt()}} on the current thread or > re-throw the exception, possibly wrapped in a {{TikaException}}. > See TIKA-775 for a previous discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1328) Translate Metadata and Content
[ https://issues.apache.org/jira/browse/TIKA-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1328: Fix Version/s: (was: 1.12) 1.13 > Translate Metadata and Content > -- > > Key: TIKA-1328 > URL: https://issues.apache.org/jira/browse/TIKA-1328 > Project: Tika > Issue Type: New Feature > Components: translation >Reporter: Tyler Palsulich > Fix For: 1.13 > > > Right now, Translation is only done on Strings. Ideally, users would be able > to "turn on" translation while parsing. I can think of a couple options: > - Make a TranslateAutoDetectParser. Automatically detect the file type, parse > it, then translate the content. > - Make a Context switch. When true, translate the content regardless of the > parser used. I'm not sure the best way to go about this method, but I prefer > it over another Parser. > Regardless, we need a black or white list for translation. I think black list > would be the way to go -- which fields should not be translated (dates, > versions, ...) Any ideas? Also, somewhat unrelated, does anyone know of any > other open source translation libraries? If we were really lucky, it wouldn't > depend on an online service. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1709) Tika Server doesn't handle multi-part attachments or form-encoded inputs
[ https://issues.apache.org/jira/browse/TIKA-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1709: Fix Version/s: (was: 1.12) 1.13 > Tika Server doesn't handle multi-part attachments or form-encoded inputs > > > Key: TIKA-1709 > URL: https://issues.apache.org/jira/browse/TIKA-1709 > Project: Tika > Issue Type: Bug > Components: server > Environment: http://github.com/chrismattmann/tika-python/ Windows 7 > Ultimate >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.13 > > > Downstream in the Tika Python library, I noticed that Tika Server doesn't > handle e.g., in /rmeta, multi-part attachments on Windows 7 Ultimate, such as > those encoded using curl -T for example. Tika-Server returns back a 415 that > it can't properly diagnose what the mime type is. > See: > https://github.com/kennethreitz/requests/issues/2725 > https://github.com/chrismattmann/tika-python/issues/58 > For more info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-987) Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted
[ https://issues.apache.org/jira/browse/TIKA-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-987: --- Fix Version/s: (was: 1.12) 1.13 > Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted > > > Key: TIKA-987 > URL: https://issues.apache.org/jira/browse/TIKA-987 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Michael McCandless > Fix For: 1.13 > > Attachments: picture.doc, picture_3.doc > > > I have two Word docs, both containing the same drawing, but one has > text added. > In one case (picture.doc) the extraction is correct: it contains only > an embedded image.wmf; when I view the image it's correct. > In the second case (picture_3.doc) the picture is extracted as image > (no extension), and is 0 bytes, and there is an invalid character > (mapped to unicode replacement char) inserted before the image: > {noformat} > > > � > > > vehicle > > {noformat} > (Though, the text "vehicle" is extracted correctly). > I dug a bit, and with the 2nd doc there is an embedded {SHAPE * > MERGEFORMAT} field, which we invoke > WordExtractor.handleSpecialCharacterRuns on, and somehow it extracts > the 0-byte no-extension image as well as the invalid character. With > the first doc there is no field (at least not one that's handle with > handleSpecialCharacterRuns...). Otherwise I'm not sure how to > fix... it could be something is going wrong in how POI parses the > Pictures from PictureSource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1513) Add mime detection and parsing for dbf files
[ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1513: Fix Version/s: (was: 1.12) 1.13 > Add mime detection and parsing for dbf files > > > Key: TIKA-1513 > URL: https://issues.apache.org/jira/browse/TIKA-1513 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.13 > > > I just came across an Apache licensed dbf parser that is available on > [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom]. > Let's add dbf parsing to Tika. > Any other recommendations for alternate parsers? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1465) Implement extraction of non-global variables from netCDF3 and netCDF4
[ https://issues.apache.org/jira/browse/TIKA-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1465: Fix Version/s: (was: 1.12) 1.13 > Implement extraction of non-global variables from netCDF3 and netCDF4 > - > > Key: TIKA-1465 > URL: https://issues.apache.org/jira/browse/TIKA-1465 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.13 > > > Speaking to Eric Nienhouse at the ongoing NSF funded Polar > Cyberinfrastructure hackathon in NYC, we became aware that variables > parameters contained within netCDF3 and netCDF4 are just as valuable (if not > more valuable) as global attribute values. > AFAIK, right now we only extract global attributes however we could extend > the support to cater for the above observations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1220) Parser implementration for IFC files
[ https://issues.apache.org/jira/browse/TIKA-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1220: Fix Version/s: (was: 1.12) 1.13 > Parser implementration for IFC files > > > Key: TIKA-1220 > URL: https://issues.apache.org/jira/browse/TIKA-1220 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Labels: new-parser > Fix For: 1.13 > > Attachments: 2012-03-23-Duplex-Programming.ifc > > > The Industry Foundation Classes (IFC) [0] data model is intended to describe > building and construction industry data. For the sake of argument, it can be > considered as a more intelligent successor to the .dwg data models used > within CAD models. > I've tracked down a potential 3rd party library [1] which we maybe able to > wrap and use within Tika however the provided software packages are licensed > under: http://creativecommons.org/licenses/by-nc-sa/3.0/de/ so I am currently > over on legal-discuss@ in an attempt to see if it is possible to wrap some > code and contribute it to tika-parsers. > When I get feedback from legal-discuss, and if this is a go-ahead, I'll need > to help the developers package the code as a Maven artifact(s), then I will > progress with writing the implementation. > [0] http://en.wikipedia.org/wiki/Industry_Foundation_Classes > [1] http://www.ifctoolsproject.com/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies
[ https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1367: Fix Version/s: (was: 1.12) 1.13 > Tika documentation should list tika-parsers parser dependencies > --- > > Key: TIKA-1367 > URL: https://issues.apache.org/jira/browse/TIKA-1367 > Project: Tika > Issue Type: Improvement > Components: documentation >Reporter: Sergey Beryozkin > Fix For: 1.13 > > > tika-parsers module has many strong transitive parser dependencies. Maven > users of tika-parsers have to exclude all the transitivie dependencies > manually. Documenting the list of the existing transitive dependencies and > keeping the list up to date will help developers exclude the libraries not > needed for a given project. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL
[ https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1696: Fix Version/s: (was: 1.12) 1.13 > Language Identification with Text Processing Toolkit from MITLL > --- > > Key: TIKA-1696 > URL: https://issues.apache.org/jira/browse/TIKA-1696 > Project: Tika > Issue Type: New Feature > Components: languageidentifier >Reporter: Paul Ramirez > Fix For: 1.13 > > > The aim here is to extend the methods for language identification within > text. MIT Lincoln Labs has an open source library [1] written in Julia. > Having spoken with the MITLL guys there is a possibility that there is a > scala version of this library which would make it easier to package in with > Tika. > At this point I'm not quite sure how many languages this library supports by > default but it can be extended when provided some training data. > [1] https://github.com/mit-nlp/Text.jl -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1508) Add uniformity to parser parameter configuration
[ https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1508: Fix Version/s: (was: 1.12) 1.13 > Add uniformity to parser parameter configuration > > > Key: TIKA-1508 > URL: https://issues.apache.org/jira/browse/TIKA-1508 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Fix For: 1.13 > > > We can currently configure parsers by the following means: > 1) programmatically by direct calls to the parsers or their config objects > 2) sending in a config object through the ParseContext > 3) modifying .properties files for specific parsers (e.g. PDFParser) > Rather than scattering the landscape with .properties files for each parser, > it would be great if we could specify parser parameters in the main config > file, something along the lines of this: > {noformat} > > > 2 > something or other > > audio/basic > audio/x-aiff > audio/x-wav > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content
[ https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-819: --- Fix Version/s: (was: 1.12) 1.13 > Make Option to Exclude Embedded Files' Text for Text Content > > > Key: TIKA-819 > URL: https://issues.apache.org/jira/browse/TIKA-819 > Project: Tika > Issue Type: New Feature > Components: general >Affects Versions: 1.0 > Environment: Windows-7 + JDK 1.6 u26 >Reporter: Albert L. > Fix For: 1.13 > > > It would be nice to be able to disable text content from embedded files. > For example, if I have a DOCX with an embedded PPTX, then I would like the > option to disable text from the PPTX from showing up when asking for the text > content from DOCX. In other words, it would be nice to have the option to > get text content *only* from the DOCX instead of the DOCX+PPTX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1108) Represent individual slides in pptx
[ https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1108: Fix Version/s: (was: 1.12) 1.13 > Represent individual slides in pptx > --- > > Key: TIKA-1108 > URL: https://issues.apache.org/jira/browse/TIKA-1108 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Daniel Bonniot de Ruisselet > Fix For: 1.13 > > > When parsing ppt, tika produces for each slide: > > However for pptx these seem to be missing, all the text is directly under > . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1366) Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse
[ https://issues.apache.org/jira/browse/TIKA-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1366: Fix Version/s: (was: 1.12) 1.13 > Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse > > > Key: TIKA-1366 > URL: https://issues.apache.org/jira/browse/TIKA-1366 > Project: Tika > Issue Type: Improvement > Components: server >Reporter: Sergey Beryozkin >Priority: Minor > Fix For: 1.13 > > > Some of Tika Server services will benefit from optionally supporting JAX-RS > 2.0 AsyncResponse -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1390) Create tika-example module
[ https://issues.apache.org/jira/browse/TIKA-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1390: Fix Version/s: (was: 1.12) 1.13 > Create tika-example module > -- > > Key: TIKA-1390 > URL: https://issues.apache.org/jira/browse/TIKA-1390 > Project: Tika > Issue Type: Bug > Components: example >Reporter: Tyler Palsulich > Fix For: 1.13 > > > This issue will track the initial creation of the tika-example module. > Subtasks will be used for the first few examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1816) Lenient testing for NamedEntityParser
[ https://issues.apache.org/jira/browse/TIKA-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved TIKA-1816. - Resolution: Fixed -fixed > Lenient testing for NamedEntityParser > - > > Key: TIKA-1816 > URL: https://issues.apache.org/jira/browse/TIKA-1816 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Thamme Gowda N >Assignee: Tim Allison > Labels: memex > Fix For: 1.12 > > Attachments: TIKA-1816-proxy-fix.patch > > > NamedEntityParser has a hard setup requirement like downloading of NER models > from remote servers and adding them to classpath. > These model files are huge and hence are not added to source control. > So, the tests are most likely to fail in various environments. > Make the best effort to set up the tests, but in the worst case skip tests > instead of failing the whole build process. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1840) No way to link slide notes to slide in PPT output.
[ https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved TIKA-1840. - Resolution: Fixed Committed in master and sync'ed to Github. Since we are rolling with 1.12 and this is super close, I figured we can merge it and improve iteratively. [~gagravarr]. Thanks Sam! {noformat} [chipotle:~/tmp/tika1.12] mattmann% git merge TIKA-1840 Updating efb645e..1bc6176 Fast-forward CHANGES.txt | 3 +++ tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java | 14 -- 2 files changed, 15 insertions(+), 2 deletions(-) [chipotle:~/tmp/tika1.12] mattmann% git push -u origin master Counting objects: 94, done. Delta compression using up to 4 threads. Compressing objects: 100% (21/21), done. Writing objects: 100% (29/29), 2.44 KiB | 0 bytes/s, done. Total 29 (delta 11), reused 0 (delta 0) To https://git-wip-us.apache.org/repos/asf/tika.git efb645e..1bc6176 master -> master Branch master set up to track remote branch master from origin. [chipotle:~/tmp/tika1.12] mattmann% {noformat} > No way to link slide notes to slide in PPT output. > -- > > Key: TIKA-1840 > URL: https://issues.apache.org/jira/browse/TIKA-1840 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Sam H >Assignee: Chris A. Mattmann > Fix For: 1.12 > > > I'm integrating Apache Tika into my project, and I want to extract (text) > information from Powerpoint slides. Both PPT and PPTX > I've noticed when using PPT format, the slide notes are all aggregated at the > end of the XML output, and there is no way to identify which note belongs to > which slide. > I began looking at the code and found the following: > {code} > // TODO Find the Notes for this slide and extract inline > {code} > in > [HSLFExtractor.java|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java] > on line 140 > I would like to implement this part and contribute -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: fix for TIKA-1840 contributed by zetisam
Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/72 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1840) No way to link slide notes to slide in PPT output.
[ https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114576#comment-15114576 ] ASF GitHub Bot commented on TIKA-1840: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/72 > No way to link slide notes to slide in PPT output. > -- > > Key: TIKA-1840 > URL: https://issues.apache.org/jira/browse/TIKA-1840 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Sam H >Assignee: Chris A. Mattmann > Fix For: 1.12 > > > I'm integrating Apache Tika into my project, and I want to extract (text) > information from Powerpoint slides. Both PPT and PPTX > I've noticed when using PPT format, the slide notes are all aggregated at the > end of the XML output, and there is no way to identify which note belongs to > which slide. > I began looking at the code and found the following: > {code} > // TODO Find the Notes for this slide and extract inline > {code} > in > [HSLFExtractor.java|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java] > on line 140 > I would like to implement this part and contribute -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1840) No way to link slide notes to slide in PPT output.
[ https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1840: Fix Version/s: 1.12 > No way to link slide notes to slide in PPT output. > -- > > Key: TIKA-1840 > URL: https://issues.apache.org/jira/browse/TIKA-1840 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Sam H >Assignee: Chris A. Mattmann > Fix For: 1.12 > > > I'm integrating Apache Tika into my project, and I want to extract (text) > information from Powerpoint slides. Both PPT and PPTX > I've noticed when using PPT format, the slide notes are all aggregated at the > end of the XML output, and there is no way to identify which note belongs to > which slide. > I began looking at the code and found the following: > {code} > // TODO Find the Notes for this slide and extract inline > {code} > in > [HSLFExtractor.java|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java] > on line 140 > I would like to implement this part and contribute -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TIKA-1840) No way to link slide notes to slide in PPT output.
[ https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned TIKA-1840: --- Assignee: Chris A. Mattmann > No way to link slide notes to slide in PPT output. > -- > > Key: TIKA-1840 > URL: https://issues.apache.org/jira/browse/TIKA-1840 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 >Reporter: Sam H >Assignee: Chris A. Mattmann > > I'm integrating Apache Tika into my project, and I want to extract (text) > information from Powerpoint slides. Both PPT and PPTX > I've noticed when using PPT format, the slide notes are all aggregated at the > end of the XML output, and there is no way to identify which note belongs to > which slide. > I began looking at the code and found the following: > {code} > // TODO Find the Notes for this slide and extract inline > {code} > in > [HSLFExtractor.java|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java] > on line 140 > I would like to implement this part and contribute -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: Code Quality Fix for Findbugs Rule Impossible C...
Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/73 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---