[jira] [Updated] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1367:

Fix Version/s: (was: 1.15)
   1.16

> Tika documentation should list tika-parsers parser dependencies
> ---
>
> Key: TIKA-1367
> URL: https://issues.apache.org/jira/browse/TIKA-1367
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sergey Beryozkin
> Fix For: 1.16
>
>
> tika-parsers module has many strong transitive parser dependencies. Maven 
> users of tika-parsers have to exclude all the transitivie dependencies 
> manually. Documenting the list of the existing transitive dependencies and 
> keeping the list up to date will help developers exclude the libraries not 
> needed for a given project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1436) improvement to PDFParser

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1436:

Fix Version/s: (was: 1.15)
   1.16

> improvement to PDFParser
> 
>
> Key: TIKA-1436
> URL: https://issues.apache.org/jira/browse/TIKA-1436
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Stefano Fornari
>  Labels: parser, pdf
> Fix For: 1.16
>
> Attachments: 
> 0001-Improvment-as-described-in-https-issues.apache.org-j.patch, 
> ste-20140927.patch
>
>
> with regards to the thread "[PDFParser] - read limited number of characters" 
> on Mar 29, I would like to propose the attached patch. I noticed that in Tika 
> 1.6 there have been some work around a better handling of the 
> WriteLimitReachedException condition, but I believe it could be even 
> improved. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1705) Update ASM dependency to 5.0.4

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1705:

Fix Version/s: (was: 1.15)
   1.16

> Update ASM dependency to 5.0.4
> --
>
> Key: TIKA-1705
> URL: https://issues.apache.org/jira/browse/TIKA-1705
> Project: Tika
>  Issue Type: Task
>Affects Versions: 1.7
>Reporter: Uwe Schindler
>Assignee: Dave Meikle
> Fix For: 1.16
>
> Attachments: TIKA-1705-2.patch, TIKA-1705.patch
>
>
> Currently the Class file parser uses ASM 4.1. This older version cannot read 
> Java 8 / Java 9 class files (fails with Exception).
> The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The 
> code change is only to update the visitor version, so it gets new Java 8 
> features like lambdas reported, but this is not really required, but should 
> be done for full support.
> FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 
> 5, too.
> You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no 
> problem with Lucene using a newer version). Since ASM 4.x the updates are 
> more easy (no visitor interfaces anymore, instead abstract classes), so it 
> does not break if you just replace the JAR file. So just see this as a 
> recommendatation, not urgent! Solr/Lucene will also work without this patch 
> (it just replaces the shipped ASM by newer version in our packaging).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-987) Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-987:
---
Fix Version/s: (was: 1.15)
   1.16

> Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted
> 
>
> Key: TIKA-987
> URL: https://issues.apache.org/jira/browse/TIKA-987
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.16
>
> Attachments: picture_3.doc, picture.doc
>
>
> I have two Word docs, both containing the same drawing, but one has
> text added.
> In one case (picture.doc) the extraction is correct: it contains only
> an embedded image.wmf; when I view the image it's correct.
> In the second case (picture_3.doc) the picture is extracted as image
> (no extension), and is 0 bytes, and there is an invalid character
> (mapped to unicode replacement char) inserted before the image:
> {noformat}
> 
> 
> �
> 
> 
> vehicle
> 
> {noformat}
> (Though, the text "vehicle" is extracted correctly).
> I dug a bit, and with the 2nd doc there is an embedded {SHAPE *
> MERGEFORMAT} field, which we invoke
> WordExtractor.handleSpecialCharacterRuns on, and somehow it extracts
> the 0-byte no-extension image as well as the invalid character.  With
> the first doc there is no field (at least not one that's handle with
> handleSpecialCharacterRuns...).  Otherwise I'm not sure how to
> fix... it could be something is going wrong in how POI parses the
> Pictures from PictureSource.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1366) Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1366:

Fix Version/s: (was: 1.15)
   1.16

> Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse 
> 
>
> Key: TIKA-1366
> URL: https://issues.apache.org/jira/browse/TIKA-1366
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Reporter: Sergey Beryozkin
>Priority: Minor
> Fix For: 1.16
>
>
> Some of Tika Server services will benefit from optionally supporting JAX-RS 
> 2.0 AsyncResponse



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1059:

Fix Version/s: (was: 1.15)
   1.16

> Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
> --
>
> Key: TIKA-1059
> URL: https://issues.apache.org/jira/browse/TIKA-1059
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ray Gauss II
> Fix For: 1.16
>
>
> The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch 
> {{InterruptedException}} and ignore it.
> The methods should either call {{interrupt()}} on the current thread or 
> re-throw the exception, possibly wrapped in a {{TikaException}}.
> See TIKA-775 for a previous discussion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-776) ExifTool Embedder

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-776:
---
Fix Version/s: (was: 1.15)
   1.16

> ExifTool Embedder
> -
>
> Key: TIKA-776
> URL: https://issues.apache.org/jira/browse/TIKA-776
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata
>Affects Versions: 1.0
> Environment: ExifTool is required 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>Assignee: Chris A. Mattmann
>  Labels: embed, exiftool, patch
> Fix For: 1.16
>
> Attachments: tika-parsers-exiftool-embed-patch.txt
>
>
> This patch adds an ExifTool ExternalEmbedder which builds upon the work in 
> issue TIKA-774 and TIKA-775.
> In the tika-parsers an ExiftoolExternalEmbedder is added which extends 
> ExternalEmbedder to programmatically create an Embedder which calls the 
> ExifTool command line to embed tika metadata into a file stream and an 
> ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and 
> XMP fields then parses the resulting file stream to verify the operation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1395) Create embedded image extraction example

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1395:

Fix Version/s: (was: 1.15)
   1.16

> Create embedded image extraction example
> 
>
> Key: TIKA-1395
> URL: https://issues.apache.org/jira/browse/TIKA-1395
> Project: Tika
>  Issue Type: Sub-task
>  Components: example
>Reporter: Tyler Palsulich
>Priority: Minor
> Fix For: 1.16
>
>
> Create an example of how to turn do embedded image extraction and parsing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-891:
---
Fix Version/s: (was: 1.15)
   1.16

> Use POST in addition to PUT on method calls in tika-server
> --
>
> Key: TIKA-891
> URL: https://issues.apache.org/jira/browse/TIKA-891
> Project: Tika
>  Issue Type: Improvement
>  Components: general
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Trivial
>  Labels: newbie
> Fix For: 1.16
>
>
> Per Jukka's email:
> http://s.apache.org/uR
> It would be a better use of REST/HTTP "verbs" to use POST to put content to a 
> resource where we don't intend to store that content (which is the 
> implication of PUT). Max suggested adding:
> {code}
> @POST
> {code}
> annotations to the methods we are currently exposing using PUT to take care 
> of this. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1456) Visual Sentiment API parser

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1456:

Fix Version/s: (was: 1.15)
   1.16

> Visual Sentiment API parser
> ---
>
> Key: TIKA-1456
> URL: https://issues.apache.org/jira/browse/TIKA-1456
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>  Labels: gsoc, gsoc2016
> Fix For: 1.16
>
>
> Integrate the Visual Sentibank API as a parser for images. We can use 
> Aperture from CMU, it's released under the MIT license:
> https://github.com/d8w/aperture



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1688) Tika Version in Metadata

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1688:

Fix Version/s: (was: 1.15)
   1.16

> Tika Version in Metadata
> 
>
> Key: TIKA-1688
> URL: https://issues.apache.org/jira/browse/TIKA-1688
> Project: Tika
>  Issue Type: Improvement
>Reporter: Paul Ramirez
>Priority: Minor
> Fix For: 1.16
>
>
> Could this be added as X-Tika:version that way downstream there would be 
> traceability to extraction based on version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1540) New Tika plugin for image based feature extraction using computer vision techniques

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1540:

Fix Version/s: (was: 1.15)
   1.16

> New Tika plugin for image based feature extraction using computer vision 
> techniques
> ---
>
> Key: TIKA-1540
> URL: https://issues.apache.org/jira/browse/TIKA-1540
> Project: Tika
>  Issue Type: New Feature
> Environment: cross platform
>Reporter: Aashish Chaudhary
>Assignee: Lewis John McGibbney
>  Labels: gsoc2015
> Fix For: 1.16
>
> Attachments: TIKA-vision.achaudhary.150209.patch.txt
>
>
> This will be a web-service client based parser to perform image feature 
> extraction using Computer Vision techniques. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1276) Missing embedded dependencies in tika-bundle

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1276:

Fix Version/s: (was: 1.15)
   1.16

> Missing embedded dependencies in tika-bundle
> 
>
> Key: TIKA-1276
> URL: https://issues.apache.org/jira/browse/TIKA-1276
> Project: Tika
>  Issue Type: Bug
>  Components: packaging
>Affects Versions: 1.5
> Environment: OSGI, Apache Felix via Apache Sling Launcher
>Reporter: Rupert Westenthaler
> Fix For: 1.16
>
> Attachments: TIKA-1276_20140423_rwesten.diff, 
> TIKA-1276_20140428_2_rwesten.diff, TIKA-1276_20140428_3_rwesten.diff, 
> TIKA-1276_20140428_rwesten.diff
>
>
> While updating from tika 1.2 to 1.5 I that the 
> `org.apache.tika:tika-bundle:1.5` module has some missing dependences.
> 1. `com.uwyn:jhighlight:1.0` is not embedded
> Because of that installing the bundle results in the following exception
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
> [103.0] osgi.wiring.package; 
> (osgi.wiring.package=com.uwyn.jhighlight.renderer))
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [103]: Unable to resolve 103.0: missing requirement 
> [103.0] osgi.wiring.package; 
> (osgi.wiring.package=com.uwyn.jhighlight.renderer)
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> 2. `org.ow2.asm:asm:4.1` is not embedded because 
> `org.apache.tika:tika-core:1.5` uses `org.ow2.asm-debug-all:asm:4.1` and 
> therefore the `Embed-Dependency` directive `asm` does not match any 
> dependency. 
> Because of that one do get the following exception (after fixing (1))
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; 
> (&(osgi.wiring.package=org.objectweb.asm)(version>=4.1.0)(!(version>=5.0.0
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; 
> (&(osgi.wiring.package=org.objectweb.asm)(version>=4.1.0)(!(version>=5.0.0)))
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> There are two possibilities to fix this (a) change the `Embed-Dependency` to 
> `asm-debug-all` or adding a dependency to `org.ow2.asm:asm:4.1` to the 
> tika-bundle pom file.
> 3. `edu.ucar:netcdf:4.2-min` is not embedded
> Because of that one does get the following exception (after fixing (1) and 
> (2))
> {code}
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2))
> org.osgi.framework.BundleException: Unresolved constraint in bundle 
> org.apache.tika.bundle [96]: Unable to resolve 96.0: missing requirement 
> [96.0] osgi.wiring.package; (osgi.wiring.package=ucar.ma2)
>   at 
> org.apache.felix.framework.Felix.resolveBundleRevision(Felix.java:3962)
>   at org.apache.felix.framework.Felix.startBundle(Felix.java:2025)
>   at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1279)
>   at 
> org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:304)
>   at java.lang.Thread.run(Thread.java:744)
> {code}
> 4. The `com.adobe.xmp:xmpcore:5.1.2` dependency is required at runtime
> After fixing the above issues the tika-bundle was started successfully. 
> However when extracting EXIG metadata from a jpeg image I got the following 
> exception.
> {code}
> java.lang.NoClassDefFoundError: com/adobe/xmp/XMPException
>   at 
> com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
>   at 
> com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
>   at 
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
>

[jira] [Updated] (TIKA-2338) Change Scope of Jai-ImageIO-Core dependency

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2338:

Fix Version/s: (was: 1.15)
   1.16

> Change Scope of Jai-ImageIO-Core dependency
> ---
>
> Key: TIKA-2338
> URL: https://issues.apache.org/jira/browse/TIKA-2338
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Luis Filipe Nassif
> Fix For: 1.16
>
>
> Looks like jai-imageio-core from github 
> (https://github.com/jai-imageio/jai-imageio-core) which we depend on with 
> test scope is Apache compatible.
> Note that is a fork from the original Jai project which is referenced by 
> PDFBox. The github fork has extracted jpeg2000 and other code with license 
> issues to a diferent project.
> Let's remove test scope from jai-imageio-core dependency, so we will provide 
> support for tiff and other image formats (except jpeg2000) out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1724) Create parser for .obo file format.

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1724:

Fix Version/s: (was: 1.15)
   1.16

> Create parser for .obo file format.
> ---
>
> Key: TIKA-1724
> URL: https://issues.apache.org/jira/browse/TIKA-1724
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.16
>
> Attachments: TIKA-1724.patch, TIKA-1724.patch
>
>
> This parser implementation caters for files of the [OBO Flat File Format 
> Guide, version 1.4|http://purl.obolibrary.org/obo/oboformat/spec.html] 
> MimeType.
> The OBO format is the text file format used by OBO-Edit, the open source, 
> platform-independent application for viewing and editing ontologies. This 
> file format is used heavily within the clinical and biomedical fields as a 
> particular flat file serialization for ontologies. .obo files are 'typically' 
> accompanied by corresponding .owl serializations as this is also another file 
> format used pervasively within the clinical and biomedical fields.
> I would sincerely appreciate code review. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1301) Establish TikaServer on Apache hosted VM

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1301:

Fix Version/s: (was: 1.15)
   1.16

> Establish TikaServer on Apache hosted VM
> 
>
> Key: TIKA-1301
> URL: https://issues.apache.org/jira/browse/TIKA-1301
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.16
>
>
> Over in Any23, Infra recently provisioned us with a nice shiny new VM to run 
> our service on
> http://any23.org
> I would like to do the same for Tika. I have some scripts on the Any23 VM 
> which will pull stable nightly tika-server snapshots and deploy them to the 
> VM. This is really nice for both dev's and users alike.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-2312) [Mp3Parser] expose fields form ID3TagsAndAudio

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2312:

Fix Version/s: (was: 1.15)
   1.16

> [Mp3Parser] expose fields form ID3TagsAndAudio 
> ---
>
> Key: TIKA-2312
> URL: https://issues.apache.org/jira/browse/TIKA-2312
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Łukasz Ozimek
>Priority: Trivial
>  Labels: beginner, easyfix
> Fix For: 1.16
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> Hi,
> First of all that's my  first issue in ASF jira so sorry for mistakes.
> Currently I am working on some custom Parsers for MP3 files. The reason I 
> would like to have access to fields in this class is that the system from 
> which I am transforming data depends on availability of particular version 
> ID3 tags and this class easily allow me to do that. 
> Moreover in current code base the Mp3Parser expose method 
> {code}
>  protected static ID3TagsAndAudio getAllTagHandlers(InputStream stream, 
> ContentHandler handler)
>throws IOException, SAXException, TikaException {
> }
> {code}
> and return object which haven't any accessible field. That's make me strange.
> Is there any reason why is it that?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1518) Docker with Tika Server

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1518:

Fix Version/s: (was: 1.15)
   1.16

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
> Fix For: 1.16
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1953) tika-server NullPointerException while processing rtfs

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1953:

Fix Version/s: (was: 1.15)
   1.16

> tika-server NullPointerException while processing rtfs
> --
>
> Key: TIKA-1953
> URL: https://issues.apache.org/jira/browse/TIKA-1953
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
> Environment: Python 2.7.11 :: Anaconda 4.0.0 (64-bit)
> Red Hat Enterprise Linux Server release 6.7 (Santiago)
> java version "1.7.0_95"
> OpenJDK Runtime Environment (rhel-2.6.4.0.el6_7-x86_64 u95-b00)
> OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)
>Reporter: Ravi
>Assignee: Tim Allison
>  Labels: newbie, rtf, tika-python, tika-server, xmlContent,
> Fix For: 1.16
>
> Attachments: officeinstallations3.rtf
>
>
> Looks like the xmlContent=True flag causes tika.py: Warn: Tika server 
> returned status: 422 error
> I start the tika server and then run the following code in the python kernel 
> at bash
> import tika
> from tika import parser
> parsed = parser.from_file('/path/to/file.rtf,'http://localhost:9003',xm
> lContent=True)
> I get.. tika.py: Warn: Tika server returned status: 422
> Looking at the tika-server log I get the following dump:
> Note: The parser seems to work fine without the xmlContent=True flag set. I 
> get the right output but setting this flag creates the NullPointerException 
> below
> --
> Apr 15, 2016 2:36:55 PM org.apache.tika.server.resource.TikaResource 
> logRequest
> INFO: rmeta/xml (autodetecting type)
> Apr 15, 2016 2:36:55 PM org.apache.tika.server.resource.TikaResource parse
> WARNING: rmeta/xml: Text extraction failed
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.rtf.RTFParser@21f0dbb9
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:177)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
> at 
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:281)
> at 
> org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:138)
> at 
> org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:119)
> at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
> at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
> at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
> at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
> at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:370)
> at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
> at 
> org.eclipse.j

[jira] [Updated] (TIKA-774) ExifTool Parser

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-774:
---
Fix Version/s: (was: 1.15)
   1.16

> ExifTool Parser
> ---
>
> Key: TIKA-774
> URL: https://issues.apache.org/jira/browse/TIKA-774
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.0
> Environment: Requires be installed 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>Assignee: Chris A. Mattmann
>  Labels: features, new-parser, newbie, patch
> Fix For: 1.16
>
> Attachments: testJPEG_IPTC_EXT.jpg, 
> tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt
>
>
> Adds an external parser that calls ExifTool to extract extended metadata 
> fields from images and other content types.
> In the core project:
> An ExifTool interface is added which contains Property objects that define 
> the metadata fields available.
> An additional Property constructor for internalTextBag type.
> In the parsers project:
> An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
> on the command line and mapping the response to tika metadata fields.  This 
> extractor could be called instead of or in addition to the existing 
> ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
> JpegParser but those have not been changed at this time.
> An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
> An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
> metadata fields to existing tika and Drew Noakes metadata fields if enabled.
> An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
> implementations in XML files.
> An ExifToolParserTest is added which tests several expected XMP and IPTC 
> metadata values in testJPEG_IPTC_EXT.jpg.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1425) Automatic batching of Microsoft service calls

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1425:

Fix Version/s: (was: 1.15)
   1.16

> Automatic batching of Microsoft service calls
> -
>
> Key: TIKA-1425
> URL: https://issues.apache.org/jira/browse/TIKA-1425
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Affects Versions: 1.6
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.16
>
>
> Right now when I use the following code I get the stack trace at the bottom 
> of this description. This seems to be because the Request URI is too large to 
> make the service request. We need to have a mechansim within the call to 
> Tika.translate which will, on a service-by-service basis, determine the 
> maximum Request URI which can be sent. I beleive that this should be on the 
> Tika side as how else am I meant to know the maximum request size?
> {code:title=translator.java|borderStyle=solid}
> +Translator translate = new MicrosoftTranslator();
> +((MicrosoftTranslator) translate).setId("...");
> +((MicrosoftTranslator) translate).setSecret("...");
>  for (java.util.Map.Entry entry : parseResult) {
>Parse parse = entry.getValue();
>LOG.info("-\nUrl\n---\n");
> @@ -201,7 +207,7 @@
>System.out.print(parse.getData().toString());
>if (dumpText) {
>  LOG.info("-\nParseText\n-\n");
> -System.out.print(parse.getText());
> +System.out.print(translate.translate(parse.getText(), "fr"));
>}
> {code}
> {code:title=stacktrace.log|borderStyle=solid}
> Exception in thread "main" java.lang.Exception: [microsoft-translator-api] 
> Error retrieving translation : Server returned HTTP response code: 414 for 
> URL: 
> http://api.microsofttranslator.com/V2/Ajax.svc/Translate?&from=&to=fr&text=%D0%A4%D0...
> ...
>   at 
> com.memetix.mst.MicrosoftTranslatorAPI.retrieveString(MicrosoftTranslatorAPI.java:202)
>   at com.memetix.mst.translate.Translate.execute(Translate.java:61)
>   at com.memetix.mst.translate.Translate.execute(Translate.java:76)
>   at 
> org.apache.tika.language.translate.MicrosoftTranslator.translate(MicrosoftTranslator.java:104)
>   at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:210)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>   at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:228)
> Caused by: java.io.IOException: Server returned HTTP response code: 414 for 
> URL: 
> http://api.microsofttranslator.com/V2/Ajax.svc/Translate?&from=&to=fr&text=%D0%A4%D0%BE%D1%80%D1%83%D0%B...
> ...
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1675)
>   at 
> sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1673)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1671)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1244)
>   at 
> com.memetix.mst.MicrosoftTranslatorAPI.retrieveResponse(MicrosoftTranslatorAPI.java:178)
>   at 
> com.memetix.mst.MicrosoftTranslatorAPI.retrieveString(MicrosoftTranslatorAPI.java:199)
>   ... 6 more
> Caused by: java.io.IOException: Server returned HTTP response code: 414 for 
> URL: 
> http://api.microsofttranslator.com/V2/Ajax.svc/Translate?&from=&to=fr&text=%D0%A4%D0%BE...
> ...
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626)
>   at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
>   at 
> com.memetix.mst.MicrosoftTranslatorAPI.retrieveResponse(MicrosoftTranslatorAPI.java:177)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1417) Create Extract Embedded Images from PDFs Example

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1417:

Fix Version/s: (was: 1.15)
   1.16

> Create Extract Embedded Images from PDFs Example
> 
>
> Key: TIKA-1417
> URL: https://issues.apache.org/jira/browse/TIKA-1417
> Project: Tika
>  Issue Type: Improvement
>  Components: example
>Reporter: Tyler Palsulich
>Priority: Minor
> Fix For: 1.16
>
>
> Users commonly want to "turn on" extraction of images embedded in PDFs (e.g. 
> TIKA-1414). Tika has the capability, but it's not clear how to use it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1674) Add example to show how to extract embedded files

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1674:

Fix Version/s: (was: 1.15)
   1.16

> Add example to show how to extract embedded files
> -
>
> Key: TIKA-1674
> URL: https://issues.apache.org/jira/browse/TIKA-1674
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.16
>
>
> On tika-user, we received a question on how to extract embedded files.  Let's 
> add an example.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-2346) Allow Office format parsers to exclude parsing shapes

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2346:

Fix Version/s: (was: 1.15)
   1.16

> Allow Office format parsers to exclude parsing shapes
> -
>
> Key: TIKA-2346
> URL: https://issues.apache.org/jira/browse/TIKA-2346
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Nick Burch
> Fix For: 1.16
>
>
> The Office format parsers support including or excluding of deleted text and 
> moved text. It would be good to also support something similar for 
> shape-based text, though probably not for PPT / PPTX as that's almost all 
> shape-based!
> (This has been done hackily in the Alfresco fork of Tika at  
> https://github.com/Alfresco/tika/commit/32aca3fd96816ad49b869a82c9ba0f02265f8744
>  but would be good to do properly)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1308:

Fix Version/s: (was: 1.15)
   1.16

> Support in memory parse mode(don't create temp file): to support run Tika in 
> GAE
> 
>
> Key: TIKA-1308
> URL: https://issues.apache.org/jira/browse/TIKA-1308
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: jefferyyuan
>  Labels: gae
> Fix For: 1.16
>
>
> I am trying to use Tika in GAE and write a simple servlet to extract meta 
> data info from jpeg:
> {code}
> String urlStr = req.getParameter("imageUrl");
> byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
> ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
> Metadata metadata = new Metadata();
> BodyContentHandler ch = new BodyContentHandler();
> AutoDetectParser parser = new AutoDetectParser();
> parser.parse(bais, ch, metadata, new ParseContext());
> bais.close();
> {code}
> This fails with exception:
> {code}
> Caused by: java.lang.SecurityException: Unable to create temporary file
>   at java.io.File.createTempFile(File.java:1986)
>   at 
> org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
>   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
>   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
>   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
> {code}
> Checked the code, in 
> org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, 
> Metadata, ParseContext), it creates a temp file from the input stream.
> I can understand why tika create temp file from the stream: so tika can parse 
> it multiple times.
> But as GAE and other cloud servers are getting more popular, is it possible 
> to avoid create temp file: instead we can copy the origin stream to a 
> byteArray stream, so tika can also parse it multiple times.
> -- This will have a limit on the file size, as tika keeps the whole file in 
> memory, but this can make tika work in GAE and maybe other cloud server.
> We can add a parameter in parser.parse to indicate whether do in memory parse 
> only.
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---
Fix Version/s: (was: 1.15)
   1.16

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10
>Reporter: Michael McCandless
>  Labels: newbie
> Fix For: 1.16
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
> testMultipart(org.apache.tika.parse

[jira] [Updated] (TIKA-1328) Translate Metadata and Content

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1328:

Fix Version/s: (was: 1.15)
   1.16

> Translate Metadata and Content
> --
>
> Key: TIKA-1328
> URL: https://issues.apache.org/jira/browse/TIKA-1328
> Project: Tika
>  Issue Type: New Feature
>  Components: translation
>Reporter: Tyler Palsulich
> Fix For: 1.16
>
>
> Right now, Translation is only done on Strings. Ideally, users would be able 
> to "turn on" translation while parsing. I can think of a couple options:
> - Make a TranslateAutoDetectParser. Automatically detect the file type, parse 
> it, then translate the content.
> - Make a Context switch. When true, translate the content regardless of the 
> parser used. I'm not sure the best way to go about this method, but I prefer 
> it over another Parser.
> Regardless, we need a black or white list for translation. I think black list 
> would be the way to go -- which fields should not be translated (dates, 
> versions, ...) Any ideas? Also, somewhat unrelated, does anyone know of any 
> other open source translation libraries? If we were really lucky, it wouldn't 
> depend on an online service.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-894:
---
Fix Version/s: (was: 1.15)
   1.16

> Add webapp mode for Tika Server, simplifies deployment
> --
>
> Key: TIKA-894
> URL: https://issues.apache.org/jira/browse/TIKA-894
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Affects Versions: 1.1, 1.2
>Reporter: Chris Wilson
>  Labels: maven, newbie, patch
> Fix For: 1.16
>
> Attachments: tika-server-webapp.patch
>
>
> For use in production services, Tika Server should really be deployed as a 
> WAR file, under a reliable servlet container that knows how to run as a 
> system service, for example Tomcat or JBoss.
> This is especially important on Windows, where I wasted an entire day trying 
> to make TikaServerCli run as some kind of a service. 
> Maven makes building a webapp pretty trivial. With the attached patch 
> applied, "mvn war:war" should work. It seems to run fine in Tomcat, which 
> makes Windows deployment much simpler. Just install Tomcat and drop the WAR 
> file into tomcat's webapps directory and you're away.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1607:

Fix Version/s: (was: 1.15)
   1.16

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.16
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1840) No way to link slide notes to slide in PPT output.

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1840:

Fix Version/s: (was: 1.15)
   1.16

> No way to link slide notes to slide in PPT output.
> --
>
> Key: TIKA-1840
> URL: https://issues.apache.org/jira/browse/TIKA-1840
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Sam H
>Assignee: Chris A. Mattmann
> Fix For: 1.16
>
>
> I'm integrating Apache Tika into my project, and I want to extract (text) 
> information from Powerpoint slides. Both PPT and PPTX
> I've noticed when using PPT format, the slide notes are all aggregated at the 
> end of the XML output, and there is no way to identify which note belongs to 
> which slide.
> I began looking at the code and found the following:
> {code}
> // TODO Find the Notes for this slide and extract inline
> {code}
> in 
> [HSLFExtractor.java|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java]
>  on line 140 
> I would like to implement this part and contribute



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1318) Use of Deprecated Word6Extractor.getParagraphText() Method

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1318:

Fix Version/s: (was: 1.15)
   1.16

> Use of Deprecated Word6Extractor.getParagraphText() Method
> --
>
> Key: TIKA-1318
> URL: https://issues.apache.org/jira/browse/TIKA-1318
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Tyler Palsulich
>Priority: Minor
>  Labels: deprecation
> Fix For: 1.16
>
>
> org.apache.tika.parser.microsoft.WordExtractor.parseWord6() uses the 
> deprecated Word6Extractor.getParagraphText() method. getParagraphText() is 
> supposed to return a String[] with an element for each paragraph in the text. 
> The replacement is getText(), which lets paragraph, cell, etc separation be 
> implementation specific. I'm not sure, at this point, how the POI 
> WordExtractor separates them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-988) We don't extract a placeholder for a Word document embedded in an Excel document

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-988:
---
Fix Version/s: (was: 1.15)
   1.16

> We don't extract a placeholder for a Word document embedded in an Excel 
> document
> 
>
> Key: TIKA-988
> URL: https://issues.apache.org/jira/browse/TIKA-988
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.16
>
> Attachments: bug31373.xls
>
>
> In TIKA-956 we fixed the Word parser so that at the point where an embedded 
> document appears, we output a  tag.
> It would be nice to do this for documents embedded in Excel too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1697) Parser Implementation for AkomaNtoso Legal XML Documents

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1697:

Fix Version/s: (was: 1.15)
   1.16

> Parser Implementation for AkomaNtoso Legal XML Documents
> 
>
> Key: TIKA-1697
> URL: https://issues.apache.org/jira/browse/TIKA-1697
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.16
>
>
> [AkomaNtoso|http://www.akomantoso.org/] is an established OASIS Legal 
> Document XML standard and used pervasively within parliaments and other 
> legislative arenas.
> This issue should utilize the 
> [akomantoso-lib|https://github.com/kohsah/akomantoso-lib] to parse and 
> populate Metadata for AkomaNtoso .xml and .akn documents.
> I'll send a PR for this soon.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1208) Migrate Any23 mime contributions to Tika

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1208:

Fix Version/s: (was: 1.15)
   1.16

> Migrate Any23 mime contributions to Tika
> 
>
> Key: TIKA-1208
> URL: https://issues.apache.org/jira/browse/TIKA-1208
> Project: Tika
>  Issue Type: Sub-task
>  Components: mime
>Reporter: Lewis John McGibbney
> Fix For: 1.16
>
> Attachments: TIKA-1208.patch
>
>
> We begin with one of the most obvious areas in which there
> is overlap.
> In short, the appeal of this package is the addition of detection 
> for the following types:
>  - text/n3
>  - text/rdf+n3
>  - application/n3
>  - text/x-nquads
>  - text/rdf+nq
>  - text/nq
>  - application/nq
>  - text/turtle
>  - application/x-turtle
>  - application/turtle
>  - application/trix
>  
> Therefore although both Tika and Any23 execute the task of Mimetype-related
> tasks, there is a contribution to be made. This involves the trasferral of
> code pertaining to pattern recogition, Mimetype XML defitinions within 
> tika-mimetypes.xml and a Purifier implementation that removes all 
> the eventual blank characters at the header of a file that might 
> prevents its MIME Type detection.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-819) Make Option to Exclude Embedded Files' Text for Text Content

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-819:
---
Fix Version/s: (was: 1.15)
   1.16

> Make Option to Exclude Embedded Files' Text for Text Content
> 
>
> Key: TIKA-819
> URL: https://issues.apache.org/jira/browse/TIKA-819
> Project: Tika
>  Issue Type: New Feature
>  Components: general
>Affects Versions: 1.0
> Environment: Windows-7 + JDK 1.6 u26
>Reporter: Albert L.
> Fix For: 1.16
>
>
> It would be nice to be able to disable text content from embedded files.
> For example, if I have a DOCX with an embedded PPTX, then I would like the 
> option to disable text from the PPTX from showing up when asking for the text 
> content from DOCX.  In other words, it would be nice to have the option to 
> get text content *only* from the DOCX instead of the DOCX+PPTX.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1709) Tika Server doesn't handle multi-part attachments or form-encoded inputs

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1709:

Fix Version/s: (was: 1.15)
   1.16

> Tika Server doesn't handle multi-part attachments or form-encoded inputs
> 
>
> Key: TIKA-1709
> URL: https://issues.apache.org/jira/browse/TIKA-1709
> Project: Tika
>  Issue Type: Bug
>  Components: server
> Environment: http://github.com/chrismattmann/tika-python/ Windows 7 
> Ultimate
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.16
>
>
> Downstream in the Tika Python library, I noticed that Tika Server doesn't 
> handle e.g., in /rmeta, multi-part attachments on Windows 7 Ultimate, such as 
> those encoded using curl -T for example. Tika-Server returns back a 415 that 
> it can't properly diagnose what the mime type is.
> See: 
> https://github.com/kennethreitz/requests/issues/2725
> https://github.com/chrismattmann/tika-python/issues/58
> For more info.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-2340) Add explicit deps to tika-parsers which are currently used from transitive scope

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2340:

Fix Version/s: (was: 1.15)
   1.16

> Add explicit deps to tika-parsers which are currently used from transitive 
> scope
> 
>
> Key: TIKA-2340
> URL: https://issues.apache.org/jira/browse/TIKA-2340
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.14
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
> Fix For: 1.16
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1598) Parser Implementation for Streaming Video

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1598:

Fix Version/s: (was: 1.15)
   1.16

> Parser Implementation for Streaming Video
> -
>
> Key: TIKA-1598
> URL: https://issues.apache.org/jira/browse/TIKA-1598
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.16
>
>
> A number of us have been discussing a Tika implementation which could, for 
> example, bind to a live multimedia stream and parse content from the stream 
> until it finished.
> An excellent example would be watching Bonnie Scotland beating R. of Ireland 
> in the upcoming European Championship Qualifying - Group D on Sat 13 Jun @ 
> 17:00 GMT :)
> I located a JMF Wrapper for ffmpeg which 'may' enable us to do this
> http://sourceforge.net/projects/jffmpeg/
> I am not sure... plus it is not licensed liberally enough for us to include 
> so if there are other implementations then please post them here.
> I 'may' be able to have a crack at implementing this next week.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1829) org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92) NPE

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1829:

Fix Version/s: (was: 1.15)
   1.16

> org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92)
>  NPE 
> 
>
> Key: TIKA-1829
> URL: https://issues.apache.org/jira/browse/TIKA-1829
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: OSX 10.11
>Reporter: frank
>Assignee: Tim Allison
>Priority: Critical
>  Labels: easyfix
> Fix For: 1.16
>
> Attachments: TesseractOCRParser.java
>
>
> Just need to add a check on parameter of context.
> 2016-01-11 12:36:52.328 [http-nio-8080-exec-9] WARN  
> o.a.j.core.query.lucene.NodeIndexer - Exception while indexing binary property
> java.lang.NullPointerException: null
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:92)
>  ~[tika-parsers-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:87) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:95) 
> ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:253)
>  ~[tika-core-1.11.jar:1.11]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.isSupportedMediaType(NodeIndexer.java:934)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:448)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.addValue(NodeIndexer.java:338)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.createDoc(NodeIndexer.java:270)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:1246)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.mergeAggregatedNodeIndexes(SearchIndex.java:1539)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:1247)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.updateNodes(SearchIndex.java:667)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.SearchManager.onEvent(SearchManager.java:408) 
> [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.EventConsumer.consumeEvents(EventConsumer.java:249)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.ObservationDispatcher.dispatchEvents(ObservationDispatcher.java:225)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.observation.EventStateCollection.dispatch(EventStateCollection.java:475)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.SharedItemStateManager$Update.end(SharedItemStateManager.java:856)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.SharedItemStateManager.update(SharedItemStateManager.java:1537)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.LocalItemStateManager.update(LocalItemStateManager.java:400)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.XAItemStateManager.update(XAItemStateManager.java:354)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.state.LocalItemStateManager.update(LocalItemStateManager.java:375)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.version.VersionManagerImplBase$WriteOperation.save(VersionManagerImplBase.java:470)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.version.VersionManagerImplBase.checkoutCheckin(VersionManagerImplBase.java:215)
>  [jackrabbit-core-2.8.0.jar:2.8.0]
>   at 
> org.apache.jackrabbit.core.VersionManagerImpl.access$400(VersionManagerImpl.java:73)
>  [jac

[jira] [Updated] (TIKA-1672) Integrate tika-java7 component

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1672:

Fix Version/s: (was: 1.15)
   1.16

> Integrate tika-java7 component
> --
>
> Key: TIKA-1672
> URL: https://issues.apache.org/jira/browse/TIKA-1672
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tyler Palsulich
> Fix For: 1.16
>
>
> Code requiring Java 7 doesn't need to be in a separate module now that 
> TIKA-1536 (upgrade to Java 7) is done.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1952) Access Date is getting modified while capturing the MetaData information using AutoDetectParser

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1952:

Fix Version/s: (was: 1.15)
   1.16

> Access Date is getting modified while capturing the MetaData information 
> using AutoDetectParser
> ---
>
> Key: TIKA-1952
> URL: https://issues.apache.org/jira/browse/TIKA-1952
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.12
> Environment: Windows
>Reporter: RameshKalidindi
>  Labels: features
> Fix For: 1.16
>
>
> I have been developing a project where in am capturing the MetaData 
> information( like File name, Author, File Extension, Last Modified Date and 
> Access Date) of each file in a folder using AutoDetectParser of Tika, I am 
> able to get meta data information for all files in a given folder, but my 
> issue is that the value of Access Date (MetaData attibute) is getting changed 
> with current date and Time as the program is accessing the each and every 
> file while extracting the MetaData information.
> My Issue : is there anyway that i can get the last Access Date of the file? 
> or can we stop changing Access Date value that was happening due to 
> AutoDetectParser of Tika API. Please help me in this regard. 
> Note: This Access Date information is very important  for my project, based 
> on this we need to build reports.
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-2298) To improve object recognition parser so that it may work without external RESTful service setup

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2298:

Fix Version/s: (was: 1.15)
   1.16

> To improve object recognition parser so that it may work without external 
> RESTful service setup
> ---
>
> Key: TIKA-2298
> URL: https://issues.apache.org/jira/browse/TIKA-2298
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Avtar Singh
>  Labels: ObjectRecognitionParser
> Fix For: 1.16
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> When ObjectRecognitionParser was built to do image recognition, there wasn't
> good support for Java frameworks.  All the popular neural networks were in
> C++ or python.  Since there was nothing that runs within JVM, we tried
> several ways to glue them to Tika (like CLI, JNI, gRPC, REST).
> However, this game is changing slowly now. Deeplearning4j, the most famous
> neural network library for JVM, now supports importing models that are
> pre-trained in python/C++ based kits [5].
> *Improvement:*
> It will be nice to have an implementation of ObjectRecogniser that
> doesn't require any external setup(like installation of native libraries or
> starting REST services). Reasons: easy to distribute and also to cut the IO
> time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1706) Bring back commons-io to tika-core

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1706:

Fix Version/s: (was: 1.15)
   1.16

> Bring back commons-io to tika-core
> --
>
> Key: TIKA-1706
> URL: https://issues.apache.org/jira/browse/TIKA-1706
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.16
>
> Attachments: TIKA-1706-1.patch, TIKA-1706-2.patch
>
>
> TIKA-249 inlined select commons-io classes in order to simplify the 
> dependency tree and save some space.
> I believe these arguments are weaker nowadays due to the following concerns:
> - Most of the non-core modules already use commons-io, and since tika-core is 
> usually not used by itself, commons-io is already included with it
> - Since some modules use both tika-core and commons-io, it's not clear which 
> code should be used
> - Having the inlined classes causes more maintenance and/or technology debt 
> (which in turn causes more maintenance)
> - Newer commons-io code utilizes newer platform code, e.g. using Charset 
> objects instead of encoding names, being able to use StringBuilder instead of 
> StringBuffer, and so on.
> I'll be happy to provide a patch to replace usages of the inlined classes 
> with commons-io classes if this is accepted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1220) Parser implementration for IFC files

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1220:

Fix Version/s: (was: 1.15)
   1.16

> Parser implementration for IFC files
> 
>
> Key: TIKA-1220
> URL: https://issues.apache.org/jira/browse/TIKA-1220
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: new-parser
> Fix For: 1.16
>
> Attachments: 2012-03-23-Duplex-Programming.ifc
>
>
> The Industry Foundation Classes (IFC) [0] data model is intended to describe 
> building and construction industry data. For the sake of argument, it can be 
> considered as a more intelligent successor to the .dwg data models used 
> within CAD models.
> I've tracked down a potential 3rd party library [1] which we maybe able to 
> wrap and use within Tika however the provided software packages are licensed 
> under: http://creativecommons.org/licenses/by-nc-sa/3.0/de/ so I am currently 
> over on legal-discuss@ in an attempt to see if it is possible to wrap some 
> code and contribute it to tika-parsers.
> When I get feedback from legal-discuss, and if this is a go-ahead, I'll need 
> to help the developers package the code as a Maven artifact(s), then I will 
> progress with writing the implementation.  
> [0] http://en.wikipedia.org/wiki/Industry_Foundation_Classes
> [1] http://www.ifctoolsproject.com/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1108) Represent individual slides in pptx

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1108:

Fix Version/s: (was: 1.15)
   1.16

> Represent individual slides in pptx
> ---
>
> Key: TIKA-1108
> URL: https://issues.apache.org/jira/browse/TIKA-1108
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Daniel Bonniot de Ruisselet
> Fix For: 1.16
>
>
> When parsing ppt, tika produces for each slide:
> 
> However for pptx these seem to be missing, all the text is directly under 
> .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (TIKA-2016) A parser that combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-2016.
-
Resolution: Fixed

this is fixed - thanks to [~thammegowda]!

> A parser that combines Apache OpenNLP and Apache Tika and provides facilities 
> for automatically deriving sentiment from text.
> -
>
> Key: TIKA-2016
> URL: https://issues.apache.org/jira/browse/TIKA-2016
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Anastasija Mensikova
>Assignee: Chris A. Mattmann
>  Labels: analysis, gsoc2016, memex, parser, sentiment
> Fix For: 1.16
>
>
> A new project that implements a parser that uses Apache OpenNLP and Apache 
> Tika to perform Sentiment Analysis.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-539) Encoding detection is too biased by encoding in meta tag

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-539:
---
Fix Version/s: (was: 1.15)
   1.16

> Encoding detection is too biased by encoding in meta tag
> 
>
> Key: TIKA-539
> URL: https://issues.apache.org/jira/browse/TIKA-539
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, parser
>Affects Versions: 0.8, 0.9, 0.10
>Reporter: Reinhard Schwab
>Assignee: Ken Krugler
>Priority: Minor
> Fix For: 1.16
>
> Attachments: TIKA-539_2.patch, TIKA-539.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from 
> http response header).
> test code to reproduce:
> static String content = "\n"
>   + " content=\"application/xhtml+xml; charset=iso-8859-1\" />"
>   + "Über den Wolken\n";
>   /**
>* @param args
>* @throws IOException
>* @throws TikaException
>* @throws SAXException
>*/
>   public static void main(String[] args) throws IOException, SAXException,
>   TikaException {
>   Metadata metadata = new Metadata();
>   metadata.set(Metadata.CONTENT_TYPE, "text/html");
>   metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   InputStream in = new 
> ByteArrayInputStream(content.getBytes("UTF-8"));
>   AutoDetectParser parser = new AutoDetectParser();
>   BodyContentHandler h = new BodyContentHandler(1);
>   parser.parse(in, h, metadata, new ParseContext());
>   System.out.print(h.toString());
>   System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>   }



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1465) Implement extraction of non-global variables from netCDF3 and netCDF4

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1465:

Fix Version/s: (was: 1.15)
   1.16

> Implement extraction of non-global variables from netCDF3 and netCDF4
> -
>
> Key: TIKA-1465
> URL: https://issues.apache.org/jira/browse/TIKA-1465
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.16
>
>
> Speaking to Eric Nienhouse at the ongoing NSF funded Polar 
> Cyberinfrastructure hackathon in NYC, we became aware that variables 
> parameters contained within netCDF3 and netCDF4 are just as valuable (if not 
> more valuable) as global attribute values. 
> AFAIK, right now we only extract global attributes however we could extend 
> the support to cater for the above observations.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-2016) A parser that combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2016:

Fix Version/s: (was: 1.16)
   1.15

> A parser that combines Apache OpenNLP and Apache Tika and provides facilities 
> for automatically deriving sentiment from text.
> -
>
> Key: TIKA-2016
> URL: https://issues.apache.org/jira/browse/TIKA-2016
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Anastasija Mensikova
>Assignee: Chris A. Mattmann
>  Labels: analysis, gsoc2016, memex, parser, sentiment
> Fix For: 1.15
>
>
> A new project that implements a parser that uses Apache OpenNLP and Apache 
> Tika to perform Sentiment Analysis.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1390) Create tika-example module

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1390:

Fix Version/s: (was: 1.15)
   1.16

> Create tika-example module
> --
>
> Key: TIKA-1390
> URL: https://issues.apache.org/jira/browse/TIKA-1390
> Project: Tika
>  Issue Type: Bug
>  Components: example
>Reporter: Tyler Palsulich
> Fix For: 1.16
>
>
> This issue will track the initial creation of the tika-example module. 
> Subtasks will be used for the first few examples.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1454) Extracting as HTML loses links in xlsx, ppt, and pptx files

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1454:

Fix Version/s: (was: 1.15)
   1.16

> Extracting as HTML loses links in xlsx, ppt, and pptx files
> ---
>
> Key: TIKA-1454
> URL: https://issues.apache.org/jira/browse/TIKA-1454
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12
> Environment: RedHat EL5, EL6, EL7
>Reporter: Chris Bryant
>Assignee: Tim Allison
> Fix For: 1.16
>
> Attachments: testurl.ods, testurl.xlsx, urltest.odp, urltest.ppt, 
> urltest.pptx
>
>
> I am trying to convert documents to HTML, then looking through the HTML for 
> anchor tags to find links to external URLs.  This works fine when looking at 
> some document types, including PDFs, Open Document formats, Microsoft Word 
> formats .doc and .docx, and the older Microsoft Excel .xls format, but it 
> does not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it 
> does not work for the newer Excel .xlsx format.  For the .ppt, .pptx, and 
> .xlsx formats, the text is extracted properly and formatted into HTML, but 
> the link is not converted to an anchor tag.
> I am running tika in --server --html mode.
> I included samples of .xlsx, .ppt, and .pptx files that do not properly 
> extract links, and also included samples of .ods and .odp files that do 
> extract links properly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1616) Tika Parser for GIBS Metadata

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1616:

Fix Version/s: (was: 1.15)
   1.16

> Tika Parser for GIBS Metadata
> -
>
> Key: TIKA-1616
> URL: https://issues.apache.org/jira/browse/TIKA-1616
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.16
>
>
> [GIBS|https://earthdata.nasa.gov/about-eosdis/science-system-description/eosdis-components/global-imagery-browse-services-gibs]
>  metadata currently consists of simple stuff in the WMTS GetCapabilities 
> request (e.g. 
> http://map1.vis.earthdata.nasa.gov/wmts-arctic/1.0.0/WMTSCapabilities.xml) 
> which includes available layers, extents, time ranges, map projections, color 
> maps, etc. We will eventually have more detailed visualization metadata 
> available in ECHO/CMR which will include linkages to data products, 
> provenance, etc. 
> Some investigation and a Tika parser would be excellent to extract and 
> assimilate GIBS Metadata.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-2016) A parser that combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-2016:

Fix Version/s: (was: 1.15)
   1.16

> A parser that combines Apache OpenNLP and Apache Tika and provides facilities 
> for automatically deriving sentiment from text.
> -
>
> Key: TIKA-2016
> URL: https://issues.apache.org/jira/browse/TIKA-2016
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Anastasija Mensikova
>Assignee: Chris A. Mattmann
>  Labels: analysis, gsoc2016, memex, parser, sentiment
> Fix For: 1.16
>
>
> A new project that implements a parser that uses Apache OpenNLP and Apache 
> Tika to perform Sentiment Analysis.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1505) chmparser breaks down when extracting from file of CHM format v3

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1505:

Fix Version/s: (was: 1.15)
   1.16

> chmparser breaks down when extracting from file of CHM format v3
> 
>
> Key: TIKA-1505
> URL: https://issues.apache.org/jira/browse/TIKA-1505
> Project: Tika
>  Issue Type: Bug
>Reporter: Bin Hawking
> Fix For: 1.16
>
>
> chmparser throws exception or returns faulty text when:
> 1. extracting from file of CHM format version 3
> 2. chm file with lzx reset interval > 2
> 3. chm file with >5000 objects
> I am making the fix now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1329) Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1329:

Fix Version/s: (was: 1.15)
   1.16

> Add RecursiveParserWrapper aka Jukka's (and Nick's) RecursiveMetadataParser
> ---
>
> Key: TIKA-1329
> URL: https://issues.apache.org/jira/browse/TIKA-1329
> Project: Tika
>  Issue Type: Sub-task
>  Components: parser
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.16
>
> Attachments: test_recursive_embedded.docx, TIKA-1329-site.patch, 
> TIKA-1329v2.patch
>
>
> Jukka and Nick have a great demo of parsing metadata recursively on the 
> [wiki|http://wiki.apache.org/tika/RecursiveMetadata].  For TIKA-1302, I'd 
> like to use something similar, and I think that others may find it useful for 
> tika-app and tika-server.
> I took the code from the wiki and made some modifications.  I'm not sure if 
> we should put this in parsers or in a new module for "examples."  Given that 
> I think this would be useful for tika-app and tika-server, I'd prefer 
> parsers, but I'm open to any input...including "let's not."



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1577) NetCDF Data Extraction

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1577:

Fix Version/s: (was: 1.15)
   1.16

> NetCDF Data Extraction
> --
>
> Key: TIKA-1577
> URL: https://issues.apache.org/jira/browse/TIKA-1577
> Project: Tika
>  Issue Type: Improvement
>  Components: handler, parser
>Affects Versions: 1.7
>Reporter: Ann Burgess
>Assignee: Ann Burgess
>  Labels: features, handler
> Fix For: 1.16
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> A netCDF classic or 64-bit offset dataset is stored as a single file 
> comprising two parts:
>  - a header, containing all the information about dimensions, attributes, and 
> variables except for the variable data;
>  - a data part, comprising fixed-size data, containing the data for variables 
> that don't have an unlimited dimension; and variable-size data, containing 
> the data for variables that have an unlimited dimension.
> The NetCDFparser currently extracts the "header part".  
>  -- text extracts file Dimensions and Variables
>  -- metadata extracts Global Attributes
> We want the option to extract the "data part" of NetCDF files.  
> Lets use the NetCDF test file for our dev testing:  
> tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1379) error in Tika().detect for xml files with xades signature

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1379:

Fix Version/s: (was: 1.15)
   1.16

> error in Tika().detect for xml files with xades signature
> -
>
> Key: TIKA-1379
> URL: https://issues.apache.org/jira/browse/TIKA-1379
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.4
>Reporter: Alessandro De Angelis
>  Labels: new-parser
> Fix For: 1.16
>
>
> we tried to get the mime type of an xml file with xades signature embedded. 
> the result is "text/html" and not the expected "text/xml" or 
> "application/xml".
> here is an example of the xml file:
> {code}
> 
> 
>   00094853 0003 2
>   2013-09-23
>   2013-09-23
>   D69017
>   FILOSOFIA DELLA SCIENZA
>   D69
>   TEATRO E ARTI VISIVE
>   
>   1233456
>   PAOLINO
>   PAPERINO
>   23.0
>   23
>   
>   
>   
>   2012
>   6.0
>   
>   9
>   جامعة البندقية - TEST
>   Verbale_3
>   QUI QUO QUA
>   D69017
>   FILOSOFIA DELLA SCIENZA
>   D69
>   TEATRO E ARTI VISIVE
>   QUI QUO QUA
> 26-09-2013 09:55:53 CEST(+0200)
> 
>   3
>   11.09.03
> 
> http://www.w3.org/2000/09/xmldsig#"; 
> Id="sig08744308748201048377">
> 
>  Algorithm="http://www.w3.org/2006/12/xml-c14n11";>
>  Algorithm="http://www.w3.org/2001/04/xmldsig-more#rsa-sha256";>
> 
> 
> http://www.w3.org/2002/06/xmldsig-filter2";>
>  xmlns:dsig-xpath="http://www.w3.org/2002/06/xmldsig-filter2"; 
> Filter="subtract">/descendant::ds:Signature
> 
> http://www.w3.org/TR/1999/REC-xslt-19991116";>
> http://www.kion.it/webesse3/multilingua"; 
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; 
> exclude-result-prefixes="kion" version="1.0">
>   
>   
>   
>select="/VERBALI/VERBALE">
>select="/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO">
>select="/VERBALI/VERBALE/RAGGRUPPAMENTO">
>select="/VERBALI/VERBALE/COMMISSIONE">
>   
>   
>   
>   
>http-equiv="Content-Type">
>
>test="$sostituzione_root">
>   Dichiarazione 
> conformità Verbale Esame
>   
>   
>   Verbalizzazione 
> esame
>   
>   
>   
>td  {font-family: Arial; font-size:10pt;} 
>div {font-family: Arial; font-size:10pt;}
>pre {font-family: Arial; font-size:10pt;} 
>   
>   
>   
>   
>
>test="$sostituzione_root">
>colspan="2"> select="$verbale_root/ATENEO_DES">
>colspan="2">DICHIARAZIONE DI 
> CONFORMITÀ
>colspan="2">Il sottoscritto  select="$verbale_root/TITOLARE_PROCEDIMENTO">, docente di 
> 
>  
>   
>   
>     
>   
>test="$sostituzione_root/MOTIVAZIONE">
>   
> PREMESSO CHE
>   
>  
>   
>  select="$sostituzione_root/MOTIVAZIONE">
>   
>  
>   
> 
>   
>   
>   
>   
> DICHIARA
>    
> 
>  

[jira] [Updated] (TIKA-980) MicrodataContentHandler for Apache Tika

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-980:
---
Fix Version/s: (was: 1.15)
   1.16

> MicrodataContentHandler for Apache Tika
> ---
>
> Key: TIKA-980
> URL: https://issues.apache.org/jira/browse/TIKA-980
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Ken Krugler
> Fix For: 1.16
>
> Attachments: TIKA-980-1.3-1.patch, TIKA-980-1.3-2.patch, 
> TIKA-980-1.3-3.patch, TIKA-980-1.3-4.patch, TIKA-980-1.3-5.patch
>
>
> ContentHandler for Apache Tika capable of building a data structure 
> containing Microdata item scopes and item properties. The Item* classes are 
> borrowed from the Apache Any23 project and are slightly modified to 
> accomodate this SAX-based extractor vs the original DOM-based extractor.
> The provided unit test outputs two item scopes about the Europe and NA 
> ApacheCon events and each has a nested property.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1295) Make some Dublin Core items multi-valued

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1295:

Fix Version/s: (was: 1.15)
   1.16

> Make some Dublin Core items multi-valued
> 
>
> Key: TIKA-1295
> URL: https://issues.apache.org/jira/browse/TIKA-1295
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.16
>
>
> According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, 
> dc:title, dc:description and dc:rights should allow multiple values because 
> of language alternatives.  Unless anyone objects in the next few days, I'll 
> switch those to Property.toInternalTextBag() from Property.toInternalText().  
> I'll also modify PDFParser to extract dc:rights.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1800) MediaType#parse does not decode escaped special characters

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1800:

Fix Version/s: (was: 1.15)
   1.16

> MediaType#parse does not decode escaped special characters
> --
>
> Key: TIKA-1800
> URL: https://issues.apache.org/jira/browse/TIKA-1800
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.11
>Reporter: Roberto Benedetti
> Fix For: 1.16
>
>
> Special characters in parameter value are escaped in canonical string 
> representation but they are not unescaped when the canonical string 
> representation is parsed.
> {code:java}
> MediaType mType = new MediaType(MediaType.APPLICATION_XML, "x-report", 
> "#report@");
> String cType = mType.toString(); // application/xml; x-report="#report\@"
> assertEquals("application/xml; x-report=\"#report\\@\"", cType); // success
> mType = MediaType.parse(cType);
> String report = mType.getParameters().get("x-report"); // #report\@
> assertEquals("#report@", report); // failure
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1808) Head section closed too eager

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1808:

Fix Version/s: (was: 1.15)
   1.16

> Head section closed too eager
> -
>
> Key: TIKA-1808
> URL: https://issues.apache.org/jira/browse/TIKA-1808
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
>Reporter: Markus Jelsma
> Fix For: 1.16
>
>
> XHTMLContentHandler has some logic that closes the head section too early, or 
> this is a problem in TagSoup. In this [1] case a  element appears in the 
> head, causing the head to be closed. Subsequent  elements do not appear 
> in custom ContentHandlers so i cannot read the document's title, or any other 
> meta tags.
> It can be fixed by using a custom HTMLSchema in the ParseContext, e.g. 
> schema.elementType("div", HTMLSchema.M_EMPTY, 65535, 0); but this isn't 
> really an elegant solution.
> [1] http://www.aljazeera.com/news/2015/05/150516182251747.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1609) Leverage Google's LibPhonenumber for enhanced phone number extraction and metadata modeling

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1609:

Fix Version/s: (was: 1.15)
   1.16

> Leverage Google's LibPhonenumber for enhanced phone number extraction and 
> metadata modeling
> ---
>
> Key: TIKA-1609
> URL: https://issues.apache.org/jira/browse/TIKA-1609
> Project: Tika
>  Issue Type: New Feature
>  Components: core
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.16
>
>
> Google's Libphonenumber can provide us with comprehensive support for 
> modeling Phone number metadata properly in Tika.
> During the development of this patch I realized two things, namely
>  * This is not a parser as such as Phone numbers are not mapped to any 
> particular Mimetype
>  * In addition, there can be many phone numbers per document, so this is most 
> likely a Content Handler of sorts
>  * Tika's Metadata support is currently too restrictive to allow us to 
> persist many complex objects e.g. String, Object. We need to expand Meatdata 
> support over and above String, String[].
> https://github.com/googlei18n/libphonenumber/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1106) CLAVIN Integration

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1106:

Fix Version/s: (was: 1.15)
   1.16

> CLAVIN Integration
> --
>
> Key: TIKA-1106
> URL: https://issues.apache.org/jira/browse/TIKA-1106
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.3
> Environment: All
>Reporter: Adam Estrada
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: entity, geospatial, new-parser
> Fix For: 1.16
>
>
> I've been evaluating CLAVIN as a way to extract location information from 
> unstructured text. It seems like meshing it with Tika in some way would make 
> a lot of sense. From CLAVIN website...
> {quote}
> CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source 
> software package for document geotagging and geoparsing that employs 
> context-based geographic entity resolution. It combines a variety of open 
> source tools with natural language processing techniques to extract location 
> names from unstructured text documents and resolve them against gazetteer 
> records. Importantly, CLAVIN does not simply "look up" location names; 
> rather, it uses intelligent heuristics in an attempt to identify precisely 
> which "Springfield" (for example) was intended by the author, based on the 
> context of the document. CLAVIN also employs fuzzy search to handle 
> incorrectly-spelled location names, and it recognizes alternative names 
> (e.g., "Ivory Coast" and "Côte d'Ivoire") as referring to the same geographic 
> entity. By enriching text documents with structured geo data, CLAVIN enables 
> hierarchical geospatial search and advanced geospatial analytics on 
> unstructured data.
> {quote}
> There was only one other instance of the word "clavin" mentioned in the ASF 
> jira site so I thought it was definitely worth posting here.
> https://github.com/Berico-Technologies/CLAVIN



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1640) Make ExternalParser support aliases for key names in extracted metadata

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1640:

Fix Version/s: (was: 1.15)
   1.16

> Make ExternalParser support aliases for key names in extracted metadata
> ---
>
> Key: TIKA-1640
> URL: https://issues.apache.org/jira/browse/TIKA-1640
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.16
>
>
> Over in TIKA-1639, we were discussing the work outside of Tika that [~rgauss] 
> did (per [~gagravarr]) on the EXIFTool parsing. I added support in TIKA-1639 
> for this, but one thing Ray's code-based work did that my config oriented 
> work didn't is allow for renaming extracted metadata key names to better 
> support having consistent metadata across parsers.
> Here's one way to do it:
> ExternalParser could have a config section like so:
> {code:xml}
> 
>   
>   
> 
> {code}
> Then this could be used to rename metadata keys.
> I'll implement that in this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1738) ForkClient does not always delete temporary bootstrap jar

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1738:

Fix Version/s: (was: 1.15)
   1.16

> ForkClient does not always delete temporary bootstrap jar
> -
>
> Key: TIKA-1738
> URL: https://issues.apache.org/jira/browse/TIKA-1738
> Project: Tika
>  Issue Type: Bug
>  Components: core
> Environment: Windows 10
>Reporter: Yaniv Kunda
>Priority: Minor
> Fix For: 1.16
>
> Attachments: TIKA-1738.patch
>
>
> ForkClient creates a new temporary bootstrap jar each time it's instantiated, 
> and tries to delete it in the {{close()}} method, after destroying the 
> process.
> Possibly a Windows-specific behavior, the OS seem to still hold a handle to 
> the file a bit after the process is destroyed, causing the delete() method to 
> do nothing.
> This is recreated by simply running ForkParserTest on my machine.
> In a long-running process,this could fill the temp folder with many bootstrap 
> jars that will never be deleted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-1815:

Fix Version/s: (was: 1.15)
   1.16

> Text content from parser is empty when NamedEntityParser is enabled
> ---
>
> Key: TIKA-1815
> URL: https://issues.apache.org/jira/browse/TIKA-1815
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.16
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When the NamedEntityParser is enabled, the Tika#parseToString() and other 
> parse() methods produces an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-985) Support for HTML5 elements

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-985:
---
Fix Version/s: (was: 1.15)
   1.16

> Support for HTML5 elements
> --
>
> Key: TIKA-985
> URL: https://issues.apache.org/jira/browse/TIKA-985
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.2
>Reporter: Markus Jelsma
> Fix For: 1.16
>
> Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, 
> TIKA-985-1.3-3.patch, TIKA-985-1.5.patch
>
>
> TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, 
> section). This prevents some custom ContentHandlers from reading expected 
> elements and/or attributes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (TIKA-1815) Text content from parser is empty when NamedEntityParser is enabled

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1815.
-
   Resolution: Fixed
Fix Version/s: (was: 1.16)
   1.15

> Text content from parser is empty when NamedEntityParser is enabled
> ---
>
> Key: TIKA-1815
> URL: https://issues.apache.org/jira/browse/TIKA-1815
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Thamme Gowda
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.15
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When the NamedEntityParser is enabled, the Tika#parseToString() and other 
> parse() methods produces an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (TIKA-1106) CLAVIN Integration

2017-05-21 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-1106.
-
Resolution: Won't Fix

we already have the GeoTopicParser so going to close this one out.

> CLAVIN Integration
> --
>
> Key: TIKA-1106
> URL: https://issues.apache.org/jira/browse/TIKA-1106
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.3
> Environment: All
>Reporter: Adam Estrada
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: entity, geospatial, new-parser
> Fix For: 1.16
>
>
> I've been evaluating CLAVIN as a way to extract location information from 
> unstructured text. It seems like meshing it with Tika in some way would make 
> a lot of sense. From CLAVIN website...
> {quote}
> CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source 
> software package for document geotagging and geoparsing that employs 
> context-based geographic entity resolution. It combines a variety of open 
> source tools with natural language processing techniques to extract location 
> names from unstructured text documents and resolve them against gazetteer 
> records. Importantly, CLAVIN does not simply "look up" location names; 
> rather, it uses intelligent heuristics in an attempt to identify precisely 
> which "Springfield" (for example) was intended by the author, based on the 
> context of the document. CLAVIN also employs fuzzy search to handle 
> incorrectly-spelled location names, and it recognizes alternative names 
> (e.g., "Ivory Coast" and "Côte d'Ivoire") as referring to the same geographic 
> entity. By enriching text documents with structured geo data, CLAVIN enables 
> hierarchical geospatial search and advanced geospatial analytics on 
> unstructured data.
> {quote}
> There was only one other instance of the word "clavin" mentioned in the ASF 
> jira site so I thought it was definitely worth posting here.
> https://github.com/Berico-Technologies/CLAVIN



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-1106) CLAVIN Integration

2017-05-21 Thread Adam Estrada (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Estrada updated TIKA-1106:
---

+1

Sent from my iPhone



> CLAVIN Integration
> --
>
> Key: TIKA-1106
> URL: https://issues.apache.org/jira/browse/TIKA-1106
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.3
> Environment: All
>Reporter: Adam Estrada
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: entity, geospatial, new-parser
> Fix For: 1.16
>
>
> I've been evaluating CLAVIN as a way to extract location information from 
> unstructured text. It seems like meshing it with Tika in some way would make 
> a lot of sense. From CLAVIN website...
> {quote}
> CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source 
> software package for document geotagging and geoparsing that employs 
> context-based geographic entity resolution. It combines a variety of open 
> source tools with natural language processing techniques to extract location 
> names from unstructured text documents and resolve them against gazetteer 
> records. Importantly, CLAVIN does not simply "look up" location names; 
> rather, it uses intelligent heuristics in an attempt to identify precisely 
> which "Springfield" (for example) was intended by the author, based on the 
> context of the document. CLAVIN also employs fuzzy search to handle 
> incorrectly-spelled location names, and it recognizes alternative names 
> (e.g., "Ivory Coast" and "Côte d'Ivoire") as referring to the same geographic 
> entity. By enriching text documents with structured geo data, CLAVIN enables 
> hierarchical geospatial search and advanced geospatial analytics on 
> unstructured data.
> {quote}
> There was only one other instance of the word "clavin" mentioned in the ASF 
> jira site so I thought it was definitely worth posting here.
> https://github.com/Berico-Technologies/CLAVIN



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2298) To improve object recognition parser so that it may work without external RESTful service setup

2017-05-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019113#comment-16019113
 ] 

ASF GitHub Bot commented on TIKA-2298:
--

chrismattmann commented on issue #159: Creation of TIKA-2298 contributed by 
asmehra95- Import of vgg16 via Deeplearning4j
URL: https://github.com/apache/tika/pull/159#issuecomment-302990927
 
 
   ping @asmehra95 any update?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> To improve object recognition parser so that it may work without external 
> RESTful service setup
> ---
>
> Key: TIKA-2298
> URL: https://issues.apache.org/jira/browse/TIKA-2298
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Avtar Singh
>  Labels: ObjectRecognitionParser
> Fix For: 1.16
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> When ObjectRecognitionParser was built to do image recognition, there wasn't
> good support for Java frameworks.  All the popular neural networks were in
> C++ or python.  Since there was nothing that runs within JVM, we tried
> several ways to glue them to Tika (like CLI, JNI, gRPC, REST).
> However, this game is changing slowly now. Deeplearning4j, the most famous
> neural network library for JVM, now supports importing models that are
> pre-trained in python/C++ based kits [5].
> *Improvement:*
> It will be nice to have an implementation of ObjectRecogniser that
> doesn't require any external setup(like installation of native libraries or
> starting REST services). Reasons: easy to distribute and also to cut the IO
> time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)