[jira] [Commented] (TIKA-1291) Invalid JSON output on CLI
[ https://issues.apache.org/jira/browse/TIKA-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992634#comment-13992634 ] Steffen commented on TIKA-1291: --- Thanks for your reply. Unfortunately I can't debug the problem here, but I sent you an email with the image file to ease debugging on your site. Invalid JSON output on CLI -- Key: TIKA-1291 URL: https://issues.apache.org/jira/browse/TIKA-1291 Project: Tika Issue Type: Bug Components: cli, metadata Affects Versions: 1.4, 1.5 Reporter: Steffen Getting the metadata via CLI from tika with output format set to JSON gives sometimes invalid JSON. I only found float/array errors here in jira and thus created this ticket with a new case. In my case the file that lead to invalid JSON output was a PNG file (that I unfortunately can't provide for testing): {noformat} { Application Record Version:4, Component 1:Y component: Quantization table 0, Sampling factors 2 horiz/2 vert, Component 2:Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert, Component 3:Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert, Compression Type:Baseline, Content-Length:113081, Content-Type:image/jpeg, Data Precision:8 bits, IPTC-NAA record:24 bytes binary data, Image Height:479 pixels, Image Width:671 pixels, Number of Components:3, Resolution Units:inch, Unknown tag (0x02f0):35,0,556,479, X Resolution:220 dots, Y Resolution:220 dots, resourceName:18, tiff:BitsPerSample:8, tiff:ImageLength:479, tiff:ImageWidth:671 } {noformat} The {noformat}Unknown tag (0x02f0):35,0,556,479, {noformat} is invalid JSON. It would be nice if there's always valid json output from tika. For other cases that might not be catched via fixes by this ticket it would be nice to have a CLI argument/option that disables the output of certain (unknown?) fields or allows giving a whitelist of fieldnames to output. That way users can bridge the time until new releases of tika by being more specific on the shell. If that feature already exists I apology for not having found it directly and a hint to the CLI option would be nice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates
[ https://issues.apache.org/jira/browse/TIKA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1233. - Resolution: Fixed After upgrade to PDFBOX-1.8.5, confirmed no longer any need for catch blocks for StringIndexOutOfBoundsException. Catch blocks removed in r1593983. PDFBox can throw StringIndexOutOfBoundsException on some dates -- Key: TIKA-1233 URL: https://issues.apache.org/jira/browse/TIKA-1233 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Tim Allison Priority: Trivial Labels: easyfix Fix For: 1.6 PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date string for parsing is empty or contains only spaces. A few of my test pdfs have this feature. Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from causing problems in TIKA {noformat} @@ -171,6 +171,9 @@ addMetadata(metadata, TikaCoreProperties.CREATED, info.getCreationDate()); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } try { Calendar modified = info.getModificationDate(); @@ -178,6 +181,9 @@ addMetadata(metadata, TikaCoreProperties.MODIFIED, modified); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates
[ https://issues.apache.org/jira/browse/TIKA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995135#comment-13995135 ] Tim Allison commented on TIKA-1233: --- [~lfcnassif], please reopen if you are still finding problems on your test set with trunk. PDFBox can throw StringIndexOutOfBoundsException on some dates -- Key: TIKA-1233 URL: https://issues.apache.org/jira/browse/TIKA-1233 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Tim Allison Priority: Trivial Labels: easyfix Fix For: 1.6 PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date string for parsing is empty or contains only spaces. A few of my test pdfs have this feature. Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from causing problems in TIKA {noformat} @@ -171,6 +171,9 @@ addMetadata(metadata, TikaCoreProperties.CREATED, info.getCreationDate()); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } try { Calendar modified = info.getModificationDate(); @@ -178,6 +181,9 @@ addMetadata(metadata, TikaCoreProperties.MODIFIED, modified); } catch (IOException e) { // Invalid date format, just ignore +} catch (StringIndexOutOfBoundsException e){ +//remove after PDFBOX-1883 is fixed +// Invalid date format, just ignore } {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception
[ https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1205. - Resolution: Won't Fix I've decided that the complication in code is not worth the benefit for me. If there is need in the community, please reopen. Allow PDFParser to fallback to other parser if there is an exception Key: TIKA-1205 URL: https://issues.apache.org/jira/browse/TIKA-1205 Project: Tika Issue Type: Improvement Components: parser Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.6 With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser instead of the traditional parser for parsing PDF files. Following the description in PDFBOX-1199, it would be useful to allow fallback to the classic parser if NonSequentialPDFParser throws an IOException. For the sake of symmetry, I propose a boolean useParserFallbackOnException parameter. If this parameter is true, and if Tika's PDFParser is using the classic parser, Tika will fallback to the NonSequentialPDFParser if there is an IOException; if this parameter is true and if Tika's PDFParser is using the NonSequentialPDFParser it will fallback to the classic parser if there is an IOException. Many thanks to Hong-Thai for championing the addition of the added NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for PDFBox's NonSequentialPDFParser (PDFBOX-1199)! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties
[ https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1283. - Resolution: Duplicate I defer to original design plan in TIKA-90. Add thumbnail as possible metadata item to TikaCoreProperties --- Key: TIKA-1283 URL: https://issues.apache.org/jira/browse/TIKA-1283 Project: Tika Issue Type: Improvement Components: metadata Reporter: Tim Allison Priority: Minor TIKA-90 originally requested to add thumbnails to a document's metadata. I'd like to have a unified way of determining whether an embedded document/resource is a thumbnail or a regular attachment. With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling out more thumbnails than before. I propose adding tika:thumbnail to the metadata of each thumbnail image. The consumer can then determine what to do with the embedded resource based on the metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1204) DWFX files detection
[ https://issues.apache.org/jira/browse/TIKA-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992896#comment-13992896 ] Nick Burch commented on TIKA-1204: -- Any chance of a much smaller sample DWFX file? The one supplied is a little larger than we generally like for unit testing against DWFX files detection Key: TIKA-1204 URL: https://issues.apache.org/jira/browse/TIKA-1204 Project: Tika Issue Type: Improvement Components: detector, mime Affects Versions: 1.4 Reporter: Marco Quaranta Priority: Minor Attachments: General assembly filter.dwfx DWFX are AutoCAD [Design web format|http://en.wikipedia.org/wiki/Design_Web_Format] files and follow [Open Packaging Conventions|http://en.wikipedia.org/wiki/Open_Packaging_Conventions]. Tika correctly detects these files as application/zip. It would be better if Tika could recognize the true mimetype: model/vnd.dwfx+xps. (y) Please add logic in ZipContainerDetector in such a way could be possible to detect dwfx. We need a method behaving like detectOfficeOpenXML(OPCPackage pkg): {noformat} PackageRelationshipCollection core = pkg.getRelationshipsByType(http://schemas.autodesk.com/dwfx/2007/relationships/documentsequence;); if (core.size() != 1) { // Invalid DWFX Package received return null; } PackagePart corePart = pkg.getPart(core.getRelationship(0)); String coreType = corePart.getContentType(); return MediaType.parse(coreType); {noformat} Thank you, Marco -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params
[ https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995298#comment-13995298 ] Ray Gauss II edited comment on TIKA-1278 at 5/12/14 5:39 PM: - Hi [~talli...@apache.org], I thought about adding to {{PDFParser.properties}} but decided against it since PDFBox could change the default values or change the properties' scale or use, and if we weren't aware of that change we'd be inadvertently overriding those defaults. Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work well for most people. We can certainly reconsider setting those defaults and/or adding other config if there are particular parameters people would find useful. was (Author: rgauss): Hi [~tallison], I thought about adding to {{PDFParser.properties}} but decided against it since PDFBox could change the default values or change the properties' scale or use, and if we weren't aware of that change we'd be inadvertently overriding those defaults. Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work well for most people. We can certainly reconsider setting those defaults and/or adding other config if there are particular parameters people would find useful. Expose PDF Avg Char and Spacing Tolerance Config Params --- Key: TIKA-1278 URL: https://issues.apache.org/jira/browse/TIKA-1278 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Ray Gauss II Assignee: Ray Gauss II Fix For: 1.6 {{PDFParserConfig}} should allow for override of PDFBox's {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO comment in {{PDF2XHTML}}. Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed slightly to allow for extension of that config class and its configuration behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1295) Make some Dublin Core items multi-valued
Tim Allison created TIKA-1295: - Summary: Make some Dublin Core items multi-valued Key: TIKA-1295 URL: https://issues.apache.org/jira/browse/TIKA-1295 Project: Tika Issue Type: Bug Reporter: Tim Allison Assignee: Tim Allison Priority: Minor Fix For: 1.6 According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, dc:title, dc:description and dc:rights should allow multiple values because of language alternatives. Unless anyone objects in the next few days, I'll switch those to Property.toInternalTextBag() from Property.toInternalText(). I'll also modify PDFParser to extract dc:rights. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params
[ https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995309#comment-13995309 ] Tim Allison commented on TIKA-1278: --- Makes sense. Thank you! Expose PDF Avg Char and Spacing Tolerance Config Params --- Key: TIKA-1278 URL: https://issues.apache.org/jira/browse/TIKA-1278 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Ray Gauss II Assignee: Ray Gauss II Fix For: 1.6 {{PDFParserConfig}} should allow for override of PDFBox's {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO comment in {{PDF2XHTML}}. Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed slightly to allow for extension of that config class and its configuration behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1296) Add case insensitive matching for text/html mime type
Phil Lester created TIKA-1296: - Summary: Add case insensitive matching for text/html mime type Key: TIKA-1296 URL: https://issues.apache.org/jira/browse/TIKA-1296 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.5 Reporter: Phil Lester Currently in tika-mimetypes.xml for the mime type text/html (and possibly others) matches in a couple different cases are provided for the elements so that varying HTML writing styles are matched. As of version 1.5 of Tika the ability exists to make these case insensitive using the stringignorecase type. This would allow consolidation of some matches and improve detection of poorly-formed HTML that would be rendered by most browsers regardless of case. For example: match value=lt;BODY type=string offset=0/ match value=lt;body type=string offset=0/ could become: match value=lt;BODY type=stringignorecase offset=0/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
[ https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1294: -- Attachment: TIKA-1294.patch All feedback welcome. Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs --- Key: TIKA-1294 URL: https://issues.apache.org/jira/browse/TIKA-1294 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Trivial Attachments: TIKA-1294.patch TIKA-1268 added the capability to extract embedded images as regular embedded resources...a great feature! However, for some use cases, it might not be desirable to extract those types of embedded resources. I see two ways of allowing the client to choose whether or not to extract those images: 1) set a value in the metadata for the extracted images that identifies them as embedded PDXObjectImages vs regular image attachments. The client can then choose not to process embedded resources with a given metadata value. 2) allow the client to set a parameter in the PDFConfig object. My initial proposal is to go with option 2, and I'll attach a patch shortly. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: NetCDF to Maven Central
Thank you for your response John. I'm cc'ing the Tika dev list so we can work towards an understanding of how best to move forward. Best, Annie On Wed, May 7, 2014 at 4:31 PM, John Caron ca...@unidata.ucar.edu wrote: Hi Annie: We find it difficult to keep maven central updated, and are maintaining our our maven server here: https://artifacts.unidata.ucar.edu/content/repositories/ unidata-releases/edu/ucar/ is that sufficient for your project? John On 5/5/2014 12:41 PM, Annie Burgess wrote: Hi John, My name is Annie Burgess, I work with Chris Mattmann at JPL and USC. I'm working on a project that requires that latest version (4.3) of NetCDF to be available on Maven Central. I've submitted a support request for this issue on the Unidata site, but thought I'd also contact you. Do you know if its possible to get 4.3 on Maven anytime soon? Any information you can give is greatly appreciated. Best, Annie -- -- Ann Bryant Burgess, PhD Postdoctoral Fellow Computer Science Department University of Southern California Viterbi School of Engineering Los Angeles, CA Alaska Science Center/USGS Anchorage, AK Cell: (585) 738-7549 Office: (907) 786-7059 Fax: **(907) 786-7150 E-mail: anniebryant.burg...@gmail.com mailto:anniebryant.burgess@ gmail.com Office Address: 4210 University Dr., Anchorage, AK 99508-4626 --- -- -- Ann Bryant Burgess, PhD Postdoctoral Fellow Computer Science Department University of Southern California Viterbi School of Engineering Los Angeles, CA Alaska Science Center/USGS Anchorage, AK Cell: (585) 738-7549 Office: (907) 786-7059 Fax: (907) 786-7150 E-mail: anniebryant.burg...@gmail.com Office Address: 4210 University Dr., Anchorage, AK 99508-4626 ---
[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
[ https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995491#comment-13995491 ] Tim Allison commented on TIKA-1294: --- Great. Just to make sure that I understand correctly...I think I was going to head this route at one point. Can your MediaTypeDisablingDocumentSelector tell the difference between a jpeg that was attached to a PDF (basic attachment) and one that was derived from a PDXObjectImage? Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs --- Key: TIKA-1294 URL: https://issues.apache.org/jira/browse/TIKA-1294 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Trivial Attachments: TIKA-1294.patch TIKA-1268 added the capability to extract embedded images as regular embedded resources...a great feature! However, for some use cases, it might not be desirable to extract those types of embedded resources. I see two ways of allowing the client to choose whether or not to extract those images: 1) set a value in the metadata for the extracted images that identifies them as embedded PDXObjectImages vs regular image attachments. The client can then choose not to process embedded resources with a given metadata value. 2) allow the client to set a parameter in the PDFConfig object. My initial proposal is to go with option 2, and I'll attach a patch shortly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params
[ https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995231#comment-13995231 ] Tim Allison commented on TIKA-1278: --- [~rgauss], thank you for adding these params and making this more extensible. Should we add default values for the new params to PDFParser.properties (parsers/src/main/resources/o/a/t/pdf/PDFParser.properties) so that they are loaded with init()? Also, should we add other parameters to configure()? Expose PDF Avg Char and Spacing Tolerance Config Params --- Key: TIKA-1278 URL: https://issues.apache.org/jira/browse/TIKA-1278 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Ray Gauss II Assignee: Ray Gauss II Fix For: 1.6 {{PDFParserConfig}} should allow for override of PDFBox's {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO comment in {{PDF2XHTML}}. Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed slightly to allow for extension of that config class and its configuration behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: NetCDF to Maven Central
Thanks Annie and John, John: I recall when I got the first NetCDF 4.2-min release up to Central through Sonatype you guys were interested in maintaining that account and thus I transferred the perms to you. Would you like me to take it over again? I think I have the cycles to publish the releases to Central via Sonatype OSS if you'd like -- or Annie can also do it. Thanks and let me know! Cheers, Chris -Original Message- From: Annie Burgess anniebry...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org, anniebryant.burg...@gmail.com anniebryant.burg...@gmail.com Date: Monday, May 12, 2014 4:14 PM To: John Caron ca...@unidata.ucar.edu Cc: support-net...@unidata.ucar.edu support-net...@unidata.ucar.edu, dev@tika.apache.org dev@tika.apache.org Subject: Re: NetCDF to Maven Central Thank you for your response John. I'm cc'ing the Tika dev list so we can work towards an understanding of how best to move forward. Best, Annie On Wed, May 7, 2014 at 4:31 PM, John Caron ca...@unidata.ucar.edu wrote: Hi Annie: We find it difficult to keep maven central updated, and are maintaining our our maven server here: https://artifacts.unidata.ucar.edu/content/repositories/ unidata-releases/edu/ucar/ is that sufficient for your project? John On 5/5/2014 12:41 PM, Annie Burgess wrote: Hi John, My name is Annie Burgess, I work with Chris Mattmann at JPL and USC. I'm working on a project that requires that latest version (4.3) of NetCDF to be available on Maven Central. I've submitted a support request for this issue on the Unidata site, but thought I'd also contact you. Do you know if its possible to get 4.3 on Maven anytime soon? Any information you can give is greatly appreciated. Best, Annie -- -- Ann Bryant Burgess, PhD Postdoctoral Fellow Computer Science Department University of Southern California Viterbi School of Engineering Los Angeles, CA Alaska Science Center/USGS Anchorage, AK Cell: (585) 738-7549 Office: (907) 786-7059 Fax: **(907) 786-7150 E-mail: anniebryant.burg...@gmail.com mailto:anniebryant.burgess@ gmail.com Office Address: 4210 University Dr., Anchorage, AK 99508-4626 --- -- -- Ann Bryant Burgess, PhD Postdoctoral Fellow Computer Science Department University of Southern California Viterbi School of Engineering Los Angeles, CA Alaska Science Center/USGS Anchorage, AK Cell: (585) 738-7549 Office: (907) 786-7059 Fax: (907) 786-7150 E-mail: anniebryant.burg...@gmail.com Office Address: 4210 University Dr., Anchorage, AK 99508-4626 -- -
[jira] [Commented] (TIKA-1296) Add case insensitive matching for text/html mime type
[ https://issues.apache.org/jira/browse/TIKA-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995874#comment-13995874 ] Phil Lester commented on TIKA-1296: --- Hi Ken, I think the option was added by TIKA-1146. I can't think of any good reason not to change them all -- it seems preferable to take that approach as there is always the possibility of someone accidentally changing the case on one or more of the tags. Thanks. Add case insensitive matching for text/html mime type - Key: TIKA-1296 URL: https://issues.apache.org/jira/browse/TIKA-1296 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.5 Reporter: Phil Lester Currently in tika-mimetypes.xml for the mime type text/html (and possibly others) matches in a couple different cases are provided for the elements so that varying HTML writing styles are matched. As of version 1.5 of Tika the ability exists to make these case insensitive using the stringignorecase type. This would allow consolidation of some matches and improve detection of poorly-formed HTML that would be rendered by most browsers regardless of case. For example: match value=lt;BODY type=string offset=0/ match value=lt;body type=string offset=0/ could become: match value=lt;BODY type=stringignorecase offset=0/ -- This message was sent by Atlassian JIRA (v6.2#6252)