date:20140512

[jira] [Commented] (TIKA-1291) Invalid JSON output on CLI

2014-05-12 Thread Steffen (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992634#comment-13992634
 ] 

Steffen commented on TIKA-1291:
---

Thanks for your reply. Unfortunately I can't debug the problem here, but I sent 
you an email with the image file to ease debugging on your site.

 Invalid JSON output on CLI
 --

 Key: TIKA-1291
 URL: https://issues.apache.org/jira/browse/TIKA-1291
 Project: Tika
  Issue Type: Bug
  Components: cli, metadata
Affects Versions: 1.4, 1.5
Reporter: Steffen

 Getting the metadata via CLI from tika with output format set to JSON gives 
 sometimes invalid JSON. I only found float/array errors here in jira and thus 
 created this ticket with a new case.
 In my case the file that lead to invalid JSON output was a PNG file (that I 
 unfortunately can't provide for testing):
 {noformat}
 { Application Record Version:4, 
 Component 1:Y component: Quantization table 0, Sampling factors 2 horiz/2 
 vert, 
 Component 2:Cb component: Quantization table 1, Sampling factors 1 horiz/1 
 vert, 
 Component 3:Cr component: Quantization table 1, Sampling factors 1 horiz/1 
 vert, 
 Compression Type:Baseline, 
 Content-Length:113081, 
 Content-Type:image/jpeg, 
 Data Precision:8 bits, 
 IPTC-NAA record:24 bytes binary data, 
 Image Height:479 pixels, 
 Image Width:671 pixels, 
 Number of Components:3, 
 Resolution Units:inch, 
 Unknown tag (0x02f0):35,0,556,479, 
 X Resolution:220 dots, 
 Y Resolution:220 dots, 
 resourceName:18, 
 tiff:BitsPerSample:8, 
 tiff:ImageLength:479, 
 tiff:ImageWidth:671 }
 {noformat}
 The {noformat}Unknown tag (0x02f0):35,0,556,479, {noformat} is invalid JSON.
 It would be nice if there's always valid json output from tika. For other 
 cases that might not be catched via fixes by this ticket it would be nice to 
 have a CLI argument/option that disables the output of certain (unknown?) 
 fields or allows giving a whitelist of fieldnames to output. That way users 
 can bridge the time until new releases of tika by being more specific on the 
 shell. If that feature already exists I apology for not having found it 
 directly and a hint to the CLI option would be nice.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates

2014-05-12 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison closed TIKA-1233.
-

Resolution: Fixed

After upgrade to PDFBOX-1.8.5, confirmed no longer any need for catch blocks 
for StringIndexOutOfBoundsException.  Catch blocks removed in r1593983.

 PDFBox can throw StringIndexOutOfBoundsException on some dates
 --

 Key: TIKA-1233
 URL: https://issues.apache.org/jira/browse/TIKA-1233
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Tim Allison
Priority: Trivial
  Labels: easyfix
 Fix For: 1.6


 PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date 
 string for parsing is empty or contains only spaces.  A few of my test pdfs 
 have this feature.
 Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from 
 causing problems in TIKA
 {noformat}
 @@ -171,6 +171,9 @@
  addMetadata(metadata, TikaCoreProperties.CREATED, 
 info.getCreationDate());
  } catch (IOException e) {
  // Invalid date format, just ignore
 +} catch (StringIndexOutOfBoundsException e){
 +//remove after PDFBOX-1883 is fixed
 +// Invalid date format, just ignore
  }
  try {
  Calendar modified = info.getModificationDate();
 @@ -178,6 +181,9 @@
  addMetadata(metadata, TikaCoreProperties.MODIFIED, modified);
  } catch (IOException e) {
  // Invalid date format, just ignore
 +} catch (StringIndexOutOfBoundsException e){
 +//remove after PDFBOX-1883 is fixed
 +// Invalid date format, just ignore
  }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates

2014-05-12 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995135#comment-13995135
 ] 

Tim Allison commented on TIKA-1233:
---

[~lfcnassif], please reopen if you are still finding problems on your test set 
with trunk.

 PDFBox can throw StringIndexOutOfBoundsException on some dates
 --

 Key: TIKA-1233
 URL: https://issues.apache.org/jira/browse/TIKA-1233
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Tim Allison
Priority: Trivial
  Labels: easyfix
 Fix For: 1.6


 PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date 
 string for parsing is empty or contains only spaces.  A few of my test pdfs 
 have this feature.
 Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from 
 causing problems in TIKA
 {noformat}
 @@ -171,6 +171,9 @@
  addMetadata(metadata, TikaCoreProperties.CREATED, 
 info.getCreationDate());
  } catch (IOException e) {
  // Invalid date format, just ignore
 +} catch (StringIndexOutOfBoundsException e){
 +//remove after PDFBOX-1883 is fixed
 +// Invalid date format, just ignore
  }
  try {
  Calendar modified = info.getModificationDate();
 @@ -178,6 +181,9 @@
  addMetadata(metadata, TikaCoreProperties.MODIFIED, modified);
  } catch (IOException e) {
  // Invalid date format, just ignore
 +} catch (StringIndexOutOfBoundsException e){
 +//remove after PDFBOX-1883 is fixed
 +// Invalid date format, just ignore
  }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception

2014-05-12 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison closed TIKA-1205.
-

Resolution: Won't Fix

I've decided that the complication in code is not worth the benefit for me.  If 
there is need in the community, please reopen.

 Allow PDFParser to fallback to other parser if there is an exception
 

 Key: TIKA-1205
 URL: https://issues.apache.org/jira/browse/TIKA-1205
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.6


 With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser 
 instead of the traditional parser for parsing PDF files.  Following the 
 description in PDFBOX-1199, it would be useful to allow fallback to the 
 classic parser if NonSequentialPDFParser throws an IOException.  For the sake 
 of symmetry, I propose a boolean useParserFallbackOnException parameter.  If 
 this parameter is true, and if Tika's PDFParser is using the classic parser, 
 Tika will fallback to the NonSequentialPDFParser if there is an IOException; 
 if this parameter is true and if Tika's PDFParser is using the 
 NonSequentialPDFParser it will fallback to the classic parser if there is an 
 IOException.
 Many thanks to Hong-Thai for championing the addition of the added 
 NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for 
 PDFBox's NonSequentialPDFParser (PDFBOX-1199)!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties

2014-05-12 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison closed TIKA-1283.
-

Resolution: Duplicate

I defer to original design plan in TIKA-90.

 Add thumbnail as possible metadata item to TikaCoreProperties
 ---

 Key: TIKA-1283
 URL: https://issues.apache.org/jira/browse/TIKA-1283
 Project: Tika
  Issue Type: Improvement
  Components: metadata
Reporter: Tim Allison
Priority: Minor

 TIKA-90 originally requested to add thumbnails to a document's metadata.
 I'd like to have a unified way of determining whether an embedded 
 document/resource is a thumbnail or a regular attachment.
 With the changes in TIKA-1223 (ooxml) and TIKA-1010 (rtf), we are now pulling 
 out more thumbnails than before.
 I propose adding tika:thumbnail to the metadata of each thumbnail image.  
 The consumer can then determine what to do with the embedded resource based 
 on the metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1204) DWFX files detection

2014-05-12 Thread Nick Burch (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992896#comment-13992896
 ] 

Nick Burch commented on TIKA-1204:
--

Any chance of a much smaller sample DWFX file? The one supplied is a little 
larger than we generally like for unit testing against

 DWFX files detection
 

 Key: TIKA-1204
 URL: https://issues.apache.org/jira/browse/TIKA-1204
 Project: Tika
  Issue Type: Improvement
  Components: detector, mime
Affects Versions: 1.4
Reporter: Marco Quaranta
Priority: Minor
 Attachments: General assembly filter.dwfx


 DWFX are AutoCAD [Design web 
 format|http://en.wikipedia.org/wiki/Design_Web_Format] files and follow [Open 
 Packaging 
 Conventions|http://en.wikipedia.org/wiki/Open_Packaging_Conventions]. 
 Tika correctly detects these files as application/zip. 
 It would be better if Tika could recognize the true mimetype: 
 model/vnd.dwfx+xps. (y)
 Please add logic in ZipContainerDetector in such a way could be possible to 
 detect dwfx. We need a method behaving like detectOfficeOpenXML(OPCPackage 
 pkg): 
 {noformat}
 PackageRelationshipCollection core = 
 pkg.getRelationshipsByType(http://schemas.autodesk.com/dwfx/2007/relationships/documentsequence;);
 if (core.size() != 1) {
  // Invalid DWFX Package received
  return null;
 }
 PackagePart corePart = pkg.getPart(core.getRelationship(0));
 String coreType = corePart.getContentType();
 return MediaType.parse(coreType);
 {noformat}
 Thank you,
 Marco



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-05-12 Thread Ray Gauss II (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995298#comment-13995298
 ] 

Ray Gauss II edited comment on TIKA-1278 at 5/12/14 5:39 PM:
-

Hi [~talli...@apache.org],

I thought about adding to {{PDFParser.properties}} but decided against it since 
PDFBox could change the default values or change the properties' scale or use, 
and if we weren't aware of that change we'd be inadvertently overriding those 
defaults.

Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work 
well for most people.

We can certainly reconsider setting those defaults and/or adding other config 
if there are particular parameters people would find useful.


was (Author: rgauss):
Hi [~tallison],

I thought about adding to {{PDFParser.properties}} but decided against it since 
PDFBox could change the default values or change the properties' scale or use, 
and if we weren't aware of that change we'd be inadvertently overriding those 
defaults.

Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work 
well for most people.

We can certainly reconsider setting those defaults and/or adding other config 
if there are particular parameters people would find useful.

 Expose PDF Avg Char and Spacing Tolerance Config Params
 ---

 Key: TIKA-1278
 URL: https://issues.apache.org/jira/browse/TIKA-1278
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


 {{PDFParserConfig}} should allow for override of PDFBox's 
 {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
 comment in {{PDF2XHTML}}.
 Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
 slightly to allow for extension of that config class and its configuration 
 behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (TIKA-1295) Make some Dublin Core items multi-valued

2014-05-12 Thread Tim Allison (JIRA)

Tim Allison created TIKA-1295:
-

 Summary: Make some Dublin Core items multi-valued
 Key: TIKA-1295
 URL: https://issues.apache.org/jira/browse/TIKA-1295
 Project: Tika
  Issue Type: Bug
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.6


According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, 
dc:title, dc:description and dc:rights should allow multiple values because of 
language alternatives.  Unless anyone objects in the next few days, I'll switch 
those to Property.toInternalTextBag() from Property.toInternalText().  I'll 
also modify PDFParser to extract dc:rights.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-05-12 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995309#comment-13995309
 ] 

Tim Allison commented on TIKA-1278:
---

Makes sense.  Thank you!

 Expose PDF Avg Char and Spacing Tolerance Config Params
 ---

 Key: TIKA-1278
 URL: https://issues.apache.org/jira/browse/TIKA-1278
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


 {{PDFParserConfig}} should allow for override of PDFBox's 
 {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
 comment in {{PDF2XHTML}}.
 Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
 slightly to allow for extension of that config class and its configuration 
 behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (TIKA-1296) Add case insensitive matching for text/html mime type

2014-05-12 Thread Phil Lester (JIRA)

Phil Lester created TIKA-1296:
-

 Summary: Add case insensitive matching for text/html mime type
 Key: TIKA-1296
 URL: https://issues.apache.org/jira/browse/TIKA-1296
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.5
Reporter: Phil Lester


Currently in tika-mimetypes.xml for the mime type text/html (and possibly 
others) matches in a couple different cases are provided for the elements so 
that varying HTML writing styles are matched. As of version 1.5 of Tika the 
ability exists to make these case insensitive using the stringignorecase 
type. This would allow consolidation of some matches and improve detection of 
poorly-formed HTML that would be rendered by most browsers regardless of case.

For example:
  match value=lt;BODY type=string offset=0/
  match value=lt;body type=string offset=0/

could become:
  match value=lt;BODY type=stringignorecase offset=0/




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

2014-05-12 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1294:
--

Attachment: TIKA-1294.patch

All feedback welcome.

 Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
 ---

 Key: TIKA-1294
 URL: https://issues.apache.org/jira/browse/TIKA-1294
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Trivial
 Attachments: TIKA-1294.patch


 TIKA-1268 added the capability to extract embedded images as regular embedded 
 resources...a great feature!
 However, for some use cases, it might not be desirable to extract those types 
 of embedded resources.  I see two ways of allowing the client to choose 
 whether or not to extract those images:
 1) set a value in the metadata for the extracted images that identifies them 
 as embedded PDXObjectImages vs regular image attachments.  The client can 
 then choose not to process embedded resources with a given metadata value.
 2) allow the client to set a parameter in the PDFConfig object.
 My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: NetCDF to Maven Central

2014-05-12 Thread Annie Burgess

Thank you for your response John.  I'm cc'ing the Tika dev list so we can
work towards an understanding of how best to move forward.

Best,
Annie


On Wed, May 7, 2014 at 4:31 PM, John Caron ca...@unidata.ucar.edu wrote:

 Hi Annie:

 We find it difficult to keep maven central updated, and are maintaining
 our our maven server here:

 https://artifacts.unidata.ucar.edu/content/repositories/
 unidata-releases/edu/ucar/

 is that sufficient for your project?

 John



 On 5/5/2014 12:41 PM, Annie Burgess wrote:

 Hi John,

 My name is Annie Burgess, I work with Chris Mattmann at JPL and USC.
   I'm working on a project that requires that latest version (4.3) of
 NetCDF to be available on Maven Central. I've submitted a support
 request for this issue on the Unidata site, but thought I'd also contact
 you.

 Do you know if its possible to get 4.3 on Maven anytime soon?

 Any information you can give is greatly appreciated.

 Best,
 Annie


 --
 
 --
 Ann Bryant Burgess, PhD

 Postdoctoral Fellow
 Computer Science Department
 University of Southern California
 Viterbi School of Engineering
 Los Angeles, CA

 Alaska Science Center/USGS
 Anchorage, AK

 Cell: (585) 738-7549
 Office: (907) 786-7059
 Fax: **(907) 786-7150
 E-mail: anniebryant.burg...@gmail.com mailto:anniebryant.burgess@
 gmail.com

 Office Address: 4210 University Dr., Anchorage, AK 99508-4626
 
 ---




-- 
--
Ann Bryant Burgess, PhD

Postdoctoral Fellow
Computer Science Department
University of Southern California
Viterbi School of Engineering
Los Angeles, CA

Alaska Science Center/USGS
Anchorage, AK

Cell:  (585) 738-7549
Office:  (907) 786-7059
Fax:  (907) 786-7150
E-mail: anniebryant.burg...@gmail.com
Office Address: 4210 University Dr., Anchorage, AK 99508-4626
---

[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

2014-05-12 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995491#comment-13995491
 ] 

Tim Allison commented on TIKA-1294:
---

Great. Just to make sure that I understand correctly...I think I was going to 
head this route at one point.  Can your MediaTypeDisablingDocumentSelector tell 
the difference between a jpeg that was attached to a PDF (basic attachment) and 
one that was derived from a PDXObjectImage?

 Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
 ---

 Key: TIKA-1294
 URL: https://issues.apache.org/jira/browse/TIKA-1294
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Trivial
 Attachments: TIKA-1294.patch


 TIKA-1268 added the capability to extract embedded images as regular embedded 
 resources...a great feature!
 However, for some use cases, it might not be desirable to extract those types 
 of embedded resources.  I see two ways of allowing the client to choose 
 whether or not to extract those images:
 1) set a value in the metadata for the extracted images that identifies them 
 as embedded PDXObjectImages vs regular image attachments.  The client can 
 then choose not to process embedded resources with a given metadata value.
 2) allow the client to set a parameter in the PDFConfig object.
 My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-05-12 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995231#comment-13995231
 ] 

Tim Allison commented on TIKA-1278:
---

[~rgauss], thank you for adding these params and making this more extensible.  
Should we add default values for the new params to PDFParser.properties 
(parsers/src/main/resources/o/a/t/pdf/PDFParser.properties) so that they are 
loaded with init()?  Also, should we add other parameters to configure()?

 Expose PDF Avg Char and Spacing Tolerance Config Params
 ---

 Key: TIKA-1278
 URL: https://issues.apache.org/jira/browse/TIKA-1278
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


 {{PDFParserConfig}} should allow for override of PDFBox's 
 {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
 comment in {{PDF2XHTML}}.
 Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
 slightly to allow for extension of that config class and its configuration 
 behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: NetCDF to Maven Central

2014-05-12 Thread Mattmann, Chris A (3980)

Thanks Annie and John,

John: I recall when I got the first NetCDF 4.2-min release up to Central
through Sonatype you guys were interested in maintaining that account and
thus I transferred the perms to you. Would you like me to take it over
again? I think I have the cycles to publish the releases to Central via
Sonatype OSS if you'd like -- or Annie can also do it.

Thanks and let me know!

Cheers,
Chris



-Original Message-
From: Annie Burgess anniebry...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org,
anniebryant.burg...@gmail.com anniebryant.burg...@gmail.com
Date: Monday, May 12, 2014 4:14 PM
To: John Caron ca...@unidata.ucar.edu
Cc: support-net...@unidata.ucar.edu support-net...@unidata.ucar.edu,
dev@tika.apache.org dev@tika.apache.org
Subject: Re: NetCDF to Maven Central

Thank you for your response John.  I'm cc'ing the Tika dev list so we can
work towards an understanding of how best to move forward.

Best,
Annie


On Wed, May 7, 2014 at 4:31 PM, John Caron ca...@unidata.ucar.edu wrote:

 Hi Annie:

 We find it difficult to keep maven central updated, and are maintaining
 our our maven server here:

 https://artifacts.unidata.ucar.edu/content/repositories/
 unidata-releases/edu/ucar/

 is that sufficient for your project?

 John



 On 5/5/2014 12:41 PM, Annie Burgess wrote:

 Hi John,

 My name is Annie Burgess, I work with Chris Mattmann at JPL and USC.
   I'm working on a project that requires that latest version (4.3) of
 NetCDF to be available on Maven Central. I've submitted a support
 request for this issue on the Unidata site, but thought I'd also
contact
 you.

 Do you know if its possible to get 4.3 on Maven anytime soon?

 Any information you can give is greatly appreciated.

 Best,
 Annie


 --
 
 --
 Ann Bryant Burgess, PhD

 Postdoctoral Fellow
 Computer Science Department
 University of Southern California
 Viterbi School of Engineering
 Los Angeles, CA

 Alaska Science Center/USGS
 Anchorage, AK

 Cell: (585) 738-7549
 Office: (907) 786-7059
 Fax: **(907) 786-7150
 E-mail: anniebryant.burg...@gmail.com mailto:anniebryant.burgess@
 gmail.com

 Office Address: 4210 University Dr., Anchorage, AK 99508-4626
 
 ---




-- 
--

Ann Bryant Burgess, PhD

Postdoctoral Fellow
Computer Science Department
University of Southern California
Viterbi School of Engineering
Los Angeles, CA

Alaska Science Center/USGS
Anchorage, AK

Cell:  (585) 738-7549
Office:  (907) 786-7059
Fax:  (907) 786-7150
E-mail: anniebryant.burg...@gmail.com
Office Address: 4210 University Dr., Anchorage, AK 99508-4626
--
-

[jira] [Commented] (TIKA-1296) Add case insensitive matching for text/html mime type

2014-05-12 Thread Phil Lester (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995874#comment-13995874
 ] 

Phil Lester commented on TIKA-1296:
---

Hi Ken,
I think the option was added by TIKA-1146. I can't think of any good reason not 
to change them all -- it seems preferable to take that approach as there is 
always the possibility of someone accidentally changing the case on one or more 
of the tags. Thanks.

 Add case insensitive matching for text/html mime type
 -

 Key: TIKA-1296
 URL: https://issues.apache.org/jira/browse/TIKA-1296
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.5
Reporter: Phil Lester

 Currently in tika-mimetypes.xml for the mime type text/html (and possibly 
 others) matches in a couple different cases are provided for the elements so 
 that varying HTML writing styles are matched. As of version 1.5 of Tika the 
 ability exists to make these case insensitive using the stringignorecase 
 type. This would allow consolidation of some matches and improve detection of 
 poorly-formed HTML that would be rendered by most browsers regardless of case.
 For example:
   match value=lt;BODY type=string offset=0/
   match value=lt;body type=string offset=0/
 could become:
   match value=lt;BODY type=stringignorecase offset=0/



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1291) Invalid JSON output on CLI

[jira] [Closed] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates

[jira] [Commented] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates

[jira] [Closed] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception

[jira] [Closed] (TIKA-1283) Add thumbnail as possible metadata item to TikaCoreProperties

[jira] [Commented] (TIKA-1204) DWFX files detection

[jira] [Comment Edited] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

[jira] [Created] (TIKA-1295) Make some Dublin Core items multi-valued

[jira] [Commented] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

[jira] [Created] (TIKA-1296) Add case insensitive matching for text/html mime type

[jira] [Updated] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

Re: NetCDF to Maven Central

[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

[jira] [Commented] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

Re: NetCDF to Maven Central

[jira] [Commented] (TIKA-1296) Add case insensitive matching for text/html mime type

16 matches

Site Navigation

Mail list logo

Footer information