[jira] [Commented] (TIKA-1191) ForkParser / ClassLoaderProxy does not define package

2018-01-09 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319839#comment-16319839
 ] 

Nick Burch commented on TIKA-1191:
--

[~talli...@mitre.org] I'm minded to apply Ben Romberg's patch from pull #215, 
any thoughts/comments/objections?

> ForkParser / ClassLoaderProxy does not define package
> -
>
> Key: TIKA-1191
> URL: https://issues.apache.org/jira/browse/TIKA-1191
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4, 1.5
>Reporter: Nicolas Belisle
> Attachments: ClassLoaderProxy.java.patch, Test.java, test.eml
>
>
> ForkParser will throw an Exception in some cases : 
> org.apache.tika.exception.TikaException: Invalid embedded resource
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:189)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.tika.fork.ForkServer.call(ForkServer.java:144)
>   at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124)
>   at org.apache.tika.fork.ForkServer.main(ForkServer.java:69)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:136)
>   at 
> org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:499)
>   at 
> org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:60)
>   at org.apache.tika.config.TikaConfig.(TikaConfig.java:169)
>   at 
> org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:268)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.getTikaConfig(AbstractPOIFSExtractor.java:72)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.getDetector(AbstractPOIFSExtractor.java:79)
>   at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:176)
>   ... 10 more
> A patch will follow



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2545) RereadableInputStream backing byte array not constructed properly

2018-01-09 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319773#comment-16319773
 ] 

Nick Burch commented on TIKA-2545:
--

Are you able to produce a short junit unit test that shows up the problem that 
your pull request (https://github.com/apache/tika/pull/217) fixes?

I'm also wondering if we need to reset size to zero or not on a re-wind, and a 
unit test seems a good way to check that too!

> RereadableInputStream backing byte array not constructed properly
> -
>
> Key: TIKA-2545
> URL: https://issues.apache.org/jira/browse/TIKA-2545
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Reporter: Eugene Hart
>Priority: Minor
>
> For original inputstreams smaller than buffersize, should create 
> bytearrayinputstream with bounds determined by size of original input, not 
> pass in entire buffer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2545) RereadableInputStream backing byte array not constructed properly

2018-01-09 Thread Eugene Hart (JIRA)
Eugene Hart created TIKA-2545:
-

 Summary: RereadableInputStream backing byte array not constructed 
properly
 Key: TIKA-2545
 URL: https://issues.apache.org/jira/browse/TIKA-2545
 Project: Tika
  Issue Type: Bug
  Components: core
Reporter: Eugene Hart
Priority: Minor


For original inputstreams smaller than buffersize, should create 
bytearrayinputstream with bounds determined by size of original input, not pass 
in entire buffer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-2196) IllegalArgumentException on a valid Excel file

2018-01-09 Thread Vinay Kawade (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay Kawade updated TIKA-2196:
---
Attachment: 1.xls

Sample file with only one sheet and 2 cells populated for testing.

> IllegalArgumentException on a valid Excel file
> --
>
> Key: TIKA-2196
> URL: https://issues.apache.org/jira/browse/TIKA-2196
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: 1.xls, 2007 Experiment watch.xls
>
>
> On the attached Excel file, which opens fine in Excel, Tika throws the 
> following error:
> java.lang.IllegalArgumentException: Cannot format given Object as a Number
>   at java.text.DecimalFormat.format:-1
>   at org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat.format:67
>   at java.text.Format.format:-1
>   at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736
>   at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804
>   at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785
>   at 
> org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:405
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336
>   at 
> org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92
>   at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109
>   at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179
>   at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
>   at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:177
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2196) IllegalArgumentException on a valid Excel file

2018-01-09 Thread Vinay Kawade (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319016#comment-16319016
 ] 

Vinay Kawade commented on TIKA-2196:


This seems to be happening when a cell is set to custom format with double 
quotes, for example: 
{code:java}
""ddd,mmm dd
or
"",  dd, 
{code}

As per,
https://bz.apache.org/bugzilla/show_bug.cgi?id=54786

the double double quotes are replaced by a single single quote


> IllegalArgumentException on a valid Excel file
> --
>
> Key: TIKA-2196
> URL: https://issues.apache.org/jira/browse/TIKA-2196
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
> Environment: Windows 7 x64, JVM 1.8.0_101
>Reporter: Seva Alekseyev
> Attachments: 2007 Experiment watch.xls
>
>
> On the attached Excel file, which opens fine in Excel, Tika throws the 
> following error:
> java.lang.IllegalArgumentException: Cannot format given Object as a Number
>   at java.text.DecimalFormat.format:-1
>   at org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat.format:67
>   at java.text.Format.format:-1
>   at org.apache.poi.ss.usermodel.DataFormatter.performDateFormatting:736
>   at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:804
>   at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents:785
>   at 
> org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell:143
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell:633
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord:405
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord:336
>   at 
> org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord:92
>   at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord:109
>   at 
> org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents:179
>   at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents:136
>   at 
> org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile:312
>   at org.apache.tika.parser.microsoft.ExcelExtractor.parse:169
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:177
>   at org.apache.tika.parser.microsoft.OfficeParser.parse:130



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)