[jira] [Comment Edited] (TIKA-3154) Exception while extracting msg files

2020-08-19 Thread Akash (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180782#comment-17180782
 ] 

Akash edited comment on TIKA-3154 at 8/19/20, 7:57 PM:
---

Tried with below config. Still same error. Seems property is not considered.
{code:java}
/

  

  


  
5000
  

  

/ 
{code}

>From 
>https://github.com/apache/tika/blob/7f0394247c8f5a731b258adbd6683449bc5c757b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
We dont have any variable with name byteArrayMaxOverride


was (Author: akki1607):
Tried with below config. Still same error. Seems property is not considered.
{code:java}
/

  

  


  
5000
  

  

/ 
{code}

> Exception while extracting msg files
> 
>
> Key: TIKA-3154
> URL: https://issues.apache.org/jira/browse/TIKA-3154
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24.1
>Reporter: Akash
>Priority: Major
>
> While parsing msg file containing some html text inside, we are getting 
> exception from Tika.
> Command : java -jar tika-app-1.24.1.jar html_code.msg
> Exception coming : 
> {code:java}
> /Aug 07, 2020 10:59:00 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@7fcf2fc1
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:293 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 
> undefined)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 
> undefined)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined)
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 1326748, but 100 is the maximum for this record type.
> If the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type.
> As a temporary workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>   at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:630 undefined)
>   at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:208 undefined)
>   at org.apache.poi.util.IOUtils.safelyAllocateCheck(IOUtils.java:610 
> undefined)
>   at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:596 
> undefined)
>   at 
> org.apache.poi.hmef.attribute.MAPIRtfAttribute.(MAPIRtfAttribute.java:49
>  undefined)
>   at 
> org.apache.tika.parser.microsoft.OutlookExtractor.handleBodyChunks(OutlookExtractor.java:328
>  undefined)
>   at 
> org.apache.tikar.microsoft.OutlookExtractor.parse.parse(OutlookExtractor.java:247
>  undefined)
>   at 
> org.apache.tikar.microsoft.OfficeParser.parse.parse(OfficeParser.java:199 
> undefined)
>   at 
> org.apache.tikar.microsoft.OfficeParser.parse.parse(OfficeParser.java:131 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)/ 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3154) Exception while extracting msg files

2020-08-19 Thread Akash (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180782#comment-17180782
 ] 

Akash edited comment on TIKA-3154 at 8/19/20, 7:56 PM:
---

Tried with below config. Still same error. Seems property is not considered.
{code:java}
/

  

  


  
5000
  

  

/ 
{code}


was (Author: akki1607):
Tried with below config. Did not help
{code:java}
/

  

  


  
5000
  

  

/ 
{code}

> Exception while extracting msg files
> 
>
> Key: TIKA-3154
> URL: https://issues.apache.org/jira/browse/TIKA-3154
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24.1
>Reporter: Akash
>Priority: Major
>
> While parsing msg file containing some html text inside, we are getting 
> exception from Tika.
> Command : java -jar tika-app-1.24.1.jar html_code.msg
> Exception coming : 
> {code:java}
> /Aug 07, 2020 10:59:00 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@7fcf2fc1
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:293 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 
> undefined)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 
> undefined)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined)
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 1326748, but 100 is the maximum for this record type.
> If the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type.
> As a temporary workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>   at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:630 undefined)
>   at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:208 undefined)
>   at org.apache.poi.util.IOUtils.safelyAllocateCheck(IOUtils.java:610 
> undefined)
>   at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:596 
> undefined)
>   at 
> org.apache.poi.hmef.attribute.MAPIRtfAttribute.(MAPIRtfAttribute.java:49
>  undefined)
>   at 
> org.apache.tika.parser.microsoft.OutlookExtractor.handleBodyChunks(OutlookExtractor.java:328
>  undefined)
>   at 
> org.apache.tikar.microsoft.OutlookExtractor.parse.parse(OutlookExtractor.java:247
>  undefined)
>   at 
> org.apache.tikar.microsoft.OfficeParser.parse.parse(OfficeParser.java:199 
> undefined)
>   at 
> org.apache.tikar.microsoft.OfficeParser.parse.parse(OfficeParser.java:131 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)/ 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3154) Exception while extracting msg files

2020-08-17 Thread Akash (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178797#comment-17178797
 ] 

Akash edited comment on TIKA-3154 at 8/17/20, 7:33 AM:
---

[~tallison] Can we make this as a configuration parameter rather than hard 
coding in code?

POI do provide an API to over ride this value.

_As a temporary workaround, consider setting a higher override value with 
IOUtils.setByteArrayMaxOverride()_

If that API can be invoked via Tika code, then we can set value as required.


was (Author: akki1607):
[~tallison] Can we make this as a configuration parameter rather than hard 
coding in code?

POI do provide an API to over ride this value.

_As a temporary workaround, consider setting a higher override value with 
IOUtils.setByteArrayMaxOverride()_

If that API can be invoked via Tika code, then we set value as required.

> Exception while extracting msg files
> 
>
> Key: TIKA-3154
> URL: https://issues.apache.org/jira/browse/TIKA-3154
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24.1
>Reporter: Akash
>Priority: Major
>
> While parsing msg file containing some html text inside, we are getting 
> exception from Tika.
> Command : java -jar tika-app-1.24.1.jar html_code.msg
> Exception coming : 
> {code:java}
> /Aug 07, 2020 10:59:00 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@7fcf2fc1
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:293 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 
> undefined)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 
> undefined)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined)
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 1326748, but 100 is the maximum for this record type.
> If the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type.
> As a temporary workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>   at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:630 undefined)
>   at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:208 undefined)
>   at org.apache.poi.util.IOUtils.safelyAllocateCheck(IOUtils.java:610 
> undefined)
>   at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:596 
> undefined)
>   at 
> org.apache.poi.hmef.attribute.MAPIRtfAttribute.(MAPIRtfAttribute.java:49
>  undefined)
>   at 
> org.apache.tika.parser.microsoft.OutlookExtractor.handleBodyChunks(OutlookExtractor.java:328
>  undefined)
>   at 
> org.apache.tikar.microsoft.OutlookExtractor.parse.parse(OutlookExtractor.java:247
>  undefined)
>   at 
> org.apache.tikar.microsoft.OfficeParser.parse.parse(OfficeParser.java:199 
> undefined)
>   at 
> org.apache.tikar.microsoft.OfficeParser.parse.parse(OfficeParser.java:131 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)/ 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3154) Exception while extracting msg files

2020-08-17 Thread Akash (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178797#comment-17178797
 ] 

Akash edited comment on TIKA-3154 at 8/17/20, 7:33 AM:
---

[~tallison] Can we make this as a configuration parameter rather than hard 
coding in code?

POI do provide an API to over ride this value. If that API can be invoked via 
Tika code, then we set value as required.


was (Author: akki1607):
[~tallison] Can we make this as a configuration parameter rather than hard 
coding in code?

> Exception while extracting msg files
> 
>
> Key: TIKA-3154
> URL: https://issues.apache.org/jira/browse/TIKA-3154
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24.1
>Reporter: Akash
>Priority: Major
>
> While parsing msg file containing some html text inside, we are getting 
> exception from Tika.
> Command : java -jar tika-app-1.24.1.jar html_code.msg
> Exception coming : 
> {code:java}
> /Aug 07, 2020 10:59:00 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@7fcf2fc1
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:293 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 
> undefined)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 
> undefined)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined)
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 1326748, but 100 is the maximum for this record type.
> If the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type.
> As a temporary workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>   at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:630 undefined)
>   at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:208 undefined)
>   at org.apache.poi.util.IOUtils.safelyAllocateCheck(IOUtils.java:610 
> undefined)
>   at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:596 
> undefined)
>   at 
> org.apache.poi.hmef.attribute.MAPIRtfAttribute.(MAPIRtfAttribute.java:49
>  undefined)
>   at 
> org.apache.tika.parser.microsoft.OutlookExtractor.handleBodyChunks(OutlookExtractor.java:328
>  undefined)
>   at 
> org.apache.tikar.microsoft.OutlookExtractor.parse.parse(OutlookExtractor.java:247
>  undefined)
>   at 
> org.apache.tikar.microsoft.OfficeParser.parse.parse(OfficeParser.java:199 
> undefined)
>   at 
> org.apache.tikar.microsoft.OfficeParser.parse.parse(OfficeParser.java:131 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)/ 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3154) Exception while extracting msg files

2020-08-17 Thread Akash (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178797#comment-17178797
 ] 

Akash edited comment on TIKA-3154 at 8/17/20, 7:33 AM:
---

[~tallison] Can we make this as a configuration parameter rather than hard 
coding in code?

POI do provide an API to over ride this value.

_As a temporary workaround, consider setting a higher override value with 
IOUtils.setByteArrayMaxOverride()_

If that API can be invoked via Tika code, then we set value as required.


was (Author: akki1607):
[~tallison] Can we make this as a configuration parameter rather than hard 
coding in code?

POI do provide an API to over ride this value. If that API can be invoked via 
Tika code, then we set value as required.

> Exception while extracting msg files
> 
>
> Key: TIKA-3154
> URL: https://issues.apache.org/jira/browse/TIKA-3154
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24.1
>Reporter: Akash
>Priority: Major
>
> While parsing msg file containing some html text inside, we are getting 
> exception from Tika.
> Command : java -jar tika-app-1.24.1.jar html_code.msg
> Exception coming : 
> {code:java}
> /Aug 07, 2020 10:59:00 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@7fcf2fc1
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:293 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)
>   at 
> org.apache.tikar.AutoDetectParser.parse.parse(AutoDetectParser.java:143 
> undefined)
>   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:209 
> undefined)
>   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:496 undefined)
>   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149 undefined)
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 1326748, but 100 is the maximum for this record type.
> If the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type.
> As a temporary workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>   at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:630 undefined)
>   at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:208 undefined)
>   at org.apache.poi.util.IOUtils.safelyAllocateCheck(IOUtils.java:610 
> undefined)
>   at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:596 
> undefined)
>   at 
> org.apache.poi.hmef.attribute.MAPIRtfAttribute.(MAPIRtfAttribute.java:49
>  undefined)
>   at 
> org.apache.tika.parser.microsoft.OutlookExtractor.handleBodyChunks(OutlookExtractor.java:328
>  undefined)
>   at 
> org.apache.tikar.microsoft.OutlookExtractor.parse.parse(OutlookExtractor.java:247
>  undefined)
>   at 
> org.apache.tikar.microsoft.OfficeParser.parse.parse(OfficeParser.java:199 
> undefined)
>   at 
> org.apache.tikar.microsoft.OfficeParser.parse.parse(OfficeParser.java:131 
> undefined)
>   at 
> org.apache.tikar.CompositeParser.parse.parse(CompositeParser.java:280 
> undefined)/ 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)