[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-08 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824874#comment-17824874
 ] 

Nick Burch commented on TIKA-4208:
--

How much heap size do you have allocated?

The error suggests that Tika managed to decode the string in the SAS data file, 
but ran out of memory passing the string through the content handler stack to 
plain text. Generally things break at the decode step if they're going to, 
rather than the output!

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
>    at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x7f94a022d1a8.write(Unknown
>  Source) 
>    at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
>    at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>  
>    at 
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
>    at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>  
>    at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
>    at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>  
>    at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-08 Thread Gregory Lepore (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824879#comment-17824879
 ] 

Gregory Lepore commented on TIKA-4208:
--

java -Xmx4G -Xms4G -jar ../tika.jar file.arc.gz 
 
works, but
 
java -Xmx4G -Xms4G -jar ../tika.jar -J file.arc.gz
 
throws the error. As does all higher values for Xmx and Xms (up to 32GB each) 
when used in conjunction with JSON output.

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
>    at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x7f94a022d1a8.write(Unknown
>  Source) 
>    at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
>    at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>  
>    at 
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
>    at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>  
>    at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
>    at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>  
>    at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-09 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824965#comment-17824965
 ] 

Nick Burch commented on TIKA-4208:
--

I would expect that the json output version would need a bit more memory, as 
we'll have to hold all the content in memory before outputting instead of just 
streaming the text/html out as we go along. I wouldn't expect it to be 4gb vs 
32gb though!

Any ideas anyone? Is it possible we've got an extra layer (or 2?) of buffering 
above and beyond what we need for the {{-J}} option?

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
>    at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x7f94a022d1a8.write(Unknown
>  Source) 
>    at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
>    at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>  
>    at 
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
>    at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>  
>    at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
>    at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>  
>    at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-11 Thread Gregory Lepore (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825384#comment-17825384
 ] 

Gregory Lepore commented on TIKA-4208:
--

I extracted all files from the ARC file and went through the sas7bdat files one 
by one. All processed correctly with the JSON option except the attached, which 
threw the above error.

 

Hopefully that will help others to figure out what's going on. Thanks!

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
> Attachments: table23.sas7bdat.zip
>
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
>    at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x7f94a022d1a8.write(Unknown
>  Source) 
>    at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
>    at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>  
>    at 
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
>    at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>  
>    at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
>    at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>  
>    at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-11 Thread Gregory Lepore (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825394#comment-17825394
 ] 

Gregory Lepore commented on TIKA-4208:
--

Actually, processing the file separately yields a slightly different error 
message:

java -Xmx4G -Xms4G -jar ../../tika.jar -J  table23.sas7bdat 
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space 
   at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
   at 
java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
 
   at 
java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
 
   at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
   at java.base/java.io.StringWriter.write(StringWriter.java:99) 
   at 
org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
 
   at 
org.apache.tika.sax.ToXMLContentHandler.write(ToXMLContentHandler.java:171) 
   at 
org.apache.tika.sax.ToXMLContentHandler.lazyCloseStartElement(ToXMLContentHandler.java:159)
 
   at 
org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:153)
 
   at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
 
   at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
 
   at 
org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
 
   at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
 
   at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
 
   at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
 
   at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
 
   at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
 
   at 
org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
   at 
org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57) 
   at 
org.apache.tika.sax.SafeContentHandler$$Lambda$325/0x7f587822f2f8.write(Unknown
 Source) 
   at 
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
   at 
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250) 
   at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
 
   at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
 
   at 
org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
   at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
   at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
   at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
   at 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
 
   at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:518) 
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:489) 
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:256)

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
> Attachments: table23.sas7bdat.zip
>
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureConten

[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827507#comment-17827507
 ] 

Tim Allison commented on TIKA-4208:
---

I think you've just run into a monster of a sas7bdat file. I'm able to run 
{{java -Xmx6g -jar tika-app-xyx.jar -J -t}} on the file successfully. The 
resulting json is 2GB -- there's a lot of compression in the sas7bdat file 
because most of the values are 0.  The metadata says that it has 685 "pages" 
(tables?), 344772 rows and 2120 columns.  With recursive json, there's not much 
of an option but to bump memory, limit the amount that you write to the handler 
or punt on the file altogether.

If you can find incorrect recursion or incorrect duplication of data or 
something wrong with what Tika is doing, please let us know.

Separately, for these "package" files like arcs, if you can't process them all 
in memory, you may need to run an initial unraveling step to extract the 
embedded files, along the lines of {{java -jar tika-app.xyz.jar -z input.arc}}.

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
> Attachments: table23.sas7bdat.zip
>
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
>    at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x7f94a022d1a8.write(Unknown
>  Source) 
>    at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
>    at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>  
>    at 
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
>    at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>  
>    at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
>    at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>  
>    at 
> org.apache.tika.parser.pkg.PackageParser.parseEn

[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827508#comment-17827508
 ] 

Tim Allison commented on TIKA-4208:
---

File 835.zip (as renamed by tika) is 141 MB, which yields 0.5 of sas2bdat 
tables, which themselves contain compressed data.

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
> Attachments: table23.sas7bdat.zip
>
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
>    at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x7f94a022d1a8.write(Unknown
>  Source) 
>    at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
>    at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>  
>    at 
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
>    at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>  
>    at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
>    at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>  
>    at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-15 Thread Gregory Lepore (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827512#comment-17827512
 ] 

Gregory Lepore commented on TIKA-4208:
--

Hmm, here's what I get:


java -Xmx6g -jar ../tika.jar -J -t table23.sas7bdat 
 
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.base/java.util.Arrays.copyOfRange(Arrays.java:4030)
        at java.base/java.lang.StringLatin1.newString(StringLatin1.java:715)
        at java.base/java.lang.StringLatin1.trim(StringLatin1.java:541)
        at java.base/java.lang.String.trim(String.java:2644)
        at 
org.apache.tika.sax.RecursiveParserWrapperHandler.addContent(RecursiveParserWrapperHandler.java:148)
        at 
org.apache.tika.sax.RecursiveParserWrapperHandler.endDocument(RecursiveParserWrapperHandler.java:120)
        at 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:180)
        at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:518)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:489)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:256)
        
 
java -Xmx6g -jar ../tika.jar -J -r table23.sas7bdat 
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
exceeds VM limit
        at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
        at 
java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:172)
        at 
java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:686)
        at java.base/java.lang.StringBuffer.append(StringBuffer.java:414)
        at java.base/java.io.StringWriter.write(StringWriter.java:99)
        at 
org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
        at 
org.apache.tika.sax.ToXMLContentHandler.write(ToXMLContentHandler.java:181)
        at 
org.apache.tika.sax.ToXMLContentHandler.endElement(ToXMLContentHandler.java:140)
        at 
org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.endElement(RecursiveParserWrapper.java:360)
        at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:134)
        at 
org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:241)
        at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:134)
        at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:134)
        at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:134)
        at 
org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:201)
        at 
org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:257)
        at 
org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:290)
        at 
org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:147)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
        at 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:164)
        at org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:518)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:489)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:256)

Differences in operating systems? I'm on Linux.

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
> Attachments: table23.sas7bdat.zip
>
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
> 

[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-15 Thread Gregory Lepore (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827516#comment-17827516
 ] 

Gregory Lepore commented on TIKA-4208:
--

I don't mind excluding the SAS parser since it's probable that what I want 
isn't in a SAS file, but without knowing there are large SAS files in the ARC 
file, it's a bit hard to anticipate.

Is there a way to set a read timeout on the parsing of files inside the ARC 
file, or just on the ARC file? That would probably also solve my issue with the 
zip bomb (TIKA-4206).

Memory shouldn't be a problem since I'm running my code on an EC2 instance with 
124GB of RAM and a 100GB swapfile.

Thanks.

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
> Attachments: table23.sas7bdat.zip
>
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
>    at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x7f94a022d1a8.write(Unknown
>  Source) 
>    at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
>    at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>  
>    at 
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
>    at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>  
>    at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
>    at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>  
>    at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)