[jira] [Created] (TIKA-4207) PipesParser should have option to extract raw bytes of embedded files

2024-03-08 Thread Tim Allison (Jira)
Tim Allison created TIKA-4207:
-

 Summary: PipesParser should have option to extract raw bytes of 
embedded files
 Key: TIKA-4207
 URL: https://issues.apache.org/jira/browse/TIKA-4207
 Project: Tika
  Issue Type: New Feature
Reporter: Tim Allison


There are many use cases, where text+metadata are important, but users also 
need the raw bytes from embedded files.

Let's make it possible to extract the usual rmeta content in _and_ the raw 
bytes. This is a preliminary step that will offer more customization options 
than the proposal in TIKA-3703.

This is targeted to 3.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-08 Thread Gregory Lepore (Jira)
Gregory Lepore created TIKA-4208:


 Summary: OOM error in SAS7BDATParser
 Key: TIKA-4208
 URL: https://issues.apache.org/jira/browse/TIKA-4208
 Project: Tika
  Issue Type: Bug
Affects Versions: 3.0.0-BETA
Reporter: Gregory Lepore


For this ARC file:
[https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]

I'm getting an OOM error:

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
exceeds VM limit 
   at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
   at 
java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
 
   at 
java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
 
   at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
   at java.base/java.io.StringWriter.write(StringWriter.java:99) 
   at 
org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
 
   at 
org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
 
   at 
org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
 
   at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
 
   at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
 
   at 
org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
 
   at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
 
   at 
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
 
   at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
 
   at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
 
   at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
 
   at 
org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
   at 
org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57) 
   at 
org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x7f94a022d1a8.write(Unknown
 Source) 
   at 
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
   at 
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250) 
   at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
 
   at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
 
   at 
org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
   at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
   at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
   at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
   at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
   at 
org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
 
   at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
   at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
 
   at 
org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)

when extracting JSON with both the app and server version of 3.0.0 BETA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-08 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824874#comment-17824874
 ] 

Nick Burch commented on TIKA-4208:
--

How much heap size do you have allocated?

The error suggests that Tika managed to decode the string in the SAS data file, 
but ran out of memory passing the string through the content handler stack to 
plain text. Generally things break at the decode step if they're going to, 
rather than the output!

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
>    at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x7f94a022d1a8.write(Unknown
>  Source) 
>    at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
>    at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>  
>    at 
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
>    at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>  
>    at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
>    at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>  
>    at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4208) OOM error in SAS7BDATParser

2024-03-08 Thread Gregory Lepore (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824879#comment-17824879
 ] 

Gregory Lepore commented on TIKA-4208:
--

java -Xmx4G -Xms4G -jar ../tika.jar file.arc.gz 
 
works, but
 
java -Xmx4G -Xms4G -jar ../tika.jar -J file.arc.gz
 
throws the error. As does all higher values for Xmx and Xms (up to 32GB each) 
when used in conjunction with JSON output.

> OOM error in SAS7BDATParser
> ---
>
> Key: TIKA-4208
> URL: https://issues.apache.org/jira/browse/TIKA-4208
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 3.0.0-BETA
>Reporter: Gregory Lepore
>Priority: Minor
>
> For this ARC file:
> [https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041019023240-02598-crawling008-c_NARA-PEOT-2004-20041019053819-01693-crawling007.archive.org.arc.gz]
> I'm getting an OOM error:
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit 
>    at java.base/java.util.Arrays.copyOf(Arrays.java:3537) 
>    at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
>  
>    at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>  
>    at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) 
>    at java.base/java.io.StringWriter.write(StringWriter.java:99) 
>    at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:96)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.writeEscaped(ToXMLContentHandler.java:229)
>  
>    at 
> org.apache.tika.sax.ToXMLContentHandler.characters(ToXMLContentHandler.java:154)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$RecursivelySecureContentHandler.characters(RecursiveParserWrapper.java:370)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:253)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:143)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler.access$101(SafeContentHandler.java:47) 
>    at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>  
>    at 
> org.apache.tika.sax.SafeContentHandler$$Lambda$327/0x7f94a022d1a8.write(Unknown
>  Source) 
>    at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) 
>    at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>  
>    at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>  
>    at 
> org.apache.tika.parser.sas.SAS7BDATParser.parse(SAS7BDATParser.java:146) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
>    at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) 
>    at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:153) 
>    at 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:259)
>  
>    at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:71) 
>    at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:109)
>  
>    at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:455)
> when extracting JSON with both the app and server version of 3.0.0 BETA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] Add Github CI workflows for multi-arch Docker images [tika-docker]

2024-03-08 Thread via GitHub


bartek commented on PR #19:
URL: https://github.com/apache/tika-docker/pull/19#issuecomment-1986589544

   Hello! :wave:
   
   This PR has been kicking around for some time and I'm interested in official 
arm images (right now we are building our own internally).
   
   Is anything blocking this PR from being merged? I see @lewismc requested 
changes but there's been many updates since.
   
   Does this need shepherding through? It's not clear if @fpiesche is active on 
this work anymore?
   
   Thank you! Happy to help as necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add Github CI workflows for multi-arch Docker images [tika-docker]

2024-03-08 Thread via GitHub


tballison commented on PR #19:
URL: https://github.com/apache/tika-docker/pull/19#issuecomment-1986632975

   We could definitely use some help. This is not an area of strength for me 
and has fallen off my plate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Add Github CI workflows for multi-arch Docker images [tika-docker]

2024-03-08 Thread via GitHub


fpiesche commented on PR #19:
URL: https://github.com/apache/tika-docker/pull/19#issuecomment-1986672882

   Sorry for dropping off the radar - I've had a lot of Life coming at me over 
the past few months so until fairly recently a lot of my personal github stuff 
fell by the wayside. It's getting late here but I'll make some time to sort out 
that regex tomorrow.
   
   As for the Docker username/token, adding your personal account/token to the 
github repo as secrets would indeed be the approach with this workflow. Does 
the ASF maybe have the ability or a process for setting up org-level accounts 
for things like accessing project Docker repos for builds (so eg. to have an 
`apache-tika` account that's controlled by the ASF and just has a Docker Hub 
access token for external CI processes etc)?
   
   I don't have a Docker subscription myself so I honestly have no idea how the 
Docker Hub CI works or how it would be configured to build on ARM - this 
workflow wouldn't apply for that process at all :thinking: I had set the GH 
workflow up on my personal repository as running builds on GitHub's CI and then 
pushing them to Docker Hub and GHCR from there is how I handle most of my hobby 
Docker projects, but as those have just me working on them I hadn't run into 
the token sharing problem...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org