date:20240314

[PR] Bump org.springframework:spring-context from 5.3.32 to 5.3.33 [tika]

2024-03-14 Thread via GitHub



dependabot[bot] opened a new pull request, #1662:
URL: https://github.com/apache/tika/pull/1662

   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.springframework:spring-context=maven=5.3.32=5.3.33)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] Bump aws.version from 1.12.679 to 1.12.680 [tika]

2024-03-14 Thread via GitHub



dependabot[bot] opened a new pull request, #1661:
URL: https://github.com/apache/tika/pull/1661

   Bumps `aws.version` from 1.12.679 to 1.12.680.
   Updates `com.amazonaws:aws-java-sdk-s3` from 1.12.679 to 1.12.680
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>com.amazonaws:aws-java-sdk-s3's
 changelog.
   
   1.12.680 2024-03-14
   AWS Amplify
   
   
   Features
   
   Documentation updates for Amplify. Identifies the APIs available only to 
apps created using Amplify Gen 1.
   
   
   
   AWS EC2 Instance Connect
   
   
   Features
   
   This release includes a new exception type 
SerialConsoleSessionUnsupportedException for 
SendSerialConsoleSSHPublicKey API.
   
   
   
   AWS Fault Injection Simulator
   
   
   Features
   
   This release adds support for previewing target resources before running 
a FIS experiment. It also adds resource ARNs for actions, experiments, and 
experiment templates to API responses.
   
   
   
   AWS Secrets Manager
   
   
   Features
   
   Doc only update for Secrets Manager
   
   
   
   Amazon Relational Database Service
   
   
   Features
   
   Updates Amazon RDS documentation for EBCDIC collation for RDS for 
Db2.
   
   
   
   Elastic Load Balancing
   
   
   Features
   
   This release allows you to configure HTTP client keep-alive duration for 
communication between clients and Application Load Balancers.
   
   
   
   Timestream InfluxDB
   
   
   Features
   
   This is the initial SDK release for Amazon Timestream for InfluxDB. 
Amazon Timestream for InfluxDB is a new time-series database engine that makes 
it easy for application developers and DevOps teams to run InfluxDB databases 
on AWS for near real-time time-series applications using open source APIs.
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/03a1164aa49caf56754378824c84ff74d7a3699b;>03a1164
 AWS SDK for Java 1.12.680
   https://github.com/aws/aws-sdk-java/commit/888a6b8ceb729c091a32ae5d492f049f6fe3f4d7;>888a6b8
 Update GitHub version number to 1.12.680-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.679...1.12.680;>compare 
view
   
   
   
   
   Updates `com.amazonaws:aws-java-sdk-transcribe` from 1.12.679 to 1.12.680
   
   Changelog
   Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>com.amazonaws:aws-java-sdk-transcribe's
 changelog.
   
   1.12.680 2024-03-14
   AWS Amplify
   
   
   Features
   
   Documentation updates for Amplify. Identifies the APIs available only to 
apps created using Amplify Gen 1.
   
   
   
   AWS EC2 Instance Connect
   
   
   Features
   
   This release includes a new exception type 
SerialConsoleSessionUnsupportedException for 
SendSerialConsoleSSHPublicKey API.
   
   
   
   AWS Fault Injection Simulator
   
   
   Features
   
   This release adds support for previewing target resources before running 
a FIS experiment. It also adds resource ARNs for actions, experiments, and 
experiment templates to API responses.
   
   
   
   AWS Secrets Manager
   
   
   Features
   
   Doc only update for Secrets Manager
   
   
   
   Amazon Relational Database Service
   
   
   Features
   
   Updates Amazon RDS documentation for EBCDIC collation for RDS for 
Db2.
   
   
   
   Elastic Load Balancing
   
   
   Features
   
   This release allows you to configure HTTP client keep-alive duration for 
communication between clients and Application Load Balancers.
   
   
   
   Timestream InfluxDB
   
   
   Features
   
   This is the initial SDK release for Amazon Timestream for InfluxDB. 
Amazon Timestream for InfluxDB is a new time-series database engine that makes 
it easy for application developers and DevOps teams to run InfluxDB databases 
on AWS for near real-time time-series applications using open source APIs.
   
   
   
   
   
   
   Commits
   
   https://github.com/aws/aws-sdk-java/commit/03a1164aa49caf56754378824c84ff74d7a3699b;>03a1164
 AWS SDK for Java 1.12.680
   https://github.com/aws/aws-sdk-java/commit/888a6b8ceb729c091a32ae5d492f049f6fe3f4d7;>888a6b8
 Update GitHub version number to 1.12.680-SNAPSHOT
   See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.679...1.12.680;>compare 
view
   
   
   
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI

[PR] Bump pdfbox.version from 3.0.1 to 3.0.2 [tika]

2024-03-14 Thread via GitHub



dependabot[bot] opened a new pull request, #1660:
URL: https://github.com/apache/tika/pull/1660

   Bumps `pdfbox.version` from 3.0.1 to 3.0.2.
   Updates `org.apache.pdfbox:xmpbox` from 3.0.1 to 3.0.2
   
   Updates `org.apache.pdfbox:fontbox` from 3.0.1 to 3.0.2
   
   Updates `org.apache.pdfbox:pdfbox` from 3.0.1 to 3.0.2
   
   Updates `org.apache.pdfbox:pdfbox-tools` from 3.0.1 to 3.0.2
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0

2024-03-14 Thread Hudson (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827248#comment-17827248
 ] 

Hudson commented on TIKA-4166:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1555 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1555/])
TIKA-4166: update mime4j (tilman: 
[https://github.com/apache/tika/commit/91820226e319e7deed535a31997d05f49dd60685])
* (edit) tika-parent/pom.xml


> dependency updates for Tika 3.0
> ---
>
> Key: TIKA-4166
> URL: https://issues.apache.org/jira/browse/TIKA-4166
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0-BETA
>
>
> Separate ticket for updates for 3.0, especially those not found by dependabot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-4213) Improvements to jdbc pipes reporter

2024-03-14 Thread Tim Allison (Jira)

Tim Allison created TIKA-4213:
-

 Summary: Improvements to jdbc pipes reporter
 Key: TIKA-4213
 URL: https://issues.apache.org/jira/browse/TIKA-4213
 Project: Tika
  Issue Type: New Feature
Reporter: Tim Allison


We should use the "id" as the key, not the emitter key. We should add a 
timestamp. We should not block on waiting for more data on the queue -- this 
prevents actually writing the buffer when run in tika-server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-14 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827241#comment-17827241
 ] 

Tim Allison edited comment on TIKA-4211 at 3/14/24 8:20 PM:


Step 3: Is there something like this in /ppt/slides/slide2.xml that references 
rId2? Is the structure exactly the same graphic->graphicData->..->p:oleObj

{code:java}
                
                    http://schemas.openxmlformats.org/presentationml/2006/ole;>
                        http://schemas.openxmlformats.org/markup-compatibility/2006;>
                            
                                
                                    
                                
                            
                            
                                
                                    
                                    
                                        
                                            
                                            
                                            
                                        
                                        
                                            
                                            
                                                
                                            
                                        
                                        
                                            
                                                
                                                
                                            
                                            
                                                
                                            
                                        
                                    
                                
                            
                        
                    
                
 {code}


was (Author: talli...@mitre.org):
Step 3: Is there something like this in /ppt/slides/slide2.xml:
{code:java}
                
                    http://schemas.openxmlformats.org/presentationml/2006/ole;>
                        http://schemas.openxmlformats.org/markup-compatibility/2006;>
                            
                                
                                    
                                
                            
                            
                                
                                    
                                    
                                        
                                            
                                            
                                            
                                        
                                        
                                            
                                            
                                                
                                            
                                        
                                        
                                            
                                                
                                                
                                            
                                            
                                                
                                            
                                        
                                    
                                
                            
                        
                    
                
 {code}

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import

[jira] [Comment Edited] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-14 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827230#comment-17827230
 ] 

Tim Allison edited comment on TIKA-4211 at 3/14/24 8:17 PM:


Step 2: In this file within the zip: /ppt/slides/_rels/slide2.xml.rels:
Do you see something like this that specifies the xlsx file as "rId2"?
{code:java}
http://schemas.openxmlformats.org/package/2006/relationships;>http://schemas.openxmlformats.org/officeDocument/2006/relationships/image;
 Target="../media/image1.emf"/>http://schemas.openxmlformats.org/officeDocument/2006/relationships/package;
 Target="../embeddings/Microsoft_Excel_Worksheet.xlsx"/>http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideLayout;
 Target="../slideLayouts/slideLayout2.xml"/>
{code}

Or, if you grep for "embeddings" in the in uncompressed zip, can you find a 
link to the xlsx file?


was (Author: talli...@mitre.org):
In this file within the zip: /ppt/slides/_rels/slide2.xml.rels:
Do you see something like this that specifies the xlsx file as "rId2"?
{code:java}
http://schemas.openxmlformats.org/package/2006/relationships;>http://schemas.openxmlformats.org/officeDocument/2006/relationships/image;
 Target="../media/image1.emf"/>http://schemas.openxmlformats.org/officeDocument/2006/relationships/package;
 Target="../embeddings/Microsoft_Excel_Worksheet.xlsx"/>http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideLayout;
 Target="../slideLayouts/slideLayout2.xml"/>
{code}

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>

[jira] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-14 Thread Tim Allison (Jira)



[ https://issues.apache.org/jira/browse/TIKA-4211 ]


Tim Allison deleted comment on TIKA-4211:
---

was (Author: talli...@mitre.org):
Or, if you grep for "embeddings" in the in uncompressed zip, can you find a 
link to the xlsx file?

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = pptxFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata)) {
>     ContentHandler handler = new DefaultHandler();
>     parser.parse(input, handler, metadata, context);
>     }
>     }
>  
>     private class FileEmbeddedDocumentExtractor implements 
> EmbeddedDocumentExtractor {
>     private int count = 0;
>  
>     public boolean shouldParseEmbedded(Metadata metadata) {
>     return true;
>     }
>  
>     public void parseEmbedded(InputStream inputStream, ContentHandler 
> contentHandler, Metadata metadata,
>   boolean outputHtml) throws SAXException, 
> IOException {
>     String fullFileName = 
> metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
>     if (fullFileName == null) {
>     fullFileName = "file" + count++;
>     }
>  
>     String[] fileNameSplit = fullFileName.split("/");
>     String fileName = fileNameSplit[fileNameSplit.length - 1];
>     File outputFile = new File(outputDir.toFile(), 
> FilenameUtils.normalize(fileName));
>     System.out.println("Extracting '" + fileName + " to " + 
> outputFile);
>

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-14 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827241#comment-17827241
 ] 

Tim Allison commented on TIKA-4211:
---

Step 3: Is there something like this in /ppt/slides/slide2.xml:
{code:java}
                
                    http://schemas.openxmlformats.org/presentationml/2006/ole;>
                        http://schemas.openxmlformats.org/markup-compatibility/2006;>
                            
                                
                                    
                                
                            
                            
                                
                                    
                                    
                                        
                                            
                                            
                                            
                                        
                                        
                                            
                                            
                                                
                                            
                                        
                                        
                                            
                                                
                                                
                                            
                                            
                                                
                                            
                                        
                                    
                                
                            
                        
                    
                
 {code}

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-14 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827233#comment-17827233
 ] 

Tim Allison commented on TIKA-4211:
---

Or, if you grep for "embeddings" in the in uncompressed zip, can you find a 
link to the xlsx file?

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = pptxFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata)) {
>     ContentHandler handler = new DefaultHandler();
>     parser.parse(input, handler, metadata, context);
>     }
>     }
>  
>     private class FileEmbeddedDocumentExtractor implements 
> EmbeddedDocumentExtractor {
>     private int count = 0;
>  
>     public boolean shouldParseEmbedded(Metadata metadata) {
>     return true;
>     }
>  
>     public void parseEmbedded(InputStream inputStream, ContentHandler 
> contentHandler, Metadata metadata,
>   boolean outputHtml) throws SAXException, 
> IOException {
>     String fullFileName = 
> metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
>     if (fullFileName == null) {
>     fullFileName = "file" + count++;
>     }
>  
>     String[] fileNameSplit = fullFileName.split("/");
>     String fileName = fileNameSplit[fileNameSplit.length - 1];
>     File outputFile = new File(outputDir.toFile(), 
> FilenameUtils.normalize(fileName));
>     System.out.println("Extracting '" + fileName + " to

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-14 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827230#comment-17827230
 ] 

Tim Allison commented on TIKA-4211:
---

In this file within the zip: /ppt/slides/_rels/slide2.xml.rels:
Do you see something like this that specifies the xlsx file as "rId2"?
{code:java}
http://schemas.openxmlformats.org/package/2006/relationships;>http://schemas.openxmlformats.org/officeDocument/2006/relationships/image;
 Target="../media/image1.emf"/>http://schemas.openxmlformats.org/officeDocument/2006/relationships/package;
 Target="../embeddings/Microsoft_Excel_Worksheet.xlsx"/>http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideLayout;
 Target="../slideLayouts/slideLayout2.xml"/>
{code}

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = pptxFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata)) {
>     ContentHandler handler = new DefaultHandler();
>     parser.parse(input, handler, metadata, context);
>     }
>     }
>  
>     private class FileEmbeddedDocumentExtractor implements 
> EmbeddedDocumentExtractor {
>     private int count = 0;
>  
>     public boolean shouldParseEmbedded(Metadata metadata) {
>     return true;
>     }
>  
>     public void parseEmbedded(InputStream inputStream, ContentHandler 
> contentHandler, Metadata metadata,
>   boolean outputHtml) throws SAXException, 
> IOException {
>     String

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-14 Thread Xiaohong Yang (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827221#comment-17827221
 ] 

Xiaohong Yang commented on TIKA-4211:
-

Hi Tim, 

Yes, I found the right file /ppt/embeddings/Microsoft_Excel_Worksheet.xlsx 
after unzipping the pptx.

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = pptxFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata)) {
>     ContentHandler handler = new DefaultHandler();
>     parser.parse(input, handler, metadata, context);
>     }
>     }
>  
>     private class FileEmbeddedDocumentExtractor implements 
> EmbeddedDocumentExtractor {
>     private int count = 0;
>  
>     public boolean shouldParseEmbedded(Metadata metadata) {
>     return true;
>     }
>  
>     public void parseEmbedded(InputStream inputStream, ContentHandler 
> contentHandler, Metadata metadata,
>   boolean outputHtml) throws SAXException, 
> IOException {
>     String fullFileName = 
> metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
>     if (fullFileName == null) {
>     fullFileName = "file" + count++;
>     }
>  
>     String[] fileNameSplit = fullFileName.split("/");
>     String fileName = fileNameSplit[fileNameSplit.length - 1];
>     File outputFile = new File(outputDir.toFile(), 
> FilenameUtils.normalize(fileName));
>     System.out.println("Extracting '" +

[jira] [Updated] (TIKA-4212) Tika fails to get file extension of file type image/x-rtf-raw-bitmap

2024-03-14 Thread Xiaohong Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaohong Yang updated TIKA-4212:

Attachment: tika-config-and-sample-file.zip

> Tika fails to get file extension of file type image/x-rtf-raw-bitmap
> 
>
> Key: TIKA-4212
> URL: https://issues.apache.org/jira/browse/TIKA-4212
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> We use  org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> objects from Word documents.  Two embedded objects are extracted from the 
> sample doc file. Their file type is image/x-rtf-raw-bitmap. But Tika fails to 
> get the file extension with the following method call
>   tikaExtension = 
> config.getMimeRepository().forName(contentType.toString()).getExtension();
> Wonder if you can fix the problem in the Tika library.  Also wonder if you 
> can tell us the file extension of file type is image/x-rtf-raw-bitmap.
> Following is the sample code and attached is the tika-config.xml and the 
> sample Word file.
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3.  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractBitMapFromWord {
>     private final Path docFile = new 
> File("/home/ubuntu/testdirs/testdir_doc/sample.DOC").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_doc/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractBitMapFromWord().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractBitMapFromWord() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_doc/tika-config.xml");
>     ExtractBitMapFromWord.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ExtractBitMapFromWord.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = docFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata)) {
>     ContentHandler handler = new DefaultHandler();
>     parser.parse(input, handler, metadata, context);
>     }
>     }
>  
>     private class FileEmbeddedDocumentExtractor implements 
> EmbeddedDocumentExtractor {
>     private int count = 0;
>  
>     public boolean shouldParseEmbedded(Metadata metadata) {
>     return true;
>     }
>  
>     public void parseEmbedded(InputStream inputStream, ContentHandler 
> contentHandler, Metadata metadata,
>   boolean outputHtml) throws SAXException, 
> IOException {
>     String fullFileName = 
> metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
>     if (fullFileName == null) {
>     fullFileName = "file" + count++;
>     }
>  
>     TikaConfig config = null;
>     try {
>     config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_doc/tika-config.xml");
>     } catch (Exception ex) {
>     ex.printStackTrace();
>     }
>     if (config == null) {
>     return;
>     }
>  
>     Detector detector = config.getDetector();;
>     MediaType

[jira] [Created] (TIKA-4212) Tika fails to get file extension of file type image/x-rtf-raw-bitmap

2024-03-14 Thread Xiaohong Yang (Jira)

Xiaohong Yang created TIKA-4212:
---

 Summary: Tika fails to get file extension of file type 
image/x-rtf-raw-bitmap
 Key: TIKA-4212
 URL: https://issues.apache.org/jira/browse/TIKA-4212
 Project: Tika
  Issue Type: Bug
Reporter: Xiaohong Yang


We use  org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
objects from Word documents.  Two embedded objects are extracted from the 
sample doc file. Their file type is image/x-rtf-raw-bitmap. But Tika fails to 
get the file extension with the following method call

  tikaExtension = 
config.getMimeRepository().forName(contentType.toString()).getExtension();

Wonder if you can fix the problem in the Tika library.  Also wonder if you can 
tell us the file extension of file type is image/x-rtf-raw-bitmap.

Following is the sample code and attached is the tika-config.xml and the sample 
Word file.

The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
2.9.1 and POI version is 5.2.3.  

 

import org.apache.pdfbox.io.IOUtils;

import org.apache.poi.poifs.filesystem.DirectoryEntry;

import org.apache.poi.poifs.filesystem.DocumentEntry;

import org.apache.poi.poifs.filesystem.DocumentInputStream;

import org.apache.poi.poifs.filesystem.POIFSFileSystem;

import org.apache.tika.config.TikaConfig;

import org.apache.tika.detect.Detector;

import org.apache.tika.extractor.EmbeddedDocumentExtractor;

import org.apache.tika.io.FilenameUtils;

import org.apache.tika.io.TikaInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.metadata.TikaCoreProperties;

import org.apache.tika.mime.MediaType;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

import org.xml.sax.helpers.DefaultHandler;

 

import java.io.*;

import java.net.URL;

import java.nio.file.Path;

 

public class ExtractBitMapFromWord {

    private final Path docFile = new 
File("/home/ubuntu/testdirs/testdir_doc/sample.DOC").toPath();

    private final Path outputDir = new 
File("/home/ubuntu/testdirs/testdir_doc/tika_output/").toPath();

 

    private Parser parser;

    private ParseContext context;

 

 

    public static void main(String args[]) {

    try {

    new ExtractBitMapFromWord().process();

    }

    catch(Exception ex) {

    ex.printStackTrace();

    }

    }

 

    public ExtractBitMapFromWord() {

    }

 

    public void process() throws Exception {

    TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_doc/tika-config.xml");

    ExtractBitMapFromWord.FileEmbeddedDocumentExtractor 
fileEmbeddedDocumentExtractor = new 
ExtractBitMapFromWord.FileEmbeddedDocumentExtractor();

 

    parser = new AutoDetectParser(config);

    context = new ParseContext();

    context.set(Parser.class, parser);

    context.set(TikaConfig.class, config);

    context.set(EmbeddedDocumentExtractor.class, 
fileEmbeddedDocumentExtractor);

 

    URL url = docFile.toUri().toURL();

    Metadata metadata = new Metadata();

    try (InputStream input = TikaInputStream.get(url, metadata)) {

    ContentHandler handler = new DefaultHandler();

    parser.parse(input, handler, metadata, context);

    }

    }

 

    private class FileEmbeddedDocumentExtractor implements 
EmbeddedDocumentExtractor {

    private int count = 0;

 

    public boolean shouldParseEmbedded(Metadata metadata) {

    return true;

    }

 

    public void parseEmbedded(InputStream inputStream, ContentHandler 
contentHandler, Metadata metadata,

  boolean outputHtml) throws SAXException, 
IOException {

    String fullFileName = 
metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);

    if (fullFileName == null) {

    fullFileName = "file" + count++;

    }

 

    TikaConfig config = null;

    try {

    config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_doc/tika-config.xml");

    } catch (Exception ex) {

    ex.printStackTrace();

    }

    if (config == null) {

    return;

    }

 

    Detector detector = config.getDetector();;

    MediaType contentType = detector.detect(inputStream, metadata);

    String tikaExtension = null;

    if(fullFileName.indexOf('.') == -1 && contentType != null){

    try {

    tikaExtension = 
config.getMimeRepository().forName(contentType.toString()).getExtension();

    } catch (Exception ex) {

    ex.printStackTrace();

    }

 

    if (tikaExtension != null &&

[jira] [Commented] (TIKA-4210) Not able to identify tika extension

2024-03-14 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827193#comment-17827193
 ] 

Tim Allison commented on TIKA-4210:
---

Those files look like this in the rtf file:

{code:java}
{\pict\wbitmap0\picw14\pich26\wbmbitspixel1\wbmplanes1\wbmwidthbytes2\picwGoal210\pichGoal390
 
fffcbffc9ffc8ffc87fc83fc81fc80fc807c803c801c800c8004800c801c803c807c80fc81fc83fc87fc8ffc9ffcbffcfffcfffc}\
{code}
and

{code:java}
 
{\pict\wbitmap0\picw173\pich7\wbmbitspixel1\wbmplanes1\wbmwidthbytes22\picwGoal2076\pichGoal84
 
fff8c7f8ff1fe3fc7f8ff1fe3fc7f8ff1fe3fc7f8ff1fe38b7f6fedfdbfb7f6fedfdbfb7f6fedfdbfb7f6fedfdb893f27e4fc9f93f27e4fc9f93f27e4fc9f93f27e4fc98b7f6fedfdbfb7f6fedfdbfb7f6fedfdbfb7f6fedfdb8cff9ff3fe7fcff9ff3fe7fcff9ff3fe7fcff9ff3fe78fff8}
 {code}

 

> Not able to identify tika extension
> ---
>
> Key: TIKA-4210
> URL: https://issues.apache.org/jira/browse/TIKA-4210
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Major
> Attachments: sample.DOC
>
>
> Hi Team,
> The attached embedded file contain .MPGA attachments which tika is  not able 
> to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still 
> showing it as empty. Please look into this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4210) Not able to identify tika extension

2024-03-14 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827191#comment-17827191
 ] 

Tim Allison commented on TIKA-4210:
---

Nick is right. The file is an RTF file. Tika does find two embedded files 
identified as x-rtf-raw-bitmap. We don't have a parser for that format, I don't 
think.
{code:java}
[
    {
        "Content-Length": "19619",
        "Content-Type": "application/rtf",
        "X-TIKA:Parsed-By": [
            "org.apache.tika.parser.DefaultParser",
            "org.apache.tika.parser.microsoft.rtf.RTFParser"
        ],
        "X-TIKA:Parsed-By-Full-Set": [
            "org.apache.tika.parser.DefaultParser",
            "org.apache.tika.parser.microsoft.rtf.RTFParser",
            "org.apache.tika.parser.EmptyParser"
        ],
        "X-TIKA:content": "...",
        "X-TIKA:content_handler": "ToTextContentHandler",
        "X-TIKA:embedded_depth": "0",
        "X-TIKA:parse_time_millis": "143",
        "resourceName": "sample.DOC.rtf"
    },
    {
        "Content-Length": "52",
        "Content-Type": "image/x-rtf-raw-bitmap",
        "Content-Type-Parser-Override": "image/x-rtf-raw-bitmap",
        "X-TIKA:Parsed-By": "org.apache.tika.parser.EmptyParser",
        "X-TIKA:embedded_depth": "1",
        "X-TIKA:embedded_id": "1",
        "X-TIKA:embedded_id_path": "/1",
        "X-TIKA:embedded_resource_path": "/file_0",
        "X-TIKA:parse_time_millis": "1",
        "resourceName": "file_0",
        "rtf_meta:thumbnail": "false"
    },
    {
        "Content-Length": "154",
        "Content-Type": "image/x-rtf-raw-bitmap",
        "Content-Type-Parser-Override": "image/x-rtf-raw-bitmap",
        "X-TIKA:Parsed-By": "org.apache.tika.parser.EmptyParser",
        "X-TIKA:embedded_depth": "1",
        "X-TIKA:embedded_id": "2",
        "X-TIKA:embedded_id_path": "/2",
        "X-TIKA:embedded_resource_path": "/file_1",
        "X-TIKA:parse_time_millis": "0",
        "resourceName": "file_1",
        "rtf_meta:thumbnail": "false"
    }
] {code}

> Not able to identify tika extension
> ---
>
> Key: TIKA-4210
> URL: https://issues.apache.org/jira/browse/TIKA-4210
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Major
> Attachments: sample.DOC
>
>
> Hi Team,
> The attached embedded file contain .MPGA attachments which tika is  not able 
> to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still 
> showing it as empty. Please look into this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-14 Thread Tim Allison (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827190#comment-17827190
 ] 

Tim Allison commented on TIKA-4211:
---

Y, as you point out, Tika works with the example file that you shared. I can't 
do much without a file to work on. If you're able to share it privately, I can 
take a look.

 

Otherwise, we can debug it together.

Step 1: unzip the pptx, do you see the embedded xlsx anywhere in the zip file. 
In the attached file, the xlsx file is 
/ppt/embeddings/Microsoft_Excel_Worksheet.xlsx

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = pptxFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata)) {
>     ContentHandler handler = new DefaultHandler();
>     parser.parse(input, handler, metadata, context);
>     }
>     }
>  
>     private class FileEmbeddedDocumentExtractor implements 
> EmbeddedDocumentExtractor {
>     private int count = 0;
>  
>     public boolean shouldParseEmbedded(Metadata metadata) {
>     return true;
>     }
>  
>     public void parseEmbedded(InputStream inputStream, ContentHandler 
> contentHandler, Metadata metadata,
>   boolean outputHtml) throws SAXException, 
> IOException {
>     String fullFileName = 
> metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
>     if (fullFileName == null) {
>     fullFileName = "file" + count++;
>     }
>  
>

[jira] [Created] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-14 Thread Xiaohong Yang (Jira)

Xiaohong Yang created TIKA-4211:
---

 Summary: Tika extractor fails to extract embedded excel from pptx
 Key: TIKA-4211
 URL: https://issues.apache.org/jira/browse/TIKA-4211
 Project: Tika
  Issue Type: Bug
Reporter: Xiaohong Yang
 Attachments: config_and_sample_file.zip

We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
excel from PowerPoint presentation.  It works with most pptx files. But it 
fails to detect the embedded excel with some pptx files.

Following is the sample code and attached is the tika-config.xml and a pptx 
file that works.

We cannot provide the pptx file that does not work because it is client data.

We noticed a difference between the pptx files that work and the pptx file that 
does not work:  

"{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
is right-clicked in the pptx files that work.*

"{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
right-clicked in the pptx file that does not work. This file might be created 
with an old version fo PowerPoint.*

 

The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
2.9.1 and POI version is 5.2.3. 

 

import org.apache.pdfbox.io.IOUtils;

import org.apache.poi.poifs.filesystem.DirectoryEntry;

import org.apache.poi.poifs.filesystem.DocumentEntry;

import org.apache.poi.poifs.filesystem.DocumentInputStream;

import org.apache.poi.poifs.filesystem.POIFSFileSystem;

import org.apache.tika.config.TikaConfig;

import org.apache.tika.extractor.EmbeddedDocumentExtractor;

import org.apache.tika.io.FilenameUtils;

import org.apache.tika.io.TikaInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.metadata.TikaCoreProperties;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

import org.xml.sax.helpers.DefaultHandler;

 

import java.io.*;

import java.net.URL;

import java.nio.file.Path;

 

public class ExtractExcelFromPowerPoint {

    private final Path pptxFile = new 
File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();

    private final Path outputDir = new 
File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();

 

    private Parser parser;

    private ParseContext context;

 

 

    public static void main(String args[]) {

    try {

    new ExtractExcelFromPowerPoint().process();

    }

    catch(Exception ex) {

    ex.printStackTrace();

    }

    }

 

    public ExtractExcelFromPowerPoint() {

    }

 

    public void process() throws Exception {

    TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");

    FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
FileEmbeddedDocumentExtractor();

 

    parser = new AutoDetectParser(config);

    context = new ParseContext();

    context.set(Parser.class, parser);

    context.set(TikaConfig.class, config);

    context.set(EmbeddedDocumentExtractor.class, 
fileEmbeddedDocumentExtractor);

 

    URL url = pptxFile.toUri().toURL();

    Metadata metadata = new Metadata();

    try (InputStream input = TikaInputStream.get(url, metadata)) {

    ContentHandler handler = new DefaultHandler();

    parser.parse(input, handler, metadata, context);

    }

    }

 

    private class FileEmbeddedDocumentExtractor implements 
EmbeddedDocumentExtractor {

    private int count = 0;

 

    public boolean shouldParseEmbedded(Metadata metadata) {

    return true;

    }

 

    public void parseEmbedded(InputStream inputStream, ContentHandler 
contentHandler, Metadata metadata,

  boolean outputHtml) throws SAXException, 
IOException {

    String fullFileName = 
metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);

    if (fullFileName == null) {

    fullFileName = "file" + count++;

    }

 

    String[] fileNameSplit = fullFileName.split("/");

    String fileName = fileNameSplit[fileNameSplit.length - 1];

    File outputFile = new File(outputDir.toFile(), 
FilenameUtils.normalize(fileName));

    System.out.println("Extracting '" + fileName + " to " + outputFile);

    FileOutputStream os = null;

    try {

    os = new FileOutputStream(outputFile);

    if (inputStream instanceof TikaInputStream tin) {

    if (tin.getOpenContainer() instanceof DirectoryEntry) {

    try(POIFSFileSystem fs = new POIFSFileSystem()){

    copy((DirectoryEntry) tin.getOpenContainer(), 
fs.getRoot());

[jira] [Commented] (TIKA-4210) Not able to identify tika extension

2024-03-14 Thread Tika User (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827036#comment-17827036
 ] 

Tika User commented on TIKA-4210:
-

The attached file is doc extension and from that file it should detect two more 
files, for those files the tika extension is getting empty.




first image : black arrow symbol

second image : dotted symbol

> Not able to identify tika extension
> ---
>
> Key: TIKA-4210
> URL: https://issues.apache.org/jira/browse/TIKA-4210
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Major
> Attachments: sample.DOC
>
>
> Hi Team,
> The attached embedded file contain .MPGA attachments which tika is  not able 
> to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still 
> showing it as empty. Please look into this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] Bump com.google.protobuf:protobuf-java from 3.25.3 to 4.26.0 [tika]

2024-03-14 Thread via GitHub



dependabot[bot] commented on PR #1659:
URL: https://github.com/apache/tika/pull/1659#issuecomment-1997093738

   OK, I won't notify you again about this release, but will get in touch when 
a new version is available. If you'd rather skip all updates until the next 
major or minor version, let me know by commenting `@dependabot ignore this 
major version` or `@dependabot ignore this minor version`. You can also ignore 
all major, minor, or patch releases for a dependency by adding an [`ignore` 
condition](https://docs.github.com/en/code-security/supply-chain-security/configuration-options-for-dependency-updates#ignore)
 with the desired `update_types` to your config file.
   
   If you change your mind, just re-open this PR and I'll resolve any conflicts 
on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Bump com.google.protobuf:protobuf-java from 3.25.3 to 4.26.0 [tika]

2024-03-14 Thread via GitHub



THausherr closed pull request #1659: Bump com.google.protobuf:protobuf-java 
from 3.25.3 to 4.26.0
URL: https://github.com/apache/tika/pull/1659


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (TIKA-4210) Not able to identify tika extension

2024-03-14 Thread Tika User (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tika User updated TIKA-4210:

Description: 
Hi Team,

The attached embedded file contain .MPGA attachments which tika is  not able to 
identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still showing 
it as empty. Please look into this.

  was:
Hi Team,

The attached embedded file contain .mega attachments which tika is  not able to 
identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still showing 
it as empty. Please look into this.


> Not able to identify tika extension
> ---
>
> Key: TIKA-4210
> URL: https://issues.apache.org/jira/browse/TIKA-4210
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Major
> Attachments: sample.DOC
>
>
> Hi Team,
> The attached embedded file contain .MPGA attachments which tika is  not able 
> to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still 
> showing it as empty. Please look into this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4210) Not able to identify tika extension

2024-03-14 Thread Nick Burch (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827017#comment-17827017
 ] 

Nick Burch commented on TIKA-4210:
--

The attached file seems to be an RTF file. I'm not sure what a ".mega 
attachment" is, but this file doesn't seem to be one of them...

tika-app-2.9.1.jar is able to correctly identify this file as RTF

> Not able to identify tika extension
> ---
>
> Key: TIKA-4210
> URL: https://issues.apache.org/jira/browse/TIKA-4210
> Project: Tika
>  Issue Type: Bug
>Reporter: Tika User
>Priority: Major
> Attachments: sample.DOC
>
>
> Hi Team,
> The attached embedded file contain .mega attachments which tika is  not able 
> to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still 
> showing it as empty. Please look into this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-4210) Not able to identify tika extension

2024-03-14 Thread Tika User (Jira)

Tika User created TIKA-4210:
---

 Summary: Not able to identify tika extension
 Key: TIKA-4210
 URL: https://issues.apache.org/jira/browse/TIKA-4210
 Project: Tika
  Issue Type: Bug
Reporter: Tika User
 Attachments: sample.DOC

Hi Team,

The attached embedded file contain .mega attachments which tika is  not able to 
identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still showing 
it as empty. Please look into this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1

2024-03-14 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826996#comment-17826996
 ] 

Tilman Hausherr commented on TIKA-4199:
---

The original error you reported wasn't really a bug in commons compress, rather 
a change that more bytes were read than tika expected, see my first comment in 
COMPRESS-661. It resulted in several fixes in tika.

> commons-compress 1.26.0 breaks Apache Tika 2.9.1
> 
>
> Key: TIKA-4199
> URL: https://issues.apache.org/jira/browse/TIKA-4199
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.1
>Reporter: Alexander Veit
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.9.2, 3.0.0
>
>
> An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 
> breaks Tika.
>  
> For more information see https://issues.apache.org/jira/browse/COMPRESS-661.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1

2024-03-14 Thread Alexander Veit (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826992#comment-17826992
 ] 

Alexander Veit commented on TIKA-4199:
--

The same error also occurs with Tika 2.9.1 and commons-compress 1.26.1.

> commons-compress 1.26.0 breaks Apache Tika 2.9.1
> 
>
> Key: TIKA-4199
> URL: https://issues.apache.org/jira/browse/TIKA-4199
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.1
>Reporter: Alexander Veit
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.9.2, 3.0.0
>
>
> An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 
> breaks Tika.
>  
> For more information see https://issues.apache.org/jira/browse/COMPRESS-661.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] Bump com.google.guava:guava from 33.0.0-jre to 33.1.0-jre [tika]

2024-03-14 Thread via GitHub



THausherr merged PR #1657:
URL: https://github.com/apache/tika/pull/1657


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Bump aws.version from 1.12.678 to 1.12.679 [tika]

2024-03-14 Thread via GitHub



THausherr merged PR #1658:
URL: https://github.com/apache/tika/pull/1658


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] Bump org.springframework:spring-context from 5.3.32 to 5.3.33 [tika]

[PR] Bump aws.version from 1.12.679 to 1.12.680 [tika]

[PR] Bump pdfbox.version from 3.0.1 to 3.0.2 [tika]

[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0

[jira] [Created] (TIKA-4213) Improvements to jdbc pipes reporter

[jira] [Comment Edited] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

[jira] [Comment Edited] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

[jira] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

[jira] [Updated] (TIKA-4212) Tika fails to get file extension of file type image/x-rtf-raw-bitmap

[jira] [Created] (TIKA-4212) Tika fails to get file extension of file type image/x-rtf-raw-bitmap

[jira] [Commented] (TIKA-4210) Not able to identify tika extension

[jira] [Commented] (TIKA-4210) Not able to identify tika extension

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

[jira] [Created] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

[jira] [Commented] (TIKA-4210) Not able to identify tika extension

Re: [PR] Bump com.google.protobuf:protobuf-java from 3.25.3 to 4.26.0 [tika]

Re: [PR] Bump com.google.protobuf:protobuf-java from 3.25.3 to 4.26.0 [tika]

[jira] [Updated] (TIKA-4210) Not able to identify tika extension

[jira] [Commented] (TIKA-4210) Not able to identify tika extension

[jira] [Created] (TIKA-4210) Not able to identify tika extension

[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1

[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1

Re: [PR] Bump com.google.guava:guava from 33.0.0-jre to 33.1.0-jre [tika]

Re: [PR] Bump aws.version from 1.12.678 to 1.12.679 [tika]

28 matches

Site Navigation

Mail list logo

Footer information