[PR] Bump org.springframework:spring-context from 5.3.32 to 5.3.33 [tika]
dependabot[bot] opened a new pull request, #1662: URL: https://github.com/apache/tika/pull/1662 [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.springframework:spring-context=maven=5.3.32=5.3.33)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Bump aws.version from 1.12.679 to 1.12.680 [tika]
dependabot[bot] opened a new pull request, #1661: URL: https://github.com/apache/tika/pull/1661 Bumps `aws.version` from 1.12.679 to 1.12.680. Updates `com.amazonaws:aws-java-sdk-s3` from 1.12.679 to 1.12.680 Changelog Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>com.amazonaws:aws-java-sdk-s3's changelog. 1.12.680 2024-03-14 AWS Amplify Features Documentation updates for Amplify. Identifies the APIs available only to apps created using Amplify Gen 1. AWS EC2 Instance Connect Features This release includes a new exception type SerialConsoleSessionUnsupportedException for SendSerialConsoleSSHPublicKey API. AWS Fault Injection Simulator Features This release adds support for previewing target resources before running a FIS experiment. It also adds resource ARNs for actions, experiments, and experiment templates to API responses. AWS Secrets Manager Features Doc only update for Secrets Manager Amazon Relational Database Service Features Updates Amazon RDS documentation for EBCDIC collation for RDS for Db2. Elastic Load Balancing Features This release allows you to configure HTTP client keep-alive duration for communication between clients and Application Load Balancers. Timestream InfluxDB Features This is the initial SDK release for Amazon Timestream for InfluxDB. Amazon Timestream for InfluxDB is a new time-series database engine that makes it easy for application developers and DevOps teams to run InfluxDB databases on AWS for near real-time time-series applications using open source APIs. Commits https://github.com/aws/aws-sdk-java/commit/03a1164aa49caf56754378824c84ff74d7a3699b;>03a1164 AWS SDK for Java 1.12.680 https://github.com/aws/aws-sdk-java/commit/888a6b8ceb729c091a32ae5d492f049f6fe3f4d7;>888a6b8 Update GitHub version number to 1.12.680-SNAPSHOT See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.679...1.12.680;>compare view Updates `com.amazonaws:aws-java-sdk-transcribe` from 1.12.679 to 1.12.680 Changelog Sourced from https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md;>com.amazonaws:aws-java-sdk-transcribe's changelog. 1.12.680 2024-03-14 AWS Amplify Features Documentation updates for Amplify. Identifies the APIs available only to apps created using Amplify Gen 1. AWS EC2 Instance Connect Features This release includes a new exception type SerialConsoleSessionUnsupportedException for SendSerialConsoleSSHPublicKey API. AWS Fault Injection Simulator Features This release adds support for previewing target resources before running a FIS experiment. It also adds resource ARNs for actions, experiments, and experiment templates to API responses. AWS Secrets Manager Features Doc only update for Secrets Manager Amazon Relational Database Service Features Updates Amazon RDS documentation for EBCDIC collation for RDS for Db2. Elastic Load Balancing Features This release allows you to configure HTTP client keep-alive duration for communication between clients and Application Load Balancers. Timestream InfluxDB Features This is the initial SDK release for Amazon Timestream for InfluxDB. Amazon Timestream for InfluxDB is a new time-series database engine that makes it easy for application developers and DevOps teams to run InfluxDB databases on AWS for near real-time time-series applications using open source APIs. Commits https://github.com/aws/aws-sdk-java/commit/03a1164aa49caf56754378824c84ff74d7a3699b;>03a1164 AWS SDK for Java 1.12.680 https://github.com/aws/aws-sdk-java/commit/888a6b8ceb729c091a32ae5d492f049f6fe3f4d7;>888a6b8 Update GitHub version number to 1.12.680-SNAPSHOT See full diff in https://github.com/aws/aws-sdk-java/compare/1.12.679...1.12.680;>compare view Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI
[PR] Bump pdfbox.version from 3.0.1 to 3.0.2 [tika]
dependabot[bot] opened a new pull request, #1660: URL: https://github.com/apache/tika/pull/1660 Bumps `pdfbox.version` from 3.0.1 to 3.0.2. Updates `org.apache.pdfbox:xmpbox` from 3.0.1 to 3.0.2 Updates `org.apache.pdfbox:fontbox` from 3.0.1 to 3.0.2 Updates `org.apache.pdfbox:pdfbox` from 3.0.1 to 3.0.2 Updates `org.apache.pdfbox:pdfbox-tools` from 3.0.1 to 3.0.2 Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0
[ https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827248#comment-17827248 ] Hudson commented on TIKA-4166: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1555 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1555/]) TIKA-4166: update mime4j (tilman: [https://github.com/apache/tika/commit/91820226e319e7deed535a31997d05f49dd60685]) * (edit) tika-parent/pom.xml > dependency updates for Tika 3.0 > --- > > Key: TIKA-4166 > URL: https://issues.apache.org/jira/browse/TIKA-4166 > Project: Tika > Issue Type: Task > Components: build >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0-BETA > > > Separate ticket for updates for 3.0, especially those not found by dependabot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4213) Improvements to jdbc pipes reporter
Tim Allison created TIKA-4213: - Summary: Improvements to jdbc pipes reporter Key: TIKA-4213 URL: https://issues.apache.org/jira/browse/TIKA-4213 Project: Tika Issue Type: New Feature Reporter: Tim Allison We should use the "id" as the key, not the emitter key. We should add a timestamp. We should not block on waiting for more data on the queue -- this prevents actually writing the buffer when run in tika-server. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx
[ https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827241#comment-17827241 ] Tim Allison edited comment on TIKA-4211 at 3/14/24 8:20 PM: Step 3: Is there something like this in /ppt/slides/slide2.xml that references rId2? Is the structure exactly the same graphic->graphicData->..->p:oleObj {code:java} http://schemas.openxmlformats.org/presentationml/2006/ole;> http://schemas.openxmlformats.org/markup-compatibility/2006;> {code} was (Author: talli...@mitre.org): Step 3: Is there something like this in /ppt/slides/slide2.xml: {code:java} http://schemas.openxmlformats.org/presentationml/2006/ole;> http://schemas.openxmlformats.org/markup-compatibility/2006;> {code} > Tika extractor fails to extract embedded excel from pptx > > > Key: TIKA-4211 > URL: https://issues.apache.org/jira/browse/TIKA-4211 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: config_and_sample_file.zip > > > We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded > excel from PowerPoint presentation. It works with most pptx files. But it > fails to detect the embedded excel with some pptx files. > Following is the sample code and attached is the tika-config.xml and a pptx > file that works. > We cannot provide the pptx file that does not work because it is client data. > We noticed a difference between the pptx files that work and the pptx file > that does not work: > "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object > is right-clicked in the pptx files that work.* > "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is > right-clicked in the pptx file that does not work. This file might be created > with an old version fo PowerPoint.* > > The operating system is Ubuntu 20.04. Java version is 17. Tika version is > 2.9.1 and POI version is 5.2.3. > > import
[jira] [Comment Edited] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx
[ https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827230#comment-17827230 ] Tim Allison edited comment on TIKA-4211 at 3/14/24 8:17 PM: Step 2: In this file within the zip: /ppt/slides/_rels/slide2.xml.rels: Do you see something like this that specifies the xlsx file as "rId2"? {code:java} http://schemas.openxmlformats.org/package/2006/relationships;>http://schemas.openxmlformats.org/officeDocument/2006/relationships/image; Target="../media/image1.emf"/>http://schemas.openxmlformats.org/officeDocument/2006/relationships/package; Target="../embeddings/Microsoft_Excel_Worksheet.xlsx"/>http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideLayout; Target="../slideLayouts/slideLayout2.xml"/> {code} Or, if you grep for "embeddings" in the in uncompressed zip, can you find a link to the xlsx file? was (Author: talli...@mitre.org): In this file within the zip: /ppt/slides/_rels/slide2.xml.rels: Do you see something like this that specifies the xlsx file as "rId2"? {code:java} http://schemas.openxmlformats.org/package/2006/relationships;>http://schemas.openxmlformats.org/officeDocument/2006/relationships/image; Target="../media/image1.emf"/>http://schemas.openxmlformats.org/officeDocument/2006/relationships/package; Target="../embeddings/Microsoft_Excel_Worksheet.xlsx"/>http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideLayout; Target="../slideLayouts/slideLayout2.xml"/> {code} > Tika extractor fails to extract embedded excel from pptx > > > Key: TIKA-4211 > URL: https://issues.apache.org/jira/browse/TIKA-4211 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: config_and_sample_file.zip > > > We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded > excel from PowerPoint presentation. It works with most pptx files. But it > fails to detect the embedded excel with some pptx files. > Following is the sample code and attached is the tika-config.xml and a pptx > file that works. > We cannot provide the pptx file that does not work because it is client data. > We noticed a difference between the pptx files that work and the pptx file > that does not work: > "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object > is right-clicked in the pptx files that work.* > "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is > right-clicked in the pptx file that does not work. This file might be created > with an old version fo PowerPoint.* > > The operating system is Ubuntu 20.04. Java version is 17. Tika version is > 2.9.1 and POI version is 5.2.3. > > import org.apache.pdfbox.io.IOUtils; > import org.apache.poi.poifs.filesystem.DirectoryEntry; > import org.apache.poi.poifs.filesystem.DocumentEntry; > import org.apache.poi.poifs.filesystem.DocumentInputStream; > import org.apache.poi.poifs.filesystem.POIFSFileSystem; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.extractor.EmbeddedDocumentExtractor; > import org.apache.tika.io.FilenameUtils; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.metadata.TikaCoreProperties; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.xml.sax.ContentHandler; > import org.xml.sax.SAXException; > import org.xml.sax.helpers.DefaultHandler; > > import java.io.*; > import java.net.URL; > import java.nio.file.Path; > > public class ExtractExcelFromPowerPoint { > private final Path pptxFile = new > File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath(); > private final Path outputDir = new > File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath(); > > private Parser parser; > private ParseContext context; > > > public static void main(String args[]) { > try { > new ExtractExcelFromPowerPoint().process(); > } > catch(Exception ex) { > ex.printStackTrace(); > } > } > > public ExtractExcelFromPowerPoint() { > } > > public void process() throws Exception { > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml"); > FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new > FileEmbeddedDocumentExtractor(); > > parser = new AutoDetectParser(config); > context = new ParseContext(); > context.set(Parser.class, parser); > context.set(TikaConfig.class, config); > context.set(EmbeddedDocumentExtractor.class, > fileEmbeddedDocumentExtractor); > >
[jira] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx
[ https://issues.apache.org/jira/browse/TIKA-4211 ] Tim Allison deleted comment on TIKA-4211: --- was (Author: talli...@mitre.org): Or, if you grep for "embeddings" in the in uncompressed zip, can you find a link to the xlsx file? > Tika extractor fails to extract embedded excel from pptx > > > Key: TIKA-4211 > URL: https://issues.apache.org/jira/browse/TIKA-4211 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: config_and_sample_file.zip > > > We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded > excel from PowerPoint presentation. It works with most pptx files. But it > fails to detect the embedded excel with some pptx files. > Following is the sample code and attached is the tika-config.xml and a pptx > file that works. > We cannot provide the pptx file that does not work because it is client data. > We noticed a difference between the pptx files that work and the pptx file > that does not work: > "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object > is right-clicked in the pptx files that work.* > "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is > right-clicked in the pptx file that does not work. This file might be created > with an old version fo PowerPoint.* > > The operating system is Ubuntu 20.04. Java version is 17. Tika version is > 2.9.1 and POI version is 5.2.3. > > import org.apache.pdfbox.io.IOUtils; > import org.apache.poi.poifs.filesystem.DirectoryEntry; > import org.apache.poi.poifs.filesystem.DocumentEntry; > import org.apache.poi.poifs.filesystem.DocumentInputStream; > import org.apache.poi.poifs.filesystem.POIFSFileSystem; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.extractor.EmbeddedDocumentExtractor; > import org.apache.tika.io.FilenameUtils; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.metadata.TikaCoreProperties; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.xml.sax.ContentHandler; > import org.xml.sax.SAXException; > import org.xml.sax.helpers.DefaultHandler; > > import java.io.*; > import java.net.URL; > import java.nio.file.Path; > > public class ExtractExcelFromPowerPoint { > private final Path pptxFile = new > File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath(); > private final Path outputDir = new > File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath(); > > private Parser parser; > private ParseContext context; > > > public static void main(String args[]) { > try { > new ExtractExcelFromPowerPoint().process(); > } > catch(Exception ex) { > ex.printStackTrace(); > } > } > > public ExtractExcelFromPowerPoint() { > } > > public void process() throws Exception { > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml"); > FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new > FileEmbeddedDocumentExtractor(); > > parser = new AutoDetectParser(config); > context = new ParseContext(); > context.set(Parser.class, parser); > context.set(TikaConfig.class, config); > context.set(EmbeddedDocumentExtractor.class, > fileEmbeddedDocumentExtractor); > > URL url = pptxFile.toUri().toURL(); > Metadata metadata = new Metadata(); > try (InputStream input = TikaInputStream.get(url, metadata)) { > ContentHandler handler = new DefaultHandler(); > parser.parse(input, handler, metadata, context); > } > } > > private class FileEmbeddedDocumentExtractor implements > EmbeddedDocumentExtractor { > private int count = 0; > > public boolean shouldParseEmbedded(Metadata metadata) { > return true; > } > > public void parseEmbedded(InputStream inputStream, ContentHandler > contentHandler, Metadata metadata, > boolean outputHtml) throws SAXException, > IOException { > String fullFileName = > metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY); > if (fullFileName == null) { > fullFileName = "file" + count++; > } > > String[] fileNameSplit = fullFileName.split("/"); > String fileName = fileNameSplit[fileNameSplit.length - 1]; > File outputFile = new File(outputDir.toFile(), > FilenameUtils.normalize(fileName)); > System.out.println("Extracting '" + fileName + " to " + > outputFile); >
[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx
[ https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827241#comment-17827241 ] Tim Allison commented on TIKA-4211: --- Step 3: Is there something like this in /ppt/slides/slide2.xml: {code:java} http://schemas.openxmlformats.org/presentationml/2006/ole;> http://schemas.openxmlformats.org/markup-compatibility/2006;> {code} > Tika extractor fails to extract embedded excel from pptx > > > Key: TIKA-4211 > URL: https://issues.apache.org/jira/browse/TIKA-4211 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: config_and_sample_file.zip > > > We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded > excel from PowerPoint presentation. It works with most pptx files. But it > fails to detect the embedded excel with some pptx files. > Following is the sample code and attached is the tika-config.xml and a pptx > file that works. > We cannot provide the pptx file that does not work because it is client data. > We noticed a difference between the pptx files that work and the pptx file > that does not work: > "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object > is right-clicked in the pptx files that work.* > "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is > right-clicked in the pptx file that does not work. This file might be created > with an old version fo PowerPoint.* > > The operating system is Ubuntu 20.04. Java version is 17. Tika version is > 2.9.1 and POI version is 5.2.3. > > import org.apache.pdfbox.io.IOUtils; > import org.apache.poi.poifs.filesystem.DirectoryEntry; > import org.apache.poi.poifs.filesystem.DocumentEntry; > import org.apache.poi.poifs.filesystem.DocumentInputStream; > import org.apache.poi.poifs.filesystem.POIFSFileSystem; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.extractor.EmbeddedDocumentExtractor; > import org.apache.tika.io.FilenameUtils; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.metadata.TikaCoreProperties; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.xml.sax.ContentHandler; > import org.xml.sax.SAXException; > import org.xml.sax.helpers.DefaultHandler; > > import java.io.*; > import java.net.URL; > import java.nio.file.Path; > > public class ExtractExcelFromPowerPoint { > private final Path pptxFile = new > File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath(); > private final Path outputDir = new > File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath(); > > private Parser parser; > private ParseContext context; > > > public static void main(String args[]) { > try { > new ExtractExcelFromPowerPoint().process(); > } > catch(Exception ex) { > ex.printStackTrace(); > } > } > > public ExtractExcelFromPowerPoint() { > } > > public void process() throws Exception { > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml"); > FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new > FileEmbeddedDocumentExtractor(); > > parser = new
[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx
[ https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827233#comment-17827233 ] Tim Allison commented on TIKA-4211: --- Or, if you grep for "embeddings" in the in uncompressed zip, can you find a link to the xlsx file? > Tika extractor fails to extract embedded excel from pptx > > > Key: TIKA-4211 > URL: https://issues.apache.org/jira/browse/TIKA-4211 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: config_and_sample_file.zip > > > We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded > excel from PowerPoint presentation. It works with most pptx files. But it > fails to detect the embedded excel with some pptx files. > Following is the sample code and attached is the tika-config.xml and a pptx > file that works. > We cannot provide the pptx file that does not work because it is client data. > We noticed a difference between the pptx files that work and the pptx file > that does not work: > "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object > is right-clicked in the pptx files that work.* > "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is > right-clicked in the pptx file that does not work. This file might be created > with an old version fo PowerPoint.* > > The operating system is Ubuntu 20.04. Java version is 17. Tika version is > 2.9.1 and POI version is 5.2.3. > > import org.apache.pdfbox.io.IOUtils; > import org.apache.poi.poifs.filesystem.DirectoryEntry; > import org.apache.poi.poifs.filesystem.DocumentEntry; > import org.apache.poi.poifs.filesystem.DocumentInputStream; > import org.apache.poi.poifs.filesystem.POIFSFileSystem; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.extractor.EmbeddedDocumentExtractor; > import org.apache.tika.io.FilenameUtils; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.metadata.TikaCoreProperties; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.xml.sax.ContentHandler; > import org.xml.sax.SAXException; > import org.xml.sax.helpers.DefaultHandler; > > import java.io.*; > import java.net.URL; > import java.nio.file.Path; > > public class ExtractExcelFromPowerPoint { > private final Path pptxFile = new > File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath(); > private final Path outputDir = new > File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath(); > > private Parser parser; > private ParseContext context; > > > public static void main(String args[]) { > try { > new ExtractExcelFromPowerPoint().process(); > } > catch(Exception ex) { > ex.printStackTrace(); > } > } > > public ExtractExcelFromPowerPoint() { > } > > public void process() throws Exception { > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml"); > FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new > FileEmbeddedDocumentExtractor(); > > parser = new AutoDetectParser(config); > context = new ParseContext(); > context.set(Parser.class, parser); > context.set(TikaConfig.class, config); > context.set(EmbeddedDocumentExtractor.class, > fileEmbeddedDocumentExtractor); > > URL url = pptxFile.toUri().toURL(); > Metadata metadata = new Metadata(); > try (InputStream input = TikaInputStream.get(url, metadata)) { > ContentHandler handler = new DefaultHandler(); > parser.parse(input, handler, metadata, context); > } > } > > private class FileEmbeddedDocumentExtractor implements > EmbeddedDocumentExtractor { > private int count = 0; > > public boolean shouldParseEmbedded(Metadata metadata) { > return true; > } > > public void parseEmbedded(InputStream inputStream, ContentHandler > contentHandler, Metadata metadata, > boolean outputHtml) throws SAXException, > IOException { > String fullFileName = > metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY); > if (fullFileName == null) { > fullFileName = "file" + count++; > } > > String[] fileNameSplit = fullFileName.split("/"); > String fileName = fileNameSplit[fileNameSplit.length - 1]; > File outputFile = new File(outputDir.toFile(), > FilenameUtils.normalize(fileName)); > System.out.println("Extracting '" + fileName + " to
[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx
[ https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827230#comment-17827230 ] Tim Allison commented on TIKA-4211: --- In this file within the zip: /ppt/slides/_rels/slide2.xml.rels: Do you see something like this that specifies the xlsx file as "rId2"? {code:java} http://schemas.openxmlformats.org/package/2006/relationships;>http://schemas.openxmlformats.org/officeDocument/2006/relationships/image; Target="../media/image1.emf"/>http://schemas.openxmlformats.org/officeDocument/2006/relationships/package; Target="../embeddings/Microsoft_Excel_Worksheet.xlsx"/>http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideLayout; Target="../slideLayouts/slideLayout2.xml"/> {code} > Tika extractor fails to extract embedded excel from pptx > > > Key: TIKA-4211 > URL: https://issues.apache.org/jira/browse/TIKA-4211 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: config_and_sample_file.zip > > > We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded > excel from PowerPoint presentation. It works with most pptx files. But it > fails to detect the embedded excel with some pptx files. > Following is the sample code and attached is the tika-config.xml and a pptx > file that works. > We cannot provide the pptx file that does not work because it is client data. > We noticed a difference between the pptx files that work and the pptx file > that does not work: > "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object > is right-clicked in the pptx files that work.* > "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is > right-clicked in the pptx file that does not work. This file might be created > with an old version fo PowerPoint.* > > The operating system is Ubuntu 20.04. Java version is 17. Tika version is > 2.9.1 and POI version is 5.2.3. > > import org.apache.pdfbox.io.IOUtils; > import org.apache.poi.poifs.filesystem.DirectoryEntry; > import org.apache.poi.poifs.filesystem.DocumentEntry; > import org.apache.poi.poifs.filesystem.DocumentInputStream; > import org.apache.poi.poifs.filesystem.POIFSFileSystem; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.extractor.EmbeddedDocumentExtractor; > import org.apache.tika.io.FilenameUtils; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.metadata.TikaCoreProperties; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.xml.sax.ContentHandler; > import org.xml.sax.SAXException; > import org.xml.sax.helpers.DefaultHandler; > > import java.io.*; > import java.net.URL; > import java.nio.file.Path; > > public class ExtractExcelFromPowerPoint { > private final Path pptxFile = new > File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath(); > private final Path outputDir = new > File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath(); > > private Parser parser; > private ParseContext context; > > > public static void main(String args[]) { > try { > new ExtractExcelFromPowerPoint().process(); > } > catch(Exception ex) { > ex.printStackTrace(); > } > } > > public ExtractExcelFromPowerPoint() { > } > > public void process() throws Exception { > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml"); > FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new > FileEmbeddedDocumentExtractor(); > > parser = new AutoDetectParser(config); > context = new ParseContext(); > context.set(Parser.class, parser); > context.set(TikaConfig.class, config); > context.set(EmbeddedDocumentExtractor.class, > fileEmbeddedDocumentExtractor); > > URL url = pptxFile.toUri().toURL(); > Metadata metadata = new Metadata(); > try (InputStream input = TikaInputStream.get(url, metadata)) { > ContentHandler handler = new DefaultHandler(); > parser.parse(input, handler, metadata, context); > } > } > > private class FileEmbeddedDocumentExtractor implements > EmbeddedDocumentExtractor { > private int count = 0; > > public boolean shouldParseEmbedded(Metadata metadata) { > return true; > } > > public void parseEmbedded(InputStream inputStream, ContentHandler > contentHandler, Metadata metadata, > boolean outputHtml) throws SAXException, > IOException { > String
[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx
[ https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827221#comment-17827221 ] Xiaohong Yang commented on TIKA-4211: - Hi Tim, Yes, I found the right file /ppt/embeddings/Microsoft_Excel_Worksheet.xlsx after unzipping the pptx. > Tika extractor fails to extract embedded excel from pptx > > > Key: TIKA-4211 > URL: https://issues.apache.org/jira/browse/TIKA-4211 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: config_and_sample_file.zip > > > We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded > excel from PowerPoint presentation. It works with most pptx files. But it > fails to detect the embedded excel with some pptx files. > Following is the sample code and attached is the tika-config.xml and a pptx > file that works. > We cannot provide the pptx file that does not work because it is client data. > We noticed a difference between the pptx files that work and the pptx file > that does not work: > "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object > is right-clicked in the pptx files that work.* > "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is > right-clicked in the pptx file that does not work. This file might be created > with an old version fo PowerPoint.* > > The operating system is Ubuntu 20.04. Java version is 17. Tika version is > 2.9.1 and POI version is 5.2.3. > > import org.apache.pdfbox.io.IOUtils; > import org.apache.poi.poifs.filesystem.DirectoryEntry; > import org.apache.poi.poifs.filesystem.DocumentEntry; > import org.apache.poi.poifs.filesystem.DocumentInputStream; > import org.apache.poi.poifs.filesystem.POIFSFileSystem; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.extractor.EmbeddedDocumentExtractor; > import org.apache.tika.io.FilenameUtils; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.metadata.TikaCoreProperties; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.xml.sax.ContentHandler; > import org.xml.sax.SAXException; > import org.xml.sax.helpers.DefaultHandler; > > import java.io.*; > import java.net.URL; > import java.nio.file.Path; > > public class ExtractExcelFromPowerPoint { > private final Path pptxFile = new > File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath(); > private final Path outputDir = new > File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath(); > > private Parser parser; > private ParseContext context; > > > public static void main(String args[]) { > try { > new ExtractExcelFromPowerPoint().process(); > } > catch(Exception ex) { > ex.printStackTrace(); > } > } > > public ExtractExcelFromPowerPoint() { > } > > public void process() throws Exception { > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml"); > FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new > FileEmbeddedDocumentExtractor(); > > parser = new AutoDetectParser(config); > context = new ParseContext(); > context.set(Parser.class, parser); > context.set(TikaConfig.class, config); > context.set(EmbeddedDocumentExtractor.class, > fileEmbeddedDocumentExtractor); > > URL url = pptxFile.toUri().toURL(); > Metadata metadata = new Metadata(); > try (InputStream input = TikaInputStream.get(url, metadata)) { > ContentHandler handler = new DefaultHandler(); > parser.parse(input, handler, metadata, context); > } > } > > private class FileEmbeddedDocumentExtractor implements > EmbeddedDocumentExtractor { > private int count = 0; > > public boolean shouldParseEmbedded(Metadata metadata) { > return true; > } > > public void parseEmbedded(InputStream inputStream, ContentHandler > contentHandler, Metadata metadata, > boolean outputHtml) throws SAXException, > IOException { > String fullFileName = > metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY); > if (fullFileName == null) { > fullFileName = "file" + count++; > } > > String[] fileNameSplit = fullFileName.split("/"); > String fileName = fileNameSplit[fileNameSplit.length - 1]; > File outputFile = new File(outputDir.toFile(), > FilenameUtils.normalize(fileName)); > System.out.println("Extracting '" +
[jira] [Updated] (TIKA-4212) Tika fails to get file extension of file type image/x-rtf-raw-bitmap
[ https://issues.apache.org/jira/browse/TIKA-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaohong Yang updated TIKA-4212: Attachment: tika-config-and-sample-file.zip > Tika fails to get file extension of file type image/x-rtf-raw-bitmap > > > Key: TIKA-4212 > URL: https://issues.apache.org/jira/browse/TIKA-4212 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: tika-config-and-sample-file.zip > > > We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded > objects from Word documents. Two embedded objects are extracted from the > sample doc file. Their file type is image/x-rtf-raw-bitmap. But Tika fails to > get the file extension with the following method call > tikaExtension = > config.getMimeRepository().forName(contentType.toString()).getExtension(); > Wonder if you can fix the problem in the Tika library. Also wonder if you > can tell us the file extension of file type is image/x-rtf-raw-bitmap. > Following is the sample code and attached is the tika-config.xml and the > sample Word file. > The operating system is Ubuntu 20.04. Java version is 17. Tika version is > 2.9.1 and POI version is 5.2.3. > > import org.apache.pdfbox.io.IOUtils; > import org.apache.poi.poifs.filesystem.DirectoryEntry; > import org.apache.poi.poifs.filesystem.DocumentEntry; > import org.apache.poi.poifs.filesystem.DocumentInputStream; > import org.apache.poi.poifs.filesystem.POIFSFileSystem; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.detect.Detector; > import org.apache.tika.extractor.EmbeddedDocumentExtractor; > import org.apache.tika.io.FilenameUtils; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.metadata.TikaCoreProperties; > import org.apache.tika.mime.MediaType; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.xml.sax.ContentHandler; > import org.xml.sax.SAXException; > import org.xml.sax.helpers.DefaultHandler; > > import java.io.*; > import java.net.URL; > import java.nio.file.Path; > > public class ExtractBitMapFromWord { > private final Path docFile = new > File("/home/ubuntu/testdirs/testdir_doc/sample.DOC").toPath(); > private final Path outputDir = new > File("/home/ubuntu/testdirs/testdir_doc/tika_output/").toPath(); > > private Parser parser; > private ParseContext context; > > > public static void main(String args[]) { > try { > new ExtractBitMapFromWord().process(); > } > catch(Exception ex) { > ex.printStackTrace(); > } > } > > public ExtractBitMapFromWord() { > } > > public void process() throws Exception { > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_doc/tika-config.xml"); > ExtractBitMapFromWord.FileEmbeddedDocumentExtractor > fileEmbeddedDocumentExtractor = new > ExtractBitMapFromWord.FileEmbeddedDocumentExtractor(); > > parser = new AutoDetectParser(config); > context = new ParseContext(); > context.set(Parser.class, parser); > context.set(TikaConfig.class, config); > context.set(EmbeddedDocumentExtractor.class, > fileEmbeddedDocumentExtractor); > > URL url = docFile.toUri().toURL(); > Metadata metadata = new Metadata(); > try (InputStream input = TikaInputStream.get(url, metadata)) { > ContentHandler handler = new DefaultHandler(); > parser.parse(input, handler, metadata, context); > } > } > > private class FileEmbeddedDocumentExtractor implements > EmbeddedDocumentExtractor { > private int count = 0; > > public boolean shouldParseEmbedded(Metadata metadata) { > return true; > } > > public void parseEmbedded(InputStream inputStream, ContentHandler > contentHandler, Metadata metadata, > boolean outputHtml) throws SAXException, > IOException { > String fullFileName = > metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY); > if (fullFileName == null) { > fullFileName = "file" + count++; > } > > TikaConfig config = null; > try { > config = new > TikaConfig("/home/ubuntu/testdirs/testdir_doc/tika-config.xml"); > } catch (Exception ex) { > ex.printStackTrace(); > } > if (config == null) { > return; > } > > Detector detector = config.getDetector();; > MediaType
[jira] [Created] (TIKA-4212) Tika fails to get file extension of file type image/x-rtf-raw-bitmap
Xiaohong Yang created TIKA-4212: --- Summary: Tika fails to get file extension of file type image/x-rtf-raw-bitmap Key: TIKA-4212 URL: https://issues.apache.org/jira/browse/TIKA-4212 Project: Tika Issue Type: Bug Reporter: Xiaohong Yang We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded objects from Word documents. Two embedded objects are extracted from the sample doc file. Their file type is image/x-rtf-raw-bitmap. But Tika fails to get the file extension with the following method call tikaExtension = config.getMimeRepository().forName(contentType.toString()).getExtension(); Wonder if you can fix the problem in the Tika library. Also wonder if you can tell us the file extension of file type is image/x-rtf-raw-bitmap. Following is the sample code and attached is the tika-config.xml and the sample Word file. The operating system is Ubuntu 20.04. Java version is 17. Tika version is 2.9.1 and POI version is 5.2.3. import org.apache.pdfbox.io.IOUtils; import org.apache.poi.poifs.filesystem.DirectoryEntry; import org.apache.poi.poifs.filesystem.DocumentEntry; import org.apache.poi.poifs.filesystem.DocumentInputStream; import org.apache.poi.poifs.filesystem.POIFSFileSystem; import org.apache.tika.config.TikaConfig; import org.apache.tika.detect.Detector; import org.apache.tika.extractor.EmbeddedDocumentExtractor; import org.apache.tika.io.FilenameUtils; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.TikaCoreProperties; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; import java.io.*; import java.net.URL; import java.nio.file.Path; public class ExtractBitMapFromWord { private final Path docFile = new File("/home/ubuntu/testdirs/testdir_doc/sample.DOC").toPath(); private final Path outputDir = new File("/home/ubuntu/testdirs/testdir_doc/tika_output/").toPath(); private Parser parser; private ParseContext context; public static void main(String args[]) { try { new ExtractBitMapFromWord().process(); } catch(Exception ex) { ex.printStackTrace(); } } public ExtractBitMapFromWord() { } public void process() throws Exception { TikaConfig config = new TikaConfig("/home/ubuntu/testdirs/testdir_doc/tika-config.xml"); ExtractBitMapFromWord.FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new ExtractBitMapFromWord.FileEmbeddedDocumentExtractor(); parser = new AutoDetectParser(config); context = new ParseContext(); context.set(Parser.class, parser); context.set(TikaConfig.class, config); context.set(EmbeddedDocumentExtractor.class, fileEmbeddedDocumentExtractor); URL url = docFile.toUri().toURL(); Metadata metadata = new Metadata(); try (InputStream input = TikaInputStream.get(url, metadata)) { ContentHandler handler = new DefaultHandler(); parser.parse(input, handler, metadata, context); } } private class FileEmbeddedDocumentExtractor implements EmbeddedDocumentExtractor { private int count = 0; public boolean shouldParseEmbedded(Metadata metadata) { return true; } public void parseEmbedded(InputStream inputStream, ContentHandler contentHandler, Metadata metadata, boolean outputHtml) throws SAXException, IOException { String fullFileName = metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY); if (fullFileName == null) { fullFileName = "file" + count++; } TikaConfig config = null; try { config = new TikaConfig("/home/ubuntu/testdirs/testdir_doc/tika-config.xml"); } catch (Exception ex) { ex.printStackTrace(); } if (config == null) { return; } Detector detector = config.getDetector();; MediaType contentType = detector.detect(inputStream, metadata); String tikaExtension = null; if(fullFileName.indexOf('.') == -1 && contentType != null){ try { tikaExtension = config.getMimeRepository().forName(contentType.toString()).getExtension(); } catch (Exception ex) { ex.printStackTrace(); } if (tikaExtension != null &&
[jira] [Commented] (TIKA-4210) Not able to identify tika extension
[ https://issues.apache.org/jira/browse/TIKA-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827193#comment-17827193 ] Tim Allison commented on TIKA-4210: --- Those files look like this in the rtf file: {code:java} {\pict\wbitmap0\picw14\pich26\wbmbitspixel1\wbmplanes1\wbmwidthbytes2\picwGoal210\pichGoal390 fffcbffc9ffc8ffc87fc83fc81fc80fc807c803c801c800c8004800c801c803c807c80fc81fc83fc87fc8ffc9ffcbffcfffcfffc}\ {code} and {code:java} {\pict\wbitmap0\picw173\pich7\wbmbitspixel1\wbmplanes1\wbmwidthbytes22\picwGoal2076\pichGoal84 fff8c7f8ff1fe3fc7f8ff1fe3fc7f8ff1fe3fc7f8ff1fe38b7f6fedfdbfb7f6fedfdbfb7f6fedfdbfb7f6fedfdb893f27e4fc9f93f27e4fc9f93f27e4fc9f93f27e4fc98b7f6fedfdbfb7f6fedfdbfb7f6fedfdbfb7f6fedfdb8cff9ff3fe7fcff9ff3fe7fcff9ff3fe7fcff9ff3fe78fff8} {code} > Not able to identify tika extension > --- > > Key: TIKA-4210 > URL: https://issues.apache.org/jira/browse/TIKA-4210 > Project: Tika > Issue Type: Bug >Reporter: Tika User >Priority: Major > Attachments: sample.DOC > > > Hi Team, > The attached embedded file contain .MPGA attachments which tika is not able > to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still > showing it as empty. Please look into this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4210) Not able to identify tika extension
[ https://issues.apache.org/jira/browse/TIKA-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827191#comment-17827191 ] Tim Allison commented on TIKA-4210: --- Nick is right. The file is an RTF file. Tika does find two embedded files identified as x-rtf-raw-bitmap. We don't have a parser for that format, I don't think. {code:java} [ { "Content-Length": "19619", "Content-Type": "application/rtf", "X-TIKA:Parsed-By": [ "org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.microsoft.rtf.RTFParser" ], "X-TIKA:Parsed-By-Full-Set": [ "org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.microsoft.rtf.RTFParser", "org.apache.tika.parser.EmptyParser" ], "X-TIKA:content": "...", "X-TIKA:content_handler": "ToTextContentHandler", "X-TIKA:embedded_depth": "0", "X-TIKA:parse_time_millis": "143", "resourceName": "sample.DOC.rtf" }, { "Content-Length": "52", "Content-Type": "image/x-rtf-raw-bitmap", "Content-Type-Parser-Override": "image/x-rtf-raw-bitmap", "X-TIKA:Parsed-By": "org.apache.tika.parser.EmptyParser", "X-TIKA:embedded_depth": "1", "X-TIKA:embedded_id": "1", "X-TIKA:embedded_id_path": "/1", "X-TIKA:embedded_resource_path": "/file_0", "X-TIKA:parse_time_millis": "1", "resourceName": "file_0", "rtf_meta:thumbnail": "false" }, { "Content-Length": "154", "Content-Type": "image/x-rtf-raw-bitmap", "Content-Type-Parser-Override": "image/x-rtf-raw-bitmap", "X-TIKA:Parsed-By": "org.apache.tika.parser.EmptyParser", "X-TIKA:embedded_depth": "1", "X-TIKA:embedded_id": "2", "X-TIKA:embedded_id_path": "/2", "X-TIKA:embedded_resource_path": "/file_1", "X-TIKA:parse_time_millis": "0", "resourceName": "file_1", "rtf_meta:thumbnail": "false" } ] {code} > Not able to identify tika extension > --- > > Key: TIKA-4210 > URL: https://issues.apache.org/jira/browse/TIKA-4210 > Project: Tika > Issue Type: Bug >Reporter: Tika User >Priority: Major > Attachments: sample.DOC > > > Hi Team, > The attached embedded file contain .MPGA attachments which tika is not able > to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still > showing it as empty. Please look into this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx
[ https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827190#comment-17827190 ] Tim Allison commented on TIKA-4211: --- Y, as you point out, Tika works with the example file that you shared. I can't do much without a file to work on. If you're able to share it privately, I can take a look. Otherwise, we can debug it together. Step 1: unzip the pptx, do you see the embedded xlsx anywhere in the zip file. In the attached file, the xlsx file is /ppt/embeddings/Microsoft_Excel_Worksheet.xlsx > Tika extractor fails to extract embedded excel from pptx > > > Key: TIKA-4211 > URL: https://issues.apache.org/jira/browse/TIKA-4211 > Project: Tika > Issue Type: Bug >Reporter: Xiaohong Yang >Priority: Major > Attachments: config_and_sample_file.zip > > > We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded > excel from PowerPoint presentation. It works with most pptx files. But it > fails to detect the embedded excel with some pptx files. > Following is the sample code and attached is the tika-config.xml and a pptx > file that works. > We cannot provide the pptx file that does not work because it is client data. > We noticed a difference between the pptx files that work and the pptx file > that does not work: > "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object > is right-clicked in the pptx files that work.* > "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is > right-clicked in the pptx file that does not work. This file might be created > with an old version fo PowerPoint.* > > The operating system is Ubuntu 20.04. Java version is 17. Tika version is > 2.9.1 and POI version is 5.2.3. > > import org.apache.pdfbox.io.IOUtils; > import org.apache.poi.poifs.filesystem.DirectoryEntry; > import org.apache.poi.poifs.filesystem.DocumentEntry; > import org.apache.poi.poifs.filesystem.DocumentInputStream; > import org.apache.poi.poifs.filesystem.POIFSFileSystem; > import org.apache.tika.config.TikaConfig; > import org.apache.tika.extractor.EmbeddedDocumentExtractor; > import org.apache.tika.io.FilenameUtils; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.metadata.TikaCoreProperties; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.xml.sax.ContentHandler; > import org.xml.sax.SAXException; > import org.xml.sax.helpers.DefaultHandler; > > import java.io.*; > import java.net.URL; > import java.nio.file.Path; > > public class ExtractExcelFromPowerPoint { > private final Path pptxFile = new > File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath(); > private final Path outputDir = new > File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath(); > > private Parser parser; > private ParseContext context; > > > public static void main(String args[]) { > try { > new ExtractExcelFromPowerPoint().process(); > } > catch(Exception ex) { > ex.printStackTrace(); > } > } > > public ExtractExcelFromPowerPoint() { > } > > public void process() throws Exception { > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml"); > FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new > FileEmbeddedDocumentExtractor(); > > parser = new AutoDetectParser(config); > context = new ParseContext(); > context.set(Parser.class, parser); > context.set(TikaConfig.class, config); > context.set(EmbeddedDocumentExtractor.class, > fileEmbeddedDocumentExtractor); > > URL url = pptxFile.toUri().toURL(); > Metadata metadata = new Metadata(); > try (InputStream input = TikaInputStream.get(url, metadata)) { > ContentHandler handler = new DefaultHandler(); > parser.parse(input, handler, metadata, context); > } > } > > private class FileEmbeddedDocumentExtractor implements > EmbeddedDocumentExtractor { > private int count = 0; > > public boolean shouldParseEmbedded(Metadata metadata) { > return true; > } > > public void parseEmbedded(InputStream inputStream, ContentHandler > contentHandler, Metadata metadata, > boolean outputHtml) throws SAXException, > IOException { > String fullFileName = > metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY); > if (fullFileName == null) { > fullFileName = "file" + count++; > } > >
[jira] [Created] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx
Xiaohong Yang created TIKA-4211: --- Summary: Tika extractor fails to extract embedded excel from pptx Key: TIKA-4211 URL: https://issues.apache.org/jira/browse/TIKA-4211 Project: Tika Issue Type: Bug Reporter: Xiaohong Yang Attachments: config_and_sample_file.zip We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded excel from PowerPoint presentation. It works with most pptx files. But it fails to detect the embedded excel with some pptx files. Following is the sample code and attached is the tika-config.xml and a pptx file that works. We cannot provide the pptx file that does not work because it is client data. We noticed a difference between the pptx files that work and the pptx file that does not work: "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object is right-clicked in the pptx files that work.* "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is right-clicked in the pptx file that does not work. This file might be created with an old version fo PowerPoint.* The operating system is Ubuntu 20.04. Java version is 17. Tika version is 2.9.1 and POI version is 5.2.3. import org.apache.pdfbox.io.IOUtils; import org.apache.poi.poifs.filesystem.DirectoryEntry; import org.apache.poi.poifs.filesystem.DocumentEntry; import org.apache.poi.poifs.filesystem.DocumentInputStream; import org.apache.poi.poifs.filesystem.POIFSFileSystem; import org.apache.tika.config.TikaConfig; import org.apache.tika.extractor.EmbeddedDocumentExtractor; import org.apache.tika.io.FilenameUtils; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.TikaCoreProperties; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; import java.io.*; import java.net.URL; import java.nio.file.Path; public class ExtractExcelFromPowerPoint { private final Path pptxFile = new File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath(); private final Path outputDir = new File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath(); private Parser parser; private ParseContext context; public static void main(String args[]) { try { new ExtractExcelFromPowerPoint().process(); } catch(Exception ex) { ex.printStackTrace(); } } public ExtractExcelFromPowerPoint() { } public void process() throws Exception { TikaConfig config = new TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml"); FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new FileEmbeddedDocumentExtractor(); parser = new AutoDetectParser(config); context = new ParseContext(); context.set(Parser.class, parser); context.set(TikaConfig.class, config); context.set(EmbeddedDocumentExtractor.class, fileEmbeddedDocumentExtractor); URL url = pptxFile.toUri().toURL(); Metadata metadata = new Metadata(); try (InputStream input = TikaInputStream.get(url, metadata)) { ContentHandler handler = new DefaultHandler(); parser.parse(input, handler, metadata, context); } } private class FileEmbeddedDocumentExtractor implements EmbeddedDocumentExtractor { private int count = 0; public boolean shouldParseEmbedded(Metadata metadata) { return true; } public void parseEmbedded(InputStream inputStream, ContentHandler contentHandler, Metadata metadata, boolean outputHtml) throws SAXException, IOException { String fullFileName = metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY); if (fullFileName == null) { fullFileName = "file" + count++; } String[] fileNameSplit = fullFileName.split("/"); String fileName = fileNameSplit[fileNameSplit.length - 1]; File outputFile = new File(outputDir.toFile(), FilenameUtils.normalize(fileName)); System.out.println("Extracting '" + fileName + " to " + outputFile); FileOutputStream os = null; try { os = new FileOutputStream(outputFile); if (inputStream instanceof TikaInputStream tin) { if (tin.getOpenContainer() instanceof DirectoryEntry) { try(POIFSFileSystem fs = new POIFSFileSystem()){ copy((DirectoryEntry) tin.getOpenContainer(), fs.getRoot());
[jira] [Commented] (TIKA-4210) Not able to identify tika extension
[ https://issues.apache.org/jira/browse/TIKA-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827036#comment-17827036 ] Tika User commented on TIKA-4210: - The attached file is doc extension and from that file it should detect two more files, for those files the tika extension is getting empty. first image : black arrow symbol second image : dotted symbol > Not able to identify tika extension > --- > > Key: TIKA-4210 > URL: https://issues.apache.org/jira/browse/TIKA-4210 > Project: Tika > Issue Type: Bug >Reporter: Tika User >Priority: Major > Attachments: sample.DOC > > > Hi Team, > The attached embedded file contain .MPGA attachments which tika is not able > to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still > showing it as empty. Please look into this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] Bump com.google.protobuf:protobuf-java from 3.25.3 to 4.26.0 [tika]
dependabot[bot] commented on PR #1659: URL: https://github.com/apache/tika/pull/1659#issuecomment-1997093738 OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting `@dependabot ignore this major version` or `@dependabot ignore this minor version`. You can also ignore all major, minor, or patch releases for a dependency by adding an [`ignore` condition](https://docs.github.com/en/code-security/supply-chain-security/configuration-options-for-dependency-updates#ignore) with the desired `update_types` to your config file. If you change your mind, just re-open this PR and I'll resolve any conflicts on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Bump com.google.protobuf:protobuf-java from 3.25.3 to 4.26.0 [tika]
THausherr closed pull request #1659: Bump com.google.protobuf:protobuf-java from 3.25.3 to 4.26.0 URL: https://github.com/apache/tika/pull/1659 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (TIKA-4210) Not able to identify tika extension
[ https://issues.apache.org/jira/browse/TIKA-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tika User updated TIKA-4210: Description: Hi Team, The attached embedded file contain .MPGA attachments which tika is not able to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still showing it as empty. Please look into this. was: Hi Team, The attached embedded file contain .mega attachments which tika is not able to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still showing it as empty. Please look into this. > Not able to identify tika extension > --- > > Key: TIKA-4210 > URL: https://issues.apache.org/jira/browse/TIKA-4210 > Project: Tika > Issue Type: Bug >Reporter: Tika User >Priority: Major > Attachments: sample.DOC > > > Hi Team, > The attached embedded file contain .MPGA attachments which tika is not able > to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still > showing it as empty. Please look into this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4210) Not able to identify tika extension
[ https://issues.apache.org/jira/browse/TIKA-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827017#comment-17827017 ] Nick Burch commented on TIKA-4210: -- The attached file seems to be an RTF file. I'm not sure what a ".mega attachment" is, but this file doesn't seem to be one of them... tika-app-2.9.1.jar is able to correctly identify this file as RTF > Not able to identify tika extension > --- > > Key: TIKA-4210 > URL: https://issues.apache.org/jira/browse/TIKA-4210 > Project: Tika > Issue Type: Bug >Reporter: Tika User >Priority: Major > Attachments: sample.DOC > > > Hi Team, > The attached embedded file contain .mega attachments which tika is not able > to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still > showing it as empty. Please look into this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4210) Not able to identify tika extension
Tika User created TIKA-4210: --- Summary: Not able to identify tika extension Key: TIKA-4210 URL: https://issues.apache.org/jira/browse/TIKA-4210 Project: Tika Issue Type: Bug Reporter: Tika User Attachments: sample.DOC Hi Team, The attached embedded file contain .mega attachments which tika is not able to identify its extension. Tried in in tika versions 2.9.0 and 2.9.1 still showing it as empty. Please look into this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1
[ https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826996#comment-17826996 ] Tilman Hausherr commented on TIKA-4199: --- The original error you reported wasn't really a bug in commons compress, rather a change that more bytes were read than tika expected, see my first comment in COMPRESS-661. It resulted in several fixes in tika. > commons-compress 1.26.0 breaks Apache Tika 2.9.1 > > > Key: TIKA-4199 > URL: https://issues.apache.org/jira/browse/TIKA-4199 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.1 >Reporter: Alexander Veit >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.9.2, 3.0.0 > > > An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 > breaks Tika. > > For more information see https://issues.apache.org/jira/browse/COMPRESS-661. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4199) commons-compress 1.26.0 breaks Apache Tika 2.9.1
[ https://issues.apache.org/jira/browse/TIKA-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826992#comment-17826992 ] Alexander Veit commented on TIKA-4199: -- The same error also occurs with Tika 2.9.1 and commons-compress 1.26.1. > commons-compress 1.26.0 breaks Apache Tika 2.9.1 > > > Key: TIKA-4199 > URL: https://issues.apache.org/jira/browse/TIKA-4199 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.9.1 >Reporter: Alexander Veit >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.9.2, 3.0.0 > > > An update to commons-compress 1.26.0 to fix CVE-2024-25710 and CVE-2024-26308 > breaks Tika. > > For more information see https://issues.apache.org/jira/browse/COMPRESS-661. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] Bump com.google.guava:guava from 33.0.0-jre to 33.1.0-jre [tika]
THausherr merged PR #1657: URL: https://github.com/apache/tika/pull/1657 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Bump aws.version from 1.12.678 to 1.12.679 [tika]
THausherr merged PR #1658: URL: https://github.com/apache/tika/pull/1658 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org