[jira] [Updated] (TIKA-4276) Tika fails to detect damaged pdf

2024-07-10 Thread Xiaohong Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaohong Yang updated TIKA-4276:

Description: 
We use Tika to check file type and extension. However, with some damaged pdf 
files Tika detects them as text file.

Wonder if you can make Tika detect the damaged pdf file as pdf file type and 
extension.

Following is the sample code and the link to the tika-config.xml and the sample 
PDF file is [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es]

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2 and POI version is 5.2.3.   

 

 

import org.apache.tika.config.TikaConfig;

import org.apache.tika.detect.Detector;

import org.apache.tika.io.TikaInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.metadata.TikaCoreProperties;

import org.apache.tika.mime.MediaType;

import org.apache.tika.mime.MimeType;

 

import java.io.FileInputStream;

 

public class DetectDamagedPDF {

 

    public static void main(String args[]) {

    try

{     String filePath = 
"/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf";     
TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml");    
 Detector detector = config.getDetector();     Metadata metadata = 
new Metadata();     FileInputStream fis = new 
FileInputStream(filePath);     TikaInputStream stream = 
TikaInputStream.get(fis);     
metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath);     
MediaType mediaType = detector.detect(stream, metadata);     MimeType 
mimeType = config.getMimeRepository().forName(mediaType.toString());    
 String tikaExtension = mimeType.getExtension();     
System.out.println("tikaExtension = " + tikaExtension);     }

    catch(Exception ex)

{     ex.printStackTrace();     }

    }

}

 

  was:
We use Tika to check file type and extension. However, with some damaged pdf 
files Tika detects them as text file.

Wonder if you can make Tika detect the damaged pdf file as pdf file type and 
extension.

Following is the sample code and the link to the tika-config.xml and the sample 
PDF file is [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es]

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2 and POI version is 5.2.3.   

 

 

import org.apache.tika.config.TikaConfig;

import org.apache.tika.detect.Detector;

import org.apache.tika.io.TikaInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.metadata.TikaCoreProperties;

import org.apache.tika.mime.MediaType;

import org.apache.tika.mime.MimeType;

 

import java.io.FileInputStream;

 

public class DetectDamagedPDF {

 

    public static void main(String args[]) {

    try {

    String filePath = 
"/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf";

    TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml");

    Detector detector = config.getDetector();

    Metadata metadata = new Metadata();

    FileInputStream fis = new FileInputStream(filePath);

    TikaInputStream stream = TikaInputStream.get(fis);

    metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath);

    MediaType mediaType = detector.detect(stream, metadata);

    MimeType mimeType = 
config.getMimeRepository().forName(mediaType.toString());

    String tikaExtension = mimeType.getExtension();

    System.out.println("tikaExtension = " + tikaExtension);

    }

    catch(Exception ex) {

    ex.printStackTrace();

    }

    }

}

[^Tika_Config_and_Sample_PDF.zip]


> Tika fails to detect damaged pdf
> 
>
> Key: TIKA-4276
> URL: https://issues.apache.org/jira/browse/TIKA-4276
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use Tika to check file type and extension. However, with some damaged pdf 
> files Tika detects them as text file.
> Wonder if you can make Tika detect the damaged pdf file as pdf file type and 
> extension.
> Following is the sample code and the link to the tika-config.xml and the 
> sample PDF file is 
> [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es]
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2 and POI version is 5.2.3.   
>  
>  
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> im

[jira] [Created] (TIKA-4276) Tika fails to detect damaged pdf

2024-07-10 Thread Xiaohong Yang (Jira)
Xiaohong Yang created TIKA-4276:
---

 Summary: Tika fails to detect damaged pdf
 Key: TIKA-4276
 URL: https://issues.apache.org/jira/browse/TIKA-4276
 Project: Tika
  Issue Type: Bug
Affects Versions: 2.9.2
Reporter: Xiaohong Yang


We use Tika to check file type and extension. However, with some damaged pdf 
files Tika detects them as text file.

Wonder if you can make Tika detect the damaged pdf file as pdf file type and 
extension.

Following is the sample code and the link to the tika-config.xml and the sample 
PDF file is [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es]

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2 and POI version is 5.2.3.   

 

 

import org.apache.tika.config.TikaConfig;

import org.apache.tika.detect.Detector;

import org.apache.tika.io.TikaInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.metadata.TikaCoreProperties;

import org.apache.tika.mime.MediaType;

import org.apache.tika.mime.MimeType;

 

import java.io.FileInputStream;

 

public class DetectDamagedPDF {

 

    public static void main(String args[]) {

    try {

    String filePath = 
"/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf";

    TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml");

    Detector detector = config.getDetector();

    Metadata metadata = new Metadata();

    FileInputStream fis = new FileInputStream(filePath);

    TikaInputStream stream = TikaInputStream.get(fis);

    metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath);

    MediaType mediaType = detector.detect(stream, metadata);

    MimeType mimeType = 
config.getMimeRepository().forName(mediaType.toString());

    String tikaExtension = mimeType.getExtension();

    System.out.println("tikaExtension = " + tikaExtension);

    }

    catch(Exception ex) {

    ex.printStackTrace();

    }

    }

}

[^Tika_Config_and_Sample_PDF.zip]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4245) Tika does not get html content properly

2024-04-26 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841209#comment-17841209
 ] 

Xiaohong Yang commented on TIKA-4245:
-

[~tilman]  Can you detect the right charset (utf-8) and fix the issue?

> Tika does not get html content properly 
> 
>
> Key: TIKA-4245
> URL: https://issues.apache.org/jira/browse/TIKA-4245
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: Sample html file and tika config xml.zip
>
>
> We use org.apache.tika.parser.AutoDetectParser to get the content of html 
> files.  And we found out that it does not get the content fo the sample file 
> properly.
> Following is the sample code and attached is the tika-config.xml and the 
> sample html file.  The content extracted with Tika reads 
> "㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
> from the native file.
>  
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2.   
>  {code:java}
> import org.apache.commons.io.FileUtils;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.PrintWriter;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ExtractTxtFromHtml {
> private static final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();
>  
> public static void main(String args[]) {
> extactText(false);
> extactText(true);
> }
>  
> static void extactText(boolean largeFile) {
> PrintWriter outputFileWriter = null;
> try {
> BodyContentHandler handler;
> Path outputFilePath = null;
>  
> if (largeFile) {
> // write tika output to disk
> outputFilePath = 
> Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");
> outputFileWriter = new 
> PrintWriter(Files.newOutputStream(outputFilePath));
> handler = new BodyContentHandler(outputFileWriter);
> } else {
> // stream it in memory
> handler = new BodyContentHandler(-1);
> }
>  
> Metadata metadata = new Metadata();
> FileInputStream inputData = new 
> FileInputStream(inputFile.toFile());
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(inputData, handler, metadata, context);
>  
> String content;
> if (largeFile) {
> content = FileUtils.readFileToString(outputFilePath.toFile());
> }
> else {
> content = handler.toString();
> }
> System.out.println("content = " + content);
> }
> catch(Exception ex) {
> ex.printStackTrace();
> } finally {
> if (outputFileWriter != null) {
> outputFileWriter.close();
> }
> }
> }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4245) Tika does not get html content properly

2024-04-25 Thread Xiaohong Yang (Jira)
Xiaohong Yang created TIKA-4245:
---

 Summary: Tika does not get html content properly 
 Key: TIKA-4245
 URL: https://issues.apache.org/jira/browse/TIKA-4245
 Project: Tika
  Issue Type: Bug
Reporter: Xiaohong Yang
 Attachments: Sample html file and tika config xml.zip

We use org.apache.tika.parser.AutoDetectParser to get the content of html 
files.  And we found out that it does not get the content fo the sample file 
properly.

Following is the sample code and attached is the tika-config.xml and the sample 
html file.  The content extracted with Tika reads 
"㱨瑭氠硭汮猺景㴢桴瑰㨯⽷睷⹷㌮潲术ㄹ㤹⽘卌⽆潲浡琢㸍ਉ़桥慤㸼䵅呁⁨瑴瀭敱畩瘽≃潮瑥湴ⵔ祰攢⁣潮瑥湴㴢瑥硴…". That is different 
from the native file.

 

 

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2.   

 

import org.apache.commons.io.FileUtils;

import org.apache.tika.config.TikaConfig;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

 

import java.io.File;

import java.io.FileInputStream;

import java.io.PrintWriter;

import java.nio.file.Files;

import java.nio.file.Path;

import java.nio.file.Paths;

 

public class ExtractTxtFromHtml {

    private static final Path inputFile = new 
File("/home/ubuntu/testdirs/testdir_html/451434.html").toPath();

 

    public static void main(String args[]) {

    extactText(false);

    extactText(true);

    }

 

    static void extactText(boolean largeFile) {

    PrintWriter outputFileWriter = null;

    try {

    BodyContentHandler handler;

    Path outputFilePath = null;

 

    if (largeFile) {

    // write tika output to disk

    outputFilePath = 
Paths.get("/home/ubuntu/testdirs/testdir_html/tika_parse_output.txt");

    outputFileWriter = new 
PrintWriter(Files.newOutputStream(outputFilePath));

    handler = new BodyContentHandler(outputFileWriter);

    } else {

    // stream it in memory

    handler = new BodyContentHandler(-1);

    }

 

    Metadata metadata = new Metadata();

    FileInputStream inputData = new FileInputStream(inputFile.toFile());

    TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_html/tika-config.xml");

    Parser autoDetectParser = new AutoDetectParser(config);

    ParseContext context = new ParseContext();

    context.set(TikaConfig.class, config);

    autoDetectParser.parse(inputData, handler, metadata, context);

 

    String content;

    if (largeFile) {

    content = FileUtils.readFileToString(outputFilePath.toFile());

    }

    else {

    content = handler.toString();

    }

    System.out.println("content = " + content);

    }

    catch(Exception ex) {

    ex.printStackTrace();

    } finally {

    if (outputFileWriter != null) {

    outputFileWriter.close();

    }

    }

    }

}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-28 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831835#comment-17831835
 ] 

Xiaohong Yang commented on TIKA-4228:
-

It is not multithreaded. I will try to get the exit value of the process (if 
possible).  I will also check if there is a core dump on the machine.

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     ProcessPdf.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ProcessPdf.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = inputFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata))
> {     Con

[jira] [Commented] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831546#comment-17831546
 ] 

Xiaohong Yang commented on TIKA-4228:
-

Hi Tim,   

There is no log. Because the program crashed it does not have any chance to 
write anything to the log.

 

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     ProcessPdf.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ProcessPdf.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = inputFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata))
> {     ContentHandler handler = new DefaultHandler();     
>

[jira] [Comment Edited] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831522#comment-17831522
 ] 

Xiaohong Yang edited comment on TIKA-4228 at 3/27/24 8:02 PM:
--

Hi Tim,  "crashing" means causing the program (or the JVM)  to crash (die).  It 
does no happen in ever run (happens in 1 run out of 10 for the same file, 
happens in 3 runs out of 8 with a client sample file).


was (Author: xyang200):
Hi Tim,  "crashing" means causing the program (or the JVM)  to crash (die).

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     ProcessPdf.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ProcessPdf.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
> 

[jira] [Commented] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17831522#comment-17831522
 ] 

Xiaohong Yang commented on TIKA-4228:
-

Hi Tim,  "crashing" means causing the program (or the JVM)  to crash (die).

> Tika parser crashes JVM when it gets metadata and embedded objects from pdf
> ---
>
> Key: TIKA-4228
> URL: https://issues.apache.org/jira/browse/TIKA-4228
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> [^tika-config-and-sample-file.zip]
>  
> We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
> objects from pdf documents.  And we found out that it crashes the program (or 
> the JVM) when it gets metadata and embedded files from the sample pdf file.
>  
> Following is the sample code and attached is the tika-config.xml and the 
> sample pdf file. Note that the sample file crashes the JVM in 1 out of 10 
> runs in our production environment.  Sometimes it happens when it gets 
> metadata and sometimes it happens when it extracts embedded files (the 
> chances are about 50/50).
>  
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.0 and POI version is 5.2.3.   
>  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.nio.file.Paths;
>  
> public class ProcessPdf {
>     private final Path inputFile = new 
> File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try
> {     System.out.println("Start");     ProcessPdf processPdf 
> = new ProcessPdf();     System.out.println("Get metadata");   
>   processPdf.getMataData();     System.out.println("Extract embedded 
> files");     processPdf.extract();     
> System.out.println("End");     }
>     catch(Exception ex)
> {     ex.printStackTrace();     }
>     }
>  
>     public ProcessPdf()
> {     }
>  
>     public void getMataData() throws Exception {
>     BodyContentHandler handler = new BodyContentHandler(-1);
>  
>     Metadata metadata = new Metadata();
>     try (FileInputStream inputData = new 
> FileInputStream(inputFile.toString()))
> {     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
> Parser autoDetectParser = new AutoDetectParser(config);     
> ParseContext context = new ParseContext();     
> context.set(TikaConfig.class, config);     
> autoDetectParser.parse(inputData, handler, metadata, context);     }
>  
>     String content = handler.toString();
>     }
>  
>     public void extract() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");
>     ProcessPdf.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ProcessPdf.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = inputFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata))
> {     ContentHandler handler = new DefaultHandler();     
> parser.parse(input, handler, metadata, con

[jira] [Updated] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Xiaohong Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaohong Yang updated TIKA-4228:

Description: 
[^tika-config-and-sample-file.zip]

 

We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
objects from pdf documents.  And we found out that it crashes the program (or 
the JVM) when it gets metadata and embedded files from the sample pdf file.

 

Following is the sample code and attached is the tika-config.xml and the sample 
pdf file. Note that the sample file crashes the JVM in 1 out of 10 runs in our 
production environment.  Sometimes it happens when it gets metadata and 
sometimes it happens when it extracts embedded files (the chances are about 
50/50).

 

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.0 and POI version is 5.2.3.   

 

 

import org.apache.pdfbox.io.IOUtils;

import org.apache.poi.poifs.filesystem.DirectoryEntry;

import org.apache.poi.poifs.filesystem.DocumentEntry;

import org.apache.poi.poifs.filesystem.DocumentInputStream;

import org.apache.poi.poifs.filesystem.POIFSFileSystem;

import org.apache.tika.config.TikaConfig;

import org.apache.tika.detect.Detector;

import org.apache.tika.extractor.EmbeddedDocumentExtractor;

import org.apache.tika.io.FilenameUtils;

import org.apache.tika.io.TikaInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.metadata.TikaCoreProperties;

import org.apache.tika.mime.MediaType;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

import org.xml.sax.helpers.DefaultHandler;

 

import java.io.*;

import java.net.URL;

import java.nio.file.Files;

import java.nio.file.Path;

import java.nio.file.Paths;

 

public class ProcessPdf {

    private final Path inputFile = new 
File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();

    private final Path outputDir = new 
File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();

 

    private Parser parser;

    private ParseContext context;

 

 

    public static void main(String args[]) {

    try

{     System.out.println("Start");     ProcessPdf processPdf = 
new ProcessPdf();     System.out.println("Get metadata");     
processPdf.getMataData();     System.out.println("Extract embedded 
files");     processPdf.extract();     
System.out.println("End");     }

    catch(Exception ex)

{     ex.printStackTrace();     }

    }

 

    public ProcessPdf()

{     }

 

    public void getMataData() throws Exception {

    BodyContentHandler handler = new BodyContentHandler(-1);

 

    Metadata metadata = new Metadata();

    try (FileInputStream inputData = new 
FileInputStream(inputFile.toString()))

{     TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");     
Parser autoDetectParser = new AutoDetectParser(config);     
ParseContext context = new ParseContext();     
context.set(TikaConfig.class, config);     
autoDetectParser.parse(inputData, handler, metadata, context);     }

 

    String content = handler.toString();

    }

 

    public void extract() throws Exception {

    TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");

    ProcessPdf.FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor 
= new ProcessPdf.FileEmbeddedDocumentExtractor();

 

    parser = new AutoDetectParser(config);

    context = new ParseContext();

    context.set(Parser.class, parser);

    context.set(TikaConfig.class, config);

    context.set(EmbeddedDocumentExtractor.class, 
fileEmbeddedDocumentExtractor);

 

    URL url = inputFile.toUri().toURL();

    Metadata metadata = new Metadata();

    try (InputStream input = TikaInputStream.get(url, metadata))

{     ContentHandler handler = new DefaultHandler();     
parser.parse(input, handler, metadata, context);     }

    }

 

    private class FileEmbeddedDocumentExtractor implements 
EmbeddedDocumentExtractor {

    private int count = 0;

 

    public boolean shouldParseEmbedded(Metadata metadata)

{     return true;     }

 

    public void parseEmbedded(InputStream inputStream, ContentHandler 
contentHandler, Metadata metadata,

  boolean outputHtml) throws SAXException, 
IOException {

    String fullFileName = 
metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);

    if (fullFileName == null)

{     fullFileName = "file" + count++;     }

 

    TikaConfig config = n

[jira] [Created] (TIKA-4228) Tika parser crashes JVM when it gets metadata and embedded objects from pdf

2024-03-27 Thread Xiaohong Yang (Jira)
Xiaohong Yang created TIKA-4228:
---

 Summary: Tika parser crashes JVM when it gets metadata and 
embedded objects from pdf
 Key: TIKA-4228
 URL: https://issues.apache.org/jira/browse/TIKA-4228
 Project: Tika
  Issue Type: Bug
Reporter: Xiaohong Yang
 Attachments: tika-config-and-sample-file.zip

[^tika-config-and-sample-file.zip]

 

We use org.apache.tika.parser.AutoDetectParser to get metadata and embedded 
objects from pdf documents.  And we found out that it crashes program (or the 
JVM) when it gets metadata and embedded files.

 

Following is the sample code and attached is the tika-config.xml and the sample 
pdf file. Note that the sample file crashes the JVM in 1 out of 10 runs in our 
production environment.  Sometimes it happens when it gets metadata and 
sometimes it happens when it extracts embedded files (the chances are about 
50/50).

 

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.0 and POI version is 5.2.3.   

 

 

import org.apache.pdfbox.io.IOUtils;

import org.apache.poi.poifs.filesystem.DirectoryEntry;

import org.apache.poi.poifs.filesystem.DocumentEntry;

import org.apache.poi.poifs.filesystem.DocumentInputStream;

import org.apache.poi.poifs.filesystem.POIFSFileSystem;

import org.apache.tika.config.TikaConfig;

import org.apache.tika.detect.Detector;

import org.apache.tika.extractor.EmbeddedDocumentExtractor;

import org.apache.tika.io.FilenameUtils;

import org.apache.tika.io.TikaInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.metadata.TikaCoreProperties;

import org.apache.tika.mime.MediaType;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

import org.xml.sax.helpers.DefaultHandler;

 

import java.io.*;

import java.net.URL;

import java.nio.file.Files;

import java.nio.file.Path;

import java.nio.file.Paths;

 

public class ProcessPdf {

    private final Path inputFile = new 
File("/home/ubuntu/testdirs/testdir_pdf/sample.pdf").toPath();

    private final Path outputDir = new 
File("/home/ubuntu/testdirs/testdir_pdf/tika_output/").toPath();

 

    private Parser parser;

    private ParseContext context;

 

 

    public static void main(String args[]) {

    try {

    System.out.println("Start");

    ProcessPdf processPdf = new ProcessPdf();

    System.out.println("Get metadata");

    processPdf.getMataData();

    System.out.println("Extract embedded files");

    processPdf.extract();

    System.out.println("End");

    }

    catch(Exception ex) {

    ex.printStackTrace();

    }

    }

 

    public ProcessPdf() {

    }

 

    public void getMataData() throws Exception {

    BodyContentHandler handler = new BodyContentHandler(-1);

 

    Metadata metadata = new Metadata();

    try (FileInputStream inputData = new 
FileInputStream(inputFile.toString())) {

    TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");

    Parser autoDetectParser = new AutoDetectParser(config);

    ParseContext context = new ParseContext();

    context.set(TikaConfig.class, config);

    autoDetectParser.parse(inputData, handler, metadata, context);

    }

 

    String content = handler.toString();

    }

 

    public void extract() throws Exception {

    TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_pdf/tika-config.xml");

    ProcessPdf.FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor 
= new ProcessPdf.FileEmbeddedDocumentExtractor();

 

    parser = new AutoDetectParser(config);

    context = new ParseContext();

    context.set(Parser.class, parser);

    context.set(TikaConfig.class, config);

    context.set(EmbeddedDocumentExtractor.class, 
fileEmbeddedDocumentExtractor);

 

    URL url = inputFile.toUri().toURL();

    Metadata metadata = new Metadata();

    try (InputStream input = TikaInputStream.get(url, metadata)) {

    ContentHandler handler = new DefaultHandler();

    parser.parse(input, handler, metadata, context);

    }

    }

 

    private class FileEmbeddedDocumentExtractor implements 
EmbeddedDocumentExtractor {

    private int count = 0;

 

    public boolean shouldParseEmbedded(Metadata metadata) {

    return true;

    }

 

    public void parseEmbedded(InputStream inputStream, ContentHandler 
contentHandler, Metadata metadata,

  boolean outputHtml) throws SAXException, 
IOException {

    String fullFileName = 
metadata.get(

[jira] [Comment Edited] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-21 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829616#comment-17829616
 ] 

Xiaohong Yang edited comment on TIKA-4211 at 3/21/24 3:50 PM:
--

Hi Tim,

Yes, it works with tika-app-3.0.0-20240321.135818-429.jar. 


was (Author: xyang200):
Hi Tim,

Yes, it works.

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Fix For: 2.9.2, 3.0.0
>
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = pptxFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata)) {
>     ContentHandler handler = new DefaultHandler();
>     parser.parse(input, handler, metadata, context);
>     }
>     }
>  
>     private class FileEmbeddedDocumentExtractor implements 
> EmbeddedDocumentExtractor {
>     private int count = 0;
>  
>     public boolean shouldParseEmbedded(Metadata metadata) {
>     return true;
>     }
>  
>     public void parseEmbedded(InputStream inputStream, ContentHandler 
> contentHandler, Metadata metadata,
>   boolean outputHtml) throws SAXException, 
> IOException {
>     String fullFileName = 
> metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
>     if (fullFileName == null) {
>     fullFileName = "file" + count++;
>     }
>  
>     String[] fileNameSplit = fullFileName.split("/");
>     String fileName = fileNameSplit[fileNameSplit.length - 1];
>     File outputFile = new Fil

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-21 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829616#comment-17829616
 ] 

Xiaohong Yang commented on TIKA-4211:
-

Hi Tim,

Yes, it works.

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Fix For: 2.9.2, 3.0.0
>
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = pptxFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata)) {
>     ContentHandler handler = new DefaultHandler();
>     parser.parse(input, handler, metadata, context);
>     }
>     }
>  
>     private class FileEmbeddedDocumentExtractor implements 
> EmbeddedDocumentExtractor {
>     private int count = 0;
>  
>     public boolean shouldParseEmbedded(Metadata metadata) {
>     return true;
>     }
>  
>     public void parseEmbedded(InputStream inputStream, ContentHandler 
> contentHandler, Metadata metadata,
>   boolean outputHtml) throws SAXException, 
> IOException {
>     String fullFileName = 
> metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
>     if (fullFileName == null) {
>     fullFileName = "file" + count++;
>     }
>  
>     String[] fileNameSplit = fullFileName.split("/");
>     String fileName = fileNameSplit[fileNameSplit.length - 1];
>     File outputFile = new File(outputDir.toFile(), 
> FilenameUtils.normalize(fileName));
>     System.out.println("Extracting '" + fileName + " to " + 
> outputFile

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-21 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829602#comment-17829602
 ] 

Xiaohong Yang commented on TIKA-4211:
-

Hi Tim,

Can you tell me when the fix will be released?

By the way, if you can give me the tika-core-3.0.0-20240321.135818-429.jar file 
I can test it in our program.

 

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = pptxFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata)) {
>     ContentHandler handler = new DefaultHandler();
>     parser.parse(input, handler, metadata, context);
>     }
>     }
>  
>     private class FileEmbeddedDocumentExtractor implements 
> EmbeddedDocumentExtractor {
>     private int count = 0;
>  
>     public boolean shouldParseEmbedded(Metadata metadata) {
>     return true;
>     }
>  
>     public void parseEmbedded(InputStream inputStream, ContentHandler 
> contentHandler, Metadata metadata,
>   boolean outputHtml) throws SAXException, 
> IOException {
>     String fullFileName = 
> metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
>     if (fullFileName == null) {
>     fullFileName = "file" + count++;
>     }
>  
>     String[] fileNameSplit = fullFileName.split("/");
>     String fileName = fileNameSplit[fileNameSplit.length - 1];
>     File outputFile = new File(outputDir.toFile(), 
> FilenameUt

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-21 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829597#comment-17829597
 ] 

Xiaohong Yang commented on TIKA-4211:
-

Hi Tim,
I ran the following command and the xlsx is in the result json:



java -jar tika-app-3.0.0-20240321.135818-429.jar  -J -t 
2020_Capacity_Ramp_Plan.pptx

Here is the related part of the json 



[

   {

  "cp:revision": "8",

  "extended-properties:AppVersion": "16.",

  "meta:paragraph-count": "278",

  "meta:word-count": "465",

  "extended-properties:PresentationFormat": "Widescreen",

  "extended-properties:Application": "Microsoft Office PowerPoint",

  "meta:last-author": "Kenneth Nip",

  "X-TIKA:Parsed-By-Full-Set": [

 "org.apache.tika.parser.DefaultParser",

 "org.apache.tika.parser.microsoft.ooxml.OOXMLParser",

 "org.apache.tika.parser.image.JpegParser",

 "org.apache.tika.parser.ocr.TesseractOCRParser"

  ],

  "X-TIKA:content_handler": "ToTextContentHandler",

  "dc:creator": "Kenneth Nip",

  "meta:slide-count": "3",

  "xmpTPg:NPages": "3",

  "resourceName": "2020_Capacity_Ramp_Plan.pptx",

  "dcterms:created": "2020-01-04T05:19:17Z",

  "dcterms:modified": "2020-01-06T07:58:18Z",

  "X-TIKA:Parsed-By": [

 "org.apache.tika.parser.DefaultParser",

 "org.apache.tika.parser.microsoft.ooxml.OOXMLParser"

  ],

  "dc:title": "PowerPoint Presentation",

  "extended-properties:DocSecurityString": "None",

  "extended-properties:TotalTime": "342",

  "X-TIKA:parse_time_millis": "1223",

  "X-TIKA:embedded_depth": "0",

  "X-TIKA:content": "…… / 
Peter\t\t\t\n\n\n\nMicrosoft_Excel_Worksheet.xlsx\n\n\n",

  "Content-Length": "144945",

  "Content-Type": 
"application/vnd.openxmlformats-officedocument.presentationml.presentation"

   },

   {

  "extended-properties:AppVersion": "16.0300",

  "extended-properties:Application": "Microsoft Excel",

  "meta:last-author": "Kenneth Nip",

  "X-TIKA:embedded_id_path": "/1",

  "X-TIKA:content_handler": "ToTextContentHandler",

  "dc:creator": "Kenneth Nip",

  "extended-properties:Company": "",

  "meta:print-date": "2019-11-06T23:43:22Z",

  "resourceName": "Microsoft_Excel_Worksheet.xlsx",

  "dcterms:created": "2019-10-30T16:50:00Z",

  "dcterms:modified": "2020-01-06T07:29:13Z",

  "X-TIKA:origResourceName": "C:\\Users\\kenrw\\Downloads\\",

  "embeddedRelationshipId": "rId3",

  "protected": "false",

  "embeddedResourceType": "ATTACHMENT",

  "X-TIKA:Parsed-By": [

 "org.apache.tika.parser.DefaultParser",

 "org.apache.tika.parser.microsoft.ooxml.OOXMLParser"

  ],

  "extended-properties:DocSecurityString": "None",

  "X-TIKA:embedded_depth": "1",

  "X-TIKA:parse_time_millis": "376",

  "X-TIKA:content": "..",

  "X-TIKA:embedded_resource_path": 
"/Microsoft_Excel_Worksheet.xlsx",

  "X-TIKA:embedded_id": "1",

  "Content-Type": 
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",

  "dc:publisher": ""

   },

…

]

 

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old ve

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-20 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828942#comment-17828942
 ] 

Xiaohong Yang commented on TIKA-4211:
-

For step 0: if you run java -jar tika-app-2.9.1.jar -J -t yourFile.pptx, do you 
see the xlsx info in the json?

I ran the command and did not find the xlsx info in the json.

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = pptxFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata)) {
>     ContentHandler handler = new DefaultHandler();
>     parser.parse(input, handler, metadata, context);
>     }
>     }
>  
>     private class FileEmbeddedDocumentExtractor implements 
> EmbeddedDocumentExtractor {
>     private int count = 0;
>  
>     public boolean shouldParseEmbedded(Metadata metadata) {
>     return true;
>     }
>  
>     public void parseEmbedded(InputStream inputStream, ContentHandler 
> contentHandler, Metadata metadata,
>   boolean outputHtml) throws SAXException, 
> IOException {
>     String fullFileName = 
> metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
>     if (fullFileName == null) {
>     fullFileName = "file" + count++;
>     }
>  
>     String[] fileNameSplit = fullFileName.split("/");
>     String fileName = fileNameSplit[fileNameSplit.length - 1];
>     File outputFile = new File(outputDir.toFile(), 
> Filena

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-18 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828155#comment-17828155
 ] 

Xiaohong Yang commented on TIKA-4211:
-

Hi Tim,

I searched Microsoft_Excel_Worksheet.xlsx  in the whole directory and found out 
that it is referenced in the following file

ppt\charts\_rels\chart1.xml.rels:



http://schemas.openxmlformats.org/package/2006/relationships";>

   http://schemas.openxmlformats.org/officeDocument/2006/relationships/package";
 Target="../embeddings/Microsoft_Excel_Worksheet.xlsx"/>

   http://schemas.microsoft.com/office/2011/relationships/chartColorStyle"; 
Target="colors1.xml"/>

   http://schemas.microsoft.com/office/2011/relationships/chartStyle"; 
Target="style1.xml"/>



 

And rId3 is referenced in the following file.

ppt\charts\chart1.xml:



http://schemas.openxmlformats.org/drawingml/2006/chart"; 
xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"; 
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"; 
xmlns:c16r2="http://schemas.microsoft.com/office/drawing/2015/06/chart";>

   

   

   

   …

   

  

   



 

Seems that Microsoft_Excel_Worksheet.xlsx is only used in charts but not in 
slides.

Seems that currently you only check attachments in slides. You can check 
attachments in charts as well. Or you can treat all files in directory 
/ppt/embeddings/ as attachments (do not check references in slides and charts 
at all).

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>   

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-15 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827519#comment-17827519
 ] 

Xiaohong Yang commented on TIKA-4211:
-

Hi Tim,

Step 3:

Because Microsoft_Excel_Worksheet.xlsx is not referenced in Step 2. No need to 
do this step.

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = pptxFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata)) {
>     ContentHandler handler = new DefaultHandler();
>     parser.parse(input, handler, metadata, context);
>     }
>     }
>  
>     private class FileEmbeddedDocumentExtractor implements 
> EmbeddedDocumentExtractor {
>     private int count = 0;
>  
>     public boolean shouldParseEmbedded(Metadata metadata) {
>     return true;
>     }
>  
>     public void parseEmbedded(InputStream inputStream, ContentHandler 
> contentHandler, Metadata metadata,
>   boolean outputHtml) throws SAXException, 
> IOException {
>     String fullFileName = 
> metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
>     if (fullFileName == null) {
>     fullFileName = "file" + count++;
>     }
>  
>     String[] fileNameSplit = fullFileName.split("/");
>     String fileName = fileNameSplit[fileNameSplit.length - 1];
>     File outputFile = new File(outputDir.toFile(), 
> FilenameUtils.normalize(fileName));
>     System.out.println(

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-15 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827518#comment-17827518
 ] 

Xiaohong Yang commented on TIKA-4211:
-

Hi Tim, 

Step 2:

There are three files in {*}/ppt/slides/_rels/{*}, their contents are as 
follows.  Note that Microsoft_Excel_Worksheet.xlsx is not referenced.

Nothing is found except Microsoft_Excel_Worksheet.xlsx in folder 
*ppt/embeddings*

 

*ppt\slides\_rels\slide1.xml.rels*



http://schemas.openxmlformats.org/package/2006/relationships";>

   http://schemas.openxmlformats.org/officeDocument/2006/relationships/chart";
 Target="../charts/chart1.xml"/>

   http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideLayout";
 Target="../slideLayouts/slideLayout2.xml"/>



 

*ppt\slides\_rels\slide2.xml.rels*



http://schemas.openxmlformats.org/package/2006/relationships";>

   http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideLayout";
 Target="../slideLayouts/slideLayout2.xml"/>



 

*\ppt\slides\_rels\slide3.xml.rels*



http://schemas.openxmlformats.org/package/2006/relationships";>

   http://schemas.openxmlformats.org/officeDocument/2006/relationships/slideLayout";
 Target="../slideLayouts/slideLayout2.xml"/>



 

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = pptxFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata)) {
>     ContentHandler handler = new DefaultHandler();
>     parser.p

[jira] [Commented] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-14 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827221#comment-17827221
 ] 

Xiaohong Yang commented on TIKA-4211:
-

Hi Tim, 

Yes, I found the right file /ppt/embeddings/Microsoft_Excel_Worksheet.xlsx 
after unzipping the pptx.

> Tika extractor fails to extract embedded excel from pptx
> 
>
> Key: TIKA-4211
> URL: https://issues.apache.org/jira/browse/TIKA-4211
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: config_and_sample_file.zip
>
>
> We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> excel from PowerPoint presentation.  It works with most pptx files. But it 
> fails to detect the embedded excel with some pptx files.
> Following is the sample code and attached is the tika-config.xml and a pptx 
> file that works.
> We cannot provide the pptx file that does not work because it is client data.
> We noticed a difference between the pptx files that work and the pptx file 
> that does not work:  
> "{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
> is right-clicked in the pptx files that work.*
> "{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
> right-clicked in the pptx file that does not work. This file might be created 
> with an old version fo PowerPoint.*
>  
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3. 
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractExcelFromPowerPoint {
>     private final Path pptxFile = new 
> File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractExcelFromPowerPoint().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractExcelFromPowerPoint() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");
>     FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
> FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = pptxFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata)) {
>     ContentHandler handler = new DefaultHandler();
>     parser.parse(input, handler, metadata, context);
>     }
>     }
>  
>     private class FileEmbeddedDocumentExtractor implements 
> EmbeddedDocumentExtractor {
>     private int count = 0;
>  
>     public boolean shouldParseEmbedded(Metadata metadata) {
>     return true;
>     }
>  
>     public void parseEmbedded(InputStream inputStream, ContentHandler 
> contentHandler, Metadata metadata,
>   boolean outputHtml) throws SAXException, 
> IOException {
>     String fullFileName = 
> metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
>     if (fullFileName == null) {
>     fullFileName = "file" + count++;
>     }
>  
>     String[] fileNameSplit = fullFileName.split("/");
>     String fileName = fileNameSplit[fileNameSplit.length - 1];
>     File outputFile = new File(outputDir.toFile(), 
> FilenameUtils.normalize(fileName));
>     System.out.println(

[jira] [Updated] (TIKA-4212) Tika fails to get file extension of file type image/x-rtf-raw-bitmap

2024-03-14 Thread Xiaohong Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaohong Yang updated TIKA-4212:

Attachment: tika-config-and-sample-file.zip

> Tika fails to get file extension of file type image/x-rtf-raw-bitmap
> 
>
> Key: TIKA-4212
> URL: https://issues.apache.org/jira/browse/TIKA-4212
> Project: Tika
>  Issue Type: Bug
>Reporter: Xiaohong Yang
>Priority: Major
> Attachments: tika-config-and-sample-file.zip
>
>
> We use  org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
> objects from Word documents.  Two embedded objects are extracted from the 
> sample doc file. Their file type is image/x-rtf-raw-bitmap. But Tika fails to 
> get the file extension with the following method call
>   tikaExtension = 
> config.getMimeRepository().forName(contentType.toString()).getExtension();
> Wonder if you can fix the problem in the Tika library.  Also wonder if you 
> can tell us the file extension of file type is image/x-rtf-raw-bitmap.
> Following is the sample code and attached is the tika-config.xml and the 
> sample Word file.
> The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
> 2.9.1 and POI version is 5.2.3.  
>  
> import org.apache.pdfbox.io.IOUtils;
> import org.apache.poi.poifs.filesystem.DirectoryEntry;
> import org.apache.poi.poifs.filesystem.DocumentEntry;
> import org.apache.poi.poifs.filesystem.DocumentInputStream;
> import org.apache.poi.poifs.filesystem.POIFSFileSystem;
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.extractor.EmbeddedDocumentExtractor;
> import org.apache.tika.io.FilenameUtils;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.xml.sax.ContentHandler;
> import org.xml.sax.SAXException;
> import org.xml.sax.helpers.DefaultHandler;
>  
> import java.io.*;
> import java.net.URL;
> import java.nio.file.Path;
>  
> public class ExtractBitMapFromWord {
>     private final Path docFile = new 
> File("/home/ubuntu/testdirs/testdir_doc/sample.DOC").toPath();
>     private final Path outputDir = new 
> File("/home/ubuntu/testdirs/testdir_doc/tika_output/").toPath();
>  
>     private Parser parser;
>     private ParseContext context;
>  
>  
>     public static void main(String args[]) {
>     try {
>     new ExtractBitMapFromWord().process();
>     }
>     catch(Exception ex) {
>     ex.printStackTrace();
>     }
>     }
>  
>     public ExtractBitMapFromWord() {
>     }
>  
>     public void process() throws Exception {
>     TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_doc/tika-config.xml");
>     ExtractBitMapFromWord.FileEmbeddedDocumentExtractor 
> fileEmbeddedDocumentExtractor = new 
> ExtractBitMapFromWord.FileEmbeddedDocumentExtractor();
>  
>     parser = new AutoDetectParser(config);
>     context = new ParseContext();
>     context.set(Parser.class, parser);
>     context.set(TikaConfig.class, config);
>     context.set(EmbeddedDocumentExtractor.class, 
> fileEmbeddedDocumentExtractor);
>  
>     URL url = docFile.toUri().toURL();
>     Metadata metadata = new Metadata();
>     try (InputStream input = TikaInputStream.get(url, metadata)) {
>     ContentHandler handler = new DefaultHandler();
>     parser.parse(input, handler, metadata, context);
>     }
>     }
>  
>     private class FileEmbeddedDocumentExtractor implements 
> EmbeddedDocumentExtractor {
>     private int count = 0;
>  
>     public boolean shouldParseEmbedded(Metadata metadata) {
>     return true;
>     }
>  
>     public void parseEmbedded(InputStream inputStream, ContentHandler 
> contentHandler, Metadata metadata,
>   boolean outputHtml) throws SAXException, 
> IOException {
>     String fullFileName = 
> metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);
>     if (fullFileName == null) {
>     fullFileName = "file" + count++;
>     }
>  
>     TikaConfig config = null;
>     try {
>     config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_doc/tika-config.xml");
>     } catch (Exception ex) {
>     ex.printStackTrace();
>     }
>     if (config == null) {
>     return;
>     }
>  
>     Detector detector = config.getDetector();;
>     MediaType contentType

[jira] [Created] (TIKA-4212) Tika fails to get file extension of file type image/x-rtf-raw-bitmap

2024-03-14 Thread Xiaohong Yang (Jira)
Xiaohong Yang created TIKA-4212:
---

 Summary: Tika fails to get file extension of file type 
image/x-rtf-raw-bitmap
 Key: TIKA-4212
 URL: https://issues.apache.org/jira/browse/TIKA-4212
 Project: Tika
  Issue Type: Bug
Reporter: Xiaohong Yang


We use  org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
objects from Word documents.  Two embedded objects are extracted from the 
sample doc file. Their file type is image/x-rtf-raw-bitmap. But Tika fails to 
get the file extension with the following method call

  tikaExtension = 
config.getMimeRepository().forName(contentType.toString()).getExtension();

Wonder if you can fix the problem in the Tika library.  Also wonder if you can 
tell us the file extension of file type is image/x-rtf-raw-bitmap.

Following is the sample code and attached is the tika-config.xml and the sample 
Word file.

The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
2.9.1 and POI version is 5.2.3.  

 

import org.apache.pdfbox.io.IOUtils;

import org.apache.poi.poifs.filesystem.DirectoryEntry;

import org.apache.poi.poifs.filesystem.DocumentEntry;

import org.apache.poi.poifs.filesystem.DocumentInputStream;

import org.apache.poi.poifs.filesystem.POIFSFileSystem;

import org.apache.tika.config.TikaConfig;

import org.apache.tika.detect.Detector;

import org.apache.tika.extractor.EmbeddedDocumentExtractor;

import org.apache.tika.io.FilenameUtils;

import org.apache.tika.io.TikaInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.metadata.TikaCoreProperties;

import org.apache.tika.mime.MediaType;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

import org.xml.sax.helpers.DefaultHandler;

 

import java.io.*;

import java.net.URL;

import java.nio.file.Path;

 

public class ExtractBitMapFromWord {

    private final Path docFile = new 
File("/home/ubuntu/testdirs/testdir_doc/sample.DOC").toPath();

    private final Path outputDir = new 
File("/home/ubuntu/testdirs/testdir_doc/tika_output/").toPath();

 

    private Parser parser;

    private ParseContext context;

 

 

    public static void main(String args[]) {

    try {

    new ExtractBitMapFromWord().process();

    }

    catch(Exception ex) {

    ex.printStackTrace();

    }

    }

 

    public ExtractBitMapFromWord() {

    }

 

    public void process() throws Exception {

    TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_doc/tika-config.xml");

    ExtractBitMapFromWord.FileEmbeddedDocumentExtractor 
fileEmbeddedDocumentExtractor = new 
ExtractBitMapFromWord.FileEmbeddedDocumentExtractor();

 

    parser = new AutoDetectParser(config);

    context = new ParseContext();

    context.set(Parser.class, parser);

    context.set(TikaConfig.class, config);

    context.set(EmbeddedDocumentExtractor.class, 
fileEmbeddedDocumentExtractor);

 

    URL url = docFile.toUri().toURL();

    Metadata metadata = new Metadata();

    try (InputStream input = TikaInputStream.get(url, metadata)) {

    ContentHandler handler = new DefaultHandler();

    parser.parse(input, handler, metadata, context);

    }

    }

 

    private class FileEmbeddedDocumentExtractor implements 
EmbeddedDocumentExtractor {

    private int count = 0;

 

    public boolean shouldParseEmbedded(Metadata metadata) {

    return true;

    }

 

    public void parseEmbedded(InputStream inputStream, ContentHandler 
contentHandler, Metadata metadata,

  boolean outputHtml) throws SAXException, 
IOException {

    String fullFileName = 
metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);

    if (fullFileName == null) {

    fullFileName = "file" + count++;

    }

 

    TikaConfig config = null;

    try {

    config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_doc/tika-config.xml");

    } catch (Exception ex) {

    ex.printStackTrace();

    }

    if (config == null) {

    return;

    }

 

    Detector detector = config.getDetector();;

    MediaType contentType = detector.detect(inputStream, metadata);

    String tikaExtension = null;

    if(fullFileName.indexOf('.') == -1 && contentType != null){

    try {

    tikaExtension = 
config.getMimeRepository().forName(contentType.toString()).getExtension();

    } catch (Exception ex) {

    ex.printStackTrace();

    }

 

    if (tikaExtension != null && 

[jira] [Created] (TIKA-4211) Tika extractor fails to extract embedded excel from pptx

2024-03-14 Thread Xiaohong Yang (Jira)
Xiaohong Yang created TIKA-4211:
---

 Summary: Tika extractor fails to extract embedded excel from pptx
 Key: TIKA-4211
 URL: https://issues.apache.org/jira/browse/TIKA-4211
 Project: Tika
  Issue Type: Bug
Reporter: Xiaohong Yang
 Attachments: config_and_sample_file.zip

We use org.apache.tika.extractor.EmbeddedDocumentExtractor to get embedded 
excel from PowerPoint presentation.  It works with most pptx files. But it 
fails to detect the embedded excel with some pptx files.

Following is the sample code and attached is the tika-config.xml and a pptx 
file that works.

We cannot provide the pptx file that does not work because it is client data.

We noticed a difference between the pptx files that work and the pptx file that 
does not work:  

"{*}Worksheet Object{*}" *is in the popup menu when the embedded Excel object 
is right-clicked in the pptx files that work.*

"{*}Edit Data{*}" *is in the popup menu when the embedded Excel object is 
right-clicked in the pptx file that does not work. This file might be created 
with an old version fo PowerPoint.*

 

The operating system is Ubuntu 20.04. Java version is 17.  Tika version is 
2.9.1 and POI version is 5.2.3. 

 

import org.apache.pdfbox.io.IOUtils;

import org.apache.poi.poifs.filesystem.DirectoryEntry;

import org.apache.poi.poifs.filesystem.DocumentEntry;

import org.apache.poi.poifs.filesystem.DocumentInputStream;

import org.apache.poi.poifs.filesystem.POIFSFileSystem;

import org.apache.tika.config.TikaConfig;

import org.apache.tika.extractor.EmbeddedDocumentExtractor;

import org.apache.tika.io.FilenameUtils;

import org.apache.tika.io.TikaInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.metadata.TikaCoreProperties;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

import org.xml.sax.helpers.DefaultHandler;

 

import java.io.*;

import java.net.URL;

import java.nio.file.Path;

 

public class ExtractExcelFromPowerPoint {

    private final Path pptxFile = new 
File("/home/ubuntu/testdirs/testdir_pptx/sample.pptx").toPath();

    private final Path outputDir = new 
File("/home/ubuntu/testdirs/testdir_pptx/tika_output/").toPath();

 

    private Parser parser;

    private ParseContext context;

 

 

    public static void main(String args[]) {

    try {

    new ExtractExcelFromPowerPoint().process();

    }

    catch(Exception ex) {

    ex.printStackTrace();

    }

    }

 

    public ExtractExcelFromPowerPoint() {

    }

 

    public void process() throws Exception {

    TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_pptx/tika-config.xml");

    FileEmbeddedDocumentExtractor fileEmbeddedDocumentExtractor = new 
FileEmbeddedDocumentExtractor();

 

    parser = new AutoDetectParser(config);

    context = new ParseContext();

    context.set(Parser.class, parser);

    context.set(TikaConfig.class, config);

    context.set(EmbeddedDocumentExtractor.class, 
fileEmbeddedDocumentExtractor);

 

    URL url = pptxFile.toUri().toURL();

    Metadata metadata = new Metadata();

    try (InputStream input = TikaInputStream.get(url, metadata)) {

    ContentHandler handler = new DefaultHandler();

    parser.parse(input, handler, metadata, context);

    }

    }

 

    private class FileEmbeddedDocumentExtractor implements 
EmbeddedDocumentExtractor {

    private int count = 0;

 

    public boolean shouldParseEmbedded(Metadata metadata) {

    return true;

    }

 

    public void parseEmbedded(InputStream inputStream, ContentHandler 
contentHandler, Metadata metadata,

  boolean outputHtml) throws SAXException, 
IOException {

    String fullFileName = 
metadata.get(TikaCoreProperties.RESOURCE_NAME_KEY);

    if (fullFileName == null) {

    fullFileName = "file" + count++;

    }

 

    String[] fileNameSplit = fullFileName.split("/");

    String fileName = fileNameSplit[fileNameSplit.length - 1];

    File outputFile = new File(outputDir.toFile(), 
FilenameUtils.normalize(fileName));

    System.out.println("Extracting '" + fileName + " to " + outputFile);

    FileOutputStream os = null;

    try {

    os = new FileOutputStream(outputFile);

    if (inputStream instanceof TikaInputStream tin) {

    if (tin.getOpenContainer() instanceof DirectoryEntry) {

    try(POIFSFileSystem fs = new POIFSFileSystem()){

    copy((DirectoryEntry) tin.getOpenContainer(), 
fs.getRoot());

  

[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-17 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400729#comment-17400729
 ] 

Xiaohong Yang commented on TIKA-3519:
-

Tried to upload a smaller example file (9 MB). Still got the error:



*File "657673_tiny.zip" was not uploaded*

An internal error has occurred. Please contact your administrator.

> Wonder if you can add a feature for Tika parser to stop reading  metadata and 
> body content if certain amount of memory or body content has reached
> --
>
> Key: TIKA-3519
> URL: https://issues.apache.org/jira/browse/TIKA-3519
> Project: Tika
>  Issue Type: Wish
>  Components: detector
>Affects Versions: 1.25, 1.26
> Environment: Linux
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
> content of MS office files.  We encountered the following exception with some 
> files
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 14523048, but 500 is the maximum for this record type. If 
> the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type. As a temporary 
> workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>  
> To resolve the problem we set byteArrayMaxOverride in the tika-config.xml 
> file as follows
>  
>   
>  
>     type="int">2000
>  
>   
>  
> This helped to parse some files that failed previously. But some other files 
> still failed.  And then we increased the value to 200 MB and 500 MB.
>  
> Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So 
> we wonder if you can add a feature to the Tika parser for it  to stop reading 
>  metadata and body content if certain amount of memory or body content has 
> reached.  The parser will return the  metadata and body content obtained so 
> far. A warning message will be returned to the caller if this happens.  This 
> will help us to get the metadata and body content from some files that 
> requires a lot of memory.  We may not be able to successfully parse some 
> files without this feature because those files fail somewhere else with the 
> out-of-memory error after we set byteArrayMaxOverride to very high values and 
> the above mentioned failure does not happen. With this feature we will get 
> truncated body content with some files but it is better than get nothing. 
> Actually we will truncate the body content ourselves if it is too large. So 
> we do not care if the body content is truncated if it reaches certain amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-17 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400685#comment-17400685
 ] 

Xiaohong Yang commented on TIKA-3519:
-

I tried to upload an example file (zipped), but it failed with the following 
error.  The file size is 32 MB. I wonder if it is too large to be uploaded.



*File "657673.zip" was not uploaded*

An internal error has occurred. Please contact your administrator.

> Wonder if you can add a feature for Tika parser to stop reading  metadata and 
> body content if certain amount of memory or body content has reached
> --
>
> Key: TIKA-3519
> URL: https://issues.apache.org/jira/browse/TIKA-3519
> Project: Tika
>  Issue Type: Wish
>  Components: detector
>Affects Versions: 1.25, 1.26
> Environment: Linux
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
> content of MS office files.  We encountered the following exception with some 
> files
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 14523048, but 500 is the maximum for this record type. If 
> the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type. As a temporary 
> workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>  
> To resolve the problem we set byteArrayMaxOverride in the tika-config.xml 
> file as follows
>  
>   
>  
>     type="int">2000
>  
>   
>  
> This helped to parse some files that failed previously. But some other files 
> still failed.  And then we increased the value to 200 MB and 500 MB.
>  
> Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So 
> we wonder if you can add a feature to the Tika parser for it  to stop reading 
>  metadata and body content if certain amount of memory or body content has 
> reached.  The parser will return the  metadata and body content obtained so 
> far. A warning message will be returned to the caller if this happens.  This 
> will help us to get the metadata and body content from some files that 
> requires a lot of memory.  We may not be able to successfully parse some 
> files without this feature because those files fail somewhere else with the 
> out-of-memory error after we set byteArrayMaxOverride to very high values and 
> the above mentioned failure does not happen. With this feature we will get 
> truncated body content with some files but it is better than get nothing. 
> Actually we will truncate the body content ourselves if it is too large. So 
> we do not care if the body content is truncated if it reaches certain amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-11 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397427#comment-17397427
 ] 

Xiaohong Yang commented on TIKA-3519:
-

Can you check if you can catch the above mentioned ByteArrayMaxOverride error 
(Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
array of length 14523048, but 500 is the maximum for this record type…), 
stop parsing and then write the available body content to the contenthandler so 
that we can have the body content parsed so far?

> Wonder if you can add a feature for Tika parser to stop reading  metadata and 
> body content if certain amount of memory or body content has reached
> --
>
> Key: TIKA-3519
> URL: https://issues.apache.org/jira/browse/TIKA-3519
> Project: Tika
>  Issue Type: Wish
>  Components: detector
>Affects Versions: 1.25, 1.26
> Environment: Linux
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
> content of MS office files.  We encountered the following exception with some 
> files
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 14523048, but 500 is the maximum for this record type. If 
> the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type. As a temporary 
> workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>  
> To resolve the problem we set byteArrayMaxOverride in the tika-config.xml 
> file as follows
>  
>   
>  
>     type="int">2000
>  
>   
>  
> This helped to parse some files that failed previously. But some other files 
> still failed.  And then we increased the value to 200 MB and 500 MB.
>  
> Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So 
> we wonder if you can add a feature to the Tika parser for it  to stop reading 
>  metadata and body content if certain amount of memory or body content has 
> reached.  The parser will return the  metadata and body content obtained so 
> far. A warning message will be returned to the caller if this happens.  This 
> will help us to get the metadata and body content from some files that 
> requires a lot of memory.  We may not be able to successfully parse some 
> files without this feature because those files fail somewhere else with the 
> out-of-memory error after we set byteArrayMaxOverride to very high values and 
> the above mentioned failure does not happen. With this feature we will get 
> truncated body content with some files but it is better than get nothing. 
> Actually we will truncate the body content ourselves if it is too large. So 
> we do not care if the body content is truncated if it reaches certain amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-11 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397345#comment-17397345
 ] 

Xiaohong Yang commented on TIKA-3519:
-

I tried org.apache.tika.sax.WriteOutContentHandler with writeLimit in a test 
program and found out that this is one of the features we want. However I 
noticed that this approach (setting of writeLimit) does not help to avoid the 
ByteArrayMaxOverride error mentioned in the ticket (Caused by: 
org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 
14523048, but 500 is the maximum for this record type…).  I also noticed 
that if the ByteArrayMaxOverride error happens we do  not get any body text 
regardless the value of  writeLimit.

When the ByteArrayMaxOverride error happens we can catch the exception and get 
the required override value from the stack trace,  and then set the required 
override value with IOUtils.setByteArrayMaxOverride() and try the parse method 
again (it will probably succeed if the machine has enough memory).

However we wonder if you can add a feature so that the body text is still 
available when the ByteArrayMaxOverride error happens so that we can decide to 
try again or use the available body text (and metadata) depending on the 
required override value because a very higher value may not be feasible for 
reasons like there is not enough memory available on the machine.

> Wonder if you can add a feature for Tika parser to stop reading  metadata and 
> body content if certain amount of memory or body content has reached
> --
>
> Key: TIKA-3519
> URL: https://issues.apache.org/jira/browse/TIKA-3519
> Project: Tika
>  Issue Type: Wish
>  Components: detector
>Affects Versions: 1.25, 1.26
> Environment: Linux
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
> content of MS office files.  We encountered the following exception with some 
> files
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 14523048, but 500 is the maximum for this record type. If 
> the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type. As a temporary 
> workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>  
> To resolve the problem we set byteArrayMaxOverride in the tika-config.xml 
> file as follows
>  
>   
>  
>     type="int">2000
>  
>   
>  
> This helped to parse some files that failed previously. But some other files 
> still failed.  And then we increased the value to 200 MB and 500 MB.
>  
> Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So 
> we wonder if you can add a feature to the Tika parser for it  to stop reading 
>  metadata and body content if certain amount of memory or body content has 
> reached.  The parser will return the  metadata and body content obtained so 
> far. A warning message will be returned to the caller if this happens.  This 
> will help us to get the metadata and body content from some files that 
> requires a lot of memory.  We may not be able to successfully parse some 
> files without this feature because those files fail somewhere else with the 
> out-of-memory error after we set byteArrayMaxOverride to very high values and 
> the above mentioned failure does not happen. With this feature we will get 
> truncated body content with some files but it is better than get nothing. 
> Actually we will truncate the body content ourselves if it is too large. So 
> we do not care if the body content is truncated if it reaches certain amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-09 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396336#comment-17396336
 ] 

Xiaohong Yang commented on TIKA-3519:
-

No. We have not. I will try it and let you know. 

Thank you very much.

> Wonder if you can add a feature for Tika parser to stop reading  metadata and 
> body content if certain amount of memory or body content has reached
> --
>
> Key: TIKA-3519
> URL: https://issues.apache.org/jira/browse/TIKA-3519
> Project: Tika
>  Issue Type: Wish
>  Components: detector
>Affects Versions: 1.25, 1.26
> Environment: Linux
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
> content of MS office files.  We encountered the following exception with some 
> files
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 14523048, but 500 is the maximum for this record type. If 
> the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type. As a temporary 
> workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>  
> To resolve the problem we set byteArrayMaxOverride in the tika-config.xml 
> file as follows
>  
>   
>  
>     type="int">2000
>  
>   
>  
> This helped to parse some files that failed previously. But some other files 
> still failed.  And then we increased the value to 200 MB and 500 MB.
>  
> Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So 
> we wonder if you can add a feature to the Tika parser for it  to stop reading 
>  metadata and body content if certain amount of memory or body content has 
> reached.  The parser will return the  metadata and body content obtained so 
> far. A warning message will be returned to the caller if this happens.  This 
> will help us to get the metadata and body content from some files that 
> requires a lot of memory.  We may not be able to successfully parse some 
> files without this feature because those files fail somewhere else with the 
> out-of-memory error after we set byteArrayMaxOverride to very high values and 
> the above mentioned failure does not happen. With this feature we will get 
> truncated body content with some files but it is better than get nothing. 
> Actually we will truncate the body content ourselves if it is too large. So 
> we do not care if the body content is truncated if it reaches certain amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-09 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396067#comment-17396067
 ] 

Xiaohong Yang commented on TIKA-3519:
-

We call Tika programmatically.

> Wonder if you can add a feature for Tika parser to stop reading  metadata and 
> body content if certain amount of memory or body content has reached
> --
>
> Key: TIKA-3519
> URL: https://issues.apache.org/jira/browse/TIKA-3519
> Project: Tika
>  Issue Type: Wish
>  Components: detector
>Affects Versions: 1.25, 1.26
> Environment: Linux
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
> content of MS office files.  We encountered the following exception with some 
> files
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 14523048, but 500 is the maximum for this record type. If 
> the file is not corrupt, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type. As a temporary 
> workaround, consider setting a higher override value with 
> IOUtils.setByteArrayMaxOverride()
>  
> To resolve the problem we set byteArrayMaxOverride in the tika-config.xml 
> file as follows
>  
>   
>  
>     type="int">2000
>  
>   
>  
> This helped to parse some files that failed previously. But some other files 
> still failed.  And then we increased the value to 200 MB and 500 MB.
>  
> Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So 
> we wonder if you can add a feature to the Tika parser for it  to stop reading 
>  metadata and body content if certain amount of memory or body content has 
> reached.  The parser will return the  metadata and body content obtained so 
> far. A warning message will be returned to the caller if this happens.  This 
> will help us to get the metadata and body content from some files that 
> requires a lot of memory.  We may not be able to successfully parse some 
> files without this feature because those files fail somewhere else with the 
> out-of-memory error after we set byteArrayMaxOverride to very high values and 
> the above mentioned failure does not happen. With this feature we will get 
> truncated body content with some files but it is better than get nothing. 
> Actually we will truncate the body content ourselves if it is too large. So 
> we do not care if the body content is truncated if it reaches certain amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-08 Thread Xiaohong Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaohong Yang updated TIKA-3519:

Description: 
We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
content of MS office files.  We encountered the following exception with some 
files

 

Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
array of length 14523048, but 500 is the maximum for this record type. If 
the file is not corrupt, please open an issue on bugzilla to request increasing 
the maximum allowable size for this record type. As a temporary workaround, 
consider setting a higher override value with IOUtils.setByteArrayMaxOverride()

 

To resolve the problem we set byteArrayMaxOverride in the tika-config.xml file 
as follows

 

  

 

   2000

 

  

 

This helped to parse some files that failed previously. But some other files 
still failed.  And then we increased the value to 200 MB and 500 MB.

 

Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So we 
wonder if you can add a feature to the Tika parser for it  to stop reading  
metadata and body content if certain amount of memory or body content has 
reached.  The parser will return the  metadata and body content obtained so 
far. A warning message will be returned to the caller if this happens.  This 
will help us to get the metadata and body content from some files that requires 
a lot of memory.  We may not be able to successfully parse some files without 
this feature because those files fail somewhere else with the out-of-memory 
error after we set byteArrayMaxOverride to very high values and the above 
mentioned failure does not happen. With this feature we will get truncated body 
content with some files but it is better than get nothing. Actually we will 
truncate the body content ourselves if it is too large. So we do not care if 
the body content is truncated if reaches certain amount.

  was:
We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
content of MS office files.  We encountered the following exception with some 
files

 

Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
array of length 14523048, but 500 is the maximum for this record type. If 
the file is not corrupt, please open an issue on bugzilla to request increasing 
the maximum allowable size for this record type. As a temporary workaround, 
consider setting a higher override value with IOUtils.setByteArrayMaxOverride()

 

To resolve the problem we set byteArrayMaxOverride in the tika-config.xml file 
as follows

 

  

 

   2000

 

  

 

This helped to parse some files that failed previously. But some other files 
still failed.  And then we increased the value to 200 MB and 500 MB.

 

Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So we 
wonder if you can add a feature to the Tika parser for it  to stop reading  
metadata and body content if certain amount of memory or body content has 
reached.  The parser will return the  metadata and body content obtained so 
far. A warning message will be returned to the caller if this happens.  This 
will help us to get the metadata and body content from some files that requires 
a lot of memory.  We may not be able to successfully parse some files without 
this feature because those files fail somewhere else with the out-of-memory 
error after we set byteArrayMaxOverride to very high values and the above 
mentioned failure does not happen. With this feature we will get truncated body 
content with some files but it is better than get nothing. Actually we will 
truncate the body content ourselves if it is too large. So we do not care if 
the body content is truncated it if reaches certain amount.


> Wonder if you can add a feature for Tika parser to stop reading  metadata and 
> body content if certain amount of memory or body content has reached
> --
>
> Key: TIKA-3519
> URL: https://issues.apache.org/jira/browse/TIKA-3519
> Project: Tika
>  Issue Type: Wish
>  Components: detector
>Affects Versions: 1.25, 1.26
> Environment: Linux
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
> content of MS office files.  We encountered the following exception with some 
> files
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 14523048, but 500 is the maximum fo

[jira] [Updated] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-08 Thread Xiaohong Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaohong Yang updated TIKA-3519:

Description: 
We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
content of MS office files.  We encountered the following exception with some 
files

 

Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
array of length 14523048, but 500 is the maximum for this record type. If 
the file is not corrupt, please open an issue on bugzilla to request increasing 
the maximum allowable size for this record type. As a temporary workaround, 
consider setting a higher override value with IOUtils.setByteArrayMaxOverride()

 

To resolve the problem we set byteArrayMaxOverride in the tika-config.xml file 
as follows

 

  

 

   2000

 

  

 

This helped to parse some files that failed previously. But some other files 
still failed.  And then we increased the value to 200 MB and 500 MB.

 

Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So we 
wonder if you can add a feature to the Tika parser for it  to stop reading  
metadata and body content if certain amount of memory or body content has 
reached.  The parser will return the  metadata and body content obtained so 
far. A warning message will be returned to the caller if this happens.  This 
will help us to get the metadata and body content from some files that requires 
a lot of memory.  We may not be able to successfully parse some files without 
this feature because those files fail somewhere else with the out-of-memory 
error after we set byteArrayMaxOverride to very high values and the above 
mentioned failure does not happen. With this feature we will get truncated body 
content with some files but it is better than get nothing. Actually we will 
truncate the body content ourselves if it is too large. So we do not care if 
the body content is truncated if it reaches certain amount.

  was:
We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
content of MS office files.  We encountered the following exception with some 
files

 

Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
array of length 14523048, but 500 is the maximum for this record type. If 
the file is not corrupt, please open an issue on bugzilla to request increasing 
the maximum allowable size for this record type. As a temporary workaround, 
consider setting a higher override value with IOUtils.setByteArrayMaxOverride()

 

To resolve the problem we set byteArrayMaxOverride in the tika-config.xml file 
as follows

 

  

 

   2000

 

  

 

This helped to parse some files that failed previously. But some other files 
still failed.  And then we increased the value to 200 MB and 500 MB.

 

Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So we 
wonder if you can add a feature to the Tika parser for it  to stop reading  
metadata and body content if certain amount of memory or body content has 
reached.  The parser will return the  metadata and body content obtained so 
far. A warning message will be returned to the caller if this happens.  This 
will help us to get the metadata and body content from some files that requires 
a lot of memory.  We may not be able to successfully parse some files without 
this feature because those files fail somewhere else with the out-of-memory 
error after we set byteArrayMaxOverride to very high values and the above 
mentioned failure does not happen. With this feature we will get truncated body 
content with some files but it is better than get nothing. Actually we will 
truncate the body content ourselves if it is too large. So we do not care if 
the body content is truncated if reaches certain amount.


> Wonder if you can add a feature for Tika parser to stop reading  metadata and 
> body content if certain amount of memory or body content has reached
> --
>
> Key: TIKA-3519
> URL: https://issues.apache.org/jira/browse/TIKA-3519
> Project: Tika
>  Issue Type: Wish
>  Components: detector
>Affects Versions: 1.25, 1.26
> Environment: Linux
>Reporter: Xiaohong Yang
>Priority: Major
>
> We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
> content of MS office files.  We encountered the following exception with some 
> files
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
> array of length 14523048, but 500 is the maximum fo

[jira] [Created] (TIKA-3519) Wonder if you can add a feature for Tika parser to stop reading metadata and body content if certain amount of memory or body content has reached

2021-08-08 Thread Xiaohong Yang (Jira)
Xiaohong Yang created TIKA-3519:
---

 Summary: Wonder if you can add a feature for Tika parser to stop 
reading  metadata and body content if certain amount of memory or body content 
has reached
 Key: TIKA-3519
 URL: https://issues.apache.org/jira/browse/TIKA-3519
 Project: Tika
  Issue Type: Wish
  Components: detector
Affects Versions: 1.26, 1.25
 Environment: Linux
Reporter: Xiaohong Yang


We use  org.apache.tika.parser.AutoDetectParser to get the metadata and body 
content of MS office files.  We encountered the following exception with some 
files

 

Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an 
array of length 14523048, but 500 is the maximum for this record type. If 
the file is not corrupt, please open an issue on bugzilla to request increasing 
the maximum allowable size for this record type. As a temporary workaround, 
consider setting a higher override value with IOUtils.setByteArrayMaxOverride()

 

To resolve the problem we set byteArrayMaxOverride in the tika-config.xml file 
as follows

 

  

 

   2000

 

  

 

This helped to parse some files that failed previously. But some other files 
still failed.  And then we increased the value to 200 MB and 500 MB.

 

Some other file may still fail with byteArrayMaxOverride set to 500 MB.  So we 
wonder if you can add a feature to the Tika parser for it  to stop reading  
metadata and body content if certain amount of memory or body content has 
reached.  The parser will return the  metadata and body content obtained so 
far. A warning message will be returned to the caller if this happens.  This 
will help us to get the metadata and body content from some files that requires 
a lot of memory.  We may not be able to successfully parse some files without 
this feature because those files fail somewhere else with the out-of-memory 
error after we set byteArrayMaxOverride to very high values and the above 
mentioned failure does not happen. With this feature we will get truncated body 
content with some files but it is better than get nothing. Actually we will 
truncate the body content ourselves if it is too large. So we do not care if 
the body content is truncated it if reaches certain amount.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3106) Tika Fails to detect some EML files if extension is not .eml

2020-06-08 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128318#comment-17128318
 ] 

Xiaohong Yang commented on TIKA-3106:
-

Instead of testing in integrated mode with our project, I created a standalone 
program and compiled it with the jar files of your build #1820. It is confirmed 
that the problem with the sample file is fixed. It is now detected as 
message/rfc822 (tika extension .eml) regardless the  actual file extension.

This is the only sample file that had the problem. I will let you know if we 
find more sample files.

I believe the fix will be included in the next 1.24.x release, right?

Thank you very much!

> Tika Fails to detect some EML files if extension is not .eml
> 
>
> Key: TIKA-3106
> URL: https://issues.apache.org/jira/browse/TIKA-3106
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, mime
>Affects Versions: 1.24
>Reporter: Xiaohong Yang
>Priority: Critical
> Attachments: EmlFile.txt
>
>
> I have an eml file that can be detected as message/rfc822 only if the file 
> extension is .eml,  otherwise it will be detected as text/plain.  Following 
> is the code that I use to detect the file type and extension.
>    TikaConfig config = TikaConfigFactory.getTikaConfig();
>    Detector detector = config.getDetector();
>    Metadata metadata = new Metadata();
>    TikaInputStream stream = TikaInputStream.get(fis = new 
> FileInputStream(filePath));
>    metadata.add(Metadata.RESOURCE_NAME_KEY, filePath);
>    MediaType mediaType = detector.detect(stream, metadata);
>    MimeType mimeType = 
> config.getMimeRepository().forName(mediaType.toString());
>    String tikaExtension = mimeType.getExtension();
>  
> When the sample file has .eml extension,  mimeType is message/rfc822 and  
> tikaExtension is eml. When I change the extension to .txt, mimeType is 
> text/plain and  tikaExtension is .txt.
>  
> The same mimeType and tikaExtension should be detected regardless the file 
> extension. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3107) AutoDetectParser.parse failed with error "Initialisation of record 0x85(BoundSheetRecord) left 28 bytes remaining still to be read"

2020-06-05 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127060#comment-17127060
 ] 

Xiaohong Yang commented on TIKA-3107:
-

Thank you for the information.  I filed the following bug in Apache POI. 

Bug 64500 - LeftoverDataException: Initialisation of record 
0x85(BoundSheetRecord) left 28 bytes remaining still to be read 
([https://bz.apache.org/bugzilla/show_bug.cgi?id=64500]). 

We do not know what software generated the sample file. Excel can open it 
properly.

> AutoDetectParser.parse failed with error "Initialisation of record 
> 0x85(BoundSheetRecord) left 28 bytes remaining still to be read"
> ---
>
> Key: TIKA-3107
> URL: https://issues.apache.org/jira/browse/TIKA-3107
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 1.24
>Reporter: Xiaohong Yang
>Priority: Critical
> Attachments: SOJ.NW.00092712.xls
>
>
> When I try to get the metadata of the sample excel file with the 
> AutoDetectParser.parse method with the following Java code, I got an error 
> "Initialisation of record 0x85(BoundSheetRecord) left 28 bytes remaining 
> still to be read".
>  
> InputStream input = new FileInputStream(localFilePath);
> BodyContentHandler handler = = new BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> TikaConfig config = TikaConfigFactory.getTikaConfig();
> Parser autoDetectParser = new AutoDetectParser(config);
> ParseContext context = new ParseContext();
> context.set(TikaConfig.class, config);
> autoDetectParser.parse(input, handler, metadata, context);
>  
> Here is the stack trace:
>  
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@2caa5ec
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>    at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>    …
>    at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
>    at java.util.concurrent.FutureTask.run(FutureTask.java)
>    at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>    at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>    at java.lang.Thread.run(Thread.java:748)
> Caused by: 
> org.apache.poi.hssf.record.RecordInputStream$LeftoverDataException: 
> Initialisation of record 0x85(BoundSheetRecord) left 28 bytes remaining still 
> to be read.
>    at 
> org.apache.poi.hssf.record.RecordInputStream.hasNextRecord(RecordInputStream.java:188)
>    at 
> org.apache.poi.hssf.extractor.OldExcelExtractor.getText(OldExcelExtractor.java:233)
>    at 
> org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:57)
>    at 
> org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)
>    at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183)
>    at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
>    at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>    ... 15 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3106) Tika Fails to detect some EML files if extension is not .eml

2020-06-04 Thread Xiaohong Yang (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126157#comment-17126157
 ] 

Xiaohong Yang commented on TIKA-3106:
-

Thank you very much for the quick response. 

We use gradle to pull your packages in our project. I downloaded the jar files 
(tika-core-2.0.0-20200604.053323-687.jar and 
tika-parsers-2.0.0-20200604.053748-675.jar) from you build and and let gradle 
use these jar files in the project. It compiles but some other classes are 
missing at runtime.  So I cannot confirm the fix in our project.

I will continue to try to run it. If you can help me to run it please let me 
know. Thank you in advance.

> Tika Fails to detect some EML files if extension is not .eml
> 
>
> Key: TIKA-3106
> URL: https://issues.apache.org/jira/browse/TIKA-3106
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, mime
>Affects Versions: 1.24
>Reporter: Xiaohong Yang
>Priority: Critical
> Attachments: EmlFile.txt
>
>
> I have an eml file that can be detected as message/rfc822 only if the file 
> extension is .eml,  otherwise it will be detected as text/plain.  Following 
> is the code that I use to detect the file type and extension.
>    TikaConfig config = TikaConfigFactory.getTikaConfig();
>    Detector detector = config.getDetector();
>    Metadata metadata = new Metadata();
>    TikaInputStream stream = TikaInputStream.get(fis = new 
> FileInputStream(filePath));
>    metadata.add(Metadata.RESOURCE_NAME_KEY, filePath);
>    MediaType mediaType = detector.detect(stream, metadata);
>    MimeType mimeType = 
> config.getMimeRepository().forName(mediaType.toString());
>    String tikaExtension = mimeType.getExtension();
>  
> When the sample file has .eml extension,  mimeType is message/rfc822 and  
> tikaExtension is eml. When I change the extension to .txt, mimeType is 
> text/plain and  tikaExtension is .txt.
>  
> The same mimeType and tikaExtension should be detected regardless the file 
> extension. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3107) AutoDetectParser.parse failed with error "Initialisation of record 0x85(BoundSheetRecord) left 28 bytes remaining still to be read"

2020-06-04 Thread Xiaohong Yang (Jira)
Xiaohong Yang created TIKA-3107:
---

 Summary: AutoDetectParser.parse failed with error "Initialisation 
of record 0x85(BoundSheetRecord) left 28 bytes remaining still to be read"
 Key: TIKA-3107
 URL: https://issues.apache.org/jira/browse/TIKA-3107
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.24
Reporter: Xiaohong Yang
 Attachments: SOJ.NW.00092712.xls

When I try to get the metadata of the sample excel file with the 
AutoDetectParser.parse method with the following Java code, I got an error 
"Initialisation of record 0x85(BoundSheetRecord) left 28 bytes remaining still 
to be read".

 

InputStream input = new FileInputStream(localFilePath);

BodyContentHandler handler = = new BodyContentHandler(-1);

Metadata metadata = new Metadata();

TikaConfig config = TikaConfigFactory.getTikaConfig();

Parser autoDetectParser = new AutoDetectParser(config);

ParseContext context = new ParseContext();

context.set(TikaConfig.class, config);

autoDetectParser.parse(input, handler, metadata, context);

 

Here is the stack trace:

 

org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@2caa5ec

   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)

   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

   at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)

   …

   at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)

   at java.util.concurrent.FutureTask.run(FutureTask.java)

   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

   at java.lang.Thread.run(Thread.java:748)

Caused by: org.apache.poi.hssf.record.RecordInputStream$LeftoverDataException: 
Initialisation of record 0x85(BoundSheetRecord) left 28 bytes remaining still 
to be read.

   at 
org.apache.poi.hssf.record.RecordInputStream.hasNextRecord(RecordInputStream.java:188)

   at 
org.apache.poi.hssf.extractor.OldExcelExtractor.getText(OldExcelExtractor.java:233)

   at 
org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:57)

   at 
org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:158)

   at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183)

   at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)

   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

   ... 15 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3106) Tika Fails to detect some EML files if extension is not .eml

2020-06-03 Thread Xiaohong Yang (Jira)
Xiaohong Yang created TIKA-3106:
---

 Summary: Tika Fails to detect some EML files if extension is not 
.eml
 Key: TIKA-3106
 URL: https://issues.apache.org/jira/browse/TIKA-3106
 Project: Tika
  Issue Type: Bug
  Components: metadata, mime
Affects Versions: 1.24
Reporter: Xiaohong Yang
 Attachments: EmlFile.txt

I have an eml file that can be detected as message/rfc822 only if the file 
extension is .eml,  otherwise it will be detected as text/plain.  Following is 
the code that I use to detect the file type and extension.

   TikaConfig config = TikaConfigFactory.getTikaConfig();

   Detector detector = config.getDetector();

   Metadata metadata = new Metadata();

   TikaInputStream stream = TikaInputStream.get(fis = new 
FileInputStream(filePath));

   metadata.add(Metadata.RESOURCE_NAME_KEY, filePath);

   MediaType mediaType = detector.detect(stream, metadata);

   MimeType mimeType = 
config.getMimeRepository().forName(mediaType.toString());

   String tikaExtension = mimeType.getExtension();

 

When the sample file has .eml extension,  mimeType is message/rfc822 and  
tikaExtension is eml. When I change the extension to .txt, mimeType is 
text/plain and  tikaExtension is .txt.

 

The same mimeType and tikaExtension should be detected regardless the file 
extension. 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)