Thank you for your reply. I really appreciate it.
This is a high priority for me.
Because we use solr, and our customer wants to search autoshapes' text in Excel 2007+ files.

I've been investigating the Tika source code, and trying to fix it.
I understand that I can extract text from autoshapes with XSSFWorkbook.
But first, I think the problem is ExcelExtractor's listenForAllRecords field. I set "true" to listenForAllRecords field like below, but it didn't work properly.
-----
Class: org.apache.tika.parser.microsoft.OfficeParser
Method: protected void parse(
DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml)
TargetCode:
       case XLR:
          Locale locale = context.get(Locale.class, Locale.getDefault());
          ExcelExtractor ee = new ExcelExtractor(context);
          ee.setListenForAllRecords(true);
          ee.parse(root, xhtml, locale);
          // original code
          // new ExcelExtractor(context).parse(root, xhtml, locale);
          break;
-----

Is this a wrong direction?
If you know which class I should fix, please let me know.



-----Original Message----- From: Allison, Timothy B.
Sent: Monday, July 22, 2013 10:27 PM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This is one way to access the underlying CTShape that contains the text:

       XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
       XSSFSheet sheet = wb.getSheetAt(0);
       XSSFDrawing drawing = sheet.createDrawingPatriarch();
       for (XSSFShape shape : drawing.getShapes()){
          if (shape instanceof XSSFSimpleShape){
             XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
             System.out.println("CT: "+simple.getCTShape());
          }
       }

Hiroshi, If this is a high priority, you could extract the txBody element with some bean work. I've opened https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this more straightforward.

-----Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a new feature in both Tika and POI. I've only looked very briefly into the POI libraries, and I may have missed how to extract text from autoshapes. I'll open an issue in both projects.

-----Original Message-----
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract from some MS office files.
The results are below.

Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)

Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)

Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?

Thanks,
Hiro.

Reply via email to