Re: How to extract autoshape text in Excel 2007+

Hiroshi Tatsumi Mon, 22 Jul 2013 09:43:47 -0700

Thank you for your reply. I really appreciate it.
This is a high priority for me.

Because we use solr, and our customer wants to search autoshapes' text inExcel 2007+ files.


I've been investigating the Tika source code, and trying to fix it.
I understand that I can extract text from autoshapes with XSSFWorkbook.

But first, I think the problem is ExcelExtractor's listenForAllRecordsfield.I set "true" to listenForAllRecords field like below, but it didn't workproperly.

-----
Class: org.apache.tika.parser.microsoft.OfficeParser
Method: protected void parse(

DirectoryNode root, ParseContext context, Metadata metadata,XHTMLContentHandler xhtml)

TargetCode:
       case XLR:
          Locale locale = context.get(Locale.class, Locale.getDefault());
          ExcelExtractor ee = new ExcelExtractor(context);
          ee.setListenForAllRecords(true);
          ee.parse(root, xhtml, locale);
          // original code
          // new ExcelExtractor(context).parse(root, xhtml, locale);
          break;
-----

Is this a wrong direction?
If you know which class I should fix, please let me know.

-----Original Message-----From: Allison, Timothy B.

Sent: Monday, July 22, 2013 10:27 PM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This is one way to access the underlying CTShape that contains the text:

       XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
       XSSFSheet sheet = wb.getSheetAt(0);
       XSSFDrawing drawing = sheet.createDrawingPatriarch();
       for (XSSFShape shape : drawing.getShapes()){
          if (shape instanceof XSSFSimpleShape){
             XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
             System.out.println("CT: "+simple.getCTShape());
          }
       }

Hiroshi, If this is a high priority, you could extract the txBody elementwith some bean work. I've openedhttps://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this morestraightforward.


-----Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a new feature in both Tika and POI. I've onlylooked very briefly into the POI libraries, and I may have missed how toextract text from autoshapes. I'll open an issue in both projects.

-----Original Message-----
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract from some MS office files.
The results are below.

Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)

Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)

Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?

Thanks,

Hiro.

Re: How to extract autoshape text in Excel 2007+

Reply via email to