This is one way to access the underlying CTShape that contains the text:

        XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
        XSSFSheet sheet = wb.getSheetAt(0);
        XSSFDrawing drawing = sheet.createDrawingPatriarch();
        for (XSSFShape shape : drawing.getShapes()){
           if (shape instanceof XSSFSimpleShape){
              XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
              System.out.println("CT: "+simple.getCTShape());
           }
        }

Hiroshi, If this is a high priority, you could extract the txBody element with 
some bean work.  I've opened https://issues.apache.org/jira/browse/TIKA-1150 
for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this more 
straightforward.

-----Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a  new feature in both Tika and POI.  I've only 
looked very briefly into the POI libraries, and I may have missed how to 
extract text from autoshapes.  I'll open an issue in both projects.

-----Original Message-----
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] 
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract from some MS office files.
The results are below.

Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)

Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)

Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?

Thanks,
Hiro. 

Reply via email to