This is one way to access the underlying CTShape that contains the text: XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f)); XSSFSheet sheet = wb.getSheetAt(0); XSSFDrawing drawing = sheet.createDrawingPatriarch(); for (XSSFShape shape : drawing.getShapes()){ if (shape instanceof XSSFSimpleShape){ XSSFSimpleShape simple = ((XSSFSimpleShape)shape); System.out.println("CT: "+simple.getCTShape()); } }
Hiroshi, If this is a high priority, you could extract the txBody element with some bean work. I've opened https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix. There's some work going on on XSSFTextCell in POI that might make this more straightforward. -----Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, July 22, 2013 8:50 AM To: user@tika.apache.org Subject: RE: How to extract autoshape text in Excel 2007+ This looks like an area for a new feature in both Tika and POI. I've only looked very briefly into the POI libraries, and I may have missed how to extract text from autoshapes. I'll open an issue in both projects. -----Original Message----- From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] Sent: Sunday, July 21, 2013 10:16 AM To: user@tika.apache.org Subject: How to extract autoshape text in Excel 2007+ Hi, I am using Tika 1.3 and Solr 4.3.1. I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't. I tried to extract from some MS office files. The results are below. Success (I can extract autoshape text.) - Excel 2003(.xls) - Word 2003(.doc) - Word 2007+(.docx) Failed (I cannot extract autoshape text.) - Excel 2007+(.xlsx) Is this a bug? If you know, could you tell me how to extract autoshape text in Excel 2007+? Thanks, Hiro.