Fixed now. Build from current trunk (r1526498) or pull from https://builds.apache.org/job/Tika-trunk/lastStableBuild/ after Jenkins has had a chance to build.
Best, Tim -----Original Message----- From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] Sent: Wednesday, September 25, 2013 6:30 PM To: user@tika.apache.org Subject: Re: How to extract autoshape text in Excel 2007+ Hi, I'm waiting for the fix of this bug. https://issues.apache.org/jira/browse/TIKA-1100 The POI's bug which is referenced in this issue has fixed already. http://issues.apache.org/bugzilla/show_bug.cgi?id=55292 It would be great if you could give me a patch. Thanks, Hiroshi Tatsumi -----Original Message----- From: Allison, Timothy B. Sent: Tuesday, July 23, 2013 5:10 AM To: user@tika.apache.org Subject: RE: How to extract autoshape text in Excel 2007+ Hiroshi, To fix this on your own will take quite a bit of work. I give details below if you do want to go this route. The longer term path, I think is: 1) https://issues.apache.org/bugzilla/show_bug.cgi?id=55292 will be committed to POI. 2) A new release of POI will be made. 3) Small fixes to Tika's Excel parser will be made to take advantage of the new functionality in POI55292. Others on the list may have a simpler solution, but this is what I had to do before https://issues.apache.org/jira/browse/TIKA-1130 was committed. This is a very unappetizing solution; beware of dragons and don't try this at work. Your steps will differ somewhat because you're working with xlsx vs docx. I'm sure that I don't remember each step. 1) Modify the underlying POI code to expose a getText() or similar functionality on the object of interest to me (in my original email, I gave some hint of how to do this) 2) Modify XWPFWordExtractorDecorator to take advantage of getText() in the underlying POI object. There are several options for how to tie it all together. 3) I chose to copy and paste into a different namespace XWPFWordExtractorDecorator and the following classes: OOXMLExtractorFactory 4) Modify the above to call your new version of XWPFWordExtractorDecorator 5) Finally, register your new office parser in tika-parsers/META-INF/org.apache.tika.parser.Parser -----Original Message----- From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] Sent: Monday, July 22, 2013 11:42 AM To: user@tika.apache.org Subject: Re: How to extract autoshape text in Excel 2007+ Thank you for your reply. I really appreciate it. This is a high priority for me. Because we use solr, and our customer wants to search autoshapes' text in Excel 2007+ files. I've been investigating the Tika source code, and trying to fix it. I understand that I can extract text from autoshapes with XSSFWorkbook. But first, I think the problem is ExcelExtractor's listenForAllRecords field. I set "true" to listenForAllRecords field like below, but it didn't work properly. ----- Class: org.apache.tika.parser.microsoft.OfficeParser Method: protected void parse( DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) TargetCode: case XLR: Locale locale = context.get(Locale.class, Locale.getDefault()); ExcelExtractor ee = new ExcelExtractor(context); ee.setListenForAllRecords(true); ee.parse(root, xhtml, locale); // original code // new ExcelExtractor(context).parse(root, xhtml, locale); break; ----- Is this a wrong direction? If you know which class I should fix, please let me know. -----Original Message----- From: Allison, Timothy B. Sent: Monday, July 22, 2013 10:27 PM To: user@tika.apache.org Subject: RE: How to extract autoshape text in Excel 2007+ This is one way to access the underlying CTShape that contains the text: XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f)); XSSFSheet sheet = wb.getSheetAt(0); XSSFDrawing drawing = sheet.createDrawingPatriarch(); for (XSSFShape shape : drawing.getShapes()){ if (shape instanceof XSSFSimpleShape){ XSSFSimpleShape simple = ((XSSFSimpleShape)shape); System.out.println("CT: "+simple.getCTShape()); } } Hiroshi, If this is a high priority, you could extract the txBody element with some bean work. I've opened https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix. There's some work going on on XSSFTextCell in POI that might make this more straightforward. -----Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, July 22, 2013 8:50 AM To: user@tika.apache.org Subject: RE: How to extract autoshape text in Excel 2007+ This looks like an area for a new feature in both Tika and POI. I've only looked very briefly into the POI libraries, and I may have missed how to extract text from autoshapes. I'll open an issue in both projects. -----Original Message----- From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] Sent: Sunday, July 21, 2013 10:16 AM To: user@tika.apache.org Subject: How to extract autoshape text in Excel 2007+ Hi, I am using Tika 1.3 and Solr 4.3.1. I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't. I tried to extract from some MS office files. The results are below. Success (I can extract autoshape text.) - Excel 2003(.xls) - Word 2003(.doc) - Word 2007+(.docx) Failed (I cannot extract autoshape text.) - Excel 2007+(.xlsx) Is this a bug? If you know, could you tell me how to extract autoshape text in Excel 2007+? Thanks, Hiro.