Thank you for your reply. I really appreciate it.
This is a high priority for me.
Because we use solr, and our customer wants to search autoshapes' text in
Excel 2007+ files.
I've been investigating the Tika source code, and trying to fix it.
I understand that I can extract text from autoshapes with XSSFWorkbook.
But first, I think the problem is ExcelExtractor's listenForAllRecords
field.
I set "true" to listenForAllRecords field like below, but it didn't work
properly.
-----
Class: org.apache.tika.parser.microsoft.OfficeParser
Method: protected void parse(
DirectoryNode root, ParseContext context, Metadata metadata,
XHTMLContentHandler xhtml)
TargetCode:
case XLR:
Locale locale = context.get(Locale.class, Locale.getDefault());
ExcelExtractor ee = new ExcelExtractor(context);
ee.setListenForAllRecords(true);
ee.parse(root, xhtml, locale);
// original code
// new ExcelExtractor(context).parse(root, xhtml, locale);
break;
-----
Is this a wrong direction?
If you know which class I should fix, please let me know.
-----Original Message-----
From: Allison, Timothy B.
Sent: Monday, July 22, 2013 10:27 PM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+
This is one way to access the underlying CTShape that contains the text:
XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
XSSFSheet sheet = wb.getSheetAt(0);
XSSFDrawing drawing = sheet.createDrawingPatriarch();
for (XSSFShape shape : drawing.getShapes()){
if (shape instanceof XSSFSimpleShape){
XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
System.out.println("CT: "+simple.getCTShape());
}
}
Hiroshi, If this is a high priority, you could extract the txBody element
with some bean work. I've opened
https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix.
There's some work going on on XSSFTextCell in POI that might make this more
straightforward.
-----Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+
This looks like an area for a new feature in both Tika and POI. I've only
looked very briefly into the POI libraries, and I may have missed how to
extract text from autoshapes. I'll open an issue in both projects.
-----Original Message-----
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+
Hi,
I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.
I tried to extract from some MS office files.
The results are below.
Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)
Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)
Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?
Thanks,
Hiro.