RE: How to extract autoshape text in Excel 2007+

Allison, Timothy B. Mon, 22 Jul 2013 13:12:05 -0700

Hiroshi,
   To fix this on your own will take quite a bit of work.  I give details below 
if you do want to go this route.


The longer term path, I think is:
1) https://issues.apache.org/bugzilla/show_bug.cgi?id=55292 will be committed 
to POI.
2) A new release of POI will be made.
3) Small fixes to Tika's Excel parser will be made to take advantage of the new 
functionality in POI55292.

Others on the list may have a simpler solution, but this is what I had to do 
before https://issues.apache.org/jira/browse/TIKA-1130 was committed.

This is a very unappetizing solution; beware of dragons and don't try this at 
work.  Your steps will differ somewhat because you're working with xlsx vs 
docx.  I'm sure that I don't remember each step.

1) Modify the underlying POI code to expose a getText() or similar 
functionality on the object of interest to me (in my original email, I gave 
some hint of how to do this)

2) Modify XWPFWordExtractorDecorator to take advantage of getText() in the 
underlying POI object.

There are several options for how to tie it all together.
3)  I chose to copy and paste into a different namespace 
XWPFWordExtractorDecorator and the following classes:
        OOXMLExtractorFactory
4) Modify the above to call your new version of XWPFWordExtractorDecorator
5) Finally, register your new office parser in 
tika-parsers/META-INF/org.apache.tika.parser.Parser



-----Original Message-----
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] 
Sent: Monday, July 22, 2013 11:42 AM
To: user@tika.apache.org
Subject: Re: How to extract autoshape text in Excel 2007+

Thank you for your reply. I really appreciate it.
This is a high priority for me.
Because we use solr, and our customer wants to search autoshapes' text in 
Excel 2007+ files.

I've been investigating the Tika source code, and trying to fix it.
I understand that I can extract text from autoshapes with XSSFWorkbook.
But first, I think the problem is ExcelExtractor's listenForAllRecords 
field.
I set "true" to listenForAllRecords field like below, but it didn't work 
properly.
-----
Class: org.apache.tika.parser.microsoft.OfficeParser
Method: protected void parse(
    DirectoryNode root, ParseContext context, Metadata metadata, 
XHTMLContentHandler xhtml)
TargetCode:
        case XLR:
           Locale locale = context.get(Locale.class, Locale.getDefault());
           ExcelExtractor ee = new ExcelExtractor(context);
           ee.setListenForAllRecords(true);
           ee.parse(root, xhtml, locale);
           // original code
           // new ExcelExtractor(context).parse(root, xhtml, locale);
           break;
-----

Is this a wrong direction?
If you know which class I should fix, please let me know.



-----Original Message----- 
From: Allison, Timothy B.
Sent: Monday, July 22, 2013 10:27 PM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This is one way to access the underlying CTShape that contains the text:

        XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
        XSSFSheet sheet = wb.getSheetAt(0);
        XSSFDrawing drawing = sheet.createDrawingPatriarch();
        for (XSSFShape shape : drawing.getShapes()){
           if (shape instanceof XSSFSimpleShape){
              XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
              System.out.println("CT: "+simple.getCTShape());
           }
        }

Hiroshi, If this is a high priority, you could extract the txBody element 
with some bean work.  I've opened 
https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this more 
straightforward.

-----Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a  new feature in both Tika and POI.  I've only 
looked very briefly into the POI libraries, and I may have missed how to 
extract text from autoshapes.  I'll open an issue in both projects.

-----Original Message-----
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract from some MS office files.
The results are below.

Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)

Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)

Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?

Thanks,
Hiro.

RE: How to extract autoshape text in Excel 2007+

Reply via email to