RE: How to extract autoshape text in Excel 2007+

Allison, Timothy B. Thu, 26 Sep 2013 07:08:57 -0700

Fixed now.  Build from current trunk (r1526498) or pull from 
https://builds.apache.org/job/Tika-trunk/lastStableBuild/ after Jenkins has had 
a chance to build.

Best,

       Tim

-----Original Message-----
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] 
Sent: Wednesday, September 25, 2013 6:30 PM
To: user@tika.apache.org
Subject: Re: How to extract autoshape text in Excel 2007+

Hi,

I'm waiting for the fix of this bug.
https://issues.apache.org/jira/browse/TIKA-1100

The POI's bug which is referenced in this issue has fixed already.
http://issues.apache.org/bugzilla/show_bug.cgi?id=55292

It would be great if you could give me a patch.

Thanks,
Hiroshi Tatsumi

-----Original Message----- 
From: Allison, Timothy B.
Sent: Tuesday, July 23, 2013 5:10 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

Hiroshi,
   To fix this on your own will take quite a bit of work.  I give details 
below if you do want to go this route.

The longer term path, I think is:
1) https://issues.apache.org/bugzilla/show_bug.cgi?id=55292 will be 
committed to POI.
2) A new release of POI will be made.
3) Small fixes to Tika's Excel parser will be made to take advantage of the 
new functionality in POI55292.

Others on the list may have a simpler solution, but this is what I had to do 
before https://issues.apache.org/jira/browse/TIKA-1130 was committed.

This is a very unappetizing solution; beware of dragons and don't try this 
at work.  Your steps will differ somewhat because you're working with xlsx 
vs docx.  I'm sure that I don't remember each step.

1) Modify the underlying POI code to expose a getText() or similar 
functionality on the object of interest to me (in my original email, I gave 
some hint of how to do this)

2) Modify XWPFWordExtractorDecorator to take advantage of getText() in the 
underlying POI object.

There are several options for how to tie it all together.
3)  I chose to copy and paste into a different namespace 
XWPFWordExtractorDecorator and the following classes:
OOXMLExtractorFactory
4) Modify the above to call your new version of XWPFWordExtractorDecorator
5) Finally, register your new office parser in 
tika-parsers/META-INF/org.apache.tika.parser.Parser

-----Original Message-----
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Monday, July 22, 2013 11:42 AM
To: user@tika.apache.org
Subject: Re: How to extract autoshape text in Excel 2007+

Thank you for your reply. I really appreciate it.
This is a high priority for me.
Because we use solr, and our customer wants to search autoshapes' text in
Excel 2007+ files.

I've been investigating the Tika source code, and trying to fix it.
I understand that I can extract text from autoshapes with XSSFWorkbook.
But first, I think the problem is ExcelExtractor's listenForAllRecords
field.
I set "true" to listenForAllRecords field like below, but it didn't work
properly.
-----
Class: org.apache.tika.parser.microsoft.OfficeParser
Method: protected void parse(
    DirectoryNode root, ParseContext context, Metadata metadata,
XHTMLContentHandler xhtml)
TargetCode:
        case XLR:
           Locale locale = context.get(Locale.class, Locale.getDefault());
           ExcelExtractor ee = new ExcelExtractor(context);
           ee.setListenForAllRecords(true);
           ee.parse(root, xhtml, locale);
           // original code
           // new ExcelExtractor(context).parse(root, xhtml, locale);
           break;
-----

Is this a wrong direction?
If you know which class I should fix, please let me know.

-----Original Message----- 
From: Allison, Timothy B.
Sent: Monday, July 22, 2013 10:27 PM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This is one way to access the underlying CTShape that contains the text:

        XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
        XSSFSheet sheet = wb.getSheetAt(0);
        XSSFDrawing drawing = sheet.createDrawingPatriarch();
        for (XSSFShape shape : drawing.getShapes()){
           if (shape instanceof XSSFSimpleShape){
              XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
              System.out.println("CT: "+simple.getCTShape());
           }
        }

Hiroshi, If this is a high priority, you could extract the txBody element
with some bean work.  I've opened
https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this more
straightforward.

-----Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a  new feature in both Tika and POI.  I've only
looked very briefly into the POI libraries, and I may have missed how to
extract text from autoshapes.  I'll open an issue in both projects.

-----Original Message-----
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract from some MS office files.
The results are below.

Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)

Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)

Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?

Thanks,
Hiro.

RE: How to extract autoshape text in Excel 2007+

Reply via email to