Re: How to extract autoshape text in Excel 2007+

2013-09-26 Thread Hiroshi Tatsumi

Thank you, Tim.
I tried to use the latest modules, and I could extract text from autoshapes 
in Excel 2007+.

I really appreciate it!!



-Original Message- 
From: Allison, Timothy B.

Sent: Thursday, September 26, 2013 11:07 PM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

Fixed now.  Build from current trunk (r1526498) or pull from
https://builds.apache.org/job/Tika-trunk/lastStableBuild/ after Jenkins has
had a chance to build.

Best,

  Tim

-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Wednesday, September 25, 2013 6:30 PM
To: user@tika.apache.org
Subject: Re: How to extract autoshape text in Excel 2007+

Hi,

I'm waiting for the fix of this bug.
https://issues.apache.org/jira/browse/TIKA-1100

The POI's bug which is referenced in this issue has fixed already.
http://issues.apache.org/bugzilla/show_bug.cgi?id=55292

It would be great if you could give me a patch.


Thanks,
Hiroshi Tatsumi



-Original Message- 
From: Allison, Timothy B.

Sent: Tuesday, July 23, 2013 5:10 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

Hiroshi,
  To fix this on your own will take quite a bit of work.  I give details
below if you do want to go this route.

The longer term path, I think is:
1) https://issues.apache.org/bugzilla/show_bug.cgi?id=55292 will be
committed to POI.
2) A new release of POI will be made.
3) Small fixes to Tika's Excel parser will be made to take advantage of the
new functionality in POI55292.

Others on the list may have a simpler solution, but this is what I had to do
before https://issues.apache.org/jira/browse/TIKA-1130 was committed.

This is a very unappetizing solution; beware of dragons and don't try this
at work.  Your steps will differ somewhat because you're working with xlsx
vs docx.  I'm sure that I don't remember each step.

1) Modify the underlying POI code to expose a getText() or similar
functionality on the object of interest to me (in my original email, I gave
some hint of how to do this)

2) Modify XWPFWordExtractorDecorator to take advantage of getText() in the
underlying POI object.

There are several options for how to tie it all together.
3)  I chose to copy and paste into a different namespace
XWPFWordExtractorDecorator and the following classes:
OOXMLExtractorFactory
4) Modify the above to call your new version of XWPFWordExtractorDecorator
5) Finally, register your new office parser in
tika-parsers/META-INF/org.apache.tika.parser.Parser



-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Monday, July 22, 2013 11:42 AM
To: user@tika.apache.org
Subject: Re: How to extract autoshape text in Excel 2007+

Thank you for your reply. I really appreciate it.
This is a high priority for me.
Because we use solr, and our customer wants to search autoshapes' text in
Excel 2007+ files.

I've been investigating the Tika source code, and trying to fix it.
I understand that I can extract text from autoshapes with XSSFWorkbook.
But first, I think the problem is ExcelExtractor's listenForAllRecords
field.
I set "true" to listenForAllRecords field like below, but it didn't work
properly.
-
Class: org.apache.tika.parser.microsoft.OfficeParser
Method: protected void parse(
   DirectoryNode root, ParseContext context, Metadata metadata,
XHTMLContentHandler xhtml)
TargetCode:
   case XLR:
  Locale locale = context.get(Locale.class, Locale.getDefault());
  ExcelExtractor ee = new ExcelExtractor(context);
  ee.setListenForAllRecords(true);
  ee.parse(root, xhtml, locale);
  // original code
  // new ExcelExtractor(context).parse(root, xhtml, locale);
  break;
-

Is this a wrong direction?
If you know which class I should fix, please let me know.



-Original Message- 
From: Allison, Timothy B.

Sent: Monday, July 22, 2013 10:27 PM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This is one way to access the underlying CTShape that contains the text:

   XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
   XSSFSheet sheet = wb.getSheetAt(0);
   XSSFDrawing drawing = sheet.createDrawingPatriarch();
   for (XSSFShape shape : drawing.getShapes()){
  if (shape instanceof XSSFSimpleShape){
 XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
 System.out.println("CT: "+simple.getCTShape());
  }
   }

Hiroshi, If this is a high priority, you could extract the txBody element
with some bean work.  I've opened
https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this more
straightforward.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sen

RE: How to extract autoshape text in Excel 2007+

2013-09-26 Thread Allison, Timothy B.
Fixed now.  Build from current trunk (r1526498) or pull from 
https://builds.apache.org/job/Tika-trunk/lastStableBuild/ after Jenkins has had 
a chance to build.

Best,

   Tim

-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] 
Sent: Wednesday, September 25, 2013 6:30 PM
To: user@tika.apache.org
Subject: Re: How to extract autoshape text in Excel 2007+

Hi,

I'm waiting for the fix of this bug.
https://issues.apache.org/jira/browse/TIKA-1100

The POI's bug which is referenced in this issue has fixed already.
http://issues.apache.org/bugzilla/show_bug.cgi?id=55292

It would be great if you could give me a patch.


Thanks,
Hiroshi Tatsumi



-Original Message- 
From: Allison, Timothy B.
Sent: Tuesday, July 23, 2013 5:10 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

Hiroshi,
   To fix this on your own will take quite a bit of work.  I give details 
below if you do want to go this route.

The longer term path, I think is:
1) https://issues.apache.org/bugzilla/show_bug.cgi?id=55292 will be 
committed to POI.
2) A new release of POI will be made.
3) Small fixes to Tika's Excel parser will be made to take advantage of the 
new functionality in POI55292.

Others on the list may have a simpler solution, but this is what I had to do 
before https://issues.apache.org/jira/browse/TIKA-1130 was committed.

This is a very unappetizing solution; beware of dragons and don't try this 
at work.  Your steps will differ somewhat because you're working with xlsx 
vs docx.  I'm sure that I don't remember each step.

1) Modify the underlying POI code to expose a getText() or similar 
functionality on the object of interest to me (in my original email, I gave 
some hint of how to do this)

2) Modify XWPFWordExtractorDecorator to take advantage of getText() in the 
underlying POI object.

There are several options for how to tie it all together.
3)  I chose to copy and paste into a different namespace 
XWPFWordExtractorDecorator and the following classes:
OOXMLExtractorFactory
4) Modify the above to call your new version of XWPFWordExtractorDecorator
5) Finally, register your new office parser in 
tika-parsers/META-INF/org.apache.tika.parser.Parser



-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Monday, July 22, 2013 11:42 AM
To: user@tika.apache.org
Subject: Re: How to extract autoshape text in Excel 2007+

Thank you for your reply. I really appreciate it.
This is a high priority for me.
Because we use solr, and our customer wants to search autoshapes' text in
Excel 2007+ files.

I've been investigating the Tika source code, and trying to fix it.
I understand that I can extract text from autoshapes with XSSFWorkbook.
But first, I think the problem is ExcelExtractor's listenForAllRecords
field.
I set "true" to listenForAllRecords field like below, but it didn't work
properly.
-
Class: org.apache.tika.parser.microsoft.OfficeParser
Method: protected void parse(
DirectoryNode root, ParseContext context, Metadata metadata,
XHTMLContentHandler xhtml)
TargetCode:
case XLR:
   Locale locale = context.get(Locale.class, Locale.getDefault());
   ExcelExtractor ee = new ExcelExtractor(context);
   ee.setListenForAllRecords(true);
   ee.parse(root, xhtml, locale);
   // original code
   // new ExcelExtractor(context).parse(root, xhtml, locale);
   break;
-

Is this a wrong direction?
If you know which class I should fix, please let me know.



-Original Message- 
From: Allison, Timothy B.
Sent: Monday, July 22, 2013 10:27 PM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This is one way to access the underlying CTShape that contains the text:

XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
XSSFSheet sheet = wb.getSheetAt(0);
XSSFDrawing drawing = sheet.createDrawingPatriarch();
for (XSSFShape shape : drawing.getShapes()){
   if (shape instanceof XSSFSimpleShape){
  XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
  System.out.println("CT: "+simple.getCTShape());
   }
}

Hiroshi, If this is a high priority, you could extract the txBody element
with some bean work.  I've opened
https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this more
straightforward.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a  new feature in both Tika and POI.  I've only
looked very briefly into the POI libraries, and I may have missed how to
extract text from a

Re: How to extract autoshape text in Excel 2007+

2013-09-25 Thread Hiroshi Tatsumi

Hi,

I'm waiting for the fix of this bug.
https://issues.apache.org/jira/browse/TIKA-1100

The POI's bug which is referenced in this issue has fixed already.
http://issues.apache.org/bugzilla/show_bug.cgi?id=55292

It would be great if you could give me a patch.


Thanks,
Hiroshi Tatsumi



-Original Message- 
From: Allison, Timothy B.

Sent: Tuesday, July 23, 2013 5:10 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

Hiroshi,
  To fix this on your own will take quite a bit of work.  I give details 
below if you do want to go this route.


The longer term path, I think is:
1) https://issues.apache.org/bugzilla/show_bug.cgi?id=55292 will be 
committed to POI.

2) A new release of POI will be made.
3) Small fixes to Tika's Excel parser will be made to take advantage of the 
new functionality in POI55292.


Others on the list may have a simpler solution, but this is what I had to do 
before https://issues.apache.org/jira/browse/TIKA-1130 was committed.


This is a very unappetizing solution; beware of dragons and don't try this 
at work.  Your steps will differ somewhat because you're working with xlsx 
vs docx.  I'm sure that I don't remember each step.


1) Modify the underlying POI code to expose a getText() or similar 
functionality on the object of interest to me (in my original email, I gave 
some hint of how to do this)


2) Modify XWPFWordExtractorDecorator to take advantage of getText() in the 
underlying POI object.


There are several options for how to tie it all together.
3)  I chose to copy and paste into a different namespace 
XWPFWordExtractorDecorator and the following classes:

OOXMLExtractorFactory
4) Modify the above to call your new version of XWPFWordExtractorDecorator
5) Finally, register your new office parser in 
tika-parsers/META-INF/org.apache.tika.parser.Parser




-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Monday, July 22, 2013 11:42 AM
To: user@tika.apache.org
Subject: Re: How to extract autoshape text in Excel 2007+

Thank you for your reply. I really appreciate it.
This is a high priority for me.
Because we use solr, and our customer wants to search autoshapes' text in
Excel 2007+ files.

I've been investigating the Tika source code, and trying to fix it.
I understand that I can extract text from autoshapes with XSSFWorkbook.
But first, I think the problem is ExcelExtractor's listenForAllRecords
field.
I set "true" to listenForAllRecords field like below, but it didn't work
properly.
-
Class: org.apache.tika.parser.microsoft.OfficeParser
Method: protected void parse(
   DirectoryNode root, ParseContext context, Metadata metadata,
XHTMLContentHandler xhtml)
TargetCode:
   case XLR:
  Locale locale = context.get(Locale.class, Locale.getDefault());
  ExcelExtractor ee = new ExcelExtractor(context);
  ee.setListenForAllRecords(true);
  ee.parse(root, xhtml, locale);
  // original code
  // new ExcelExtractor(context).parse(root, xhtml, locale);
  break;
-

Is this a wrong direction?
If you know which class I should fix, please let me know.



-Original Message- 
From: Allison, Timothy B.

Sent: Monday, July 22, 2013 10:27 PM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This is one way to access the underlying CTShape that contains the text:

   XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
   XSSFSheet sheet = wb.getSheetAt(0);
   XSSFDrawing drawing = sheet.createDrawingPatriarch();
   for (XSSFShape shape : drawing.getShapes()){
  if (shape instanceof XSSFSimpleShape){
 XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
 System.out.println("CT: "+simple.getCTShape());
  }
   }

Hiroshi, If this is a high priority, you could extract the txBody element
with some bean work.  I've opened
https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this more
straightforward.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a  new feature in both Tika and POI.  I've only
looked very briefly into the POI libraries, and I may have missed how to
extract text from autoshapes.  I'll open an issue in both projects.

-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract fro

RE: How to extract autoshape text in Excel 2007+

2013-07-22 Thread Allison, Timothy B.
Hiroshi,
   To fix this on your own will take quite a bit of work.  I give details below 
if you do want to go this route.

The longer term path, I think is:
1) https://issues.apache.org/bugzilla/show_bug.cgi?id=55292 will be committed 
to POI.
2) A new release of POI will be made.
3) Small fixes to Tika's Excel parser will be made to take advantage of the new 
functionality in POI55292.

Others on the list may have a simpler solution, but this is what I had to do 
before https://issues.apache.org/jira/browse/TIKA-1130 was committed.

This is a very unappetizing solution; beware of dragons and don't try this at 
work.  Your steps will differ somewhat because you're working with xlsx vs 
docx.  I'm sure that I don't remember each step.

1) Modify the underlying POI code to expose a getText() or similar 
functionality on the object of interest to me (in my original email, I gave 
some hint of how to do this)

2) Modify XWPFWordExtractorDecorator to take advantage of getText() in the 
underlying POI object.

There are several options for how to tie it all together.
3)  I chose to copy and paste into a different namespace 
XWPFWordExtractorDecorator and the following classes:
OOXMLExtractorFactory
4) Modify the above to call your new version of XWPFWordExtractorDecorator
5) Finally, register your new office parser in 
tika-parsers/META-INF/org.apache.tika.parser.Parser



-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] 
Sent: Monday, July 22, 2013 11:42 AM
To: user@tika.apache.org
Subject: Re: How to extract autoshape text in Excel 2007+

Thank you for your reply. I really appreciate it.
This is a high priority for me.
Because we use solr, and our customer wants to search autoshapes' text in 
Excel 2007+ files.

I've been investigating the Tika source code, and trying to fix it.
I understand that I can extract text from autoshapes with XSSFWorkbook.
But first, I think the problem is ExcelExtractor's listenForAllRecords 
field.
I set "true" to listenForAllRecords field like below, but it didn't work 
properly.
-
Class: org.apache.tika.parser.microsoft.OfficeParser
Method: protected void parse(
DirectoryNode root, ParseContext context, Metadata metadata, 
XHTMLContentHandler xhtml)
TargetCode:
case XLR:
   Locale locale = context.get(Locale.class, Locale.getDefault());
   ExcelExtractor ee = new ExcelExtractor(context);
   ee.setListenForAllRecords(true);
   ee.parse(root, xhtml, locale);
   // original code
   // new ExcelExtractor(context).parse(root, xhtml, locale);
   break;
-

Is this a wrong direction?
If you know which class I should fix, please let me know.



-Original Message- 
From: Allison, Timothy B.
Sent: Monday, July 22, 2013 10:27 PM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This is one way to access the underlying CTShape that contains the text:

XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
XSSFSheet sheet = wb.getSheetAt(0);
XSSFDrawing drawing = sheet.createDrawingPatriarch();
for (XSSFShape shape : drawing.getShapes()){
   if (shape instanceof XSSFSimpleShape){
  XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
  System.out.println("CT: "+simple.getCTShape());
   }
}

Hiroshi, If this is a high priority, you could extract the txBody element 
with some bean work.  I've opened 
https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this more 
straightforward.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a  new feature in both Tika and POI.  I've only 
looked very briefly into the POI libraries, and I may have missed how to 
extract text from autoshapes.  I'll open an issue in both projects.

-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract from some MS office files.
The results are below.

Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)

Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)

Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?

Thanks,
Hiro. 



Re: How to extract autoshape text in Excel 2007+

2013-07-22 Thread Hiroshi Tatsumi

Thank you for your reply. I really appreciate it.
This is a high priority for me.
Because we use solr, and our customer wants to search autoshapes' text in 
Excel 2007+ files.


I've been investigating the Tika source code, and trying to fix it.
I understand that I can extract text from autoshapes with XSSFWorkbook.
But first, I think the problem is ExcelExtractor's listenForAllRecords 
field.
I set "true" to listenForAllRecords field like below, but it didn't work 
properly.

-
Class: org.apache.tika.parser.microsoft.OfficeParser
Method: protected void parse(
   DirectoryNode root, ParseContext context, Metadata metadata, 
XHTMLContentHandler xhtml)

TargetCode:
   case XLR:
  Locale locale = context.get(Locale.class, Locale.getDefault());
  ExcelExtractor ee = new ExcelExtractor(context);
  ee.setListenForAllRecords(true);
  ee.parse(root, xhtml, locale);
  // original code
  // new ExcelExtractor(context).parse(root, xhtml, locale);
  break;
-

Is this a wrong direction?
If you know which class I should fix, please let me know.



-Original Message- 
From: Allison, Timothy B.

Sent: Monday, July 22, 2013 10:27 PM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This is one way to access the underlying CTShape that contains the text:

   XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
   XSSFSheet sheet = wb.getSheetAt(0);
   XSSFDrawing drawing = sheet.createDrawingPatriarch();
   for (XSSFShape shape : drawing.getShapes()){
  if (shape instanceof XSSFSimpleShape){
 XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
 System.out.println("CT: "+simple.getCTShape());
  }
   }

Hiroshi, If this is a high priority, you could extract the txBody element 
with some bean work.  I've opened 
https://issues.apache.org/jira/browse/TIKA-1150 for the longer term fix.


There's some work going on on XSSFTextCell in POI that might make this more 
straightforward.


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a  new feature in both Tika and POI.  I've only 
looked very briefly into the POI libraries, and I may have missed how to 
extract text from autoshapes.  I'll open an issue in both projects.


-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp]
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract from some MS office files.
The results are below.

Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)

Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)

Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?

Thanks,
Hiro. 



RE: How to extract autoshape text in Excel 2007+

2013-07-22 Thread Allison, Timothy B.
This is one way to access the underlying CTShape that contains the text:

XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(f));
XSSFSheet sheet = wb.getSheetAt(0);
XSSFDrawing drawing = sheet.createDrawingPatriarch();
for (XSSFShape shape : drawing.getShapes()){
   if (shape instanceof XSSFSimpleShape){
  XSSFSimpleShape simple = ((XSSFSimpleShape)shape);
  System.out.println("CT: "+simple.getCTShape());
   }
}

Hiroshi, If this is a high priority, you could extract the txBody element with 
some bean work.  I've opened https://issues.apache.org/jira/browse/TIKA-1150 
for the longer term fix.

There's some work going on on XSSFTextCell in POI that might make this more 
straightforward.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, July 22, 2013 8:50 AM
To: user@tika.apache.org
Subject: RE: How to extract autoshape text in Excel 2007+

This looks like an area for a  new feature in both Tika and POI.  I've only 
looked very briefly into the POI libraries, and I may have missed how to 
extract text from autoshapes.  I'll open an issue in both projects.

-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] 
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract from some MS office files.
The results are below.

Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)

Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)

Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?

Thanks,
Hiro. 



RE: How to extract autoshape text in Excel 2007+

2013-07-22 Thread Allison, Timothy B.
This looks like an area for a  new feature in both Tika and POI.  I've only 
looked very briefly into the POI libraries, and I may have missed how to 
extract text from autoshapes.  I'll open an issue in both projects.

-Original Message-
From: Hiroshi Tatsumi [mailto:honekich...@comet.ocn.ne.jp] 
Sent: Sunday, July 21, 2013 10:16 AM
To: user@tika.apache.org
Subject: How to extract autoshape text in Excel 2007+

Hi,

I am using Tika 1.3 and Solr 4.3.1.
I'd like to extract autoshape text in Excel 2007+(.xlsx), but I can't.

I tried to extract from some MS office files.
The results are below.

Success (I can extract autoshape text.)
- Excel 2003(.xls)
- Word 2003(.doc)
- Word 2007+(.docx)

Failed (I cannot extract autoshape text.)
- Excel 2007+(.xlsx)

Is this a bug?
If you know, could you tell me how to extract autoshape text in Excel 2007+?

Thanks,
Hiro.