[Solved]: MSPowerPointExtractor problem
Hi everybody, just a quick note for everybody: Meanwhile, I managed to solve the problem. Ryan's and Sudhakar's sources work flawlessly (at least with German special characters) after adding one additional method to the source (see below). The extracted String needs to be interpreted with "Cp1252" encoding. It may well be that this is a Mac-specific encoding problem - I cannot verify the Windows or Linux behavior here. Anyway, the following code solved the problem for me. Again, thanks for all the great work you have done. Ralph Scheuer private static String convertEncoding(String incoming){ String outgoing = null; try { outgoing = new String(incoming.getBytes(), "Cp1252"); } catch (Exception e) { SDLogger.catchException(e); } return outgoing; } PS: If there are no objections, I would like to contact the POI developer team and file a bug in bugzilla as I have the feeling that the code both of you have provided would be ideally suited for integrating some variant of it into the POI framework. Kind regards. Ralph Am 02.08.2004 um 13:13 schrieb Koundinya ((Sudhakar Chavali)): Hm, Basically we have concentrated on English language. So we never faced any problems. It become a new task for our team now :-) Thanks to Ralph in pointing that problem. We Will work on related and let the Jakarta team knows :-) Regards Sudhakar - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MSPowerPointExtractor problem
Hm, Basically we have concentrated on English language. So we never faced any problems. It become a new task for our team now :-) Thanks to Ralph in pointing that problem. We Will work on related and let the Jakarta team knows :-) Regards Sudhakar --- Ralph Scheuer <[EMAIL PROTECTED]> wrote: > Ryan, > > thanks for your reply. > > I have also seen the posts from Sudhakar on this subject who seems to > be contributing a whole lot of code here - which is a great thing but > in this code the problem also persists so I think we solve this > encoding problem in your code (which is simpler - the fix could later > be integrated into Sudhakar's code if this is checked in or > whatever...). > > I have tested this with a simple PPT file containing just the following > text: > > Umlaut-Test > Ökologie, Mühsal, Größe, Grätsche > > I get the following console output with this text: > > Umlaut-Test > \326kologie, M\374hsal, Gr\374\337e, Gr\344tsche > > Here is the output I get in a web browser (through a web app, "view > HTML source" mode): > > Umlaut-Test ÷kologie, M¸hsal, Gr¸?e, Gr?tsche > > German "umlaute" and other special characters work fine that way > whenever I extract text from Word documents or Excel spreadsheets using > POI and Ryan Ackley's TextMining framework. > > just for the record: I have only tested this on my own configuration: > Mac OS X 10.3.4, Java 1.4.2_03 so I have no idea how these classes > might behave on Linux or Windows. Can anybody confirm this? I have seen > some German names on this list ;-) > > Thanks for all the work you put into this. > > Ralph Scheuer > > Am 01.08.2004 um 08:07 schrieb Ryan Rhodes: > > > Hi Ralph, > > > > I haven't tested the PPT extractor with any other languages. I > > remember > > reading about other people having problems with different character > > sets > > though. > > > > Could you send a before and after example file here or to bugzilla? > > > > -Ryan Rhodes > > > > > > -Original Message- > > From: Ralph Scheuer [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 28, 2004 10:01 AM > > To: slide > > Subject: MSPowerPointExtractor problem > > > > Hello everybody, > > > > When I was searching for a Java class to extract text from PowerPoint > > files, I accidentally discovered Slide. > > > > I pulled the MSPowerPointExtractor class and some other stuff it > > depends on via CVS and tried it for some text extraction. > > > > The method I used looks very similar to the provided example main > > method (see below). > > > > However. when I tried to extract text from a German PowerPoint > > presentation, I had some problems with the encoding. I did not know > > which encoding to use, converting the output to ISO Latin 1 with my > > text editor solved only part of the problem (some German Umlaute were > > displayed correctly, some were not). > > > > Is this a known issue or am I doing something wrong? Any hints for me? > > > > Thanks in advance. > > > > Ralph Scheuer > > > > BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding > > on this platform is MacRoman. > > > > > > public static String contentStringForData(NSData data){ > > > > StringBuffer buf = new StringBuffer(); > > try{ > > ByteArrayInputStream input = data.stream(); > > MSPowerPointExtractor ex = new MSPowerPointExtractor(null, > > null); > > > > Reader reader = ex.extract(input); > > > > int c; > > do > > { > > c = reader.read(); > > > > buf.append((char)c); > > } > > while( c != -1 ); > > }catch(Exception e){ > > > > } > > > > return buf.toString(); > > } > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > = "No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) "Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955) "It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950) __ Do you Yahoo!? New and Improved Yahoo! Mail - 100MB free storage! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MSPowerPointExtractor problem
Ryan, thanks for your reply. I have also seen the posts from Sudhakar on this subject who seems to be contributing a whole lot of code here - which is a great thing but in this code the problem also persists so I think we solve this encoding problem in your code (which is simpler - the fix could later be integrated into Sudhakar's code if this is checked in or whatever...). I have tested this with a simple PPT file containing just the following text: Umlaut-Test Ökologie, Mühsal, Größe, Grätsche I get the following console output with this text: Umlaut-Test \326kologie, M\374hsal, Gr\374\337e, Gr\344tsche Here is the output I get in a web browser (through a web app, "view HTML source" mode): Umlaut-Test ÷kologie, M¸hsal, Gr¸?e, Gr?tsche German "umlaute" and other special characters work fine that way whenever I extract text from Word documents or Excel spreadsheets using POI and Ryan Ackley's TextMining framework. just for the record: I have only tested this on my own configuration: Mac OS X 10.3.4, Java 1.4.2_03 so I have no idea how these classes might behave on Linux or Windows. Can anybody confirm this? I have seen some German names on this list ;-) Thanks for all the work you put into this. Ralph Scheuer Am 01.08.2004 um 08:07 schrieb Ryan Rhodes: Hi Ralph, I haven't tested the PPT extractor with any other languages. I remember reading about other people having problems with different character sets though. Could you send a before and after example file here or to bugzilla? -Ryan Rhodes -Original Message- From: Ralph Scheuer [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 28, 2004 10:01 AM To: slide Subject: MSPowerPointExtractor problem Hello everybody, When I was searching for a Java class to extract text from PowerPoint files, I accidentally discovered Slide. I pulled the MSPowerPointExtractor class and some other stuff it depends on via CVS and tried it for some text extraction. The method I used looks very similar to the provided example main method (see below). However. when I tried to extract text from a German PowerPoint presentation, I had some problems with the encoding. I did not know which encoding to use, converting the output to ISO Latin 1 with my text editor solved only part of the problem (some German Umlaute were displayed correctly, some were not). Is this a known issue or am I doing something wrong? Any hints for me? Thanks in advance. Ralph Scheuer BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding on this platform is MacRoman. public static String contentStringForData(NSData data){ StringBuffer buf = new StringBuffer(); try{ ByteArrayInputStream input = data.stream(); MSPowerPointExtractor ex = new MSPowerPointExtractor(null, null); Reader reader = ex.extract(input); int c; do { c = reader.read(); buf.append((char)c); } while( c != -1 ); }catch(Exception e){ } return buf.toString(); } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: MSPowerPointExtractor problem
Hello All, This was my first contribution http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/attachments/PPT2Text.java for jakarta team. And it seems another expert(Ryan Rhodes- [EMAIL PROTECTED]) has already started working on that based on my first given contribution. That sounds great to me. So In order to increase the development process for Powerpoint extractor, I just wanted to contribute Our team efforts in developing the Powerpoint extractor Authors :- Sudhakar Chavali ([EMAIL PROTECTED]) and Hari Shanker Goud ([EMAIL PROTECTED]) Have a look on the below source codes Regards Sudhakar /** * Title: DocumentParserException class * Description: This is root Exceptional class for throwing the runtime errors that can be raised by different parsers * @author Sudhakar * @version 1.0 */ public class DocumentParserException extends Exception { /** * Constructs a new exception with null as its detail message. */ public DocumentParserException() { } /** * Constructs a new exception with the specified detail message. * @param message */ public DocumentParserException(String message) { super(message); } /** * Constructs a new exception with the specified detail message. * @param message * @param cause */ public DocumentParserException(String message, Throwable cause) { super(message, cause); } } _ import java.io.*; /** * * Title: Summary Base * Description: A Generic one that reads the document's summary information and returns it through different internal methods * @author Sudhakar Chavali * @version 1.0 */ public interface SummaryBase { /** * A method returns the Document's Author * @return String */ public String getDocAuthor(); /** * A method that returns the Document Created Date * @return String */ public String getDocCreatedDate(); /** * A method that returns the Document's Key words * @return String */ public String getDocKeywords(); /** * A method that returns the Document's comments * @return String */ public String getDocComments(); /** * A method that returns the Document Name * @return String */ public String getDocName(); /** * A method that returns the Document's Subject * @return String */ public String getDocSubject(); /** * A method that returns the Document's title */ public String getDocTitle(); /** * A method that reads the document's Summary Information * @throws DocumentParserException */ public void read() throws DocumentParserException; /** * A method that writes the Document's summary information as an XML into the file * @param strXMLFile * @throws DocumentParserException */ public void write(String strXMLFile) throws DocumentParserException; /** * A method that writes the document's summary information as an XML into OutputStream Object * @param out * @throws DocumentParserException */ public void write(OutputStream out) throws DocumentParserException; /** * A method that returns the Document's summary as an XML String * @return String * @throws DocumentParserException */ public String getSummaryAsXML() throws DocumentParserException; /** * A method that returns document's summary information as normal text * @return String * @throws DocumentParserException */ public String getSummaryAsText() throws DocumentParserException; } __ import java.io.*; /** * A generic document that reads the document's text and parses it into normal Ascii text using the different methods. */ public interface Document { /** * A method that returns the document's text after parsing. This method should be called after calling the read method * @return String * @see #read() * @throws DocumentParserException */ public abstract String getText() throws DocumentParserException; /** * A method that returns the parsed text as byte array. This method should be called after calling the read method * @return byte[] * @throws DocumentParserException */ public abstract byte[] getBytes() throws DocumentParserException; /** * A method that writes the parsed text into the OutputStream object. This method should be called after calling the read method * @param out * @throws DocumentParserException */ public abstract void write(OutputStream out) throws DocumentParserException, Exception; /** * A method that reads and parses the document into Normal text * @throws DocumentParserException */ public abstract void read() throws DocumentParserException, Exception; /**
RE: MSPowerPointExtractor problem
Check this, http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/attachments/PPT2Text.java --- Ryan Rhodes <[EMAIL PROTECTED]> wrote: > Hi Ralph, > > I haven't tested the PPT extractor with any other languages. I remember > reading about other people having problems with different character sets > though. > > Could you send a before and after example file here or to bugzilla? > > -Ryan Rhodes > > > -Original Message- > From: Ralph Scheuer [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 28, 2004 10:01 AM > To: slide > Subject: MSPowerPointExtractor problem > > Hello everybody, > > When I was searching for a Java class to extract text from PowerPoint > files, I accidentally discovered Slide. > > I pulled the MSPowerPointExtractor class and some other stuff it > depends on via CVS and tried it for some text extraction. > > The method I used looks very similar to the provided example main > method (see below). > > However. when I tried to extract text from a German PowerPoint > presentation, I had some problems with the encoding. I did not know > which encoding to use, converting the output to ISO Latin 1 with my > text editor solved only part of the problem (some German Umlaute were > displayed correctly, some were not). > > Is this a known issue or am I doing something wrong? Any hints for me? > > Thanks in advance. > > Ralph Scheuer > > BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding > on this platform is MacRoman. > > > public static String contentStringForData(NSData data){ > > StringBuffer buf = new StringBuffer(); > try{ > ByteArrayInputStream input = data.stream(); > MSPowerPointExtractor ex = new MSPowerPointExtractor(null, > null); > > Reader reader = ex.extract(input); > > int c; > do > { > c = reader.read(); > > buf.append((char)c); > } > while( c != -1 ); > }catch(Exception e){ > > } > > return buf.toString(); > } > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > = "No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) "Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955) "It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950) __ Do you Yahoo!? New and Improved Yahoo! Mail - 100MB free storage! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: MSPowerPointExtractor problem
Hi Ralph, I haven't tested the PPT extractor with any other languages. I remember reading about other people having problems with different character sets though. Could you send a before and after example file here or to bugzilla? -Ryan Rhodes -Original Message- From: Ralph Scheuer [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 28, 2004 10:01 AM To: slide Subject: MSPowerPointExtractor problem Hello everybody, When I was searching for a Java class to extract text from PowerPoint files, I accidentally discovered Slide. I pulled the MSPowerPointExtractor class and some other stuff it depends on via CVS and tried it for some text extraction. The method I used looks very similar to the provided example main method (see below). However. when I tried to extract text from a German PowerPoint presentation, I had some problems with the encoding. I did not know which encoding to use, converting the output to ISO Latin 1 with my text editor solved only part of the problem (some German Umlaute were displayed correctly, some were not). Is this a known issue or am I doing something wrong? Any hints for me? Thanks in advance. Ralph Scheuer BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding on this platform is MacRoman. public static String contentStringForData(NSData data){ StringBuffer buf = new StringBuffer(); try{ ByteArrayInputStream input = data.stream(); MSPowerPointExtractor ex = new MSPowerPointExtractor(null, null); Reader reader = ex.extract(input); int c; do { c = reader.read(); buf.append((char)c); } while( c != -1 ); }catch(Exception e){ } return buf.toString(); } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MSPowerPointExtractor problem
Hello everybody, When I was searching for a Java class to extract text from PowerPoint files, I accidentally discovered Slide. I pulled the MSPowerPointExtractor class and some other stuff it depends on via CVS and tried it for some text extraction. The method I used looks very similar to the provided example main method (see below). However. when I tried to extract text from a German PowerPoint presentation, I had some problems with the encoding. I did not know which encoding to use, converting the output to ISO Latin 1 with my text editor solved only part of the problem (some German Umlaute were displayed correctly, some were not). Is this a known issue or am I doing something wrong? Any hints for me? Thanks in advance. Ralph Scheuer BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding on this platform is MacRoman. public static String contentStringForData(NSData data){ StringBuffer buf = new StringBuffer(); try{ ByteArrayInputStream input = data.stream(); MSPowerPointExtractor ex = new MSPowerPointExtractor(null, null); Reader reader = ex.extract(input); int c; do { c = reader.read(); buf.append((char)c); } while( c != -1 ); }catch(Exception e){ } return buf.toString(); }