Re: MSPowerPointExtractor problem

Ralph Scheuer Mon, 02 Aug 2004 03:25:48 -0700

Ryan,

thanks for your reply.

I have also seen the posts from Sudhakar on this subject who seems to be contributing a whole lot of code here - which is a great thing but in this code the problem also persists so I think we solve this encoding problem in your code (which is simpler - the fix could later be integrated into Sudhakar's code if this is checked in or whatever...).

I have tested this with a simple PPT file containing just the following text:

Umlaut-Test
Ökologie, Mühsal, Größe, Grätsche

I get the following console output with this text:

Umlaut-Test
\326kologie, M\374hsal, Gr\374\337e, Gr\344tsche

Here is the output I get in a web browser (through a web app, "view HTML source" mode):

Umlaut-Test ÷kologie, M¸hsal, Gr¸?e, Gr?tsche

German "umlaute" and other special characters work fine that way whenever I extract text from Word documents or Excel spreadsheets using POI and Ryan Ackley's TextMining framework.

just for the record: I have only tested this on my own configuration: Mac OS X 10.3.4, Java 1.4.2_03 so I have no idea how these classes might behave on Linux or Windows. Can anybody confirm this? I have seen some German names on this list ;-)

Thanks for all the work you put into this.

Ralph Scheuer

Am 01.08.2004 um 08:07 schrieb Ryan Rhodes:

Hi Ralph,

I haven't tested the PPT extractor with any other languages. I remember reading about other people having problems with different character sets though.

Could you send a before and after example file here or to bugzilla?

-Ryan Rhodes


-----Original Message-----
From: Ralph Scheuer [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 28, 2004 10:01 AM
To: slide
Subject: MSPowerPointExtractor problem

Hello everybody,

When I was searching for a Java class to extract text from PowerPoint
files, I accidentally discovered Slide.

I pulled the MSPowerPointExtractor class and some other stuff it
depends on via CVS and tried it for some text extraction.

The method I used looks very similar to the provided example main
method (see below).

However. when I tried to extract text from a German PowerPoint
presentation, I had some problems with the encoding. I did not know
which encoding to use, converting the output to ISO Latin 1 with my
text editor solved only part of the problem (some German Umlaute were
displayed correctly, some were not).

Is this a known issue or am I doing something wrong? Any hints for me?

Thanks in advance.

Ralph Scheuer

BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding
on this platform is MacRoman.


     public static String contentStringForData(NSData data){
        
        StringBuffer buf = new StringBuffer();
        try{
            ByteArrayInputStream input = data.stream();
            MSPowerPointExtractor ex = new MSPowerPointExtractor(null,
null);
        
            Reader reader = ex.extract(input);
        
            int c;
            do
                {
                    c = reader.read();
                
                    buf.append((char)c);
                }
            while( c != -1 );
        }catch(Exception e){
        
        }
        
        return buf.toString();
     }

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MSPowerPointExtractor problem

Reply via email to