[Solved]: MSPowerPointExtractor problem

2004-08-02 Thread Ralph Scheuer
Hi everybody,
just a quick note for everybody:
Meanwhile, I managed to solve the problem. Ryan's and Sudhakar's 
sources work flawlessly (at least with German special characters) after 
adding one additional method to the source (see below).

The extracted String needs to be interpreted with "Cp1252" encoding.
It may well be that this is a Mac-specific encoding problem - I cannot 
verify the Windows or Linux behavior here.

Anyway, the following code solved the problem for me.
Again, thanks for all the great work you have done.
Ralph Scheuer
private static String convertEncoding(String incoming){
String outgoing = null;
try {
outgoing = new String(incoming.getBytes(), "Cp1252");

} catch (Exception e) {
SDLogger.catchException(e);
}
return outgoing;
}
PS: If there are no objections, I would like to contact the POI 
developer team and file a bug in bugzilla as I have the feeling that 
the code both of you have provided would be ideally suited for 
integrating some variant of it into the POI framework.

Kind regards.
Ralph
Am 02.08.2004 um 13:13 schrieb Koundinya ((Sudhakar Chavali)):
Hm,
Basically we have concentrated on English language. So we never faced 
any problems. It become a
new task for our team now :-)

Thanks to Ralph in pointing that problem.
We Will work on related and let the Jakarta team knows :-)
Regards
Sudhakar

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MSPowerPointExtractor problem

2004-08-02 Thread Koundinya \(Sudhakar Chavali\)
Hm,

Basically we have concentrated on English language. So we never faced any problems. It 
become a
new task for our team now :-) 

Thanks to Ralph in pointing that problem.

We Will work on related and let the Jakarta team knows :-)

Regards
Sudhakar





--- Ralph Scheuer <[EMAIL PROTECTED]> wrote:

> Ryan,
> 
> thanks for your reply.
> 
> I have also seen the posts from Sudhakar on this subject who seems to 
> be contributing a whole lot of code here - which is a great thing but 
> in this code the problem also persists so I think we solve this 
> encoding problem in your code (which is simpler - the fix could later 
> be integrated into Sudhakar's code if this is checked in or 
> whatever...).
> 
> I have tested this with a simple PPT file containing just the following 
> text:
> 
> Umlaut-Test
> Ökologie, Mühsal, Größe, Grätsche
> 
> I get the following console output with this text:
> 
> Umlaut-Test
> \326kologie, M\374hsal, Gr\374\337e, Gr\344tsche
> 
> Here is the output I get in a web browser (through a web app, "view 
> HTML source" mode):
> 
> Umlaut-Test ÷kologie, M¸hsal, Gr¸?e, Gr?tsche
> 
> German "umlaute" and other special characters work fine that way 
> whenever I extract text from Word documents or Excel spreadsheets using 
> POI and Ryan Ackley's TextMining framework.
> 
> just for the record: I have only tested this on my own configuration: 
> Mac OS X 10.3.4, Java 1.4.2_03 so I have no idea how these classes 
> might behave on Linux or Windows. Can anybody confirm this? I have seen 
> some German names on this list ;-)
> 
> Thanks for all the work you put into this.
> 
> Ralph Scheuer
> 
> Am 01.08.2004 um 08:07 schrieb Ryan Rhodes:
> 
> > Hi Ralph,
> >
> > I haven't tested the PPT extractor with any other languages.  I 
> > remember
> > reading about other people having problems with different character 
> > sets
> > though.
> >
> > Could you send a before and after example file here or to bugzilla?
> >
> > -Ryan Rhodes
> >
> >
> > -Original Message-
> > From: Ralph Scheuer [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, July 28, 2004 10:01 AM
> > To: slide
> > Subject: MSPowerPointExtractor problem
> >
> > Hello everybody,
> >
> > When I was searching for a Java class to extract text from PowerPoint
> > files, I accidentally discovered Slide.
> >
> > I pulled the MSPowerPointExtractor class and some other stuff it
> > depends on via CVS and tried it for some text extraction.
> >
> > The method I used looks very similar to the provided example main
> > method (see below).
> >
> > However. when I tried to extract text from a German PowerPoint
> > presentation, I had some problems with the encoding. I did not know
> > which encoding to use, converting the output to ISO Latin 1 with my
> > text editor solved only part of the problem (some German Umlaute were
> > displayed correctly, some were not).
> >
> > Is this a known issue or am I doing something wrong? Any hints for me?
> >
> > Thanks in advance.
> >
> > Ralph Scheuer
> >
> > BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding
> > on this platform is MacRoman.
> >
> >
> >  public static String contentStringForData(NSData data){
> > 
> > StringBuffer buf = new StringBuffer();
> > try{
> > ByteArrayInputStream input = data.stream();
> > MSPowerPointExtractor ex = new MSPowerPointExtractor(null,
> > null);
> > 
> > Reader reader = ex.extract(input);
> > 
> > int c;
> > do
> > {
> > c = reader.read();
> > 
> > buf.append((char)c);
> > }
> > while( c != -1 );
> > }catch(Exception e){
> > 
> > }
> > 
> > return buf.toString();
> >  }
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


=
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw 
(1856-1950)




__
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MSPowerPointExtractor problem

2004-08-02 Thread Ralph Scheuer
Ryan,
thanks for your reply.
I have also seen the posts from Sudhakar on this subject who seems to 
be contributing a whole lot of code here - which is a great thing but 
in this code the problem also persists so I think we solve this 
encoding problem in your code (which is simpler - the fix could later 
be integrated into Sudhakar's code if this is checked in or 
whatever...).

I have tested this with a simple PPT file containing just the following 
text:

Umlaut-Test
Ökologie, Mühsal, Größe, Grätsche
I get the following console output with this text:
Umlaut-Test
\326kologie, M\374hsal, Gr\374\337e, Gr\344tsche
Here is the output I get in a web browser (through a web app, "view 
HTML source" mode):

Umlaut-Test ÷kologie, M¸hsal, Gr¸?e, Gr?tsche
German "umlaute" and other special characters work fine that way 
whenever I extract text from Word documents or Excel spreadsheets using 
POI and Ryan Ackley's TextMining framework.

just for the record: I have only tested this on my own configuration: 
Mac OS X 10.3.4, Java 1.4.2_03 so I have no idea how these classes 
might behave on Linux or Windows. Can anybody confirm this? I have seen 
some German names on this list ;-)

Thanks for all the work you put into this.
Ralph Scheuer
Am 01.08.2004 um 08:07 schrieb Ryan Rhodes:
Hi Ralph,
I haven't tested the PPT extractor with any other languages.  I 
remember
reading about other people having problems with different character 
sets
though.

Could you send a before and after example file here or to bugzilla?
-Ryan Rhodes
-Original Message-
From: Ralph Scheuer [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 28, 2004 10:01 AM
To: slide
Subject: MSPowerPointExtractor problem
Hello everybody,
When I was searching for a Java class to extract text from PowerPoint
files, I accidentally discovered Slide.
I pulled the MSPowerPointExtractor class and some other stuff it
depends on via CVS and tried it for some text extraction.
The method I used looks very similar to the provided example main
method (see below).
However. when I tried to extract text from a German PowerPoint
presentation, I had some problems with the encoding. I did not know
which encoding to use, converting the output to ISO Latin 1 with my
text editor solved only part of the problem (some German Umlaute were
displayed correctly, some were not).
Is this a known issue or am I doing something wrong? Any hints for me?
Thanks in advance.
Ralph Scheuer
BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding
on this platform is MacRoman.
 public static String contentStringForData(NSData data){

StringBuffer buf = new StringBuffer();
try{
ByteArrayInputStream input = data.stream();
MSPowerPointExtractor ex = new MSPowerPointExtractor(null,
null);

Reader reader = ex.extract(input);

int c;
do
{
c = reader.read();

buf.append((char)c);
}
while( c != -1 );
}catch(Exception e){

}

return buf.toString();
 }
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: MSPowerPointExtractor problem

2004-08-01 Thread Koundinya \(Sudhakar Chavali\)
Hello All,

This was my first contribution 
http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/attachments/PPT2Text.java
 for
jakarta team. And it seems another expert(Ryan Rhodes- [EMAIL PROTECTED]) has already
started working on that based on my first given contribution.

That sounds great to me.

So In order to increase the development process for Powerpoint extractor, I just 
wanted to
contribute Our team efforts in developing the Powerpoint extractor

Authors :- Sudhakar Chavali ([EMAIL PROTECTED]) and Hari Shanker Goud
([EMAIL PROTECTED])


Have a look on the below source codes


Regards
Sudhakar



/**
 * Title: DocumentParserException class
 * Description: This is root Exceptional class for throwing the runtime errors that 
can be raised
by different parsers
 * @author Sudhakar
 * @version 1.0
 */

public class DocumentParserException
extends Exception {

  /**
   * Constructs a new exception with null as its detail message.
   */

  public DocumentParserException() {
  }

  /**
   * Constructs a new exception with the specified detail message.
   * @param message
   */

  public DocumentParserException(String message) {
super(message);
  }

  /**
   * Constructs a new exception with the specified detail message.
   * @param message
   * @param cause
   */
  public DocumentParserException(String message, Throwable cause) {
super(message, cause);
  }

}
_

import java.io.*;

/**
 *
 * Title: Summary Base
 * Description: A Generic one that reads the document's summary information and 
returns it through
different internal methods
 * @author Sudhakar Chavali
 * @version 1.0
 */
public interface SummaryBase {
  /**
   * A method returns the Document's Author
   * @return String
   */
  public String getDocAuthor();

  /**
   * A method that returns the Document Created Date
   * @return String
   */
  public String getDocCreatedDate();

  /**
   * A method that returns the Document's Key words
   * @return String
   */
  public String getDocKeywords();

  /**
   * A method that returns the Document's comments
   * @return String
   */
  public String getDocComments();

  /**
   * A method that returns the Document Name
   * @return String
   */
  public String getDocName();

  /**
   * A method that returns the Document's Subject
   * @return String
   */
  public String getDocSubject();

  /**
   * A method that returns the Document's title
   */

  public String getDocTitle();

  /**
   * A method that reads the document's Summary Information
   * @throws DocumentParserException
   */
  public void read() throws DocumentParserException;

  /**
   * A method that writes the Document's summary information as an XML into the file
   * @param strXMLFile
   * @throws DocumentParserException
   */
  public void write(String strXMLFile) throws 
  DocumentParserException;

  /**
   * A method that writes the document's summary information as an XML into 
OutputStream Object
   * @param out
   * @throws DocumentParserException
   */
  public void write(OutputStream out) throws 
  DocumentParserException;

  /**
   * A method that returns the Document's summary as an XML String
   * @return String
   * @throws DocumentParserException
   */
  public String getSummaryAsXML() throws 
  DocumentParserException;

  /**
   * A method that returns document's summary information as normal text
   * @return String
   * @throws DocumentParserException
   */
  public String getSummaryAsText() throws 
  DocumentParserException;
}

__

import java.io.*;

/**
 * A generic document that reads the document's text and parses it into normal Ascii 
text using
the different methods.
 */
public interface Document {

  /**
   * A method that returns the document's text after parsing. This method should be 
called after
calling the read method
   * @return String
   * @see #read()
   * @throws DocumentParserException
   */
  public abstract String getText() throws 
  DocumentParserException;

  /**
   * A method that returns the parsed text as byte array. This method should be called 
after
calling the read method
   * @return byte[]
   * @throws DocumentParserException
   */
  public abstract byte[] getBytes() throws 
  DocumentParserException;

  /**
   * A method that writes the parsed text into the OutputStream object. This method 
should be
called after calling the read method
   * @param out
   * @throws DocumentParserException
   */
  public abstract void write(OutputStream out) throws 
  DocumentParserException, Exception;

  /**
   * A method that reads and parses the document into Normal text
   * @throws DocumentParserException
   */
  public abstract void read() throws 
  DocumentParserException, Exception;

  /**
 

RE: MSPowerPointExtractor problem

2004-08-01 Thread Koundinya \(Sudhakar Chavali\)
Check this,

http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/attachments/PPT2Text.java

--- Ryan Rhodes <[EMAIL PROTECTED]> wrote:

> Hi Ralph,
> 
> I haven't tested the PPT extractor with any other languages.  I remember
> reading about other people having problems with different character sets
> though.
> 
> Could you send a before and after example file here or to bugzilla?
> 
> -Ryan Rhodes
> 
> 
> -Original Message-
> From: Ralph Scheuer [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, July 28, 2004 10:01 AM
> To: slide
> Subject: MSPowerPointExtractor problem
> 
> Hello everybody,
> 
> When I was searching for a Java class to extract text from PowerPoint 
> files, I accidentally discovered Slide.
> 
> I pulled the MSPowerPointExtractor class and some other stuff it 
> depends on via CVS and tried it for some text extraction.
> 
> The method I used looks very similar to the provided example main 
> method (see below).
> 
> However. when I tried to extract text from a German PowerPoint 
> presentation, I had some problems with the encoding. I did not know 
> which encoding to use, converting the output to ISO Latin 1 with my 
> text editor solved only part of the problem (some German Umlaute were 
> displayed correctly, some were not).
> 
> Is this a known issue or am I doing something wrong? Any hints for me?
> 
> Thanks in advance.
> 
> Ralph Scheuer
> 
> BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding 
> on this platform is MacRoman.
> 
> 
>  public static String contentStringForData(NSData data){
>   
>   StringBuffer buf = new StringBuffer();
>   try{
>   ByteArrayInputStream input = data.stream();
>   MSPowerPointExtractor ex = new MSPowerPointExtractor(null,
> null);
>   
>   Reader reader = ex.extract(input);
>   
>   int c;
>   do
>   {
>   c = reader.read();
>   
>   buf.append((char)c);
>   }
>   while( c != -1 );
>   }catch(Exception e){
>   
>   }
>   
>   return buf.toString();
>  }
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


=
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw 
(1856-1950)




__
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: MSPowerPointExtractor problem

2004-07-31 Thread Ryan Rhodes
Hi Ralph,

I haven't tested the PPT extractor with any other languages.  I remember
reading about other people having problems with different character sets
though.

Could you send a before and after example file here or to bugzilla?

-Ryan Rhodes


-Original Message-
From: Ralph Scheuer [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 28, 2004 10:01 AM
To: slide
Subject: MSPowerPointExtractor problem

Hello everybody,

When I was searching for a Java class to extract text from PowerPoint 
files, I accidentally discovered Slide.

I pulled the MSPowerPointExtractor class and some other stuff it 
depends on via CVS and tried it for some text extraction.

The method I used looks very similar to the provided example main 
method (see below).

However. when I tried to extract text from a German PowerPoint 
presentation, I had some problems with the encoding. I did not know 
which encoding to use, converting the output to ISO Latin 1 with my 
text editor solved only part of the problem (some German Umlaute were 
displayed correctly, some were not).

Is this a known issue or am I doing something wrong? Any hints for me?

Thanks in advance.

Ralph Scheuer

BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding 
on this platform is MacRoman.


 public static String contentStringForData(NSData data){

StringBuffer buf = new StringBuffer();
try{
ByteArrayInputStream input = data.stream();
MSPowerPointExtractor ex = new MSPowerPointExtractor(null,
null);

Reader reader = ex.extract(input);

int c;
do
{
c = reader.read();

buf.append((char)c);
}
while( c != -1 );
}catch(Exception e){

}

return buf.toString();
 }

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MSPowerPointExtractor problem

2004-07-28 Thread Ralph Scheuer
Hello everybody,
When I was searching for a Java class to extract text from PowerPoint 
files, I accidentally discovered Slide.

I pulled the MSPowerPointExtractor class and some other stuff it 
depends on via CVS and tried it for some text extraction.

The method I used looks very similar to the provided example main 
method (see below).

However. when I tried to extract text from a German PowerPoint 
presentation, I had some problems with the encoding. I did not know 
which encoding to use, converting the output to ISO Latin 1 with my 
text editor solved only part of the problem (some German Umlaute were 
displayed correctly, some were not).

Is this a known issue or am I doing something wrong? Any hints for me?
Thanks in advance.
Ralph Scheuer
BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding 
on this platform is MacRoman.

public static String contentStringForData(NSData data){

StringBuffer buf = new StringBuffer();
try{
ByteArrayInputStream input = data.stream();
MSPowerPointExtractor ex = new MSPowerPointExtractor(null, null);

Reader reader = ex.extract(input);

int c;
do
{
c = reader.read();

buf.append((char)c);
}
while( c != -1 );
}catch(Exception e){

}

return buf.toString();
}