Re: Bridge with OpenOffice

2004-04-19 Thread Stephane James Vaucher
I'll make a copy of the code available on the wiki before it disappears 
off the Web.

Now for some info on using OO on a production system:

http://www.oooforum.org/forum/viewtopic.php?t=2913&highlight=jurt

OO works well (but is slow), but is 
not multi-threaded (the communication bridge is).

Quotes from end of 2003:

Kai Sommerfeld from Sun wrote:

Quote:
The answer is not a simple 'yes' or 'no'. It's more: 'partly'. There
are parts of the OOo API that are threadsafe, others are not. Newer
components are generally threadsafe. Components thare are mainly
wrappers for old Office code are mostly not. A main problem is that we
cannot state for sure which components are actually thread safe and
which are not. It's as worse as I say it here.

We're trying to solve the multithreading issues for one of the next
major releases of OOo. But this is definitely not an easy task,
especially, since rewriting all non-threadaware code is simply not an
option because of missing developer resources.

Juergen Schmidt from Sun wrote:

Quote:
If you want to use OO in a safe way you shouldn't use it multi
threaded. But we want to improve the server functionality of OO in
genral so that your described scenario should be possible.

Sorry, but currently you have to workaround this in your own application
and you should use OO single threaded. But as i said we are working on
this feature.

Niklas Nebel from Sun who seem to have success with some code running 
successfully as multithreaded, wrote:

Quote:
The document API functions use the SolarMutex, so you should be able to
use them from multiple threads without problems (with one call blocking
the next, of course). Listener callbacks might be a problem if handled
by different threads, but at least for the spreadsheet API I'm not aware
of any other problems.

Don't forget that every API call over a connection to a running office
is "multi-threaded", as the connection is handled by a different thread
from office user interactions.

sv

On Mon, 19 Apr 2004, Magnus Johansson wrote:

> Yes I have tried it and it seems to work ok.
> I haven't really used it in a production environment
> however.
> 
> There was some code here
> 
> http://www.gzlinux.org/docs/category/dev/java/doc2txt.pdf
> 
> it is however not there anymore, Google HTML version is however
> avaialble at
> 
> http://66.102.9.104/search?q=cache:549doYEZTD4J:www.gzlinux.org/docs/category/dev/java/doc2txt.pdf+Appending+the+favoured+extension+to+the+origin+document+name&hl=en&ie=UTF-8
> 
> 
> /magnus
> 
> 
> > Anyone try what Joerg suggested here?
> >
> > http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=6231
> >
> > sv
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bridge with OpenOffice

2004-04-19 Thread Mario Ivankovits




Stephane
James Vaucher wrote:


Anyone try what Joerg suggested here?
  
  http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=6231
  
 
  
  

Dont know what you would like to do, but if you simply would like to
extract text, you could simply try this sniplet:


---snip---     JarFile jar = new JarFile(file, false);

   ZipEntry entry = jar.getEntry("content.xml");

   if (entry == null)

   {

   throw new IOException("content.xml missing in file: " +
file);

   }

   InputStream is = jar.getInputStream(entry);


   XMLReader xr =
XMLReaderFactory.createXMLReader("org.apache.crimson.parser.XMLReaderImpl");

   xr.setEntityResolver(new EntityResolver()

   {

   public InputSource resolveEntity(String publicId, String
systemId) throws SAXException, IOException

   {

   if (systemId.toLowerCase().endsWith(".dtd"))

   {

   StringReader stringInput = new StringReader(" ");

   return new InputSource(stringInput);

   }

   else

   {

   return null;

   }

   }

   });


   final StringBuffer sbText = new StringBuffer(10240);

   xr.setContentHandler(new ContentHandler()

   {

   public void skippedEntity(String name) throws SAXException

   {

   }


   public void setDocumentLocator(Locator locator)

   {

   }


   public void ignorableWhitespace(char ch[], int start, int
length) throws SAXException

   {

   }


   public void processingInstruction(String target, String
data) throws SAXException

   {

   }


   public void startDocument() throws SAXException

   {

   }


   public void startElement(String namespaceURI, String
localName, String qName, Attributes atts) throws SAXException

   {

   if (qName.equals("text:p"))

   {

   if (sbText.length() > 0 &&
sbText.charAt(sbText.length() - 1) != '\n')

   {

   sbText.append('\n');

   }

   }

   }


   public void endPrefixMapping(String prefix) throws
SAXException

   {

   }


   public void characters(char ch[], int start, int length)
throws SAXException

   {

   sbText.append(ch, start, length);

   }


   public void endElement(String namespaceURI, String
localName, String qName) throws SAXException

   {

   }


   public void endDocument() throws SAXException

   {

   }


   public void startPrefixMapping(String prefix, String uri)
throws SAXException

   {

   }

   });


   InputSource source = new InputSource(is);

   source.setPublicId("");

   source.setSystemId("");

   xr.parse(source);


   System.err.println("TXT: " + sbText.toString());

---snip---


Ciao,

Mario






smime.p7s
Description: S/MIME Cryptographic Signature


Re: Optimize crash

2004-04-19 Thread Paul
Dear all,

I hate to be insistent, but I have a large live website with a growing,
un-optimizable Lucene index and which therefore has it's appointment
with destiny pencilled into The Diary of Doom on a date roughly
three weeks hence.

So if I'm doing something stupid, or there's a workaround, or someone
is already looking into this problem, *please* let me know. My alternative
is to spend two days re-indexing the archive, and then to just wait for the 
inevitable repeat of this problem, like Groundhog Day, which isn't a
particularly attractive option.

(NB: The original message is under the same subject line in the archive.)

Thanks.

Cheers,
Paul.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bridge with OpenOffice

2004-04-19 Thread Stephane James Vaucher
Actually, the objective would be to use OO to extract text from MSOffice 
formats. If I read your code correctly, your code should only work with OO 
as the docs are in xml. 

Thanks for the code for OO docs through,
sv

On Mon, 19 Apr 2004, Mario Ivankovits wrote:

> Stephane James Vaucher wrote:
> 
> > Anyone try what Joerg suggested here?
> > http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=6231 
> >
> >  
> >
> Dont know what you would like to do, but if you simply would like to 
> extract text, you could simply try this sniplet:
> 
> ---snip--- JarFile jar = new JarFile(file, false);
>ZipEntry entry = jar.getEntry("content.xml");
>if (entry == null)
>{
>throw new IOException("content.xml missing in file: " + file);
>}
>InputStream is = jar.getInputStream(entry);
> 
>XMLReader xr = 
> XMLReaderFactory.createXMLReader("org.apache.crimson.parser.XMLReaderImpl"); 
> 
>xr.setEntityResolver(new EntityResolver()
>{
>public InputSource resolveEntity(String publicId, String 
> systemId) throws SAXException, IOException
>{
>if (systemId.toLowerCase().endsWith(".dtd"))
>{
>StringReader stringInput = new StringReader(" ");
>return new InputSource(stringInput);
>}
>else
>{
>return null;
>}
>}
>});
> 
>final StringBuffer sbText = new StringBuffer(10240);
>xr.setContentHandler(new ContentHandler()
>{
>public void skippedEntity(String name) throws SAXException
>{
>}
> 
>public void setDocumentLocator(Locator locator)
>{
>}
> 
>public void ignorableWhitespace(char ch[], int start, int 
> length) throws SAXException
>{
>}
> 
>public void processingInstruction(String target, String data) 
> throws SAXException
>{
>}
> 
>public void startDocument() throws SAXException
>{
>}
> 
>public void startElement(String namespaceURI, String 
> localName, String qName, Attributes atts) throws SAXException
>{
>if (qName.equals("text:p"))
>{
>if (sbText.length() > 0 && 
> sbText.charAt(sbText.length() - 1) != '\n')
>{
>sbText.append('\n');
>}
>}
>}
> 
>public void endPrefixMapping(String prefix) throws SAXException
>{
>}
> 
>public void characters(char ch[], int start, int length) 
> throws SAXException
>{
>sbText.append(ch, start, length);
>}
> 
>public void endElement(String namespaceURI, String localName, 
> String qName) throws SAXException
>{
>}
> 
>public void endDocument() throws SAXException
>{
>}
> 
>public void startPrefixMapping(String prefix, String uri) 
> throws SAXException
>{
>}
>});
> 
>InputSource source = new InputSource(is);
>source.setPublicId("");
>source.setSystemId("");
>xr.parse(source);
> 
>System.err.println("TXT: " + sbText.toString());
> ---snip---
> 
> Ciao,
> Mario
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bridge with OpenOffice

2004-04-19 Thread Tatu Saloranta
On Monday 19 April 2004 14:01, Mario Ivankovits wrote:
> Stephane James Vaucher wrote:
> > Anyone try what Joerg suggested here?
> > http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]
> >pache.org&msgNo=6231
>
> Dont know what you would like to do, but if you simply would like to
> extract text, you could simply try this sniplet:

This leads to question I was thinking; it seems that originally this thread 
started by someone pointing that OO can be used as converter from other 
formats... but how about tokenizer for native OO documents? I have written 
full-featured converters from OO to (simplified) DocBook and HTML, and 
creating one for just tokenizing to be used by Lucene would be much easier. 
Even if it would tokenize into separate fields (document metadata, content, 
maybe bibliography separately etc), it'd be easy to do.

Would anyone find full-featured, customizable OpenOffice document tokenizer 
useful?

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bridge with OpenOffice

2004-04-19 Thread Argyn
I've the same requirement. I used antiword, xlhtml and ppthtml on win2k. I 
called them with Runtime.exec(). There are still problems: all three hang 
up sometimes. Otherwise, it worked. I indexed several hunderds of 
thousands files in development mode. I never got into production.

Argyn

On Mon, 19 Apr 2004 16:53:41 -0400 (EDT), Stephane James Vaucher 
<[EMAIL PROTECTED]> wrote:

Actually, the objective would be to use OO to extract text from MSOffice
formats. If I read your code correctly, your code should only work with 
OO
as the docs are in xml.

Thanks for the code for OO docs through,
sv
On Mon, 19 Apr 2004, Mario Ivankovits wrote:

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Bridge with OpenOffice

2004-04-19 Thread Peter Becker
We did a simple one a while ago. Could probably be a bit more 
sophisticated, but it seems to do it job on the little bit of testing we 
did.

See 
http://cvs.sourceforge.net/viewcvs.py/toscanaj/docco/source/org/tockit/docco/documenthandler/OpenOfficeDocumentHandler.java?rev=1.4&view=auto

HTH,
 Peter
PS: sorry for the broken whitespace -- I just noticed that myself.

Tatu Saloranta wrote:

On Monday 19 April 2004 14:01, Mario Ivankovits wrote:
 

Stephane James Vaucher wrote:
   

Anyone try what Joerg suggested here?
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]
pache.org&msgNo=6231
 

Dont know what you would like to do, but if you simply would like to
extract text, you could simply try this sniplet:
   

This leads to question I was thinking; it seems that originally this thread 
started by someone pointing that OO can be used as converter from other 
formats... but how about tokenizer for native OO documents? I have written 
full-featured converters from OO to (simplified) DocBook and HTML, and 
creating one for just tokenizing to be used by Lucene would be much easier. 
Even if it would tokenize into separate fields (document metadata, content, 
maybe bibliography separately etc), it'd be easy to do.

Would anyone find full-featured, customizable OpenOffice document tokenizer 
useful?

-+ Tatu +-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]