Re: Using the Cocoon pipeline outside web apps

admin Mon, 16 Dec 2002 17:59:01 -0800

It must be possible to do, since an OpenOffice file is just a set of XML files, 
zipped into one.  I did the opposite thing: create one XML file from the Zip 
file, in order to publish them through Cocoon.  I used Perl.  Here's my script.
This doesn't do what you want it to do, but hey, it's Open Source, right.



#!/usr/bin/perl
# Written by Yves Vindevogel - [EMAIL PROTECTED]
# 14-Nov-2002

# This file opens a OpenOffice document (which is a zip file)
# and exports all the files in the document to XML
#
# Usage: oo2xml inputfile outputfile


# Check if the input file exists
unless (-e @ARGV[0])
{       die "oo2xml error: Could not find input file\n" ;
} ;

# Run system command to unzip the file into a temp xml file
# unzip -p  opens the zip file and puts the content in the pipe
# Since the content of an OpenOffice file is plain XML,
# all the files in the OO file are put into the pipe.
# The pipe is then flushed into a file, thus the xml file
# contains all the content, in XML.
# This is not a new valid XML file !!
# On the temp xml file, some modifications must be done.
system "unzip -p @ARGV[0] > /tmp/tmp.xml"
        || die "oo2xml error: Could not unzip the input file\n";

# Open the temp xml file
open (tmp, "/tmp/tmp.xml")
        || die "oo2xml error: Could not open temp file\n" ;

# Open second temp file to split the tags
# When the tags are not split, and an <!tag> comes second, 
# the complete line is neglected, resulting in bugs
# Therefore, in a first pass, the tags rewritten to a seperate line
open (tmp2, "> /tmp/tmp2.xml")
        || die "002xml error: Could not open temp split file\n" ;

# Loop through lines and split by entering a \n between the > and <
while ($line = <tmp>)
{
        $line =~ s/></>\n</g ;

        print tmp2 $line ;
} ;

# Close them
close tmp2 ;
close tmp ;

# Open the filtered input file
open (tmp, "/tmp/tmp2.xml")
        || die "oo2xml error: Could not open split file\n" ;

# Open the output file
open (xml, "> @ARGV[1]")
        || die "oo2xml error: Could not open output file\n" ;

# Print the office:document tag
# The complete document needs to be enclosed by one root element
# The root element will thus be <office:document>
print xml "<?xml version=\x221.0\x22 encoding=\x22UTF-8\x22?>\n" ; # \x22 = "
print xml "<office:document " ;
print xml "xmlns:office=\x22http://openoffice.org/2000/office\x22 " ;
print xml "xmlns:style=\x22http://openoffice.org/2000/style\x22 " ;
print xml "xmlns:text=\x22http://openoffice.org/2000/text\x22 " ;
print xml "xmlns:table=\x22http://openoffice.org/2000/table\x22 " ;
print xml "xmlns:draw=\x22http://openoffice.org/2000/drawing\x22 " ;
print xml "xmlns:fo=\x22http://www.w3.org/1999/XSL/Format\x22 " ;
print xml "xmlns:xlink=\x22http://www.w3.org/1999/xlink\x22 " ;
print xml "xmlns:number=\x22http://openoffice.org/2000/datastyle\x22 " ;
print xml "xmlns:svg=\x22http://www.w3.org/2000/svg\x22 " ;
print xml "xmlns:chart=\x22http://openoffice.org/2000/chart\x22 " ;
print xml "xmlns:dr3d=\x22http://openoffice.org/2000/dr3d\x22 " ;
print xml "xmlns:math=\x22http://www.w3.org/1998/Math/MathML\x22 " ;
print xml "xmlns:form=\x22http://openoffice.org/2000/form\x22 " ;
print xml "xmlns:script=\x22http://openoffice.org/2000/script\x22 " ;
print xml "xmlns:config=\x22http://openoffice.org/2001/config\x22 " ;
print xml "xmlns:meta=\x22http://openoffice.org/2000/meta\x22 " ;
print xml "xmlns:manifest=\x22http://openoffice.org/2001/manifest\x22 " ;
print xml "xmlns:dc=\x22http://purl.org/dc/elements/1.1/\x22 " ;

print xml ">\n" ;

# Loop through the lines in the temp XML file
# Lines with DOCTYPE descriptions and version info is omitted
while ($line = <tmp>)
{
        # temp var to see if we need to write the line
        $ok = 1 ;

        # Two reasons not to write the line: procession instructions and 
doctypes
        if ($line =~ /<\x3F/) { $ok = 0; } ; # \x3F = ?
        if ($line =~ /<!/) { $ok = 0; } ;

        # Remove any xmlns info from the line,
        # all the namespace information is already written in the root element
        # If you don't remove them, you get errors
        if ($line =~ /xmlns/)
        {
                # Split on white space
                @tags = split / /, $line ;

                # Loop through the tags, 
                # if xmlns, check to see if it was the first or last tag
                # If so, write the opening or closing tag
                # otherwise simply write the tag and a white space
                foreach $tag (@tags)
                {
                        if ($tag =~ /xmlns/)
                        {
                                if ($tag =~ /</) { print xml "<"} ;
                                if ($tag =~ />/) { print xml ">\n"} ;
                        }
                        else
                        {
                                print xml $tag, " ";
                        }  ;
                } ;

                # Don't need to write the line, already written
                $ok = 0 ;
        } ;

        # Write the line if the temp var is still 1
        unless ($ok == 0) { print xml $line ; } ;
} ;

# Write document end tag
print xml "</office:document>\n" ;

# Delete the temp files
system "rm -f /tmp/tmp.xml"
        || warn "oo2xml warning: Temp file could not be deleted" ;

system "rm -f /tmp/tmp2.xml"
        || warn "oo2xml warning: Temp split file could not be deleted" ;






Citeren Olivier Mengué <[EMAIL PROTECTED]>:

> Hi,
> 
> I'm working on a project that will generate OpenOffice.org document from
> data extracted from a database. Our aim is to automatise the publishing of
> the program of hikes for my hikers association. It is actually done with a
> Microsoft Word document merge and it is not satisfying. PDF is not an option
> as publishers have to do additionnal editing after the automatic step.
> The output document will be many pages long, so we want to process in batch
> instead of as a web application.
> 
> As OpenOffice.org document format is XML, I would like to reuse the Cocoon
> pipeline with an ESQL transformer from a simple Java application.
> 
> My question are :
> - is it possible ? I mean, is it possible to reuse just the pipeline in a
> standard Java application, without the sitemap and servlet stuff, without
> too much code or too many dependencies. The pipeline would be either
> hard-coded or specified with a simpler sitemap-like configuration file.
> - how ? The package org.apache.cocoon.components.pipeline seems interesting,
> but I don't know which class to use and how to build a simple pipeline with
> a generator, a transformer and serialiser. Then, how to feed the pipeline ?
> 
> Could you point me to the important classes, and the order to create them ?
> 
> 
> Thank you for your help,
> 
> Olivier Mengué
> 
> 
> ---------------------------------------------------------------------
> Please check that your question  has not already been answered in the
> FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>
> 
> To unsubscribe, e-mail:     <[EMAIL PROTECTED]>
> For additional commands, e-mail:   <[EMAIL PROTECTED]>

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <[EMAIL PROTECTED]>
For additional commands, e-mail:   <[EMAIL PROTECTED]>

Re: Using the Cocoon pipeline outside web apps

Reply via email to