Word Import

Richard Heck Sun, 01 Aug 2010 06:37:26 -0700

On 07/31/2010 10:10 PM, Rob Oakes wrote:

...then running a filter on the HTML...

I think this is really the key. You just really can't assume that theconverter---any converter---is going to produce nice LaTeX, suitable forimport into LyX. But the kinds of mistakes the converter makes will befairly predictable, and that means you can write a script to fix them.

What I'd propose is that someone begin a proper word2lyx project basedupon Rob's ideas. The actual work in the project would involve writingone or more filters, to take the output of html2latex or writer2latexand clean it up in a way that is appropriate for LyX. And, possiblyadditionally, to run some kind of filter on the LyX document generatedby tex2lyx, which might want cleaning up in various ways. Note that thiscould even handles Steve's problem: As long as the converters dosomething predictable, we can fix it programmatically. Just as anexample, I've pasted below the perl script I used to clean up the outputof wp2latex, when I was converting WordPerfect files. It's prettytrivial, really. The nice thing, of course, is that this kind of filtercan easily develop over time, as people find more annoyances.

The coding for this project should be done in python, preferably, sothat it could actually be shipped with LyX.


Richard

=====


#! /usr/bin/perl

open(TEXFILE, "<" . @ARGV[0]) || die ("Couldn't open file " . @ARGV[0]);

$line = <TEXFILE>;
print $line;
$line = <TEXFILE>;
print $line;
print "\\newcommand{\\IndexNum}{\\refstepcounter{Index}\\theIndex }\n";

while (<TEXFILE>) {
        $line = $_;
        $line =~ s/\\begin{flushleft}//g;
        $line =~ s/\\end{flushleft}//g;
        $line =~ s/\\begin{tabbing}//g;
        $line =~ s/\\end{tabbing}//g;
        $line =~ s/\\kill//g;
        $line =~ s/\\begin{indenting}{.*cm}//g;
        $line =~ s/\\end{indenting}//g;
        $line =~ s/\\begin{center}//g;
        $line =~ s/\\end{center}//g;
        $line =~ s/\\bigskip//g;
        $line =~ s/\\testlastline//g;
        $line =~ s/\\zerotestlastline//g;
        $line =~ s/\\baselineskip\s*=\s*\d*\.\d*ex//g;

$line =~ s/\\stepcounter{Section}\\theSection {\\bf\.\s*(.*)}/\\section{$1}/g;

        $line =~ s/{\\it ([^}]+)}/\\emph{$1}/g;
        $line =~ s#\\/##g;
        $line =~ s/\\stepcounter{Index}\\theIndex/\\IndexNum/g;
        $line =~ s/\$\s*\$/ /g;
        $line =~ s/{\\penalty\d*}//g;
        $line =~ s/\\hspace\*?{[^}]*?}//g;
        $line =~ s/\\-//g;
        $line =~ s|\\>||g;
        $line =~ s|\\=||g;
        $line =~ s/kern-\d*.\d*cm//g;
        $line =~ s/\\\\/\n\n/g;
#       $line =~ s/\$[^{}]*\$//g;
        $line =~ s/\\hyphenpenalty \d+ //g;
        $line =~ s/\\nwln//g;
        $line =~ s/{?\\nobreak}?//g;
        $line =~ s/{}//g;
        $line =~ s/ I~/ I /g;
        $line =~ s/\\endnote/\\footnote/g;
        $line =~ s/\^\{\{\\rm \{\\bf \\GrBox\(1001\)\}\}\}/\\ulcorner/g;
        $line =~ s/\^\{\{\\rm \{\\bf \\GrBox\(0011\)\}\}\}/\\urcorner/g;
        print ($line);
}

close (TEXFILE);

Word Import

Reply via email to