On 07/31/2010 10:10 PM, Rob Oakes wrote:
...then running a filter on the HTML...

I think this is really the key. You just really can't assume that the converter---any converter---is going to produce nice LaTeX, suitable for import into LyX. But the kinds of mistakes the converter makes will be fairly predictable, and that means you can write a script to fix them.

What I'd propose is that someone begin a proper word2lyx project based upon Rob's ideas. The actual work in the project would involve writing one or more filters, to take the output of html2latex or writer2latex and clean it up in a way that is appropriate for LyX. And, possibly additionally, to run some kind of filter on the LyX document generated by tex2lyx, which might want cleaning up in various ways. Note that this could even handles Steve's problem: As long as the converters do something predictable, we can fix it programmatically. Just as an example, I've pasted below the perl script I used to clean up the output of wp2latex, when I was converting WordPerfect files. It's pretty trivial, really. The nice thing, of course, is that this kind of filter can easily develop over time, as people find more annoyances.

The coding for this project should be done in python, preferably, so that it could actually be shipped with LyX.

Richard

=====


#! /usr/bin/perl

open(TEXFILE, "<" . @ARGV[0]) || die ("Couldn't open file " . @ARGV[0]);

$line = <TEXFILE>;
print $line;
$line = <TEXFILE>;
print $line;
print "\\newcommand{\\IndexNum}{\\refstepcounter{Index}\\theIndex }\n";

while (<TEXFILE>) {
        $line = $_;
        $line =~ s/\\begin{flushleft}//g;
        $line =~ s/\\end{flushleft}//g;
        $line =~ s/\\begin{tabbing}//g;
        $line =~ s/\\end{tabbing}//g;
        $line =~ s/\\kill//g;
        $line =~ s/\\begin{indenting}{.*cm}//g;
        $line =~ s/\\end{indenting}//g;
        $line =~ s/\\begin{center}//g;
        $line =~ s/\\end{center}//g;
        $line =~ s/\\bigskip//g;
        $line =~ s/\\testlastline//g;
        $line =~ s/\\zerotestlastline//g;
        $line =~ s/\\baselineskip\s*=\s*\d*\.\d*ex//g;
$line =~ s/\\stepcounter{Section}\\theSection {\\bf \.\s*(.*)}/\\section{$1}/g;
        $line =~ s/{\\it ([^}]+)}/\\emph{$1}/g;
        $line =~ s#\\/##g;
        $line =~ s/\\stepcounter{Index}\\theIndex/\\IndexNum/g;
        $line =~ s/\$\s*\$/ /g;
        $line =~ s/{\\penalty\d*}//g;
        $line =~ s/\\hspace\*?{[^}]*?}//g;
        $line =~ s/\\-//g;
        $line =~ s|\\>||g;
        $line =~ s|\\=||g;
        $line =~ s/kern-\d*.\d*cm//g;
        $line =~ s/\\\\/\n\n/g;
#       $line =~ s/\$[^{}]*\$//g;
        $line =~ s/\\hyphenpenalty \d+ //g;
        $line =~ s/\\nwln//g;
        $line =~ s/{?\\nobreak}?//g;
        $line =~ s/{}//g;
        $line =~ s/ I~/ I /g;
        $line =~ s/\\endnote/\\footnote/g;
        $line =~ s/\^\{\{\\rm \{\\bf \\GrBox\(1001\)\}\}\}/\\ulcorner/g;
        $line =~ s/\^\{\{\\rm \{\\bf \\GrBox\(0011\)\}\}\}/\\urcorner/g;
        print ($line);
}

close (TEXFILE);

Reply via email to