On 07/31/2010 10:10 PM, Rob Oakes wrote:
...then running a filter on the HTML...
I think this is really the key. You just really can't assume that the
converter---any converter---is going to produce nice LaTeX, suitable for
import into LyX. But the kinds of mistakes the converter makes will be
fairly predictable, and that means you can write a script to fix them.
What I'd propose is that someone begin a proper word2lyx project based
upon Rob's ideas. The actual work in the project would involve writing
one or more filters, to take the output of html2latex or writer2latex
and clean it up in a way that is appropriate for LyX. And, possibly
additionally, to run some kind of filter on the LyX document generated
by tex2lyx, which might want cleaning up in various ways. Note that this
could even handles Steve's problem: As long as the converters do
something predictable, we can fix it programmatically. Just as an
example, I've pasted below the perl script I used to clean up the output
of wp2latex, when I was converting WordPerfect files. It's pretty
trivial, really. The nice thing, of course, is that this kind of filter
can easily develop over time, as people find more annoyances.
The coding for this project should be done in python, preferably, so
that it could actually be shipped with LyX.
Richard
=====
#! /usr/bin/perl
open(TEXFILE, "<" . @ARGV[0]) || die ("Couldn't open file " . @ARGV[0]);
$line = <TEXFILE>;
print $line;
$line = <TEXFILE>;
print $line;
print "\\newcommand{\\IndexNum}{\\refstepcounter{Index}\\theIndex }\n";
while (<TEXFILE>) {
$line = $_;
$line =~ s/\\begin{flushleft}//g;
$line =~ s/\\end{flushleft}//g;
$line =~ s/\\begin{tabbing}//g;
$line =~ s/\\end{tabbing}//g;
$line =~ s/\\kill//g;
$line =~ s/\\begin{indenting}{.*cm}//g;
$line =~ s/\\end{indenting}//g;
$line =~ s/\\begin{center}//g;
$line =~ s/\\end{center}//g;
$line =~ s/\\bigskip//g;
$line =~ s/\\testlastline//g;
$line =~ s/\\zerotestlastline//g;
$line =~ s/\\baselineskip\s*=\s*\d*\.\d*ex//g;
$line =~ s/\\stepcounter{Section}\\theSection {\\bf
\.\s*(.*)}/\\section{$1}/g;
$line =~ s/{\\it ([^}]+)}/\\emph{$1}/g;
$line =~ s#\\/##g;
$line =~ s/\\stepcounter{Index}\\theIndex/\\IndexNum/g;
$line =~ s/\$\s*\$/ /g;
$line =~ s/{\\penalty\d*}//g;
$line =~ s/\\hspace\*?{[^}]*?}//g;
$line =~ s/\\-//g;
$line =~ s|\\>||g;
$line =~ s|\\=||g;
$line =~ s/kern-\d*.\d*cm//g;
$line =~ s/\\\\/\n\n/g;
# $line =~ s/\$[^{}]*\$//g;
$line =~ s/\\hyphenpenalty \d+ //g;
$line =~ s/\\nwln//g;
$line =~ s/{?\\nobreak}?//g;
$line =~ s/{}//g;
$line =~ s/ I~/ I /g;
$line =~ s/\\endnote/\\footnote/g;
$line =~ s/\^\{\{\\rm \{\\bf \\GrBox\(1001\)\}\}\}/\\ulcorner/g;
$line =~ s/\^\{\{\\rm \{\\bf \\GrBox\(0011\)\}\}\}/\\urcorner/g;
print ($line);
}
close (TEXFILE);