[Sorry, I've just subscribed, having been thrown a link to this 
discussion by a friend, apologies for breaking the thread]

Hello!

        I've been pointed to a couple of recent posts discussing Project 
Gutenberg texts in Plucker format. I'm working with PG on this stuff, 
and would rather like to chuck some info into the mix about where we're 
going with format conversion.

        Historically, PG texts are in pretty bad shape. The plain-text format 
is inconsistent and lossy, and needs to be cleaned up. The general 
consensus (not without its detractors, but PG does rather tend to work 
like that :-\) is that in the long term, we're moving to storing all 
the texts marked up in XML, probably some variant of TEI. This would 
then be automatically converted to all manner of different formats, 
using XSL transforms and/or other programmatic means.

        However, we're being terribly slow and disorganised about getting going 
with this project, and however fast we are, it will take non-trivial 
amounts of time to convert the entire archive to this format. So it's 
still very worthwhile looking at a solution which can cope with the 
text (and a few HTML) files in their current state.

        I've constructed a generalised framework for format conversion, which 
can manage both lossless (from an XML master) and lossless 
(heuristics-based) transformations, of which there are demos on the PG 
site. I'm currently re-setting it up, so if there are one or two things 
which are thoroughly broken (as opposed to ordinary bugs and 
deficiencies, of which there are quite a few too), I do apologise. I 
was wondering if I could take a peek at the conversion tools some of 
you have been using, and see if it's possible to work them in as a 
filter for this system. Are they publically available?

Thanks,
Meredydd
_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to