How is the current list of articles being generated? On Sun, Sep 7, 2008 at 7:22 PM, Chris Ball <[EMAIL PROTECTED]> wrote: > Hi, > > > Where is the code for this? Lede-detection code is a priority for > > me, and I'd like to work on it. It should be easy to sense the > > start of the first H2 and drop the rest of the article. > > There is no code for lead detection. You'd have to write it from > scratch; take the enwiki.xml.bz2 from ¹, run it through your script, > and output a new enwiki.xml.bz2 with articles substituted for leads > if the article isn't present in ². > > ¹: > http://download.wikimedia.org/enwiki/20080724/enwiki-20080724-pages-meta-current.xml.bz2 > ²: http://dev.laptop.org/~cjb/enwiki/en-g1g1-8k.txt
Now I can't tell whether this was humour ;-) There's already been a subset made for the 27,000 articles in Wikipedia 0.7 , thanks to Martin and CBM -- which is the set we are talking about. So no need to download and parse 8GB of xml. I meant, which is the code that takes a list of articles and extracts that subset of a larger xml file? If I know the single-file-id/extraction routine can run a size-reduction script on that article before moving it from the larger xml collection to the smaller one. > This sounds complicated, and we don't try to do it for pages. Templates > are probably on average the same size as each other (or rather, they're > all small enough that the difference is not very meaningful); find out > the size of an average-looking one and how much disk space we have left, Oh, right - they are still being transcluded dynamically. That's what I was asking about, forgetting that mwlib conserves space here. That seems to me a reason to include many templates -- the good ones enhance dozens of articles. SJ _______________________________________________ Sugar mailing list Sugar@lists.laptop.org http://lists.laptop.org/listinfo/sugar