I have something that maybe could be made into one: http://uncorpora.org/
It is resolutions of the United Nations General Assembly in 6 official languages aligned on a paragraph level in an XML (Translation Memory eXchange) format. The 6 languages are: English, French, Spanish, Arabic, Chinese, Russian. Facets could be derived from already encoded information for: 1) Session number: 55-62 2) Committee number: 0-6 3) Operative/preambulatory phrase (for some of the paragraphs) 4) Resolution number (which is part of the record ID) 5) Cross-reference information that is embedded in the text, but is marked off with XML tags Markup and all, it is about 170 Mbytes between 6 languages. If that looks useful, I would be happy to work with more experienced Solr users to beat it into the right shape. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ Research group: http://www.clt.mq.edu.au/Research/ - I think age is a very high price to pay for maturity (Tom Stoppard) On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen<jason.rutherg...@gmail.com> wrote: > Is there a standard index like what Lucene uses for contrib/benchmark for > executing faceted queries over? Or maybe we can randomly generate one that > works in conjunction with wikipedia? That way we can execute real world > queries against faceted data. Or we could use the Lucene/Solr mailing lists > and other data (ala Lucid's faceted site) as a standard index?