I have something that maybe could be made into one: http://uncorpora.org/

It is resolutions of the United Nations General Assembly in 6 official
languages aligned on a paragraph level in an XML (Translation Memory
eXchange) format. The 6 languages are: English, French, Spanish,
Arabic, Chinese, Russian.

Facets could be derived from already encoded information for:
1) Session number: 55-62
2) Committee number: 0-6
3) Operative/preambulatory phrase (for some of the paragraphs)
4) Resolution number (which is part of the record ID)
5) Cross-reference information that is embedded in the text, but is
marked off with XML tags

Markup and all, it is about 170 Mbytes between 6 languages.

If that looks useful, I would be happy to work with more experienced
Solr users to beat it into the right shape.

Regards,
    Alex.

Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/
- I think age is a very high price to pay for maturity (Tom Stoppard)

On Tue, Jul 14, 2009 at 3:36 PM, Jason
Rutherglen<jason.rutherg...@gmail.com> wrote:
> Is there a standard index like what Lucene uses for contrib/benchmark for
> executing faceted queries over? Or maybe we can randomly generate one that
> works in conjunction with wikipedia? That way we can execute real world
> queries against faceted data. Or we could use the Lucene/Solr mailing lists
> and other data (ala Lucid's faceted site) as a standard index?

Reply via email to