Re: Wikipedia or reuters like index for testing facets?

2009-07-18 Thread Alexandre Rafalovitch
I have something that maybe could be made into one: http://uncorpora.org/ It is resolutions of the United Nations General Assembly in 6 official languages aligned on a paragraph level in an XML (Translation Memory eXchange) format. The 6 languages are: English, French, Spanish, Arabic, Chinese, Ru

Re: Wikipedia or reuters like index for testing facets?

2009-07-18 Thread Grant Ingersoll
It's only really effective if the number of tokens in the Sink is expected to be significantly less than (my various tests showed around < 50%, but YMMV) so it isn't likely useful for most copy fields situations. For Solr to utilize, the schema would have to allow for giving ids to the var

Re: Wikipedia or reuters like index for testing facets?

2009-07-17 Thread Jason Rutherglen
I saw the discussion about TeeSinkTokenFilter on java-user, and was wondering how Solr performs copy fields? Couldn't Solr by default utilize a TeeSinkTokenFilter like class for copying fields? > That link is meant to be stable for benchmarking purposes within Lucene. The fields are different? O

Re: Wikipedia or reuters like index for testing facets?

2009-07-17 Thread Grant Ingersoll
It's likely quite different. That link is meant to be stable for benchmarking purposes within Lucene. Note, one think I wish I had time for: Hook in Tee/Sink capabilities into Solr such that one could use the WikipediaTokenizer and then Tee the Categories, etc. off to separate fields autom

Re: Wikipedia or reuters like index for testing facets?

2009-07-17 Thread Jason Rutherglen
The question that comes to mind is how it's different than http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2 Guess we'd need to download it and take a look! On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin wrote: > AWS provides some standard data sets, including an

Re: Wikipedia or reuters like index for testing facets?

2009-07-16 Thread Peter Wolanin
AWS provides some standard data sets, including an extract of all wikipedia content: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249 Looks like it's not being updated often, so this or another AWS data set could be a consistent basis for benchmarking? -Pe

Re: Wikipedia or reuters like index for testing facets?

2009-07-15 Thread Jason Rutherglen
Yeah that's what I was thinking of as an alternative, use enwiki and randomly generate facet data along with it. However for consistent benchmarking the random data would need to stay the same so that people could execute the same benchmark consistently in their own environment. On Tue, Jul 14, 20

Re: Wikipedia or reuters like index for testing facets?

2009-07-14 Thread Mark Miller
Why don't you just randomly generate the facet data? Thats prob the best way right? You can control the uniques and ranges. On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll wrote: > Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer > in Lucene can pull out richer syntax

Re: Wikipedia or reuters like index for testing facets?

2009-07-14 Thread Grant Ingersoll
Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer in Lucene can pull out richer syntax which could then be Teed/Sinked to other fields. Things like categories, related links, etc. Mostly, though, I was just commenting on the fact that it isn't hard to at least us

Re: Wikipedia or reuters like index for testing facets?

2009-07-14 Thread Jason Rutherglen
You think enwiki has enough data for faceting? On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll wrote: > At a min, it is trivial to use the EnWikiDocMaker and then send the doc over > SolrJ... > > On Jul 14, 2009, at 4:07 PM, Mark Miller wrote: > >> On Tue, Jul 14, 2009 at 3:36 PM, Jason Ruthergle

Re: Wikipedia or reuters like index for testing facets?

2009-07-14 Thread Grant Ingersoll
At a min, it is trivial to use the EnWikiDocMaker and then send the doc over SolrJ... On Jul 14, 2009, at 4:07 PM, Mark Miller wrote: On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: Is there a standard index like what Lucene uses for contrib/ benchmark

Re: Wikipedia or reuters like index for testing facets?

2009-07-14 Thread Mark Miller
On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > Is there a standard index like what Lucene uses for contrib/benchmark for > executing faceted queries over? Or maybe we can randomly generate one that > works in conjunction with wikipedia? That way we can exe

Wikipedia or reuters like index for testing facets?

2009-07-14 Thread Jason Rutherglen
Is there a standard index like what Lucene uses for contrib/benchmark for executing faceted queries over? Or maybe we can randomly generate one that works in conjunction with wikipedia? That way we can execute real world queries against faceted data. Or we could use the Lucene/Solr mailing lists an