I have something that maybe could be made into one: http://uncorpora.org/
It is resolutions of the United Nations General Assembly in 6 official
languages aligned on a paragraph level in an XML (Translation Memory
eXchange) format. The 6 languages are: English, French, Spanish,
Arabic, Chinese, Ru
It's only really effective if the number of tokens in the Sink is
expected to be significantly less than (my various tests showed around
< 50%, but YMMV) so it isn't likely useful for most copy fields
situations. For Solr to utilize, the schema would have to allow for
giving ids to the var
I saw the discussion about TeeSinkTokenFilter on java-user, and
was wondering how Solr performs copy fields? Couldn't Solr by
default utilize a TeeSinkTokenFilter like class for copying
fields?
> That link is meant to be stable for benchmarking purposes within Lucene.
The fields are different?
O
It's likely quite different. That link is meant to be stable for
benchmarking purposes within Lucene.
Note, one think I wish I had time for:
Hook in Tee/Sink capabilities into Solr such that one could use the
WikipediaTokenizer and then Tee the Categories, etc. off to separate
fields autom
The question that comes to mind is how it's different than
http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2
Guess we'd need to download it and take a look!
On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin wrote:
> AWS provides some standard data sets, including an
AWS provides some standard data sets, including an extract of all
wikipedia content:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249
Looks like it's not being updated often, so this or another AWS data
set could be a consistent basis for benchmarking?
-Pe
Yeah that's what I was thinking of as an alternative, use enwiki
and randomly generate facet data along with it. However for
consistent benchmarking the random data would need to stay the
same so that people could execute the same benchmark
consistently in their own environment.
On Tue, Jul 14, 20
Why don't you just randomly generate the facet data? Thats prob the best way
right? You can control the uniques and ranges.
On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll wrote:
> Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer
> in Lucene can pull out richer syntax
Probably not as generated by the EnwikiDocMaker, but the
WikipediaTokenizer in Lucene can pull out richer syntax which could
then be Teed/Sinked to other fields. Things like categories, related
links, etc. Mostly, though, I was just commenting on the fact that it
isn't hard to at least us
You think enwiki has enough data for faceting?
On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll wrote:
> At a min, it is trivial to use the EnWikiDocMaker and then send the doc over
> SolrJ...
>
> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>
>> On Tue, Jul 14, 2009 at 3:36 PM, Jason Ruthergle
At a min, it is trivial to use the EnWikiDocMaker and then send the
doc over SolrJ...
On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:
Is there a standard index like what Lucene uses for contrib/
benchmark
On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:
> Is there a standard index like what Lucene uses for contrib/benchmark for
> executing faceted queries over? Or maybe we can randomly generate one that
> works in conjunction with wikipedia? That way we can exe
Is there a standard index like what Lucene uses for contrib/benchmark for
executing faceted queries over? Or maybe we can randomly generate one that
works in conjunction with wikipedia? That way we can execute real world
queries against faceted data. Or we could use the Lucene/Solr mailing lists
an
13 matches
Mail list logo