Re: Wikipedia or reuters like index for testing facets?

Grant Ingersoll Sat, 18 Jul 2009 18:55:57 -0700

It's only really effective if the number of tokens in the Sink isexpected to be significantly less than (my various tests showed around< 50%, but YMMV) so it isn't likely useful for most copy fieldssituations. For Solr to utilize, the schema would have to allow forgiving ids to the various TokenFilter's so that you could identify theTees and the Sinks. At least that was my first thought on it.


-Grant
On Jul 17, 2009, at 7:50 PM, Jason Rutherglen wrote:

I saw the discussion about TeeSinkTokenFilter on java-user, and
was wondering how Solr performs copy fields? Couldn't Solr by
default utilize a TeeSinkTokenFilter like class for copying
fields?
That link is meant to be stable for benchmarking purposes withinLucene.
The fields are different?
On Fri, Jul 17, 2009 at 9:57 AM, GrantIngersoll<gsing...@apache.org> wrote:
It's likely quite different.  That link is meant to be stable for
benchmarking purposes within Lucene.

Note, one think I wish I had time for:
Hook in Tee/Sink capabilities into Solr such that one could use the
WikipediaTokenizer and then Tee the Categories, etc. off toseparate fields
automatically for faceting, etc.

-Grant

On Jul 17, 2009, at 10:48 AM, Jason Rutherglen wrote:
The question that comes to mind is how it's different than

http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2

Guess we'd need to download it and take a look!
On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin<peter.wola...@acquia.com>
wrote:
AWS provides some standard data sets, including an extract of all
wikipedia content:


http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249
Looks like it's not being updated often, so this or another AWSdata
set could be a consistent basis for benchmarking?

-Peter

On Wed, Jul 15, 2009 at 2:21 PM, Jason
Rutherglen<jason.rutherg...@gmail.com> wrote:
Yeah that's what I was thinking of as an alternative, use enwiki
and randomly generate facet data along with it. However for
consistent benchmarking the random data would need to stay the
same so that people could execute the same benchmark
consistently in their own environment.
On Tue, Jul 14, 2009 at 6:28 PM, MarkMiller<markrmil...@gmail.com>
wrote:
Why don't you just randomly generate the facet data? Thats probthe
best way
right? You can control the uniques and ranges.

On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll
<gsing...@apache.org>wrote:
Probably not as generated by the EnwikiDocMaker, but the
WikipediaTokenizer
in Lucene can pull out richer syntax which could then be Teed/Sinked
to
other fields. Things like categories, related links, etc.Mostly,
though,
I was just commenting on the fact that it isn't hard to atleast use
it for
getting docs into Solr.

-Grant

On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:

 You think enwiki has enough data for faceting?
On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gsing...@apache.org>
wrote:
At a min, it is trivial to use the EnWikiDocMaker and thensend the
doc
over
SolrJ...

On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:

 On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

 Is there a standard index like what Lucene uses for
contrib/benchmark
for
executing faceted queries over? Or maybe we can randomlygenerate
one
that
works in conjunction with wikipedia? That way we canexecute real
world
queries against faceted data. Or we could use the Lucene/Solr
mailing
lists
and other data (ala Lucid's faceted site) as a standardindex?
I don't think there is any standard set of docs for solrtesting -
there
is
not a real benchmark contrib - though I know more than afew of us
have
hacked up pieces of Lucene benchmark to work with Solr - Ithink
I've
done
it twice now ;)
Would be nice to get things going. I was thinking the otherday: I
wonder
how hard it would be to make Lucene Benchmark genericenough to
accept
Solr
impls and Solr algs?

It does a lot that would suck to duplicate.

--
--
- Mark

http://www.lucidimagination.com
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using
Solr/Lucene:
http://www.lucidimagination.com/search
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using
Solr/Lucene:
http://www.lucidimagination.com/search
--
--
- Mark

http://www.lucidimagination.com
--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using
Solr/Lucene:
http://www.lucidimagination.com/search


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Wikipedia or reuters like index for testing facets?

Reply via email to