Re: Generating large datasets for Solr proof-of-concept

Markus Jelsma Thu, 15 Sep 2011 14:14:33 -0700

If we want to test with huge amounts of data we feed portions of the internet. 
The problem is it takes a lot of bandwith and lots of computing power to get 
to a `reasonable` size. On the positive side, you deal with real text so it's 
easier to tune for relevance.


I think it's easier to create a simple XML generator with mock data, prices, 
popularity rates etc. It's fast to generate millions of mock products and once 
you have a large quantity of XML files, you can easily index, test, change 
config or schema and reindex.

On the other hand, the sample data that comes with the Solr example is a good 
set as well as it proves the concepts well, especially with the stock Velocity 
templates.

We know Solr will handle enormous sets but quantity is not always a part of a 
PoC.

> Hello Everyone,
> 
> I have a goal of populating Solr with a million unique products in
> order to create a test environment for a proof of concept. I started
> out by using DIH with Amazon RSS feeds but I've quickly realized that
> there's no way I can glean a million products from one RSS feed. And
> I'd go mad if I just sat at my computer all day looking for feeds and
> punching them into DIH config for Solr.
> 
> Has anyone ever had to create large mock/dummy datasets for test
> environments or for POCs/Demos to convince folks that Solr was the
> wave of the future? Any tips would be greatly appreciated. I suppose
> it sounds a lot like crawling even though it started out as innocent
> DIH usage.
> 
> - Pulkit

Re: Generating large datasets for Solr proof-of-concept

Reply via email to